Apache Spark: Ready, Get Set, Stream !

Nimisha Sharath
5 min readJun 9, 2017

For the past few weeks, I’ve been exploring the data streaming module on Spark at work and it only seemed fitting to write a post about this fantastic tool.

Let’s assume we have a data stream that we’ve gotta clean it up, store it in a sensible way and compute something out of. For the sake of simplicity, let’s take a bunch of messages that are coming in from a server and you’re listening to it on a TCP socket. What’s your goto game plan?

Write a simple python program, collect the data, clean it, and then do whatever seems funky with it. Okay, you’re a Java person? Yeah you’re gonna write a simple client receiver and spend a bit more time doing the same thing.

Let’s address all the possible issues you’re going to encounter

My program is kinda slow!

Trust me, I went an entire semester in college writing python scripts to parallelize operations with the multiprocessing module and thought there was nothing better. You know those map() and reduce() functions which make python programming, lambda heaven? Yeah, that’s way faster. And what’s faster than standalone map-reduce scripts? Hadoop. What’s faster than Hadoop? Spark. Yup. Spark has this amazing concept of in-memory processing. I’m just gonna ask you to have a look at this link right here for more information on that.

Another advantage Spark has here is that it can launch tasks much faster. MapReduce starts a new JVM for each task, which can take seconds with loading JARs, JITing, parsing configuration XML, etc. Spark keeps an executor JVM running on each node, so launching a task is simply a matter of making an RPC to it and passing a Runnable to a thread pool, which takes in the single digits of milliseconds.

I have way too much data!

Spark’s core manages several important functions like setting tasks and interactions as well as producing input/output operations. It can be said to be an RDD, or resilient distributed dataset. Basically, this happens to be a mix of data that is spread across several machines connected via a network. The transformation of this data is created by a four-step method, comprised of mapping the data, sorting it, reducing it and then finally, joining the data.

I cant control my data!

If you’re one of those people who just play around with APIs, you know what I’m talking about. Packets go missing, they’re delayed and arrive out of sequence etc and adding features to your standalone python script to control the storage of this data is a mighty task. That being said, stateful stream processing is the need of the era!

A Stateful Stream Processing System is a system that needs to update its state with the stream of data. Latency should be low for such a system, and even if a node fails, the state should not be lost (for example, computing the distance covered by a vehicle based on a stream of its GPS location, or counting the occurrences of word “spark” in a stream of data).

Spark Streaming allows stateful computations — maintaining a state based on data coming in a stream. It also allows window operations (i.e., allows the developer to specify a time frame and perform operations on the data flowing in that time window. The window has a sliding interval, which is the time interval of updating the window. If I define a time window of 10 seconds with a sliding interval of 2 seconds. I would be performing my computation on the data coming to the stream in the past 10 seconds and the window would be updating every 2 seconds!

What if one of my nodes fail?

Its extremely usual to have a backup of your data for reasons that needn’t be specified. But, how often will you manually take a backup? Where will you store it? How efficiently will you store it?

Fault tolerance is the capability of a system to overcome failure. Fault tolerance in Spark Streaming is similar to fault tolerance in Spark. Like RDD partitions, dStreams data is recomputed in case of a failure. The raw input is replicated in memory across the cluster of nodes. In case of a node failure, the data can be reproduced using the lineage. The system can recover from a failure in less than one second.

I wanna do some cool stuff with the data I collected!

Who doesn’t? Spark has an amazing collection of libraries on Machine Learning, graph processing and statistics that you work with RDDs as the form of input and by now, you know that the data stream coming into spark, are being processed in micro batches and being stored as RDDs.

You want to work on data incrementally? Sure! We’ve got windows, we’ve got aggregate functions, we’ve even got Spark SQL! This module that allows you to deal with your data being interpreted as a relational database. Go ahead, write your complex queries. We’ve even got our own query optimiser, that manages the data partitions across the nodes you have. Bet that would’ve been hard to manage in your standalone script huh?

There’s gotta be a dark side.

Well, Spark is an open source tool. There are always gonna be improvisations.

Besides, “its not a bug, its a feature!”

That being said, there are some minute things that have been pending on Spark’s side, which I noticed. Despite having boasted about SQL queries, Spark Structured Streaming doesn’t allow you to apply joins on multiple data streams. It also doesn’t allow you to do outer left and outer right joins. I’m sure I found a bunch of other stuff like this, but they don’t seem to be deal breakers at the moment.

I’ll just leave this link here, which is an official introduction to structured streaming in Spark. Its gonna give you a much better understanding of how streaming works in spark, in comparison to this fangirl post.

I’ll perhaps write a better one once I finish the module I’m working on. :)

Let me know if I missed something or seemed off at any point.

Thanks for reading, there will be more such posts coming up!

--

--