Continue from my previous post on Spark, which provides a highly efficient parallel processing framework. Spark streaming is a natural extension of its core programming paradigm to provide large-scale, real-time data processing. The biggest benefits of using Spark Streaming is that it is based on a similar programming paradigm of its core and there is no need to develop and maintain a completely different programming paradigm for batch and realtime processing.
Spark Core Programming Paradigm RecapThe core Spark programming paradigm consists of the following steps ...
- Taking input data from an external data source and create an RDD (a distributed data set across many servers)
- Transform the RDD to another RDD (these transformation defines a direct acyclic graph of dependencies between RDD)
- Output the final RDD to an external data source
Notice that the RDD is immutable, therefore the sequence of transformations is deterministic and therefore recovery from intermediate processing failure is simply by tracing back to the parent of the failure node (in the DAG) and redo the processing from there.
Spark StreamingSpark Streaming introduce a data structure call DStream which is basically a sequence of RDD where each RDD contains data associated with a time interval. DStream is created with a frequency parameters which defines the frequency RDD creation into the sequence.
Transformation of a DStream boils down to transformation of each RDD (within the sequence of RDD that the DStream contains). Within the transformation, the RDD inside the DStream can "join" with another RDD (outside the DStream), hence provide a mix processing paradigm between DStream and other RDDs. Also, since each transformation produces an output RDD, the result of transforming a DStream results in another sequence of RDDs that defines an output DStream.
Here is the basic transformation where each RDD in the output DStream has a one to one correspondence with each RDD in the input DStream.
Instead of performing a 1 to 1 transformation of each RDD in the DStream. Spark streaming enable a sliding window operation by defining a WINDOW which groups consecutive RDDs along the time dimension. There are 2 parameters that the window is defined ...
- Window length: defines how many consecutive RDDs will be combined for performing the transformation.
- Slide interval: defines how many RDD will be skipped before the next transformation executes.
By providing a similar set of transformation operation for both RDD and DStream, Spark enable a unified programming paradigm across both batch and real-time processing, and hence reduce the corresponding development and maintenance cost.