Friday, November 28, 2014

Spark Streaming

In this post, we'll discuss another important topic of big data processing: real-time stream processing area.  This is an area where Hadoop falls short because of its high latency, and another open source framework Storm is developed to cover the need in real-time processing.  Unfortunately, Hadoop and Storm provides quite different programming model, resulting in high development and maintenance cost.

Continue from my previous post on Spark, which provides a highly efficient parallel processing framework.  Spark streaming is a natural extension of its core programming paradigm to provide large-scale, real-time data processing.  The biggest benefits of using Spark Streaming is that it is based on a similar programming paradigm of its core and there is no need to develop and maintain a completely different programming paradigm for batch and realtime processing.

Spark Core Programming Paradigm Recap

The core Spark programming paradigm consists of the following steps ...
  1. Taking input data from an external data source and create an RDD (a distributed data set across many servers)
  2. Transform the RDD to another RDD (these transformation defines a direct acyclic graph of dependencies between RDD)
  3. Output the final RDD to an external data source

Notice that the RDD is immutable, therefore the sequence of transformations is deterministic and therefore recovery from intermediate processing failure is simply by tracing back to the parent of the failure node (in the DAG) and redo the processing from there.

Spark Streaming

Spark Streaming introduce a data structure call DStream which is basically a sequence of RDD where each RDD contains data associated with a time interval.  DStream is created with a frequency parameters which defines the frequency RDD creation into the sequence.

Transformation of a DStream boils down to transformation of each RDD (within the sequence of RDD that the DStream contains).  Within the transformation, the RDD inside the DStream can "join" with another RDD (outside the DStream), hence provide a mix processing paradigm between DStream and other RDDs.  Also, since each transformation produces an output RDD, the result of transforming a DStream results in another sequence of RDDs that defines an output DStream.

Here is the basic transformation where each RDD in the output DStream has a one to one correspondence with each RDD in the input DStream. 

Instead of performing a 1 to 1 transformation of each RDD in the DStream.  Spark streaming enable a sliding window operation by defining a WINDOW which groups consecutive RDDs along the time dimension.  There are 2 parameters that the window is defined ...
  1. Window length: defines how many consecutive RDDs will be combined for performing the transformation.
  2. Slide interval: defines how many RDD will be skipped before the next transformation executes.

By providing a similar set of transformation operation for both RDD and DStream, Spark enable a unified programming paradigm across both batch and real-time processing, and hence reduce the corresponding development and maintenance cost.

Wednesday, November 5, 2014

Common data science project flow

As working across multiple data science projects, I observed a similar pattern across a group of strategic data science projects where a common methodology can be used.  In this post, I want to sketch this methodology at a high level.

First of all, "data science" itself is a very generic term that means different things to different people.  For the projects I involved, many of them target to solve a very tactical and specific problem.  However, over the last few years more and more enterprises start to realize the strategic value of data.  I observed a growing number of strategic data science projects were started from a very broad scope and took a top-down approach to look at the overall business operation.  Along the way, the enterprise prioritize the most critical areas within their operation cycle and build sophisticated models to guide and automate the decision process.

Usually, my engagement started as a data scientist / consultant, with very little (or even no) domain knowledge.  Being unfamiliar with the domain is nothing to be proud of and often slow down my initial discussion.  Therefore, within a squeezed time period I need to quickly learn enough "basic domain knowledge" to facilitate the discussion smooth.  On the other hand, lacking a per-conceived model enables me (or you can say force me) to look from a fresh-eye view, from which I can trim off unnecessary details from the legacies and only focus on those essential elements that contributes to the core part of the data model.  It is also fun to go through the concept blending process between a data scientist and a domain expert.  I force them to think in my way and they force me to think in their way.  This is by far the most effective way for me to learn any new concepts.

Recently I had a discussion with a company who has a small, but very sophisticated data science team that build pricing model, and demand forecasting for their product line.  I am, by no means an expert in their domain.  But their problem (how to predict demand, and how to set price) is general enough across many industries.  Therefore, I will use this problem as an example to illustration the major steps in the common pattern that I describe above.

Problem Settings

Lets say a car manufacturer starts its quarterly planning process.  Here are some key decisions that need to be made by the management.
  • How many cars the company should produce for next year ?
  • What should be the renew price of the cars ?
First of all, we need to identify the ultimate "goal" of these decisions.  Such goal is usually easy to find as it usually in the company's mission statement.

In this problem, the goal is to ...
maximize: "Profit_2015"

In general, I find it is a good start to look at the problem from an "optimization" angle, from which we define our goal in terms of an objective function as well as a set of constraints.

Step 1:  Identify variables and define its dependency graph

Build the dependency graph between different variables starting from the Objective function.  Separate between the decision variables (where you have control) and environment variable (where you have no control).

As an illustrative example, we start from our objective function "Profit_2015" and define the dependency relationship below. Decision variable is highlighted in blue.

Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
Demand_2015 = H(GDP_2014, PhoneSold_2014)
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)

Identifying these variable and their potential dependencies typically come from a well-studied theory from University, or domain experts in the industry.  At this stage, we don't need to know the exact formula of the function F/G/H.  We only need to capture the links between the variables.  It is also ok to include a link that shouldn't have exist (ie: there is no relationship between the 2 variables in reality).  However, it is not good if we miss a link (ie: fail to capture a strong, existing dependency).

This round usually involves 4 to 5 half day brainstorming sessions with the domain experts, facilitated by the data scientist/consultant who is familiar with the model building process.  There may be additional investigation, background studies if the subject matter experts doesn't exist.  Starting from scratch, this round can take somewhere between couple weeks to couple months

Step 2:  Define the dependency function

In this round,  we want to identify the relationship between variable using formula of F(), G(), H().

Well-Known Function
For some relationship that is well-studied, we can use a known mathematical model.

For example, in the relationship
Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)

We can use the following Mathematical formula in a very straightforward manner
Profit = (UnitSold * Price) - (UnitProduced * Cost)

Semi-Known Function
However, some of the relationship is not as straightforward as that.  For those relationship that we don't exactly know the formula, but can make a reasonable assumption on the shape of the formula, we can assume the relationship follows a family of models (e.g. Linear, Quadratic ... etc.), and then figure out the optimal parameters that best fit the historical data.

For example, in the relationship
Demand_2015 = H(GDP_2014, PhoneSold_2014)

Lets assume the "demand" is a linear combination of "GDP" and "Phone sold", which seems to be a reasonable assumption.

For the linear model we assume
Demand = w0 + (w1 * GDP) + (w2 * PhoneSold)

Then we feed the historical training data to a build a linear regression model and figure out what the fittest value of w0, w1, w2 should be.

Time-Series Function
In some cases, a variable depends only on its own past value but not other variables, here we can train a Time Series model to predict the variable based on its own past values.  Typically, the model is decomposed into 3 components; Noise, Trend and Seasonality.  One popular approach is to use exponential smoothing techniques such as Holt/Winters model.  Another popular approach is to use the ARIMA model which decomposed the value into "Auto-Regression" and "Moving-Average".

For example, in the relationship
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)

We can use TimeSeries model to learn the relationship between the historical data to its future value.

Completely Unknown Function
But if we cannot even assume the model family, we can consider using "k nearest neighbor" approach to interpolate the output from its input.  We need to define the "distance function" between data points based on domain knowledge and also to figure out what the optimal value of k should be.  In many case, using a weighted average of the k-nearest neighbor is a good interpolation.

For example, in the relationship
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
 It is unclear what model to be used in representing UnitSold as a function of Supply, Demand, Price and CompetitorPrice.  So we go with a nearest neighbor approach.

Based on monthly sales of past 3 years, we can use "Euclidean distance" (we can also consider scaling the data to a comparable range by minus its mean and divide by its standard deviation) to find out the closest 5 neighbors, and then using the weighted average to predict the unit sold.

Step 3: Optimization

At this point, we have the following defined
  • A goal defined by maximizing (or minimizing) an objective function
  • A set of variables (including the decision and environment variables)
  • A set of functions that define how these variables are inter-related to each other.  Some of them is defined by a mathematical formula and some of them is defined as a black-box (base on a predictive model)
Our goal is to figure out what the decision variables (which we have control) should be set such that the objective function is optimized (maximized or minimized).

Determine the value of environment variables
For those environment variables that has no dependencies on other variables, we can acquire their value from external data sources.  For those environment variables that has dependencies on other environment variables (but not decision variables), we can estimate their value using the corresponding dependency function (of course, we need to estimate all its depending variables first).  For those environment variables that has dependencies (direct or indirect) on decision variables, leave it as undefined.

Determine the best value of decision variables
Once we formulate the dependency function, depends on the format of these function, we can employ different optimization methods.  Here is how I choose the appropriate method based on the formulation of dependency functions.

Additional Challenges

To summarize, we have following the process below
  1. Define an objective function, constraints, decision variables and environment variables
  2. Identify the relationship between different variables
  3. Collect or predict those environment variables
  4. Optimize those decision variables based on the objective functions
  5. Return the optimal value of decision variables as the answer
So far, our dependency graph is acyclic where our decision won't affect the underlying variables.  Although this is reasonably true if the enterprise is an insignificant small player in the market, it is no longer true if the enterprise is one of the few major players.  For example, their pricing strategy may causes other competitors to change their own pricing strategy as well.  And how the competitors would react is less predictable and historical data play a less important role here.  At some point, human judgement will get involved to fill the gaps.