Pragmatic Programming Techniques

Structure Learning and Imitation Learning

2018-09-02T10:09:00.000-07:00

In classical prediction use case, the predicted output is either a number (for regression) or category (for classification). A set of training data (x, y) where x is the input and y is the labeled output is provided to train a parameterized predictive model.

The model is characterized by a set of parameters w
Given an input x, for the model predicts y_hat = f(x; w) for regression, or the model predicts the probability of each possible class for classification
Define a Lost function L(y, y_hat) for regression, or L(y, P(y=a | x), P(y=b | x) ...), find the parameters w to minimize L

This problem is typically view as an optimization problem, and use gradient descent approach to solve it.

Need for Structure Learning

However, in some cases, y is not as simple as a number or a class. For example

For machine translation, x is a sentence in English, and y is a translated sentence in French
For self-driving-vehicle, x is a camera image, and y is the control action on steering wheel, brake, gas pedal

In these cases, the output y can be viewed as an object. But wait, can we break down the object as multiple numbers / categories and use the classical regression / classification approach to solve it ? Not quite, because the lost function cannot be just formulated as the summation of loss of individual components. For example, two French sentences using different words may still be a very good translation of the same English sentence. So we need to generalize a bit more and introduce the concept of an object and compatibility here.

The prediction problem can be generalized as: Given input x, finding an object y such that y is the "most compatible" with x. The compatibility is a parameterized function that we are going to learn from the training data.

The compatibility function is defined as F(x, y; w)
During training phase, we tune parameter w such that for every sample in training data, F(x, y; w) is the maximum. In other words, F(x, Y=y; w) > F(x, Y=other_val; w)

Notice that the training process is different from classical ML in the following way.

There are two optimization loop here. a) Given parameter w, find y_opt that maximize F(x,y,w). b) Given Lost = gap between F(x,y,w) and F(x,y_opt,w), find w that minimize the gap.
It turns out the first optimization is solved in a problem specific way while the second optimization can be solved by classical gradient descent approach.

After we learn the compatibility function parameters, at inference phase, we will apply the first optimization to the given input x to find the most compatible y_opt such that F(x, y_opt; w) is the maximum.

Rather than trying to exactly match y_hat to y in the training data, structure learning enable us to learn a more abstract relationship (ie: compatibility) between x and y so that we can output other equally-good y even it is not the same as the y in the training data. This more-generalized form is very powerful when we don't have a lot of training data. The downside of structure learning is it is compute intensive, because at inference phase it needs to solve an optimization problem which is typically expensive.

Imitation Learning

In a typical setting of Reinforcement Learning, an digital agent observe the state from the environment, formulate its policy to determine an action that it believes will maximize its accumulative future reward, take the action, get the reward from the environment and transition to the next state. Reinforcement learning is about how the agent optimize its policy along its experience when interacting with the environment.

For an overview of Reinforcement Learning and basic algorithm, you can visit my previous blog here.

Basically, reinforcement learning is based on a trial-and-error approach to learn the lesson. This approach can be very costly because during the trial process, serious mistake can be made (imagine if we use reinforcement learning to learn self-driving car, we may crash many cars before we learn something meaningful). In practice, people rely on simulator to mimic the environment. However, coming up good simulator is not easy because it requires a very deep understanding of how the actual environment behaves, this is one of the limitation that restrict reinforcement learning to be broadly applied.

Another important design consideration is how the reward is assigned. Of course, we can use the actual reward from the environment to train the policy, but this is usually very inefficient. Imagine when playing the chess game, we only get the reward at the end when we win/lose the game. Propagating this reward all the way to each move will be very inefficient. To make the learning faster, we usually use a technique called "reward shaping", basically to assign some artificial reward along the trajectory to bias the agent towards certain desirable actions (based on domain knowledge).

One special form of reward shaping is "imitation learning" where we assign intermediate reward based on how the action "similar" to when an expert does in real-life circumstances. Lets say we collect a set of observations that the expert is taking action y in state x, and try to learn a model that the agent will bias to take action y when seeing state x. But wait, does it sound like a supervised learning problem ? Can we just train a prediction model from x to y and we are done ?

Unfortunately, it is not as simple. Expert data is typically very sparse and expensive to get, meaning we usually don't have too many data from the expert. Imagine in a self-driving program, if we want to learn how to react when the car is almost crash, we may not find any situation in the expert's observation because the expert may not run into such dangerous situation at all. On the other hand, we don't need to copy exactly when the expert did in every situation, we just need to copy those that is relevant to the situation.

"Inverse Reinforcement Learning" comes into rescue. Basically, it cuts off the reward from the environment and replace it with a "reward estimator function", which is trained from a set of expert behavior, assuming that expert behavior will achieve highest reward.

Underlying algorithm of inverse reinforcement learning is based on the "structure learning" algorithm. In this case, the x is the start state, y is the output of the trajectory of the expert which is basically the training data. And y_opt is the output of the trajectory based on the agent policy, which is learned from the reward function using Reinforcement Learning algorithm. The compatibility function is basically our reward function because we assume expert behavior achieve highest reward.

Then we bring it the structure learning algorithm below ...

The agent still need to interact with the environment (or simulator) to get its trajectory, but the environment only need to determine the next state, but not the reward. Again, there are two nested optimization loop in the algorithm

Given a reward function (characterized by w), use classical RL to learn the optimal policy
Use the optimal policy to interact with the environment to collect the total reward of each episode, then adjust the reward function parameter w such that the expert behavior always get the highest total reward.

Reinforcement Learning Overview

2017-08-25T22:51:00.002-07:00

There are basically 3 different types of Machine Learning

Supervised Learning: The major use case is Prediction. We provide a set of training data including the input and output, then train a model that can predict output from an unseen input.
Unsupervised Learning: The major use case is Pattern extraction. We provide a set of data that has no output, the algorithm will try to extract the underlying non-trivial structure within the data.
Reinforcement Learning: The major use case is Optimization. Mimicking how human learn from childhood, we use a trial and error approach to find out what actions will produce good outcome, and bias our preference towards those good actions.

In this post, I will provide an overview of the settings of Reinforcement Learning as well as some of its key algorithms.

Agent / Environment Interaction

Reinforcement Learning is all about how we can make good decision through trial and error. It is the interaction between the "agent" and the "environment".

Repeat the following steps until reaching a termination condition

The agent observe the environment having state s
Out of all possible actions, the agent need to decide which action to take. (this is called "policy", which is a function that output an action given the current state)
Agent take the action, and the environment receive that action
Through a transition matrix model, environment determine what is the next state and proceed to that state
Through a reward distribution model, the environment determines the reward to the agent given he take action a at state s

The goal for the agent is to determine an optimal policy such that the "value" of the start state is maximized.

Some terminology

Episode: a sequence of (s1, a1, r1, s2, a2, r2, s3, a3, r3 .... st, at, rt ... sT, aT, rT)
Reward rt: Money the agent receive after taking action at a state at time t
Return: Cumulative reward since the action is taken (sum of rt, r[t+1], ... rT)
Value: Expected return at a particular state, called "state value" V(s), or expected return when taking action a at state s, called "Q Value" Q(s,a)

The optimal policy can be formulated as choosing action a* amount all choices of a at state s such that Q(s, a*) is maximum.

To deal with never ended interaction, we put a discount factor "gamma" on future reward. This discount factor will turn the sum of an infinite series into a finite number.

Optimal Policy when model is known

If we know the "model", then figuring out the policy is easy. We just need to use dynamic programming technique to compute the optimal policy offline and there is no need for learning.

Two algorithms can be used:

"Value iteration" starts with a random value and iteratively update the value based on the Bellman's equation, and finally compute the "value" of each state or state/action pair (also call Q state). The optimal policy for a given state s is to choose the action a* that maximize the Q value, Q(s, a).

Another algorithm "Policy iteration" starts with a random policy, and iteratively modifies the policy to make it better, until the policy at next iteration doesn't change any more.

However, in practice, we usually don't know the model, so we cannot compute the optimal policy as described above.

Optimal Policy when model is unknown

One solution is the "model based" learning, we spare some time to find out the transition probability model as well as the reward distribution model. To make sure we experience all possible combinations of different state/action pairs, we will take random action in order to learn the model.

Once we learn the model, we can go back to use the value iteration or policy iteration to determine the optimal policy.

Learning has a cost though. Rather than taking the best action, we will take random action in order to explore new actions that we haven't tried before and it is very likely that the associated reward is not maximum. However we accumulate our knowledge about how the environment reacts under a wider range of scenarios and hopefully this will help us to get a better action in future. In other words, we sacrifice or trade off our short term gain for a long term gain.

Making the right balance is important. A common approach is to use the epsilon greedy algorithm. For each decision step, we allocate a small probability e where we take random action and probability (1-e) where we take the best known action we have explored before.

Another solution approach is the "model free" learning. Lets go back to look at the detail formula under Value iteration and Policy iteration, the reason of knowing the model is to calculate the expected value of state value and Q value. Can we directly figure out the expected state and Q value through trial and error ?

Value based model free learning

If we modify the Q value iteration algorithm to replace the expected reward/nextstate with the actual reward/nextstate, we arrive at the SARSA algorithm below.

Deep Q Learning

The algorithm above requires us to keep a table to remember all Q(s,a) values which can be huge, and also becomes infinite if any of the state or action is continuous. To deal with this, we will introduce the idea of value function. The state and action will become the input parameters of this function, which will create "input features" and then feed into a linear model and finally output the Q value.

Now we modify the previous SARSA algorithm to the following ...

Instead of lookup the Q(s,a) value, we call the function (can be a DNN) to pass in the f(s, a) feature, and get its output
We randomly initialize the parameter of the function (can be weights if the function is a DNN)
We update the parameters using gradient descent on the lost which can be the difference between the estimated value and the target value (can be a one step look ahead estimation: r + gamma*max_a'[Q(s',a)] )

If we further generalize the Q value function using a deep neural network, and update the parameter using back propagation, then we reach a simple version of Deep Q Learning.

While this algorithm allow us to learn the Q value function which can represents a continuous state, we still need to evaluate every action and pick the one with the maximum Q value. In other words, the action space can only be discrete and finite.

Policy gradient

Since the end goal is to pick the right action, and finding out the Q value is just the means (so we can pick the action of maximum Q), why don't we learn a function that takes a state and directly output an action. Using this policy function approach, we can handle both continuous or discrete action space as well.

The key idea is to learn a function (given a state, output an action)

If the action is discrete, it outputs a probability distribution of each action
It the action is continuous, it output the mean and variance of the action, assume normal distribution

The agent will sample from the output distribution to determine the action, so its chosen action is stochastic (nondeterministic). Then the environment will determine the reward and next state. Cycle repeats ...

The goal is to find the best policy function where the expected value of Q(s, a) is maximize. Notice that s and a are random variable parameterized by θ.

To maximize an "expected value" of a function with parameters θ, we need to calculate the gradient of that function.

Actor Critic Algorithm

There are 2 moving targets in this equation:

To improve the policy function, we need an accurate estimation of Q value and also need to know the gradient of log(s, a)
To make the Q value estimation more accurate, we need a stable policy function

We can break down these into two different roles

An actor, whose job is to improve the policy function by tuning the policy function parameters
A critic, whose job is to fine tune the estimation of Q value based on current (incrementally improving) policy

The "actor critic" algorithm is shown below.

Then we enhance this algorithm by adding the following steps

Replace the Q value function with an Advantage function, where A(s, a) = Q(s, a) - Expected Q(s, *). ie: A(s, a) = Q(s, a) - V(s)
Run multiple thread Asynchronously

This is the state of the art A3C algorithm.

Learning resources and credits

Some of the algorithms I discussed above is extracted from the following sources

Regression model outputting probability density distribution

2017-07-15T11:45:00.000-07:00

For a classification problem (let say output is one of the labels R, G, B), how do we predict ?

There are two formats that we can report our prediction

Output a single value which is most probable outcome. e.g. output "B" if P(B) > P(R) and P(B) > P(G)
Output the probability estimation of each label. (e.g. R=0.2, G=0.3, B=0.4)

But if we look at regression problem (lets say we output a numeric value v), most regression model only output a single value (that minimize the RMSE). In this article, we will look at some use cases where outputting a probability density function is much preferred.

Predict the event occurrence time

As an illustrative example, we want to predict when would a student finish her work given she has already spent some time s. In other words, we want to estimate E[t | t > s] where t is a random variable representing the total duration and s is the elapse time so far.

Estimating time t is generally hard if the model only output an expectation. Notice that the model has the same set of features, expect that the elapse time has changed in a continuous manner as time passes.

Lets look at how we can train a prediction model that can output a density distribution.

Lets say our raw data schema: [feature, duration]

f1, 13.30
f2, 14.15
f3, 15.35
f4, 15.42

Take a look at the range (ie. min and max) of the output value. We transform into the training data of the following schema:

[feature, dur<13, dur<14, dur<15, dur<16]

f1, 0, 1, 1, 1
f2, 0, 0, 1, 1
f3, 0, 0, 0, 1
f4, 0, 0, 0, 1

After that, we train 4 classification model.

feature, dur<13
feature, dur<14
feature, dur<15
feature, dur<16

Now, given a new observation with corresponding feature, we can invoke these 4 model to output the probability of binary classification (cumulative probability). If we want the probability density, simply take the difference (ie: differentiation of cumulative probability).

At this moment, we can output a probability distribution given its input feature.

Now, we can easily estimate the remaining time from the expected time in the shade region. As time passed, we just need to slide the red line continuously and recalculate the expected time, we don't need to execute the prediction model unless the input features has changed.

Predict cancellation before commitment

As an illustrative example, lets say a customer of restaurant has reserved a table at 8:00pm. Time now is 7:55pm and the customer still hasn't arrive, what is the chance of no-show ?

Now, given a person (with feature x), and current time is S - t (still hasn't bought the ticket yet), predict the probability of this person watching the movie.

Lets say our raw data schema: [feature, arrival]

f1, -15.42
f2, -15.35
f3, -14.15
f4, -13.30

f5, infinity
f6, infinity

We transform into the training data of the following schema:

[feature, arr<-16, arr<-15, arr<-14, arr<-13]

f1, 0, 1, 1, 1
f2, 0, 1, 1, 1
f3, 0, 0, 1, 1
f4, 0, 0, 0, 1
f5, 0, 0, 0, 0
f6, 0, 0, 0, 0

After that, we train 4 classification models.

feature, arr<-16
feature, arr<-15
feature, arr<-14
feature, arr<-13

Notice that P(arr<0) can be smaller than 1 because the customer can be no show.

In this post, we discuss some use cases where we need the regression model to output not just its value prediction but also the probability density distribution. And we also illustrate how we can build such prediction model.

How AI differs from ML

2017-07-02T00:43:00.002-07:00

AI is not a new term, it is multiple decades old starting around early 80s when computer scientist design algorithms that can "learn" and "mimic human behavior".

On the "learning" side, the most significant algorithm is Neural Network, which is not very successful due to overfitting (the model is too powerful but not enough data). Nevertheless, in some more specific tasks, the idea of "using data to fit a function" has gained significant success and this form the foundation of "machine learning" today.

On the "mimic" side, people have focus in "image recognition", "speech recognition", "natural language processing", experts have been spending tremendous amount of time to create features like "edge detection", "color profile", "N-grams", "Syntax tree" ... etc. Nevertheless, the success is moderate.

Traditional Machine Learning

Machine Learning (ML) Technique has played a significant role in prediction and ML has undergone multiple generations, with a rick set of model structure, such as

Linear regression
Logistic regression
Decision tree
Support Vector Machine
Bayesian model
Regularization model
Ensemble model
Neural network

Each of these predictive model is based on certain algorithmic structure, with parameters as tunable knobs. Training a predictive model involves the following

Choose a model structure (e.g. Logistic regression, or Random forest, or ...)
Feed the model with training data (with both input and output)
The learning algorithm will output the optimal model (ie: model with specific parameters that minimize the training error)

Each model has its own characteristics and will perform good in some tasks and bad in others. But generally, we can group them into the low-power (simple) model and the high-power (complex) model. Choose between different models is a very tricky question.

Traditionally, using a low power / simple model is preferred over the use of a high power / complex model for the following reasons

Until we have massive processing power, training the high power model will take too long
Until we have massive amount of data, training the high power model will cause the overfit problem (since the high power model has rich parameters and can fit into a wide range of data shape, we may end up train a model that fits too specific to the current training data and not generalized enough to do good prediction on future data).

However, choosing a low power model suffers from the so called "under-fit" problem where the model structure is too simple and unable to fit the training data in case it is more complex. (Imagine the underlying data has a quadratic relationship: y = 5 * x^2, there is no way you can fit a linear regression: y = a*x + b no matter what a and b we pick).

To mitigate the "under-fit problem", data scientist will typically apply their "domain knowledge" to come up with "input features", which has a more direct relationship with the output. (e.g. Going back to the quadratic relationship: y = 5 * square(x), if you create a feature z = x^2, then you can fit a linear regression: y = a*z + b, by picking a = 5 and b = 0)

The major obstacle of "Machine Learning" is this "Feature Engineering" step which requires deep "domain experts" to identify important signals before feeding into training process. The feature engineering step is very manual and demands a lot of scarce domain expertise and therefore become the major bottleneck of most machine learning tasks today.

In other words, if we don't have enough processing power and enough data, then we have to use the low-power / simpler model, which requires us to spend significant time and effort to create appropriate input features. This is where most data scientists spending their time doing today.

Return of Neural Network

At early 2000, machine processing power has increased tremendously, with the advancement of cloud computing, massively parallel processing infrastructure, together with big data era where massive amount of fine grain event data being collected. We are no longer restricted to the low-power / simple model. For example, two most popular, mainstream machine learning model today are RandomForest and Gradient Boosting Tree. Nevertheless, although both of them are very powerful and provide non-linear model fitting to the training data, data scientist still need to carefully create features in order to achieve good performance.

At the same time, computer scientists has revisited the use of many layers Neural Network in doing these human mimic tasks. This give a new birth to DNN (Deep Neural Network) and provide a significant breakthrough in image classification and speech recognition tasks. The major difference of DNN is that you can feed the raw signals (e.g. the RGB pixel value) directly into DNN without creating any domain specific input features. Through many layers of neurons (hence it is called "deep" neural network), DNN can "automatically" generate the appropriate features through each layer and finally provide a very good prediction. This saves significantly the "feature engineering" effort, a major bottleneck done by the data scientists.

DNN also evolves into many different network topology structure, so we have CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), LSTM (Long Short Term Memory), GAN (Generative Adversarial Network), Transfer Learning, Attention Model ... etc. The whole spectrum is called Deep Learning, which is catching the whole machine learning community’s attention today.

Reinforcement Learning

Another key component is about how to mimic a person (or animal) learn. Imagine the very natural animal behavior of perceive/act/reward cycle. A person or animal will first understand the environment by sensing what "state" he is in. Based on that, he will pick an "action" which brings him to another "state". Then he will receive a "reward". The cycle repeats until he dies. This way of learning (called "Reinforcement Learning") is quite different from the "curve fitting" approaches of traditional supervised machine learning approach. In particular, learning in RL is very fast because every new feedback (such as perform an action and receive a reward) is sent immediately to influence subsequent decisions. Reinforcement Learning has gain tremendous success in self-driving cars as well as AlphaGO (Chess Playing Robot).

Reinforcement Learning also provides a smooth integration between "Prediction" and "Optimization" because it maintains a belief of current state and possible transition probabilities when taking different actions, and then make decisions which action can lead to the best outcome.

AI = DL + RL

Compare to the classical ML Technique, DL provide a more powerful prediction model that usually produce good prediction accuracy. Compare to the classical Optimization model using LP, RL provide a much faster learning mechanism and also more adaptive to change of the environment.

An output of a truly random process

2017-04-29T23:58:00.000-07:00

Recently I have a discussion with my data science team whether we can challenge the observations is following a random process or not. Basically, data science is all about learning hidden pattern that is affecting the observations. If the observation is following a random process, then there is nothing we can learn about. Let me walk through an example to illustrate.

Lets say someone is making a claim that he is throwing a fair dice (with number 1 to 6) sequentially.
Lets say I claim the output of my dice throw is uniformly random, ie: with equal chances of getting a number from 1 to 6.

And then he throws the dice 12 times, and show you the output sequence. From the output, can you make a judgement whether this is really a sequential flow of a fair dice ? In other words, is the output really follow a random process as expected ?

Lets look at 3 situations

Situation 1 output is [4, 1, 3, 1, 2, 6, 3, 5, 5, 1, 2, 4]
Situation 2 output is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Situation 3 output is [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]

At first glance, the output of situation 1 looks like resulting from a random process. Situation 2 definitely doesn't look like it. Situation 3 is harder to judge. If you look at the proportion of the output numbers, the frequency of each output number of situation 3 definitely follows a uniform distribution of a fair dice. But if you look at the number ordering, situation 3 follows a well-defined ordering that doesn't seem to be random at all. Therefore, I don't think the output of situation 3 is following a random process.

However, this seems to be a very arbitrary choice. Why would I look at the number ordering at all ? Should I look for more properties ? such as ...

Whether the number of the even position are even
Average gap between consecutive throws
Whether the number in the 3rd position always smaller than the 10th position
...

As you can see, depends on my imagination, the list can go on and on. How can I tell whether situation 3 is following a random process or not ?

Method 1: Randomization Test

This is based on the hypothesis testing methodology. We establish null hypothesis H0 that situation 3 follows a random process.

First, I define an arbitrary list of statistics of my choices

statisticA = proportion of even numbers in even position
statisticB = average gap between consecutive output numbers
statisticC = ...

Second, I run a simulation to generate 12 numbers based on a random process. Calculate the corresponding statistics defined above.

Third repeat the simulation for N times, output the mean and standard deviation of the statistics.

If the statisticA or B or C of situation 3 are too far away (based on the likelihood pValue) from the mean of statistics A/B/C by the number of standard deviation of statistics A/B/C, then we conclude that situation 3 is not following a random process. Otherwise, we don't have enough evidence to show our null hypothesis is violated and so we accept situation 3 follows the random process.

Method 2: Predictability Test

This is based on the theory of predictive analytics.

First, I pick a particular machine learning algorithm, lets say time series forecast using ARIMA.

Notice that I can also choose to use RandomForest and create some arbitrary input features (such as previous output number, maximum number in last 3 numbers ... etc)

Second, I train my selected predictive model based on the output data of situation 3 (in this example, situation 3 has only 12 data point, but imagine we have much more than 12 data point).

Third, I evaluate my model in the test set. And see whether the prediction is much better than a random guess. For example I can measure the lift of my model by comparing the RMSE (root mean square error) or my prediction and the standard deviation of the testing data. If the lift is very insignificant, then I conclude that situation 3 results from a random process, because my predictive model doesn't learn any pattern.

Common techniques in optimization

2015-08-25T23:11:00.002-07:00

Optimization is a frequently encountered problem in real life. We need to make a decision to achieve something within a set of constraints, and we need to maximize or minimize our objective based on some measurement. For example, a restaurant may need to decide how many workers (of each position) to hire to serve the customers, with the constraint that workers cannot work overtime, and the objective is to minimize the cost. A car manufacturer may need to decide how many units of each model to be produced, within the constraint of the storage capacity of its warehouse, and maximize the profit of its sales.

Exhaustive Search

If there isn't a lot of choices, an exhaustive search (e.g. breath-first, depth-first, best-first, A* ... etc.) can be employed to evaluate each option, see if it meets all the constraints (ie: a feasible solution), and then compute its objective value. Then we sort all feasible solutions based on its corresponding objective value and pick the solution that has the max (or min) objective value as our decision. Unfortunately, real world problem usually involve a large number (exponentially large due to combinatorial explosion) of choices, making the exhaustive search in many cases impractical.

When this happens, two other solution approaches are commonly used.
1) Mathematical Programming
2) Local Greedy Search.

Mathematical Programming

Mathematical programming is a classical way to solve optimization problem. It is a family of approaches including linear programming, integer programming, quadratic programming and even non-linear programming. The development process usually go through the following steps ...

From a problem description, the modeler will express the problem into a mathematical structure containing 3 parts.

Variables: there are two types of variables. Data variables contains the current value of your business environment (e.g. cost of hiring a waiter, price of a car), and Decision variables hold the decision you make to optimize your objective (e.g. how many staff to hire, how many cars to make).
Constraints: it is a set of rules that you cannot break. Effectively, constraints disallow certain combinations of your decision variables and is mainly used to filter out in-feasible solutions. In a typical settings, constraints are expressed as a system of linear inequality equations where a linear expression of decision variables are specified on the left hand side and a value is specified on the right hand side after an inequality comparison.
Objective function: it encapsulates the quantitative measure of how our goal has been achieved. In a typical setting, objective function is expressed as a single linear (or quadratic) combination of decision variables

After the mathematical structure is defined, the modeler will submit it to a solver, which will output the best solution. The process can be depicted as follows.

Expressing the problem in the mathematical structure is the key design of the solution. There are many elements to be considered, which we described below.

The first consideration is how to represent your decision, specially whether the decision is a quantity (real number), a number of units (integer) or a binary decision (integer 0, 1).

The next step is to represent the constraints in terms of inequality equations of linear combination of decision variables. You need to think whether the constraints is a hard of soft constraints. Hard constraints should be expressed in the constraint part of the problem. Notice the solver will not consider any solution once it violates any constraints. Therefore, if the solver cannot find a feasible solution that fulfill all constraints, it will simply abort. In other words, it won't return you a solution that violates the smallest number of constraints. If you want the solve to tell you that because you have rooms to relax them, you should model these soft constraints in the objective function rather than in the constraint section. Typically, you define an objective function that quantifies the degree of violation. The solver will then give you the optimal solution (violating least number of constraints) rather than just telling you no solution is found.

Finally you define the objective function. In most real-life problems, you have multiple goals in mind (e.g. you want to minimize your customer's wait time in the queue, but you also want to minimize your cost of hiring staffs). First you express each goal as a linear expression of decision variables and then take the weighted average among different goals (which is also a linear combination of all decision variables) to form the final objective function. How to choose the weights among different goals is a business question, based on how much you are willing to trade off between conflicting goals.

There are some objectives that cannot be expressed by a linear expression of decision variables. One frequently encountered example is about minimizing the absolute deviation from a target value (ie. no matter the deviation is a positive and negative value). A common way is to minimize the sum of square of the difference. But after we square it, it is no longer a linear expression. To address this requirement, there is a more powerful class of solver call "quadratic programming" which relax the objective function to allow a degree 2 polynomial expression.

After you expressed the problem in the mathematical structure. You can pass it to a solver (there are many open source and commercial solver available) which you can treat it as a magical black box. The solver will output an optimal solution (with a value assigned to each decision variable) that fulfill all constraints and maximize (or minimize) the objective function.

Once you received the solution from the solver, you need to evaluate how "reliable" is this optimal solution. There may be fluctuations in the data variables due to collection error, or the data may have a volatile value that changes rapidly, or the data is an estimation of another unknown quantity.

Ideally, the optimal solution doesn't change much when we fluctuate the data values within its error bound. In this case, our optimal solution is stable against error in our data variables and therefore is a good one. However, if the optimal solution changes drastically when the data variable fluctuates, we say the optimal solution is unreliable and cannot use it. In the case, we usually modify each data variables one at a time to figure out which data variable is causing a big swing of optimal solution and try to reduce our estimation error of that data variable.

The sensitivity analysis is an important step to evaluate the stability and hence the quality of our optimal solution. It also provides guidance on which area we need to invest effort to make the estimation more accurate.

Mathematical Programming allows you to specify your optimization problem in a very declarative manner and also output an optimal solution if it exist. It should be the first-to-go solution. The downside of Mathematical programming is that it requires linear constraints and linear (or quadratic) objectives. And it also has limits in terms of number of decision variables and constraints that it can store (and this limitation varies among different implementations). Although there are non-linear solvers, the number of variables it can take is even smaller.

Usually the first thing to do is to test out if your problem is small enough to fit in the mathematically programming solver. If not, you may need to use another approach.

Greedy Local Search

The key words here are "local" and "greedy". Greedy local search starts at a particular selection of decision variables, and then it look around surrounding neighborhood (hence the term "local") and within that find the best solution (hence the term "greedy"). Then it moves the current point to this best neighbor and then the process repeats until the new solution stays at the same place (ie. there is no better neighbor found).

If the initial point is selected to be a non-feasible solution, the greedy search will first try to find a feasible solution first by looking for neighbors that has less number of constraint violation. After the feasible solution is found, the greedy search will only look for neighbors that fulfill all constraints and within which to find a neighbor with the best objective value. Another good initialization strategy is to choose the initial point to be a feasible (of course not optimal) solution and then start the local search from there.

Because local search limits its search within a neighborhood, it can control the degree of complexity by simply limits the scope of neighbor. Greedy local search can evaluate a large number of variables by walking across a chain of neighborhoods. However, because local search only navigate towards the best neighbor within a limited scope, it loses the overall picture and rely on the path to be a convex shape. If there are valleys along the path, the local search will stop there and never reach the global optimum. This is called a local optimal trap because the search is trapped within a local maximum and not able to escape. There are some techniques to escape from local optimum that I describe in my previous blog post.

When performing greedy local search, it is important to pick the right greedy function (also called heuristic function) as well as the right neighborhood. Although it is common to choose the greedy function to be the same objective function itself, they don't have to be the same. In many good practices, the greedy function is chosen to be the one that has high correlation with the objective function, but can be computed much faster. On the other hand, evaluating objective function of every point in the neighborhood can also be a very expensive operation, especially when the data has a high dimension and the number of neighbors can be exponentially large. A good neighbor function can be combined with a good greedy function such that we don't need to evaluate each neighbor to figure out which one has the best objective value.

Combining the two approaches

In my experience, combining the two approaches can be a very powerful solution itself. At the outer layer, we are using the greedy local search to navigate the direction towards better solution. However, when we search for the best neighbor within the neighborhood, we use the mathematical programming by taking the greedy function as the objective function, and we use the constraints directly. This way, we can use a very effective approach to search for best solution within the neighborhood and can use the flexible neighborhood scoping of local search to control the complexity of the mathematical programming solver.

When machine replace human

2015-06-27T09:28:00.002-07:00

Recently, a good friend sent me an article from Harvard Business Review called "Beyond Automation", written by Thomas H. Davenport and Julia Kirby. The article talked about how automation affects our job forces and displacing values from human workers. It proposed 5 strategies in how we can get prepared to retain competitiveness in the automation era. This is a very good article and triggers me a lot of thoughts.

I want to explore a fundamental question: "Can machine replace a human in future ?"

Lets start looking at what machines are doing and not doing today. Machines are operating under a human's program, and therefore it can only solve those problems that we, human can express or codified in a structural form. Don't underestimate its power underneath. With good abstract thinking, smartest human in the world has partitioned large number of problems (by its problem nature) into different problem categories. Each category is expressed in form od a "generic problem" and subsequently a "general solution" is developed. Notice that computer scientist has been doing this for many decades, and come up with the powerful algorithm such as "Sorting", "Finding shortest path", "Heuristic search" ... etc.

By grouping concrete problems by their nature into a "generic, abstract problem", we can significantly reduce the volume of cases/scenarios while still covers a large area of ground. The "generic solution" we developed can also be specialized for each concrete problem scenario. After that we can develop a software program which can be executed in a large cluster of machines equipped with fast CPU and a lot of memory. Compare this automated solution with what a human can do in a manual fashion. In these areas, once problems are well-defined and solutions are automated by software program, computers with much powerful CPU and memory will always beat human in many many orders of magnitude. There is no question that the human job in these areas will be eliminated.

In terms of capturing our experience using a abstract data structure and algorithm, computer scientist are very far from done. There are still a very large body of problems that even the smartest human haven't completely figured out how to put them in a structural form yet. Things that involve "perception", "intuition", "decision making", "estimation", "creativity" are primarily done today by human. I believe these type of jobs will continue to be done by human workers in our next decade. On the other hand, with our latest technology research, we continuously push our boundary of automation into some of these areas. "Face recognition", "Voice recognition" that involves high degree of perception can now be done very accurately by software program. With "machine learning" technology, we can do "prediction" and make judgement in a more objective way than a human. Together with "planning" and "optimization" algorithm, large percentage of decision making can be automated, and the result is usually better because of a less biased and data-driven manner.

However, in these forefront areas where latest software technology is unable to automate every steps, the human is need in the path to make a final decision, or interven in those exceptional situation that the software is not programmed to handled. There are jobs that a human and machine can working together to make better outcome. This is what is called "augmentation" in the article. Some job examples are artists are using advanced software to touchup their photos, using computer graphics to create movies, using machine learning to do genome sequence processing, using robots to perform surgery, driver-less vehicles ... etc.

Whether computer programming can replace human completely remains to be seen, but I don't think this will happen in the next 2 decades. We humans are unique and good at perceiving things with multiple level of abstractions from different angles. We are good at connecting the dots between unrelated areas. We can invent new things. These are things that machine will be very hard to do, or at least will take a long time if at all possible.

"When can we program a machine that can write program ?"

The HBR article suggests a person can consider five strategies (step up, step aside, step in, step narrowly and step forward) to retain value in the automation era. I favor the "step forward" strategy because the person is driving the trend rather than passively reacting to the trend. Date back to our history, human's value system has been shifted across industry revolution, internet revolution etc. At the end of the day, it is more-sophisticated human who take away jobs (and value) from other less-sophisticated human. And it is always the people who drives the movement to be the winner of this value shift. It happens in the past and will continue into future.

Big Data Processing in Spark

2015-02-22T23:42:00.001-08:00

In the traditional 3-tier architecture, data processing is performed by the application server where the data itself is stored in the database server. Application server and database server are typically two different machine. Therefore, the processing cycle proceeds as follows

Application server send a query to the database server to retrieve the necessary data
Application server perform processing on the received data
Application server will save the changed data to the database server

In the traditional data processing paradigm, we move data to the code.
It can be depicted as follows ...

Then big data phenomenon arrives. Because the data volume is huge, it cannot be hold by a single database server. Big data is typically partitioned and stored across many physical DB server machines. On the other hand, application servers need to be added to increase the processing power of big data.

However, as we increase the number of App servers and DB servers for storing and processing the big data, more data need to be transfer back and forth across the network during the processing cycle, up to a point where the network becomes a major bottleneck.

Moving code to data

To overcome the network bottleneck, we need a new computing paradigm. Instead of moving data to the code, we move the code to the data and perform the processing at where the data is stored.

Notice the change of the program structure

The program execution starts at a driver, which orchestrate the execution happening remotely across many worker servers within a cluster.
Data is no longer transferred to the driver program, the driver program holds a data reference in its variable rather than the data itself. The data reference is basically an id to locate the corresponding data residing in the database server
Code is shipped from the program to the database server, where the execution is happening, and data is modified at the database server without leaving the server machine.
Finally the program request a save of the modified data. Since the modified data resides in the database server, no data transfer happens over the network.

By moving the code to the data, the volume of data transfer over network is significantly reduced. This is an important paradigm shift for big data processing.

In the following session, I will use Apache Spark to illustrate how this big data processing paradigm is implemented.

RDD

Resilient Distributed Dataset (RDD) is how Spark implements the data reference concept. RDD is a logical reference of a dataset which is partitioned across many server machines in the cluster.

To make a clear distinction between data reference and data itself, a Spark program is organized as a sequence of execution steps, which can either be a "transformation" or an "action".

Programming Model

A typical program is organized as follows

From an environment variable "context", create some initial data reference RDD objects
Transform initial RDD objects to create more RDD objects. Transformation is expressed in terms of functional programming where a code block is shipped from the driver program to multiple remote worker server, which hold a partition of the RDD. Variable appears inside the code block can either be an item of the RDD or a local variable inside the driver program which get serialized over to the worker machine. After the code (and the copy of the serialized variables) is received by the remote worker server, it will be executed there by feeding the items of RDD residing in that partition. Notice that the result of a transformation is a brand new RDD (the original RDD is not mutated)
Finally, the RDD object (the data reference) will need to be materialized. This is achieved through an "action", which will dump the RDD into a storage, or return its value data to the driver program.

Here is a word count example

# Get initial RDD from the context
file = spark.textFile("hdfs://...")

# Three consecutive transformation of the RDD
counts = file.flatMap(lambda line: line.split(" "))
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b) 

# Materialize the RDD using an action
counts.saveAsTextFile("hdfs://...")

When the driver program starts its execution, it builds up a graph where nodes are RDD and edges are transformation steps. However, no execution is happening at the cluster until an action is encountered. At that point, the driver program will ship the execution graph as well as the code block to the cluster, where every worker server will get a copy.

The execution graph is a DAG.

Each DAG is a atomic unit of execution.
Each source node (no incoming edge) is an external data source or driver memory
Each intermediate node is a RDD
Each sink node (no outgoing edge) is an external data source or driver memory
Green edge connecting to RDD represents a transformation. Red edge connecting to a sink node represents an action

Data Shuffling

Although we ship the code to worker server where the data processing happens, data movement cannot be completely eliminated. For example, if the processing requires data residing in different partitions to be grouped first, then we need to shuffle data among worker server.

Spark carefully distinguish "transformation" operation in two types.

"Narrow transformation" refers to the processing where the processing logic depends only on data that is already residing in the partition and data shuffling is unnecessary. Examples of narrow transformation includes filter(), sample(), map(), flatMap() .... etc.
"Wide transformation" refers to the processing where the processing logic depends on data residing in multiple partitions and therefore data shuffling is needed to bring them together in one place. Example of wide transformation includes groupByKey(), reduceByKey() ... etc.

Joining two RDD can also affect the amount of data being shuffled. Spark provides two ways to join data. In a shuffle join implementation, data of two RDD with the same key will be redistributed to the same partition. In other words, each of the items in each RDD will be shuffled across worker servers.

Beside shuffle join, Spark provides another alternative call broadcast join. In this case, one of the RDD will be broadcasted and copied over to every partition. Imagine the situation when one of the RDD is significantly smaller relative to the other, then broadcast join will reduce the network traffic because only the small RDD need to be copied to all worker servers while the large RDD doesn't need to be shuffled at all.

In some cases, transformation can be re-ordered to reduce the amount of data shuffling. Below is an example of a JOIN between two huge RDDs followed by a filtering.

Plan1 is a naive implementation which follows the given order. It first join the two huge RDD and then apply the filter on the join result. This ends up causing a big data shuffling because the two RDD is huge, even though the result after filtering is small.

Plan2 offers a smarter way by using the "push-down-predicate" technique where we first apply the filtering in both RDDs before joining them. Since the filtering will reduce the number of items of each RDD significantly, the join processing will be much cheaper.

Execution planning

As explain above, data shuffling incur the most significant cost in the overall data processing flow. Spark provides a mechanism that generate an execute plan from the DAG that minimize the amount of data shuffling.

Analyze the DAG to determine the order of transformation. Notice that we starts from the action (terminal node) and trace back to all dependent RDDs.
To minimize data shuffling, we group the narrow transformation together in a "stage" where all transformation tasks can be performed within the partition and no data shuffling is needed. The transformations becomes tasks that are chained together within a stage
Wide transformation sits at the boundary of two stages, which requires data to be shuffled to a different worker server. When a stage finishes its execution, it persist the data into different files (one per partition) of the local disks. Worker nodes of the subsequent stage will come to pickup these files and this is where data shuffling happens

Below is an example how the execution planning turns the DAG into an execution plan involving stages and tasks.

Reliability and Fault Resiliency

Since the DAG defines a deterministic transformation steps between different partitions of data within each RDD RDD, fault recovery is very straightforward. Whenever a worker server crashes during the execution of a stage, another worker server can simply re-execute the stage from the beginning by pulling the input data from its parent stage that has the output data stored in local files. In case the result of the parent stage is not accessible (e.g. the worker server lost the file), the parent stage need to be re-executed as well. Imagine this is a lineage of transformation steps, and any failure of a step will trigger a restart of execution from its last step.

Since the DAG itself is an atomic unit of execution, all the RDD values will be forgotten after the DAG finishes its execution. Therefore, after the driver program finishes an action (which execute a DAG to its completion), all the RDD value will be forgotten and if the program access the RDD again in subsequent statement, the RDD needs to be recomputed again from its dependents. To reduce this repetitive processing, Spark provide a caching mechanism to remember RDDs in worker server memory (or local disk). Once the execution planner finds the RDD is already cache in memory, it will use the RDD right away without tracing back to its parent RDDs. This way, we prune the DAG once we reach an RDD that is in the cache.

Overall speaking, Apache Spark provides a powerful framework for big data processing. By the caching mechanism that holds previous computation result in memory, Spark out-performs Hadoop significantly because it doesn't need to persist all the data into disk for each round of parallel processing. Although it is still very new, I think Spark will take off as the main stream approach to process big data.

Spark Streaming

2014-11-28T22:11:00.000-08:00

In this post, we'll discuss another important topic of big data processing: real-time stream processing area. This is an area where Hadoop falls short because of its high latency, and another open source framework Storm is developed to cover the need in real-time processing. Unfortunately, Hadoop and Storm provides quite different programming model, resulting in high development and maintenance cost.

Continue from my previous post on Spark, which provides a highly efficient parallel processing framework. Spark streaming is a natural extension of its core programming paradigm to provide large-scale, real-time data processing. The biggest benefits of using Spark Streaming is that it is based on a similar programming paradigm of its core and there is no need to develop and maintain a completely different programming paradigm for batch and realtime processing.

Spark Core Programming Paradigm Recap

The core Spark programming paradigm consists of the following steps ...

Taking input data from an external data source and create an RDD (a distributed data set across many servers)
Transform the RDD to another RDD (these transformation defines a direct acyclic graph of dependencies between RDD)
Output the final RDD to an external data source

Notice that the RDD is immutable, therefore the sequence of transformations is deterministic and therefore recovery from intermediate processing failure is simply by tracing back to the parent of the failure node (in the DAG) and redo the processing from there.

Spark Streaming

Spark Streaming introduce a data structure call DStream which is basically a sequence of RDD where each RDD contains data associated with a time interval. DStream is created with a frequency parameters which defines the frequency RDD creation into the sequence.

Transformation of a DStream boils down to transformation of each RDD (within the sequence of RDD that the DStream contains). Within the transformation, the RDD inside the DStream can "join" with another RDD (outside the DStream), hence provide a mix processing paradigm between DStream and other RDDs. Also, since each transformation produces an output RDD, the result of transforming a DStream results in another sequence of RDDs that defines an output DStream.

Here is the basic transformation where each RDD in the output DStream has a one to one correspondence with each RDD in the input DStream.

Instead of performing a 1 to 1 transformation of each RDD in the DStream. Spark streaming enable a sliding window operation by defining a WINDOW which groups consecutive RDDs along the time dimension. There are 2 parameters that the window is defined ...

Window length: defines how many consecutive RDDs will be combined for performing the transformation.
Slide interval: defines how many RDD will be skipped before the next transformation executes.

By providing a similar set of transformation operation for both RDD and DStream, Spark enable a unified programming paradigm across both batch and real-time processing, and hence reduce the corresponding development and maintenance cost.

Common data science project flow

2014-11-05T16:01:00.000-08:00

As working across multiple data science projects, I observed a similar pattern across a group of strategic data science projects where a common methodology can be used. In this post, I want to sketch this methodology at a high level.

First of all, "data science" itself is a very generic term that means different things to different people. For the projects I involved, many of them target to solve a very tactical and specific problem. However, over the last few years more and more enterprises start to realize the strategic value of data. I observed a growing number of strategic data science projects were started from a very broad scope and took a top-down approach to look at the overall business operation. Along the way, the enterprise prioritize the most critical areas within their operation cycle and build sophisticated models to guide and automate the decision process.

Usually, my engagement started as a data scientist / consultant, with very little (or even no) domain knowledge. Being unfamiliar with the domain is nothing to be proud of and often slow down my initial discussion. Therefore, within a squeezed time period I need to quickly learn enough "basic domain knowledge" to facilitate the discussion smooth. On the other hand, lacking a per-conceived model enables me (or you can say force me) to look from a fresh-eye view, from which I can trim off unnecessary details from the legacies and only focus on those essential elements that contributes to the core part of the data model. It is also fun to go through the concept blending process between a data scientist and a domain expert. I force them to think in my way and they force me to think in their way. This is by far the most effective way for me to learn any new concepts.

Recently I had a discussion with a company who has a small, but very sophisticated data science team that build pricing model, and demand forecasting for their product line. I am, by no means an expert in their domain. But their problem (how to predict demand, and how to set price) is general enough across many industries. Therefore, I will use this problem as an example to illustration the major steps in the common pattern that I describe above.

Problem Settings

Lets say a car manufacturer starts its quarterly planning process. Here are some key decisions that need to be made by the management.

How many cars the company should produce for next year ?
What should be the renew price of the cars ?

First of all, we need to identify the ultimate "goal" of these decisions. Such goal is usually easy to find as it usually in the company's mission statement.

In this problem, the goal is to ...
maximize: "Profit_2015"

In general, I find it is a good start to look at the problem from an "optimization" angle, from which we define our goal in terms of an objective function as well as a set of constraints.

Step 1: Identify variables and define its dependency graph

Build the dependency graph between different variables starting from the Objective function. Separate between the decision variables (where you have control) and environment variable (where you have no control).

As an illustrative example, we start from our objective function "Profit_2015" and define the dependency relationship below. Decision variable is highlighted in blue.

Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
Demand_2015 = H(GDP_2014, PhoneSold_2014)
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)
...

Identifying these variable and their potential dependencies typically come from a well-studied theory from University, or domain experts in the industry. At this stage, we don't need to know the exact formula of the function F/G/H. We only need to capture the links between the variables. It is also ok to include a link that shouldn't have exist (ie: there is no relationship between the 2 variables in reality). However, it is not good if we miss a link (ie: fail to capture a strong, existing dependency).

This round usually involves 4 to 5 half day brainstorming sessions with the domain experts, facilitated by the data scientist/consultant who is familiar with the model building process. There may be additional investigation, background studies if the subject matter experts doesn't exist. Starting from scratch, this round can take somewhere between couple weeks to couple months

Step 2: Define the dependency function

In this round, we want to identify the relationship between variable using formula of F(), G(), H().

Well-Known Function
For some relationship that is well-studied, we can use a known mathematical model.

For example, in the relationship
Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)

We can use the following Mathematical formula in a very straightforward manner
Profit = (UnitSold * Price) - (UnitProduced * Cost)

Semi-Known Function
However, some of the relationship is not as straightforward as that. For those relationship that we don't exactly know the formula, but can make a reasonable assumption on the shape of the formula, we can assume the relationship follows a family of models (e.g. Linear, Quadratic ... etc.), and then figure out the optimal parameters that best fit the historical data.

For example, in the relationship
Demand_2015 = H(GDP_2014, PhoneSold_2014)

Lets assume the "demand" is a linear combination of "GDP" and "Phone sold", which seems to be a reasonable assumption.

For the linear model we assume
Demand = w0 + (w1 * GDP) + (w2 * PhoneSold)

Then we feed the historical training data to a build a linear regression model and figure out what the fittest value of w0, w1, w2 should be.

Time-Series Function
In some cases, a variable depends only on its own past value but not other variables, here we can train a Time Series model to predict the variable based on its own past values. Typically, the model is decomposed into 3 components; Noise, Trend and Seasonality. One popular approach is to use exponential smoothing techniques such as Holt/Winters model. Another popular approach is to use the ARIMA model which decomposed the value into "Auto-Regression" and "Moving-Average".

For example, in the relationship
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)

We can use TimeSeries model to learn the relationship between the historical data to its future value.

Completely Unknown Function
But if we cannot even assume the model family, we can consider using "k nearest neighbor" approach to interpolate the output from its input. We need to define the "distance function" between data points based on domain knowledge and also to figure out what the optimal value of k should be. In many case, using a weighted average of the k-nearest neighbor is a good interpolation.

For example, in the relationship
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
It is unclear what model to be used in representing UnitSold as a function of Supply, Demand, Price and CompetitorPrice. So we go with a nearest neighbor approach.

Based on monthly sales of past 3 years, we can use "Euclidean distance" (we can also consider scaling the data to a comparable range by minus its mean and divide by its standard deviation) to find out the closest 5 neighbors, and then using the weighted average to predict the unit sold.

Step 3: Optimization

At this point, we have the following defined

A goal defined by maximizing (or minimizing) an objective function
A set of variables (including the decision and environment variables)
A set of functions that define how these variables are inter-related to each other. Some of them is defined by a mathematical formula and some of them is defined as a black-box (base on a predictive model)

Our goal is to figure out what the decision variables (which we have control) should be set such that the objective function is optimized (maximized or minimized).

Determine the value of environment variables
For those environment variables that has no dependencies on other variables, we can acquire their value from external data sources. For those environment variables that has dependencies on other environment variables (but not decision variables), we can estimate their value using the corresponding dependency function (of course, we need to estimate all its depending variables first). For those environment variables that has dependencies (direct or indirect) on decision variables, leave it as undefined.

Determine the best value of decision variables
Once we formulate the dependency function, depends on the format of these function, we can employ different optimization methods. Here is how I choose the appropriate method based on the formulation of dependency functions.

Additional Challenges

To summarize, we have following the process below

Define an objective function, constraints, decision variables and environment variables
Identify the relationship between different variables
Collect or predict those environment variables
Optimize those decision variables based on the objective functions
Return the optimal value of decision variables as the answer

So far, our dependency graph is acyclic where our decision won't affect the underlying variables. Although this is reasonably true if the enterprise is an insignificant small player in the market, it is no longer true if the enterprise is one of the few major players. For example, their pricing strategy may causes other competitors to change their own pricing strategy as well. And how the competitors would react is less predictable and historical data play a less important role here. At some point, human judgement will get involved to fill the gaps.

Lambda Architecture Principles

2014-08-17T22:36:00.000-07:00

"Lambda Architecture" (introduced by Nathan Marz) has gained a lot of traction recently. Fundamentally, it is a set of design patterns of dealing with Batch and Real time data processing workflow that fuel many organization's business operations. Although I don't realize any novice ideas has been introduced, it is the first time these principles are being outlined in such a clear and unambiguous manner.

In this post, I'd like to summarize the key principles of the Lambda architecture, focus more in the underlying design principles and less in the choice of implementation technologies, which I may have a different favors from Nathan.

One important distinction of Lambda architecture is that it has a clear separation between the batch processing pipeline (ie: Batch Layer) and the real-time processing pipeline (ie: Real-time Layer). Such separation provides a means to localize and isolate complexity for handling data update. To handle real-time query, Lambda architecture provide a mechanism (ie: Serving Layer) to merge/combine data from the Batch Layer and Real-time Layer and return the latest information to the user.

Data Source Entry

At the very beginning, data flows in Lambda architecture as follows ...

Transaction data starts streaming in from OLTP system during business operations. Transaction data ingestion can be materialized in the form of records in OLTP systems, or text lines in App log files, or incoming API calls, or an event queue (e.g. Kafka)
This transaction data stream is replicated and fed into both the Batch Layer and Realtime Layer

Here is an overall architecture diagram for Lambda.

Batch Layer

For storing the ground truth, "Master dataset" is the most fundamental DB that captures all basic event happens. It stores data in the most "raw" form (and hence the finest granularity) that can be used to compute any perspective at any given point in time. As long as we can maintain the correctness of master dataset, every perspective of data view derived from it will be automatically correct.

Given maintaining the correctness of master dataset is crucial, to avoid the complexity of maintenance, master dataset is "immutable". Specifically data can only be appended while update and delete are disallowed. By disallowing changes of existing data, it avoids the complexity of handling the conflicting concurrent update completely.

Here is a conceptual schema of how the master dataset can be structured. The center green table represents the old, traditional-way of storing data in RDBMS. The surrounding blue tables illustrates the schema of how the master dataset can be structured, with some key highlights

Data are partitioned by columns and stored in different tables. Columns that are closely related can be stored in the same table
NULL values are not stored
Each data record is associated with a time stamp since then the record is valid

Notice that every piece of data is tagged with a time stamp at which the data is changed (or more precisely, a change record that represents the data modification is created). The latest state of an object can be retrieved by extracting the version of the object with the largest time stamp.

Although master dataset stores data in the finest granularity and therefore can be used to compute result of any query, it usually take a long time to perform such computation if the processing starts with such raw form. To speed up the query processing, various data at intermediate form (called Batch View) that aligns closer to the query will be generated in a periodic manner. These batch views (instead of the original master dataset) will be used to serve the real-time query processing.

To generate these batch views, the "Batch Layer" use a massively parallel, brute force approach to process the original master dataset. Notice that since data in master data set is timestamped, the data candidate can be identified simply from those that has the time stamp later than the last round of batch processing. Although less efficient, Lambda architecture advocates that at each round of batch view generation, the previous batch view should just be simply discarded and the new batch view is computed from master dataset. This simple-mind, compute-from-scratch approach has some good properties in stopping error propagation (since error cannot be accumulated), but the processing may not be optimized and may take a longer time to finish. This can increase the "staleness" of the batch view.

Real time Layer

As discussed above, generating the batch view requires scanning a large volume of master dataset that takes few hours. The batch view will therefore be stale for at least the processing time duration (ie: between the start and end of the Batch processing). But the maximum staleness can be up to the time period between the end of this Batch processing and the end of next Batch processing (ie: the batch cycle). The following diagram illustrate this staleness.

Even the batch view is stale period, business operates as usual and transaction data will be streamed in continuously. To answer user's query with the latest, up-to-date information. The business transaction records need to be captured and merged into the real-time view. This is the responsibility of the Real-time Layer. To reduce the latency of latest information availability close to zero, the merge mechanism has to be done in an incremental manner such that no batching delaying the processing will be introduced. This requires the real time view update to be very different from the batch view update, which can tolerate a high latency. The end goal is that the latest information that is not captured in the Batch view will be made available in the Realtime view.

The logic of doing the incremental merge on Realtime view is application specific. As a common use case, lets say we want to compute a set of summary statistics (e.g. mean, count, max, min, sum, standard deviation, percentile) of the transaction data since the last batch view update. To compute the sum, we can simply add the new transaction data to the existing sum and then write the new sum back to the real-time view. To compute the mean, we can multiply the existing count with existing mean, adding the transaction sum and then divide by the existing count plus one. To implement this logic, we need to READ data from the Realtime view, perform the merge and WRITE the data back to the Realtime view. This requires the Realtime serving DB (which host the Realtime view) to support both random READ and WRITE. Fortunately, since the realtime view only need to store the stale data up to one batch cycle, its scale is limited to some degree.
Once the batch view update is completed, the real-time layer will discard the data from the real time serving DB that has time stamp earlier than the batch processing. This not only limit the data volume of Realtime serving DB, but also allows any data inconsistency (of the realtime view) to be clean up eventually. This drastically reduce the requirement of sophisticated multi-user, large scale DB. Many DB system support multiple user random read/write and can be used for this purpose.

Serving Layer

The serving layer is responsible to host the batch view (in the batch serving database) as well as hosting the real-time view (in the real-time serving database). Due to very different accessing pattern, the batch serving DB has a quite different characteristic from the real-time serving DB.

As mentioned in above, while required to support efficient random read at large scale data volume, the batch serving DB doesn't need to support random write because data will only be bulk-loaded into the batch serving DB. On the other hand, the real-time serving DB will be incrementally (and continuously) updated by the real-time layer, and therefore need to support both random read and random write.

To maintain the batch serving DB updated, the serving layer need to periodically check the batch layer progression to determine whether a later round of batch view generation is finished. If so, bulk load the batch view into the batch serving DB. After completing the bulk load, the batch serving DB has contained the latest version of batch view and some data in the real-time view is expired and therefore can be deleted. The serving layer will orchestrate these processes. This purge action is especially important to keep the size of the real-time serving DB small and hence can limit the complexity for handling real-time, concurrent read/write.

To process a real-time query, the serving layer disseminates the incoming query into 2 different sub-queries and forward them to both the Batch serving DB and Realtime serving DB, apply application-specific logic to combine/merge their corresponding result and form a single response to the query. Since the data in the real-time view and batch view are different from a timestamp perspective, the combine/merge is typically done by concatenate the results together. In case of any conflict (same time stamp), the one from Batch view will overwrite the one from Realtime view.

Final Thoughts

By separating different responsibility into different layers, the Lambda architecture can leverage different optimization techniques specifically designed for different constraints. For example, the Batch Layer focuses in large scale data processing using simple, start-from-scratch approach and not worrying about the processing latency. On the other hand, the Real-time Layer covers where the Batch Layer left off and focus in low-latency merging of the latest information and no need to worry about large scale. Finally the Serving Layer is responsible to stitch together the Batch View and Realtime View to provide the final complete picture.

The clear demarcation of responsibility also enable different technology stacks to be utilized at each layer and hence can tailor more closely to the organization's specific business need. Nevertheless, using a very different mechanism to update the Batch view (ie: start-from-scratch) and Realtime view (ie: incremental merge) requires two different algorithm implementation and code base to handle the same type of data. This can increase the code maintenance effort and can be considered to be the price to pay for bridging the fundamental gap between the "scalability" and "low latency" need.

Nathan's Lambda architecture also introduce a set of candidate technologies which he has developed and used in his past projects (e.g. Hadoop for storing Master dataset, Hadoop for generating Batch view, ElephantDB for batch serving DB, Cassandra for realtime serving DB, STORM for generating Realtime view). The beauty of Lambda architecture is that the choice of technologies is completely decoupled so I intentionally do not describe any of their details in this post. On the other hand, I have my own favorite which is different and that will be covered in my future posts.

Incorporate domain knowledge into predictive model

2014-07-27T15:44:00.001-07:00

As a data scientist / consultant, in many cases we are being called in to work with domain experts who has in-depth business knowledge of industry settings. The main objective is to help our clients to validate and quantify the intuition of existing domain knowledge based on empirical data, and remove any judgement bias. In many cases, customers will also want to build a predictive model to automate their business decision making process.

To create a predictive model, feature engineering (defining the set of input) is a key part if not the most important. In this post, I'd like to share my experience in how to come up with the initial set of features and how to evolve it as we learn more.

Firstly, we need to acknowledge two forces in this setting

Domain experts tends to be narrowly focused (and potentially biased) towards their prior experience. Their domain knowledge can usually encoded in terms of "business rules" and tends to be simple and obvious (if it is too complex and hidden, human brain is not good at picking them up).
Data scientist tends to be less biased and good at mining through a large set of signals to determine how relevant they are in an objective and quantitative manner. Unfortunately, raw data rarely gives strong signals. And lacking the domain expertise, data scientist alone will not even be able to come up with a good set of features (usually requires derivation from the combination of raw data). Notice that trying out all combinations are impractical because there are infinite number of ways to combine raw data. Also, when you have too many features in the input, the training data will not be enough and resulting in model with high variance.

Maintain a balance between these forces is a critical success factor of many data science project.

This best project settings (in my opinion) is to let the data scientist to take control in the whole exercise (as less bias has an advantage) while guided by input from domain experts.

Indicator Feature

This is a binary variable based on a very specific boolean condition (ie: true or false) that the domain expert believe to be highly indicative to the output. For example, for predicting stock, one indicator feature is whether the stock has been drop more than 15 % in a day.

Notice that indicator features can be added at any time once a new boolean condition is discovered by the domain expert. Indicators features doesn't need to be independent to each other and in fact most of the time they are highly inter-correlated.

After fitting these indicator features into the predictive model, we can see how many influence each of these features is asserting in the final prediction and hence providing a feedback to the domain experts about the strength of these signals.

Derived Feature

This is a numeric variable (ie: quantity) that the domain expert believe to be important to predicting the output. The idea is same as indicator feature except it is numeric in nature.

Expert Stacking

Here we build a predictive model whose input features are taking from each of the expert's prediction output. For example, to predict the stock, our model takes 20 analyst's prediction as its input.

The strength of this approach is that it can incorporate domain expertise very easily because it treat them as a blackbox (without needing to understand their logic). The model we training will take into account the relative accuracy of each expert's prediction and adjust its weighting accordingly. On the other hand, one weakness is the reliance of domain expertise during the prediction, which may or may not be available in an on-going manner.

Interactive Data Visualization

2014-06-28T09:02:00.001-07:00

Recently, "interactive report" is becoming a hot topic in data visualization. I believe it is becoming the next generation UI paradigm for KPI reports.

Interactive report is sitting somewhere in between static report and BI tools …

Executive KPI report today

Today most executive reports are "static report" provided by financial experts by pulling data from various ERP system on the regular basis, summarize these raw data in a highly condensed and simplified form, then generate a static report for the execs. When the exec gets the report, it is already in a summarized form that is customized based on his/her prior requirement. There is no way to ask any other question that the report is not already showing. Of course, the exec can ask for a separate report which requires additional development time and effort on his/her staff, but also need to wait for the new report to be developed.

This is a suboptimal situation. In order to survive or maintain leadership in today's highly competitive business environment, execs not just need a much broader perspective (from wide variety of operation data) to make his/her decision, but also he/she has to make the decision fast. Static report cannot fulfill this need.

Business Intelligence Tools

On the other hand, BI tools (such as Tableau) or OLAP tools can do very detail analysis in wide range of data sources. However, using these tools to perform more detail analysis (such as slice/dice/rollup/drilldown) typically requires specially trained data analysis skills. In reality, very few execs use these tools directly. What they do is to ask their data analyst to prepare a static report for them using these BI tools. The exec still get a "static report" although it is provided by the BI tools. Whenever they need to ask a different question, they need to go back to the data analyst and ask to prepare a separate report.

There is a gap between the static report and BI tool.

Interactive Report

"Interactive Report" provides a new paradigm to fill this gap. It has the following characteristics …

Like a static report, "Interactive Report" is still based on "static data", which is a fixed set of data generated in a periodic batch fashion.
Unlike static report, this pre-generated "static data" is much larger and wider that covers a broader scope of questions that the execs may ask.
Because the "static data" is large and wide, it is impossible to visualize all aspects in the report. Therefore, only one perspective of the static data (based on the exec's pre-specified requirement) is shown in the report.
However, if the exec wants to ask a different question, he/she can switch to a different perspective of the same "static data".

By providing a much large volume of static data, "interactive report" provides a more dynamic data navigation experience to the execs to find out the answer of their ad/hoc unplanned questions.

There are many open source technologies (such as Googlevis...) to support interactive data visualization from which the "interactive report" can be built. And many of them provides a programmatic interface with R so now data scientist without much Javascript experience can produce highly interactive web pages.

Common Text Mining workflow

2014-03-12T00:06:00.001-07:00

In this post, I want to summarize a common pattern that I have used in my previous text mining projects.

Text mining is different in that it uses vocabulary term as a key elements in feature engineering, otherwise it is quite similar to a statistical data mining project. Following are the key steps ...

Determine the "object" that we are interested to analyze. In some cases, the text document itself is the object (e.g. an email). In other cases, the text document is providing information about the object (e.g. user comment of a product, tweaks about a company)
Determine the features of the object we are interested, and create the corresponding feature vector of the object.
Feed the data (each object and its corresponding set of features) to standard descriptive analytics and predictive analytics techniques.

The overall process of text mining can be described in the following flow ...

Extract docs

In this phase, we are extracting text document from various types of external sources into a text index (for subsequent search) as well as a text corpus (for text mining).

Document source can be a public web site, an internal file system, or a SaaS offerings. Extracting documents typically involves one of the following ...

Perform a google search or crawl a predefined list of web sites, then download the web page from the list of URL, parse the DOM to extract text data from its sub-elements, and eventually creating one or multiple documents, store them into the text index as well as text Corpus.
Invoke the Twitter API to search for tweets (or monitor a particular topic stream of tweets), store them into the text index and text Corpus.
There is no limit in where to download the text data. In an intranet environment, this can be downloading text document from a share drive. On the other hand, in a compromised computer, user's email or IM can also be download from the virus agent.
If the text is in a different language, we may also invoke some machine translation service (e.g. Google translate) to convert the language to English.

Once the document is stored in the text index (e.g. Lucene index), it is available for search. Also, once the document is stored in the text corpus, further text processing will be involved.

Transformation

After the document is stored in the Corpus, here are some typical transformations ...

If we want to extract information about some entities mentioned in the document, we need to conduct sentence segmentation, paragraph segmentation in order to provide some local context from which we can analyze the entity with respect to its relationship with other entities.
Attach Part-Of-Speech tagging, or Entity tagging (person, place, company) to each word.
Apply standard text processing such as lower case, removing punctuation, removing numbers, removing stopword, stemming.
Perform domain specific conversion such as replace dddd-dd-dd with , (ddd)ddd-dddd to , remove header and footer template text, remove terms according to domain-specific stop-word dictionary.
Optionally, normalize the words to its synonyms using Wordnet or domain specific dictionary.

Extract Features

For text mining, the "bag-of-words model" is commonly used as the feature set. In this model, each document is represented as a word vector (a high dimensional vector with magnitude represents the importance of that word in the document). Hence all documents within the corpus is represented as a giant document/term matrix. The "term" can be generalized as uni-gram, bi-gram, tri-gram or n-gram, while the cell value in the matrix represents the frequency of the term appears in the document. We can also use TF/IDF as the cell value to dampen the importance of those terms if it appears in many documents. If we just want to represent whether the term appears in the document, we can binarize the cell value into 0 or 1.

After this phase, the Corpus will turn into a large and sparse document term matrix.

Reduce Dimensions

Since each row in the document/term matrix represents each document as a high dimension vector (with each dimension represents the occurrence of each term), there are two reasons we want to reduce its dimension ...

For efficiency reason, we want to reduce the memory footprint for storing the corpus
We want to transform the vector from the "term" space to a "topic" space, which allows document of similar topics to situate close by each other even they use different terms. (e.g. document using the word "pet" and "cat" are map to the same topic based on their co-occurrence)

SVD (Singular Value Decomposition) is a common matrix factorization technique to convert a "term" vector into a "concept" vector. SVD can be used to factor a large sparse matrix of M by N into the multiplication of three smaller dense matrix M*K, K*K, K*N. Latent Semantic Indexing (LSI) is applying the SVD in the document term matrix.

Another popular technique call topic modeling, based on LDA (Latent Dirichlet Allocation) is also commonly used to transform the document into a smaller set of topic dimensions.

Apply standard data mining

At this point, each document is represented as a topic vector. We can also add more domain specific features (such as for spam detection, whether the document contains certain word or character patterns such as '$', '!'). After that we can feed the each vector into the regular machine learning process.

Tools and Library

I have used Python's NLTK as well as R's TM, topicmodel library for performing the text mining work that I described above. Both of these library provide a good set of features for mining text documents.

Estimating statistics via Bootstrapping and Monte Carlo simulation

2014-03-03T23:14:00.003-08:00

We want to estimate some "statistics" (e.g. average income, 95 percentile height, variance of weight ... etc.) from a population.

It will be too tedious to enumerate all members of the whole population. For efficiency reason, we randomly pick a number samples from the population, compute the statistics of the sample set to estimate the corresponding statistics of the population. We understand the estimation done this way (via random sampling) can deviate from the population. Therefore, in additional to our estimated statistics, we also include a "standard error" (how big our estimation may be deviated from the actual population statistics) or a "confidence interval" (a lower and upper bound of the statistics which we are confident about containing the true statistics).

The challenge is how do we estimate the "standard error" or the "confidence interval". A straightforward way is to repeat the sampling exercise many times, each time we create a different sample set from which we compute one estimation. Then we look across all estimations from different sample sets to estimate the standard error and confidence interval of the estimation.

But what if collecting data from a different sample set is expensive, or for any reason the population is no longer assessable after we collected our first sample set. Bootstrapping provides a way to address this ...

Bootstrapping

Instead of creating additional sample sets from the population, we create additional sample sets by re-sampling data (with replacement) from the original sample set. Each of the created sample set will follow the same data distribution of the original sample set, which in turns, follow the population.

R provides a nice "bootstrap" library to do this.

> library(boot)
> # Generate a population
> population.weight <- rnorm(100000, 160, 60)
> # Lets say we care about the ninety percentile
> quantile(population.weight, 0.9)
     90% 
236.8105 
> # We create our first sample set of 500 samples
> sample_set1 <- sample(population.weight, 500)
> # Here is our sample statistic of ninety percentile
> quantile(sample_set1, 0.9)
     90% 
232.3641 
> # Notice that the sample statistics deviates from the population statistics
> # We want to estimate how big is this deviation by using bootstrapping
> # I need to define my function to compute the statistics
> ninety_percentile <- function(x, idx) {return(quantile(x[idx], 0.9))}
> # Bootstrapping will call this function many times with different idx
> boot_result <- boot(data=sample_set1, statistic=ninety_percentile, R=1000)
> boot_result

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = sample_set1, statistic = ninety_percentile, R = 1000)


Bootstrap Statistics :
    original   bias    std. error
t1* 232.3641 2.379859     5.43342
> plot(boot_result)
> boot.ci(boot_result, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot_result, type = "bca")

Intervals : 
Level       BCa          
95%   (227.2, 248.1 )  
Calculations and Intervals on Original Scale

Here is the visual output of the bootstrap plot

Bootstrapping is a powerful simulation technique for estimate any statistics in an empirical way. It is also non-parametric because it doesn't assume any model as well as parameters and just use the original sample set to estimate the statistics.

If we assume certain distribution model want to see the distribution of certain statistics. Monte Carlo simulation provides a powerful way for this.

Monte Carlo Simulation

The idea is pretty simple, based on a particular distribution function (defined by a specific model parameters), we generate many sets of samples. We compute the statistics of each sample set and see how the statistics distributed across different sample sets.

For example, given a normal distribution population, what is the probability distribution of the max value of 5 randomly chosen samples.

> sample_stats <- rep(0, 1000)
> for (i in 1:1000) {
+     sample_stats[i] <- max(rnorm(5))
+ }
> mean(sample_stats)
[1] 1.153008
> sd(sample_stats)
[1] 0.6584022
> par(mfrow=c(1,2))
> hist(sample_stats, breaks=30)
> qqnorm(sample_stats)
> qqline(sample_stats)

Here is the distribution of the "max(5)" statistics, which shows some right skewness

Bootstrapping and Monte Carlo simulation is a powerful tool to estimate statistics in an empirical manner, especially when we don't have an analytic form of solution.

Spark: Low latency, massively parallel processing framework

2013-12-27T14:55:00.000-08:00

While Hadoop fits well in most batch processing workload, and is the primary choice of big data processing today, it is not optimized for other types of workload due to its following limitation

Lack of iteration support
High latency due to persisting intermediate data onto disk

For a more detail elaboration of the Hadoop limitation, refer to my previous post.

Nevertheless, the Map/Reduce processing paradigm is a proven mechanism for dealing with large scale data. On the other hand, many of Hadoop's infrastructure piece such as HDFS, HBase has been mature over time.

In this blog post, we'll look at a different architecture called Spark, which has taken the strength of Hadoop and make improvement in a number of Hadoop's weakness, and provides a more efficient batch processing framework with a much lower latency (from the benchmark result, Spark (using RAM cache) claims to be 100x faster than Hadoop, and 10x faster than Hadoop when running on disk. Although competing with Hadoop MapRed, Spark integrates well with other parts of Hadoop Ecosystem (such as HDFS, HBase ... etc.). Spark has generated a lot of excitement in the big data community and represents a very promising parallel execution stack for big data analytics.

Berkeley Spark

Within the Spark cluster, there is a driver program where the application logic execution is started, with multiple workers which processing data in parallel. Although this is not mandated, data is typically collocated with the worker and partitioned across the same set of machines within the cluster. During the execution, the driver program will passed code/closure into the worker machine where processing of corresponding partition of data will be conducted. The data will undergoing different steps of transformation while staying in the same partition as much as possible (to avoid data shuffling across machines). At the end of the execution, actions will be executed at the worker and result will be returned to the driver program.

Underlying the cluster, there is an important Distributed Data Structure called RDD (Resilient Distributed Dataset), which is a logically centralized entity but physically partitioned across multiple machines inside a cluster based on some notion of key. Controlling how different RDD are co-partitioned (with the same keys) across machines can reduce inter-machine data shuffling within a cluster. Spark provides a "partition-by" operator which create a new RDD by redistributing the data in the original RDD across machines within the cluster.

RDD can optionally be cached in RAM and hence providing fast access. Currently the granularity of caching is done at the RDD level, either the whole or none of the RDD is cached. Cached is a hint but not a guarantee. Spark will try to cache the RDD if sufficient memory is available in the cluster, based on LRU (Least Recent Use) eviction algorithm.

RDD provides an abstract data structure from which application logic can be expressed as a sequence of transformation processing, without worrying about the underlying distributed nature of the data.

Typically an application logic are expressed in terms of a sequence of TRANSFORMATION and ACTION. "Transformation" specifies the processing dependency DAG among RDDs and "Action" specifies what the output will be (ie: the sink node of the DAG with no outgoing edge). The scheduler will perform a topology sort to determine the execution sequence of the DAG, tracing all the way back to the source nodes, or node that represents a cached RDD.

Notice that dependencies in Spark come in two forms. "Narrow dependency" means the all partitions of an RDD will be consumed by a single child RDD (but a child RDD is allowed to have multiple parent RDDs). "Wide dependencies" (e.g. group-by-keys, reduce-by-keys, sort-by-keys) means a parent RDD will be splitted with elements goes to different children RDDs based on their keys. Notice that RDD with narrow dependencies preserve the key partitioning between parent and child RDD. Therefore RDD can be co-partitioned with the same keys (parent key range to be a subset of child key range) such that the processing (generating child RDD from parent RDD) can be done within a machine with no data shuffling across network. On the other hand, RDD will wide dependencies involves data shuffling.

Narrow transformation (involves no data shuffling) includes the following operators

Map
FlatMap
Filter
Sample

Wide transformation (involves data shuffling) includes the following operators

SortByKey
ReduceByKey
GroupByKey
CogroupByKey
Join
Cartesian

Action output the RDD to the external world and includes the following operators

Collect
Take(n)
Reduce
ForEach
Sample
Count
Save

The scheduler will examine the type of dependencies and group the narrow dependency RDD into a unit of processing called a stage. Wide dependencies will span across consecutive stages within the execution and require the number of partition of the child RDD to be explicitly specified.

A typical execution sequence is as follows ...

RDD is created originally from external data sources (e.g. HDFS, Local file ... etc)
RDD undergoes a sequence of TRANSFORMATION (e.g. map, flatMap, filter, groupBy, join), each provide a different RDD that feed into the next transformation.
Finally the last step is an ACTION (e.g. count, collect, save, take), which convert the last RDD into an output to external data sources

The above sequence of processing is called a lineage (outcome of the topological sort of the DAG). Each RDD produced within the lineage is immutable. In fact, unless if it is cached, it is used only once to feed the next transformation to produce the next RDD and finally produce some action output.

In a classical distributed system, fault resilience is achieved by replicating data across different machines together with a active monitoring system. In case of any machine crashes, there is always another copy of data residing in a different machine from where recovery can take place.

Fault resiliency in Spark takes a different approach. First of all, as a large scale compute cluster, Spark is not meant to be a large scale data cluster at all. Spark makes two assumptions of its workload.

The processing time is finite (although the longer it takes, the cost of recovery after fault will be higher)
Data persistence is the responsibility of external data sources, which keeps the data stable within the duration of processing.

Spark has made a tradeoff decision that in case of any data lost during the execution, it will re-execute the previous steps to recover the lost data. However, this doesn't mean everything done so far is discarded and we need to start from scratch at the beginning. We just need to re-executed the corresponding partition in the parent RDD which is responsible for generating the lost partitions, in case of narrow dependencies, this resolved to the same machine.

Notice that the re-execution of lost partition is exactly the same as the lazy evaluation of the DAG, which starts from the leaf node of the DAG, tracing back the dependencies on what parent RDD is needed and then eventually track all the way to the source node. Recomputing the lost partition is done is a similar way, but taking partition as an extra piece of information to determine which parent RDD partition is needed.

However, re-execution across wide dependencies can touch a lot of parent RDD across multiple machines and may cause re-execution of everything. To mitigate this, Spark persist the intermediate data output from a Map phase before it shuffle them to different machines executing the reduce phase. In case of machine crash, the re-execution (from another surviving machine) just need to trace back to fetch the intermediate data from the corresponding partition of the mapper's persisted output. Spark also provide a checkpoint API to explicitly persist intermediate RDD so re-execution (when crash) doesn't need to trace all the way back to the beginning. In future, Spark will perform check-pointing automatically by figuring out a good balance between the latency of recovery and the overhead of check-pointing based on statistical result.

Spark provides a powerful processing framework for building low latency, massively parallel processing for big data analytics. It supports API around the RDD abstraction with a set of operation for transformation and action for a number of popular programming language like Scala, Java and Python.

In future posts, I'll cover other technologies in the Spark stack including real-time analytics using streaming as well as machine learning frameworks.

Escape local optimum trap

2013-12-12T13:23:00.000-08:00

Optimization is a very common technique in computer science and machine learning to search for the best (or good enough) solution. Optimization itself is a big topic and involves a wide range of mathematical techniques in different scenarios.

In this post, I will be focusing in local search, which is a very popular technique in searching for an optimal solution based on a series of greedy local moves. The general setting of local search is as follows ...

1. Define an objective function
2. Pick an initial starting point
3. Repeat
     3.1 Find a neighborhood
     3.2 Locate a subset of neighbors that is a candidate move
     3.3 Select a candidate from the candidate set
     3.4 Move to the candidate

One requirement is that the optimal solution need to be reachable by a sequence of moves. Usually this requires a proof that any arbitrary state is reachable by any arbitrary state through a sequence of moves.

Notice that there are many possible strategies for each steps in 3.1, 3.2, 3.3. For example, one strategy can examine all members within the neighborhood, pick the best one (in terms of the objective function) and move to that. Another strategy can randomly pick a member within the neighborhood, and move to the member if it is better than the current state.

Regardless of the strategies, the general theme is to move towards the members which is improving the objective function, hence the greedy nature of this algorithm.

One downside of this algorithm is that it is possible to be trapped in a local optimum, whose is the best candidate within its neighborhood, but not the best candidate from a global sense.

In the following, we'll explore a couple meta-heuristic techniques that can mitigate the local optimum trap.

Multiple rounds

We basically conduct k rounds of local search, each round getting a local optimum and then pick the best one out of these k local optimum.

Simulated Anealing

This strategy involves a dynamic combination of exploitation (better neighbor) and exploration (random walk to worse neighbor). The algorithm works in the following way ...

1. Pick an initial starting point
2. Repeat until terminate condition
     2.1 Within neighborhood, pick a random member
     2.2 If neighbor is better than me
           move to the neighbor
         else
           With chance exp(-(myObj - neighborObj)/Temp)
               move to the worse neighbor
     2.3 Temp = alpha * Temp

Tabu Search

This strategy maintains a list of previously visited states (called Tabu list) and make sure these states will not be re-visited in subsequent exploration. The search will keep exploring the best move but skipping the previously visited nodes. This way the algorithm will explore the path that hasn't be visited before. The search also remember the best state obtained so far.

1. Initialization
     1.1 Pick an initial starting point S
     1.2 Initial an empty Tabu list
     1.3 Set the best state to S
     1.4 Put S into the Tabu list
2. Repeat until terminate condition
     2.1 Find a neighborhood
     2.2 Locate a smaller subset that is a candidate move
     2.3 Remove elements that is already in Tabu list
     2.4 Select the best candidate and move there
     2.5 Add the new state to the Tabu list
     2.6 If the new state is better that best state
          2.6.1 Set the best state to this state
     2.7 If the Tabu list is too large
          2.7.1 Trim Tabu list by removing old items

Diverse recommender

2013-11-09T23:12:00.002-08:00

This is a continuation of my previous blog in recommendation systems, which describes some basic algorithm for building recommendation systems. These techniques evaluate each item against user's interest independently and pick the topK items to construct the recommendation. However, it suffers from the lack of diversity. For example, the list may contain the same book with soft cover, hard cover, and Kindle version. Since human's interests are usually diverse, a better recommendation list should contain items that covers a broad spectrum of user's interests, even though each element by itself is not the most aligned with the user's interests.

In this post, I will discuss a recommendation algorithm that consider diversity in its list of recommendation.

Topic Space

First of all, lets define a "topic space" where both the content and user will be map to. Having a "topic space" is a common approach in recommendation because it can reduce dimensionality and resulting in better system performance and improved generality.

The set of topics in topic space can be extracted algorithmically using Text Mining techniques such as LDA, but for simplicity here we use a manual approach to define the topic space (topics should be orthogonal to each other, as highly correlated topics can distort the measures). Lets say we have the following topics defined ...

Romance
Sports
Politics
Weather
Horror
...

Content as Vector of topic weights

Once the topic space is defined, content author can assign topic weights to each content. For example, movie can be assigned with genres and web page can be assigned with topics as well. Notice that a single content can be assigned with multiple topics of different weights. In other words, each content can be described as a vector of topic weights.

User as Vector of topic weights

On the other hand, user can also be represented as a vector of topic weights based on their interaction of content, such as viewing a movie, visiting a web page, buying a product ... etc. Such interaction can have a positive or negative effect depends on whether the user like or dislike the content. If the user like the content, the user vector will have the corresponding topic weight associated with the content increases by multiplying alpha (alpha > 1). If the user dislike the content, the corresponding topic weight will be divided by alpha. After each update, the user vector will be normalized to a unit vector.

Diversifying the recommendation

We use a utility function to model the diversity of the document and then maximize the utility function.

In practice, A is not computed from the full set of documents, which is usually huge. The full set of documents is typically indexed using some kind of Inverted index technologies using the set of topics as keywords, each c[j,k] is represented as tf-idf.

User is represented as a "query", and will be sent to the inverted index as a search request. Relevant documents (based on cosine distance measures w.r.t. user) will be return as candidate set D (e.g. top 200 relevant content).

To pick the optimal set of A out of D, we use a greedy approach as follows ...

Start with an empty set A
Repeat until |A| > H

Pick a doc i from D such that by adding it to the set A will maximize the utility function
Add doc i into A and remove doc i from D

Exploration and Exploitation

2013-09-07T22:11:00.002-07:00

"Cold Start" is a common problem happens quite frequent in recommendation system. When a new item enters, there is no prior history that the recommendation system can use. Since recommendation is an optimization engine which recommends item that matches the best with the user's interests. Without prior statistics, the new item will hardly be picked up as recommendation and hence continuously lacking the necessary statistics that the recommendation system can use.

One example is movie recommendation where a movie site recommends movies to users based on their past viewing history. When a new movie arrives the market, there isn't enough viewing statistics about the movie and therefore the new movie will not have a strong match score and won't be picked as a recommendation. Because we never learn from those that we haven't recommended, the new movies will continuously not have any statistics and therefore will never be picked in future recommendations.

Another cold start example is online Ad serving when a new Ad enters the Ad repository.

Multilevel Granularity Prediction

One solution of cold-start problem is to leverage existing items that are "SIMILAR" to the new item; "similarity" is based on content attributes (e.g. actors, genres). Notice that here we are using a coarser level of granularity (group of similar items) for prediction, which can be less accurate than a fine-grain model that use view statistics history for prediction.

In other words, we can make recommendation based on two models of different granularity. Fine-grain model based on instance-specific history data is preferred because it usually has higher accuracy. For cold-start problem when the new items don't have history data available, we will fall back to use the coarse-grain model based on other similar items to predict user's interests on the new items.

A common approach is to combine both models of different granularity using different weights where the weights depends on the confidence level of the fine-grain model. For new items, the fine-grain model will have low confidence and therefore it gives more weights to the coarse-grain model.

However, in case we don't have a coarser level of granularity, or the coarse level is too coarse and doesn't give good prediction. We have to use the fine-grain model to predict. But how can we build up the instance-specific history for the fine-grain model when we are not sure if the new items are good recommendation for the user ?

Optimization under Uncertainty

The core of our problem is we need to optimize under uncertainties. We have two approaches

Exploitation: Make the most optimal choice based on current data. Because of uncertainty (high variation) the current data may deviate from its true expected value, we may end up picking a non-optimal choice.
Exploration: Make a random choice or choices that we haven't made before. The goal is to gather more data point and reduce the uncertainty. This may waste our cycles of picking the optimal choice.

Lets start with a simple, multi-bandit problem. There are multiple bandit in a Casino, each Bandit has a different probability to win. If you know the true underlying winning probability of each bandit, you will pick the one with the highest winning probability and keep playing on that one.
Unfortunately, you don't know the underlying probability and has only a limited number of rounds to play. How would you choose which bandit to play to maximize the total number of rounds you win.

Our strategy should strike a good balance between exploiting and exploring. To measure how good the strategy is, there is a concept of "regret", which is the ratio of the two elements

Value you obtain by following Batch Optimal strategy (after you have done batch analysis and have a clear picture in the underlying probability distribution)
Value you obtain by following the strategy

We'll pick our strategy to do more Exploration initially when you have a lot of uncertainty, and gradually tune down the ratio of Exploration (and leverage more on Exploitation) as we collected more statistics.

Epsilon-Greedy Strategy

In the "epsilon-greedy" strategy, at every play we throw a dice between explore and exploit.
With probability p(t) = k/t (where k is a constant and t is the number of tries so far),we pick a bandit randomly with equal chance (regardless of whether the bandit has been picked in the past). And with probability 1 - p(t), we pick the bandit that has the highest probability of win based on passed statistics.

Epsilon-greedy has the desirable property of spending more time to explore initially and gradually reduce the portion as time passes. However, it doesn't have a smooth transition between explore and exploit. Also while it explores, it picks each bandit uniformly without giving more weight to the unexplored bandits. While it exploits, it doesn't consider the confidence of probability estimation.

Upper Confidence Bound: UCB

In the more sophisticated UCB strategy, each bandit is associated with an estimated mean with a confidence interval. In every play, we choose the bandit whose upper confidence bound (ie: mean + standard deviation) is the largest.

Initially each bandit has a zero mean and a large confidence interval. As time goes, we estimated the mean p[i] of bandit i based on how many time it wins since we play the bandit i. We also adjust the confidence interval (reducing deviation) as we play the bandit.
e.g. standard deviation is (p.(1-p)/n)^0.5

Notice that the UCB model can be used in a more general online machine learning setting. We require the machine learning model be able to output its estimation based on a confidence parameter. As a concrete example, lets say a user is visiting our movie site and we want to recommend a movie to the user, based on a bunch of input features (e.g. user feature, query feature ... etc.).

We can do a first round selection (based on information retrieval technique) to identify movie candidate based on relevancy (ie: user's viewing history or user search query). For each movie candidate, we can invoke the ML model to estimated interest level, as well as the 68% confidence boundary (the confidence level is arbitrary and need to be hand-tuned, 68% is roughly one standard deviation of a Gaussian distribution). We then combine them by add the 68% confidence range as an offset to its estimation and recommend the movie that has the highest resulting value.

After recommendation, we monitor whether user click on it, view it ... etc. and the response will be fed back to our ML model as a new training data. Our ML model is an online learning setting and will update the model with this new training data. Over time the 68% confidence range will be reduced over time as more and more data is gathered.

Relationship with A/B Testing

For most web sites, we run experiments continuously to improve user experience by trying out different layouts, or to improve user's engagement by recommending different types of contents, or by trying out different things. In general, we have an objective function that defines what aspects we are trying to optimize, and we run different experiments through A/B testing to try out different combinations of configuration to see which one will maximize our objective function.

When the number of experiments (combinations of different configuration) is small, then A/B is exploration mainly. In a typical setting, we use the old user experience as a control experiment and use the new user experience as a treatment. The goal is to test if the treatment causes any significant improvement from the control. Certain percentage of production users (typically 5 - 10%) will be directed to the new experience and we measured whether the user engagement level (say this is our objective function) has increased significantly in a statistical sense. Such splitting is typically done by hashing the user id (or browser cookies), and based on the range of the hash code falls to determine whether the user should get the new experience. This hashing is consistent (same user will hash into the same bucket in subsequent request) and so the user will get the whole new user experience when visiting the web site.

When the number of experiments is large and new experiments comes out dynamically and unpredictable, traditional A/B testing model described above will not be able to keep track of all pairs of control and treatment combination. In the case, we need to use the dynamic exploration/exploitation mechanism to find out the best user experience.

Using the UCB approach, we can treat each user experience as a bandit that the A/B test framework can choose from. Throughout the process, A/B test framework will explore/exploit among different user experience to optimize for the objective function. At any time, we can query the A/B testing framework to find out the latest statistics of each user experience. This provides a much better way to look at large number of experiment result at the same time.

Six steps in data science

2013-08-17T23:37:00.003-07:00

"Data science" and "Big data" has become a very hot term in last 2 years. Since the data storage price dropped significantly, enterprises and web companies has collected huge amount of their customer's behavioral data even before figuring out how to use them. Enterprises also start realizing that these data, if use appropriately, can turn into useful insight that can guide its business along a more successful path.

From my observation, the effort of data science can be categorized roughly into the following areas.

Data acquisition
Data visualization
OLAP, Report generation
Response automation
Predictive analytics
Decision optimization

Data Acquisition

Data acquisition is about collecting data into the system. This is an important step before any meaningful processing or analysis can begin. Most companies start with collecting business transaction records from their OLTP system. Typically there is an ETL (Extract/Transform/Load) process involved to ingest the raw data, clean the data and transform them appropriately. Finally, the data is loaded into a data warehouse where the subsequent data analytics exercises are performed. In today's big data world, the destination has been shifted from traditional data warehouse to Hadoop distributed file system (HDFS).

Data Visualization

Data visualization is usually the first step of analyzing the data. It is typically done by plotting data in different ways to establish give a quick sense of its shape, in order to guide the data scientist to determine what subsequent analysis should be conducted. This is also where a more experience data scientist can be distinguished from an less experienced one based on how fast they can spot common patterns or anomalies from data. Most of the plotting package works only for data that fits into one single machine. Therefore, in the big data world, data sampling is typically conducted first to reduce the data size, and then import to a single machine where R, SAS, SPSS can be used to visualized the sample data.

OLAP / Reporting

OLAP is about aggregating transaction data (e.g. revenue) along different dimensions (e.g. Month, Location, Product) where the enterprise defines its KPI/business metrics that measures the companies' performance. This can be done either in an ad/hoc manner (OLAP), or in a predefined manner (Report template). Report writer (e.g. Tableau, Microstrategy) are use to produce the reports. The data is typically stored in regular RDBMS or a multidimensional cube which is optimizing for OLAP processing (ie: slice, dice, rollup, drilldown). In the big data world, Hive provides a SQL-like access mechanism and is commonly used to access data stored in HDFS. Most popular report writers have integrated with Hive (or declare plan to integrate with it) to access big data stored in HDFS.

Response Automation

Response automation is about leveraging domain-specific knowledge to encode a set of "rules" which include event/condition/action. The system monitors all events observed, matches them against the conditions, (which can be a boolean expression of event attributes, or a sequence of event occurrence), and trigger appropriate actions. In the big data world, automating such response is typically done by stream processing mechanism (such as Flume and Storm). Notice that the "rule" need to be well-defined and unambiguous.

Predictive Analytics

Prediction is about estimating unknown data based on observed data through statistical/probabilistic approach. Depends on the data type of the output, "prediction" can be subdivided into "classification" (when the output is a category) or "regression" (when the output is a number).

Prediction is typically done by first "train" a predictive model using historic data (where all input and output value is known). This training is done via an iterative process where the performance of the model in each iteration is measured at the end of each iteration. Additional input data or different model parameters will be used in next round of iteration. When the predictive performance is good enough and no significant improvement is made between subsequent iterations, the process will stop and the best model created during the process will be used.

Once we have the predictive model, we can use it to predict information we haven't observed, either this is information that is hidden, or hasn't happen yet (ie: predicting future)

For a more detail description of what is involved in performing predictive analytics, please refer to my following blog.

Decision Optimization

Decision optimization is about making the best decision after carefully evaluating possible options against some measure of business objectives. The business objective is defined by some "objective function" which is expressed as a mathematical formula of some "decision variables". Through various optimization technique, the system will figure out what the decision variables should be in order ro maximize (or minimize) the value of the objective function. The optimizing technique for discrete decision variables is typically done using exhaustive search, greedy/heuristic search, integer programming technique, while optimization for continuous decision variables is done using linear/quadratic programming.

From my observation, "decision optimization" is the end goal of most data science effort. But "decision optimization" relies on previous effort. To obtain the overall picture (not just observed data, but also hidden or future data) at the optimization phase, we need to make use of the output of the prediction. And in order to train a good predictive model, we need to have the data acquisition to provide clean data. All these efforts are inter-linked with each other in some sense.

OLAP operation in R

2013-07-29T13:02:00.003-07:00

OLAP (Online Analytical Processing) is a very common way to analyze raw transaction data by aggregating along different combinations of dimensions. This is a well-established field in Business Intelligence / Reporting. In this post, I will highlight the key ideas in OLAP operation and illustrate how to do this in R.

Facts and Dimensions

The core part of OLAP is a so-called "multi-dimensional data model", which contains two types of tables; "Fact" table and "Dimension" table

A Fact table contains records each describe an instance of a transaction. Each transaction records contains categorical attributes (which describes contextual aspects of the transaction, such as space, time, user) as well as numeric attributes (called "measures" which describes quantitative aspects of the transaction, such as no of items sold, dollar amount).

A Dimension table contain records that further elaborates the contextual attributes, such as user profile data, location details ... etc.

In a typical setting of Multi-dimensional model ...

Each fact table contains foreign keys that references the primary key of multiple dimension tables. In the most simple form, it is called a STAR schema.
Dimension tables can contain foreign keys that references other dimensional tables. This provides a sophisticated detail breakdown of the contextual aspects. This is also called a SNOWFLAKE schema.
Also this is not a hard rule, Fact table tends to be independent of other Fact table and usually doesn't contain reference pointer among each other.
However, different Fact table usually share the same set of dimension tables. This is also called GALAXY schema.
But it is a hard rule that Dimension table NEVER points / references Fact table

A simple STAR schema is shown in following diagram.

Each dimension can also be hierarchical so that the analysis can be done at different degree of granularity. For example, the time dimension can be broken down into days, weeks, months, quarter and annual; Similarly, location dimension can be broken down into countries, states, cities ... etc.

Here we first create a sales fact table that records each sales transaction.

# Setup the dimension tables

state_table <- 
  data.frame(key=c("CA", "NY", "WA", "ON", "QU"),
             name=c("California", "new York", "Washington", "Ontario", "Quebec"),
             country=c("USA", "USA", "USA", "Canada", "Canada"))

month_table <- 
  data.frame(key=1:12,
            desc=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
            quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3","Q3","Q3","Q4","Q4","Q4"))

prod_table <- 
  data.frame(key=c("Printer", "Tablet", "Laptop"),
            price=c(225, 570, 1120))

# Function to generate the Sales table
gen_sales <- function(no_of_recs) {

  # Generate transaction data randomly
  loc <- sample(state_table$key, no_of_recs, 
                replace=T, prob=c(2,2,1,1,1))
  time_month <- sample(month_table$key, no_of_recs, replace=T)
  time_year <- sample(c(2012, 2013), no_of_recs, replace=T)
  prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2))
  unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3))
  amount <- unit*prod_table[prod,]$price

  sales <- data.frame(month=time_month,
                      year=time_year,
                      loc=loc,
                      prod=prod,
                      unit=unit,
                      amount=amount)

  # Sort the records by time order
  sales <- sales[order(sales$year, sales$month),]
  row.names(sales) <- NULL
  return(sales)
}

# Now create the sales fact table
sales_fact <- gen_sales(500)

# Look at a few records
head(sales_fact)

  month year loc   prod unit amount
1     1 2012  NY Laptop    1    225
2     1 2012  CA Laptop    2    450
3     1 2012  ON Tablet    2   2240
4     1 2012  NY Tablet    1   1120
5     1 2012  NY Tablet    2   2240
6     1 2012  CA Laptop    1    225

Multi-dimensional Cube

Now, we turn this fact table into a hypercube with multiple dimensions. Each cell in the cube represents an aggregate value for a unique combination of each dimension.

# Build up a cube
revenue_cube <- 
    tapply(sales_fact$amount, 
           sales_fact[,c("prod", "month", "year", "loc")], 
           FUN=function(x){return(sum(x))})

# Showing the cells of the cude
revenue_cube

, , year = 2012, loc = CA

         month
prod         1    2     3    4    5    6    7    8    9   10   11   12
  Laptop  1350  225   900  675  675   NA  675 1350   NA 1575  900 1350
  Printer   NA 2280    NA   NA 1140  570  570  570   NA  570 1710   NA
  Tablet  2240 4480 12320 3360 2240 4480 3360 3360 5600 2240 2240 3360

, , year = 2013, loc = CA

         month
prod         1    2    3    4    5    6    7    8    9   10   11   12
  Laptop   225  225  450  675  225  900  900  450  675  225  675 1125
  Printer   NA 1140   NA 1140  570   NA   NA  570   NA 1140 1710 1710
  Tablet  3360 3360 1120 4480 2240 1120 7840 3360 3360 1120 5600 4480

, , year = 2012, loc = NY

         month
prod         1     2    3    4    5    6    7    8    9   10   11   12
  Laptop   450   450   NA   NA  675  450  675   NA  225  225   NA  450
  Printer   NA  2280   NA 2850  570   NA   NA 1710 1140   NA  570   NA
  Tablet  3360 13440 2240 2240 2240 5600 5600 3360 4480 3360 4480 3360

, , year = 2013, loc = NY

.....

dimnames(revenue_cube)

$prod
[1] "Laptop"  "Printer" "Tablet" 

$month
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"

$year
[1] "2012" "2013"

$loc
[1] "CA" "NY" "ON" "QU" "WA"

OLAP Operations

Here are some common operations of OLAP

Slice
Dice
Rollup
Drilldown
Pivot

"Slice" is about fixing certain dimensions to analyze the remaining dimensions. For example, we can focus in the sales happening in "2012", "Jan", or we can focus in the sales happening in "2012", "Jan", "Tablet".

# Slice
# cube data in Jan, 2012
revenue_cube[, "1", "2012",]

         loc
prod        CA   NY   ON   QU   WA
  Laptop  1350  450   NA  225  225
  Printer   NA   NA   NA 1140   NA
  Tablet  2240 3360 5600 1120 2240
 
# cube data in Jan, 2012
revenue_cube["Tablet", "1", "2012",]

  CA   NY   ON   QU   WA 
2240 3360 5600 1120 2240

"Dice" is about limited each dimension to a certain range of values, while keeping the number of dimensions the same in the resulting cube. For example, we can focus in sales happening in [Jan/ Feb/Mar, Laptop/Tablet, CA/NY].

revenue_cube[c("Tablet","Laptop"), 
             c("1","2","3"), 
             ,
             c("CA","NY")]


, , year = 2012, loc = CA

        month
prod        1    2     3
  Tablet 2240 4480 12320
  Laptop 1350  225   900

, , year = 2013, loc = CA

        month
prod        1    2    3
  Tablet 3360 3360 1120
  Laptop  225  225  450

, , year = 2012, loc = NY

        month
prod        1     2    3
  Tablet 3360 13440 2240
  Laptop  450   450   NA

, , year = 2013, loc = NY

        month
prod        1    2    3
  Tablet 3360 4480 6720
  Laptop  450   NA  225

"Rollup" is about applying an aggregation function to collapse a number of dimensions. For example, we want to focus in the annual revenue for each product and collapse the location dimension (ie: we don't care where we sold our product).

apply(revenue_cube, c("year", "prod"),
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


      prod
year   Laptop Printer Tablet
  2012  22275   31350 179200
  2013  25200   33060 166880

"Drilldown" is the reverse of "rollup" and applying an aggregation function to a finer level of granularity. For example, we want to focus in the annual and monthly revenue for each product and collapse the location dimension (ie: we don't care where we sold our product).

apply(revenue_cube, c("year", "month", "prod"), 
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


, , prod = Laptop

      month
year      1    2    3    4    5    6    7    8    9   10   11   12
  2012 2250 2475 1575 1575 2250 1800 1575 1800  900 2250 1350 2475
  2013 2250  900 1575 1575 2250 2475 2025 1800 2025 2250 3825 2250

, , prod = Printer

      month
year      1    2    3    4    5    6    7    8    9   10   11   12
  2012 1140 5700  570 3990 4560 2850 1140 2850 2850 1710 3420  570
  2013 1140 4560 3420 4560 2850 1140  570 3420 1140 3420 3990 2850

, , prod = Tablet

      month
year       1     2     3     4     5     6     7     8     9    10    11    12
  2012 14560 23520 17920 12320 10080 14560 13440 15680 25760 12320 11200  7840
  2013  8960 11200 10080  7840 14560 10080 29120 15680 15680  8960 12320 22400

"Pivot" is about analyzing the combination of a pair of selected dimensions. For example, we want to analyze the revenue by year and month. Or we want to analyze the revenue by product and location.

apply(revenue_cube, c("year", "month"), 
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


      month
year       1     2     3     4     5     6     7     8     9    10    11    12
  2012 17950 31695 20065 17885 16890 19210 16155 20330 29510 16280 15970 10885
  2013 12350 16660 15075 13975 19660 13695 31715 20900 18845 14630 20135 27500
 

apply(revenue_cube, c("prod", "loc"),
      FUN=function(x) {return(sum(x, na.rm=TRUE))})


         loc
prod         CA     NY    ON    QU    WA
  Laptop  16425   9450  7650  7425  6525
  Printer 15390  19950  7980 10830 10260
  Tablet  90720 117600 45920 34720 57120

I hope you can get a taste of the richness of data processing model in R.

However, since R is doing all the processing in RAM. This requires your data to be small enough so it can fit into the local memory in a single machine.

Text processing (part 2): Inverted Index

2013-02-23T00:08:00.002-08:00

This is the second part of my text processing series. In this blog, we'll look into how text documents can be stored in a form that can be easily retrieved by a query. I'll used the popular open source Apache Lucene index for illustration.

There are two main processing flow in the system ...

Document indexing: Given a document, add it into the index
Document retrieval: Given a query, retrieve the most relevant documents from the index.

The following diagram illustrate how this is done in Lucene.

Index Structure

Both documents and query is represented as a bag of words. In Apache Lucene, "Document" is the basic unit for storage and retrieval. A "Document" contains multiple "Fields" (also call zones). Each "Field" contains multiple "Terms" (equivalent to words).

To control how the document will be indexed across its containing fields, a Field can be declared in multiple ways to specified whether it should be analyzed (a pre-processing step during index), indexed (participate in the index) or stored (in case it needs to be returned in query result).

Keyword (Not analyzed, Indexed, Stored)
Unindexed (Not analyzed, Not indexed, Stored)
Unstored (Analyzed, Indexed, Not stored)
Text (Analyzed, Indexed, Stored)

The inverted index is a core data structure of the storage. It is organized as an inverted manner from terms to the list of documents (which contain the term). The list (known as posting list) is ordered by a global ordering (typically by document id). To enable faster retrieval, the list is not just a single list but a hierarchy of skip lists. For simplicity, we ignore the skip list in subsequent discussion.

This data structure is illustration below based on Lucene's implementation. It is stored on disk as segment files which will be brought to memory during the processing.

The above diagram only shows the inverted index. The whole index contain an additional forward index as follows.

Document indexing

Document in its raw form is extracted from a data adaptor. (this can be making an Web API to retrieve some text output, or crawl a web page, or receiving an HTTP document upload). This can be done in a batch or online manner.

When the index processing start, it parses each raw document and analyze its text content. The typical steps includes ...

Tokenize the document (breakdown into words)
Lowercase each word (to make it non-case-sensitive, but need to be careful with names or abbreviations)
Remove stop words (take out high frequency words like "the", "a", but need to careful with phrases)
Stemming (normalize different form of the same word, e.g. reduce "run", "running", "ran" into "run")
Synonym handling. This can be done in two ways. Either expand the term to include its synonyms (ie: if the term is "huge", add "gigantic" and "big"), or reduce the term to a normalized synonym (ie: if the term is "gigantic" or "huge", change it to "big")

At this point, the document is composed with multiple terms. doc = [term1, term2 ...]. Optionally, terms can be further combined into n-grams. After that we count the term frequency of this document. For example, in a bi-gram expansion, the document will become ...
doc1 -> {term1: 5, term2: 8, term3: 4, term1_2: 3, term2_3:1}

We may also compute a "static score" based on some measure of quality of the document. After that, we insert the document into the posting list (if it exist, otherwise create a new posting list) for each terms (all n-grams), this will create the inverted list structure as shown in previous diagram.

There is a boost factor that can be set to the document or field. The boosting factor effectively multiply the term frequency which effectively affecting the importance of the document or field.

Document can be added to the index in one of the following ways; inserted, modified and deleted.
Typically the document will first added to the memory buffer, which is organized as an inverted index in RAM.

When this is a document insertion, it goes through the normal indexing process (as I described above) to analyze the document and build an inverted list in RAM.
When this is a document deletion (the client request only contains the doc id), it fetches the forward index to extract the document content, then goes through the normal indexing process to analyze the document and build the inverted list. But in this case the doc object in the inverted list is labeled as "deleted".
When this is a document update (the client request contains the modified document), it is handled as a deletion followed by an insertion, which means the system first fetch the old document from the forward index to build an inverted list with nodes marked "deleted", and then build a new inverted list from the modified document. (e.g. If doc1 = "A B" is update to "A C", then the posting list will be {A:doc1(deleted) -> doc1, B:doc1(deleted), C:doc1}. After collapsing A, the posting list will be {A:doc1, B:doc1(deleted), C:doc1}

As more and more document are inserted into the memory buffer, it will become full and will be flushed to a segment file on disk. In the background, when M segments files have been accumulated, Lucene merges them into bigger segment files. Notice that the size of segment files at each level is exponentially increased (M, M^2, M^3). This maintains the number of segment files that need to be search per query to be at the O(logN) complexity where N is the number of documents in the index. Lucene also provide an explicit "optimize" call that merges all the segment files into one.

Here lets detail a bit on the merging process, since the posting list is already vertically ordered by terms and horizontally ordered by doc id, merging two segment files S1, S2 is basically as follows

Walk the posting list from both S1 and S2 together in sorted term order. For those non-common terms (term that appears in one of S1 or S2 but not both), write out the posting list to a new segment S3.
Until we find a common term T, we merge the corresponding posting list from these 2 segments. Since both list are sorted by doc id, we just walk down both posting list to write out the doc object to a new posting list. When both posting lists have the same doc (which is the case when the document is updated or deleted), we pick the latest doc based on time order.
Finally, the doc frequency of each posting list (of the corresponding term) will be computed.

Document retrieval

Consider a document is a vector (each term as the separated dimension and the corresponding value is the tf-idf value) and the query is also a vector. The document retrieval problem can be defined as finding the top-k most similar document that match a query, where similarity is defined as the dot-product or cosine distance between the document vector and the query vector.

tf-idf is a normalized frequency. TF (term frequency) represents how many time the term appears in the document (usually a compression function such as square root or logarithm is applied). IDF is the inverse of document frequency which is used to discount the significance if that term appears in many other documents. There are many variants of TF-IDF but generally it reflects the strength of association of the document (or query) with each term.

Given a query Q containing terms [t1, t2], here is how we fetch the corresponding documents. A common approach is the "document at a time approach" where we traverse the posting list of t1, t2 concurrently (as opposed to the "term at a time" approach where we traverse the whole posting list of t1 before we start the posting list of t2). The traversal process is described as follows ...

For each term t1, t2 in query, we identify all the corresponding posting lists.
We walk each posting list concurrently to return a sequence of documents (ordered by doc id). Notice that each return document contains at least one term but can also also contain multiple terms.
We compute the dynamic score which is dot product of the query to document vector. Notice that we typically don't concern the TF/IDF of the query (which is short and we don't care the frequency of each term). Therefore we can just compute the sum up all the TF score of the posting list that has a match term after dividing the IDF score (at the head of each posting list). Lucene also support query level boosting where a boost factor can be attached to the query terms. The boost factor will multiply the term frequency correspondingly.
We also look up the static score which is purely based on the document (but not the query). The total score is a linear combination of static and dynamic score.
Although the score we used in above calculation is based on computing the cosine distance between the query and document, we are not restricted to that. We can plug in any similarity function that make sense to the domain. (e.g. we can use machine learning to train a model to score the similarity between a query and a document).
After we compute a total score, we insert the document into a heap data structure where the topK scored document is maintained.

Here the whole posting list will be traversed. In case of the posting list is very long, the response time latency will be long. Is there a way that we don't have to traverse the whole list and still be able to find the approximate top K documents ? There are a couple strategies we can consider.

Static Score Posting Order: Notice that the posting list is sorted based on a global order, this global ordering provide a monotonic increasing document id during the traversal that is important to support the "document at a time" traversal because it is impossible to visit the same document again. This global ordering, however, can be quite arbitrary and doesn't have to be the document id. So we can pick the order to be based on the static score (e.g. quality indicator of the document) which is global. The idea is that we traverse the posting list in decreasing magnitude of static score, so we are more likely to visit the document with the higher total score (static + dynamic score).
Cut frequent terms: We do not traverse the posting list whose term has a low IDF value (ie: the term appears in many documents and therefore the posting list tends to be long). This way we avoid to traverse the long posting list.
TopR list: For each posting list, we create an extra posting list which contains the top R documents who has the highest TF (term frequency) in the original list. When we perform the search, we perform our search in this topR list instead of the original posting list.

Since we have multiple inverted index (in memory buffer as well as the segment files at different levels), we need to combine the result them. If termX appears in both segmentA and segmentB, then the fresher version will be picked. The fresher version is determine as follows; the segment with a lower level (smaller size) will be considered more fresh. If the two segment files are at the same level, then the one with a higher number is more fresh. On the other hand, the IDF value will be the sum of the corresponding IDF of each posting list in the segment file (the value will be slightly off if the same document has been updated, but such discrepancy is negligible). However, the processing of consolidating multiple segment files incur processing overhead in document retrieval. Lucene provide an explicit "optimize" call to merge all segment files into one single file so there is no need to look at multiple segment files during document retrieval.

Distributed Index

For large corpus (like the web documents), the index is typically distributed across multiple machines. There are two models of distribution: Term partitioning and Document partitioning.

In document partitioning, documents are randomly spread across different partitions where the index is built. In term partitioning, the terms are spread across different partitions. We'll discuss document partitioning as it is more commonly used. Distributed index is provider by other technologies that is built on Lucene, such as ElasticSearch. A typical setting is as follows ...

In this setting, machines are organized as columns and rows. Each column represent a partition of documents while each row represent a replica of the whole corpus.

During the document indexing, first a row of the machines is randomly selected and will be allocated for building the index. When a new document crawled, a column machine from the selected row is randomly picked to host the document. The document will be sent to this machine where the index is build. The updated index will be later propagated to the other rows of replicas.

During the document retrieval, first a row of replica machines is selected. The client query will then be broadcast to every column machine of the selected row. Each machine will perform the search in its local index and return the TopM elements to the query processor which will consolidate the results before sending back to client. Notice that K/P < M < K, where K is the TopK documents the client expects and P is the number of columns of machines. Notice that M is a parameter that need to be tuned.

One caveat of this distributed index is that as the posting list is split horizontally across partitions, we lost the global view of the IDF value without which the machine is unable to calculate the TF-IDF score. There are two ways to mitigate that ...

Do nothing: here we assume the document are evenly spread across different partitions so the local IDF represents a good ratio of the actual IDF.
Extra round trip: In the first round, query is broadcasted to every column which returns its local IDF. The query processor will collected all IDF response and compute the sum of the IDF. In the second round, it broadcast the query along with the IDF sum to each column of machines, which will compute the local score based on the IDF sum.

Text Processing (part 1) : Entity Recognition

2013-02-15T22:59:00.000-08:00

Entity recognition is commonly used to parse unstructured text document and extract useful entity information (like location, person, brand) to construct a more useful structured representation. It is one of the most common text processing to understand a text document.

I am planning to write a blog series on text processing. In this first blog of a series of basic text processing algorithm, I will introduce some basic algorithm for entity recognition.

Given the sentence: Ricky is living in CA USA.
Output: Ricky/SP is/NA living/NA in/NA CA/SL USA/CL

Basically we want to tag each word with the entity, whose definition is domain specific. In this example, we define the following tags

NA - Not Applicable
SP - Start of a person name
CP - Continue of a person name
SL - Start of a location name
CL - Continue of a location name

Hidden Markov Model

Lets say there is a sequence of state, lets look at multiple probabilistic graph.

However, in our tagging example, we don't directly observe the tag. Instead, we only observe the words. In this case, we can use a hidden markov model (ie: HMM).

Now the tagging problem can be structured as follows.

Given a sequence of words, we want to predict the most likely tag sequence.

Find a tag sequence t1, t2, ... tn that maximize the probability of P(t1, t2, .... | w1, w2 ...)

Using Bayes rules,
P(t1, t2, .... | w1, w2 ...) = P(t1, t2, ... tn, w1, w2, ... wn) / P(w1, w2, ... wn)

Since the sequence w1, w2 ... wn is observed and constant among all tag sequence. This is equivalent to maximize P(t1, t2, ... tn, w1, w2, ... wn) which is equal to P(t1|S)*P(t2|t1)…P(E|tn)*P(w1|t1)*P(w2|t2)…

Now, P(t1|S), P(t2|t1) ... can be estimated by counting the occurrence within the training data.
P(t2|t1) = count(t1, t2) / count(*, t2)

Similarly, P(w1|t1) = count(w1, t1) / count(*, t1)

Viterbi Algorithm

Now the problem is find a tag sequence t1, ... tn to maximize
P(t1|S)*P(t2|t1)…P(E|tn)*P(w1|t1)*P(w2|t2)…

A naive method is to find all possible combination of tag sequence and then evaluate the above probability. The order of complexity will be O(|T|^n) where T is the number of possible tag values. Notice that this is exponential complexity with respect to the length of the sentence.

However, there is a more effective Viterbi algorithm that leverage the Markov chain properties as well as dynamic programming technique.

The key element is M(k, L) which indicates the max probability of any length k sequence that ends at tk = L. On the other hand, M(k, L) is computed by looking back different choices of S of the length k-1 sequence, and pick the one that gives the maximum M(k-1, S)*P(tk=L | tk-1=S)*P(wk|tk=L). The complexity of this algorithm is O(n*|T|^2).

To find the actual tag sequence, we also maintain a back pointer from every cell to S which leads to the cell. Then we can trace back the path from the max cell M(n, STOP) where STOP is the end of sentence.

Notice that for some rare words that is not observed from the training data, P(wk|tk=L) will be zero and cause the whole term to be zero. Such words can be numbers, dates. One way to solve this problem is to group these rare words into patterns (e.g. 3 digits, year2012 ... etc) and then compute P(group(wk) | tk=L). However, such grouping is domain specific and has to be hand-tuned.

Reference

NLP course from Michael Collins of Columbia Unversity

Basic Planning Algorithm

2013-02-12T07:57:00.000-08:00

You can think of planning as a graph search problem where each node in the graph represents a possible "state" of the reality. A directed edge from nodeA to nodeB representing an "action" is available to transition stateA to stateB.

Planning can be thought of as another form of a constraint optimization problem which is quite different from the one I described in my last blog. In this case, the constraint is the goal state we want to achieve, where a sequence of actions needs to be found to meet the constraint. The sequence of actions will incur cost and our objective is to minimize the cost associated with our chosen actions

Basic Concepts

A "domain" defined the structure of the problem.

A set of object types. e.g. ObjectTypeX, ObjectTypeY ... etc.
A set of relation types e.g. [ObjectTypeX RelationTypeA ObjectTypeY] or [ObjectTypeX RelationTypeA ValueTypeY]

A "state" is composed of a set of relation instances, It can either be a "reality state" or a "required state".

A reality state contains tuples of +ve atoms. e.g. [(personX in locationA), (personX is male)]. Notice that -ve atoms will not exist in reality state. e.g. If personX is NOT in locationB, such tuple will just not show up in the state.

A required state contains both +ve and -ve atoms. e.g. [(personX in locationA), NOT(personX is male)] The required state is used to check against the reality state. The required state is reached if all of the following is true.

All +ve atoms in the required state is contained in the +ve atoms of the reality state
None of the -ve atoms in the required state is contained in the +ve atoms of the reality state

Notice that there can be huge (or even infinite) number of nodes and edges in the graph if we are to expand the whole graph (with all possible states and possible actions). Normally we will expressed only a subset of nodes and edges in an analytical way. Instead of enumerating all possible states, we describe the state as a set of relations that we care, in particular we describe the initial state of the environment with all the things we observed and the goal state as what we want to reach. Similarity, we are not enumerate every possible edges, instead we describe actions with variables such that it describe rules that can transition multiple situations of states.

An "action" causes transition from one state to the other. It is defined as action(variable1, variable2 ...) and contains the following components.

Pre-conditions: a required state containing a set of tuples (expressed by variables). The action is feasible if the current reality state contains all the +ve atoms but not any -ve atoms specified in the pre-conditions.
Effects: A set of +ve atoms and -ve atoms (also expressed by variables). After the action is taken, it removes all the -ve atoms from the current reality state and then insert all the +ve atoms into the current reality state.
Cost of executing this actio.

Notice that since actions contains variables but the reality state does not, therefore before an action can be execution, we need to bind the variables in the pre-conditions to a specific value such that it will match the current reality state. This binding will propagate to the variable in the effects of the actions and new atoms will be insert / removed from the reality state.

Planning Algorithm

This can be think of a Search problem. Given an initial state and a goal state, our objective is to search for a sequence of actions such that the goal state is reached.

We can perform the search from the initial state, expand all the possible states that can be reached by taking some actions, and check during this process whether the goal state has been reached. If so, terminate the process and return the path.

Forward planning build the plan from the initial state. It works as follows ...

Put the initial state into the exploration queue, with an empty path.
Pick a state (together with its path from initial state) from the exploration queue as the current state according to some heuristics.
If this current state is the goal state, then return its path that contains the sequence of action and we are done. Else move on.
For this current state, explore which action is possible by seeing whether the current state meet the pre-conditions (ie: contains all +ve and no -ve state specified in the action pre-conditions).
If the action is feasible, compute the next reachable state, and the path (by adding this action to the original path), insert the next state into the exploration queue.
Repeat 5 for all feasible actions of current state.

Alternatively, we can perform the search from the goal state. We looked at what need to be accomplished and identify what possible actions can accomplish that (ie: the effect of the action meets the goal state). Then we looked at whether those actions are feasible (ie: the initial state meets the action's pre-conditions). If so we can execute the action, otherwise we take the action's pre-conditions as our sub-goal and expand our over goal state.

Backward planning build the plan from the goal state. It works as follows ...

Put the goal state into the exploration queue, with an empty path.
Pick a regression state (a state that can reach the goal state, can be considered as a sub-goal) from the exploration queue according to some heuristics.
If the regression state is contained in initial state, then we are done and return the path as the plan. Else move on.
From this regression state, identify all "relevant actions"; those actions who has some +ve effect is contained in the regression state; and all of its +ve effect is not overlap with the -ve regression state; and all of its -ve effect is not overlap with the +ve regression state.
If the action is relevant, compute the next regression state by removing the action effect from the current regression state and adding the action pre-conditions into the current regression state, insert the next regression state into the exploration queue.
Repeat 5 from all relevant actions of current regression state.

Heuristic Function

In above algorithms, to pick the next candidate from the exploration queue. We can employ many strategies.

If we pick the oldest element in the queue, this is a breathe-first search
If we pick the youngest element in the queue, this is a depth-first search
We can pick the best element in the queue based on some value function.

Notice that what is "best" is very subjective and is also domain specific. A very popular approach is using the A* search whose value function = g(thisState) + h(thisState).

Notice that g(thisState) is the accumulative cost to move from initial state to "thisState", while h(thisState) is a domain-specific function that estimate the cost from "thisState" to the goal state. It can be proved that in order for A* search to return an optimal solution (ie: the least cost path), the chosen h(state) function must not over-estimate (ie: need to underestimate) the actual cost to move from "thisState" to the goal state.

Here is some detail of A* search.

Optimization in R

2013-01-14T00:10:00.000-08:00

Optimization is a very common problem in data analytics. Given a set of variables (which one has control), how to pick the right value such that the benefit is maximized. More formally, optimization is about determining a set of variables x1, x2, … that maximize or minimize an objective function f(x1, x2, …).

Unconstrained optimization

In an unconstraint case, the variable can be freely selected within its full range.

A typical solution is to compute the gradient vector of the objective function [∂f/∂x1, ∂f/∂x2, …] and set it to [0, 0, …]. Solve this equation and output the result x1, x2 … which will give the local maximum.

In R, this can be done by a numerical analysis method.

> f <- function(x){(x[1] - 5)^2 + (x[2] - 6)^2}
> initial_x <- c(10, 11)
> x_optimal <- optim(initial_x, f, method="CG")
> x_min <- x_optimal$par
> x_min
[1] 5 6

Equality constraint optimization

Moving onto the constrained case, lets say x1, x2 … are not independent and then have to related to each other in some particular way: g1(x1, x2, …) = 0, g2(x1, x2, …) = 0.

The optimization problem can be expressed as …
Maximize objective function: f(x1, x2, …)
Subjected to equality constraints:

g1(x1, x2, …) = 0
g2(x1, x2, …) = 0

A typical solution is to turn the constraint optimization problem into an unconstrained optimization problem using Lagrange multipliers.

Define a new function F as follows ...
F(x1, x2, …, λ1, λ2, …) = f(x1, x2, …) + λ1.g1(x1, x2, …) + λ2.g2(x1, x2, …) + …

Then solve for ...
[∂F/∂x1, ∂F/∂x2, …, ∂F/∂λ1, ∂F/∂λ2, …] = [0, 0, ….]

Inequality constraint optimization

In this case, the constraint is inequality. We cannot use the Lagrange multiplier technique because it requires equality constraint. There is no general solution for arbitrary inequality constraints.

However, we can put some restriction in the form of constraint. In the following, we study two models where constraint is restricted to be a linear function of the variables:
w1.x1 + w2.x2 + … >= 0

Linear Programming

Linear programming is a model where both the objective function and constraint function is restricted as linear combination of variables. The Linear Programming problem can be defined as follows ...

Maximize objective function: f(x1, x2) = c1.x1 + c2.x2

Subjected to inequality constraints:

a11.x1 + a12.x2 <= b1
a21.x1 + a22.x2 <= b2
a31.x1 + a32.x2 <= b3
x1 >= 0, x2 >=0

As an illustrative example, lets walkthrough a portfolio investment problem. In the example, we want to find an optimal way to allocate the proportion of asset in our investment portfolio.

StockA gives 5% return on average
StockB gives 4% return on average
StockC gives 6% return on average

To set some constraints, lets say my investment in C must be less than sum of A, B. Also, investment in A cannot be more than twice of B. Finally, at least 10% of investment in each stock.

To formulate this problem:

Variable: x1 = % investment in A, x2 = % in B, x3 = % in C

Maximize expected return: f(x1, x2, x3) = x1*5% + x2*4% + x3*6%

Subjected to constraints:

10% < x1, x2, x3 < 100%
x1 + x2 + x3 = 1
x3 < x1 + x2
x1 < 2 * x2

> library(lpSolve)
> library(lpSolveAPI)
> # Set the number of vars
> model <- make.lp(0, 3)
> # Define the object function: for Minimize, use -ve
> set.objfn(model, c(-0.05, -0.04, -0.06))
> # Add the constraints
> add.constraint(model, c(1, 1, 1), "=", 1)
> add.constraint(model, c(1, 1, -1), ">", 0)
> add.constraint(model, c(1, -2, 0), "<", 0)
> # Set the upper and lower bounds
> set.bounds(model, lower=c(0.1, 0.1, 0.1), upper=c(1, 1, 1))
> # Compute the optimized model
> solve(model)
[1] 0
> # Get the value of the optimized parameters
> get.variables(model)
[1] 0.3333333 0.1666667 0.5000000
> # Get the value of the objective function
> get.objective(model)
[1] -0.05333333
> # Get the value of the constraint
> get.constraints(model)
[1] 1 0 0

Quadratic Programming

Quadratic programming is a model where both the objective function is a quadratic function (contains up to two term products) while constraint function is restricted as linear combination of variables.

The Quadratic Programming problem can be defined as follows ...

Minimize quadratic objective function:
f(x1, x2) = c1.x12 + c2. x1x2 + c2.x22 - (d1. x1 + d2.x2)
Subject to constraints

a11.x1 + a12.x2 == b1
a21.x1 + a22.x2 == b2
a31.x1 + a32.x2 >= b3
a41.x1 + a42.x2 >= b4
a51.x1 + a52.x2 >= b5

Express the problem in Matrix form:

Minimize objective ½(DTx) - dTx
Subject to constraint ATx >= b
First k columns of A is equality constraint

As an illustrative example, lets continue on the portfolio investment problem. In the example, we want to find an optimal way to allocate the proportion of asset in our investment portfolio.

StockA gives 5% return on average
StockB gives 4% return on average
StockC gives 6% return on average

We also look into the variance of each stock (known as risk) as well as the covariance between stocks.

For investment, we not only want to have a high expected return, but also a low variance. In other words, stocks with high return but also high variance is not very attractive. Therefore, maximize the expected return and minimize the variance is the typical investment strategy.

One way to minimize variance is to pick multiple stocks (in a portfolio) to diversify the investment. Diversification happens when the stock components within the portfolio moves from their average in different direction (hence the variance is reduced).

The Covariance matrix ∑ (between each pairs of stocks) is given as follows:
1%, 0.2%, 0.5%
0.2%, 0.8%, 0.6%
0.5%, 0.6%, 1.2%

In this example, we want to achieve a overall return of at least 5.2% of return while minimizing the variance of return

To formulate the problem:

Variable: x1 = % investment in A, x2 = % in B, x3 = % in C

Minimize variance: xT∑x

Subjected to constraint:

x1 + x2 + x3 == 1
X1*5% + x2*4% + x3*6% >= 5.2%
0% < x1, x2, x3 < 100%

> library(quadprog)
> mu_return_vector <- c(0.05, 0.04, 0.06) 
> sigma <- matrix(c(0.01, 0.002, 0.005, 
+                   0.002, 0.008, 0.006, 
+                   0.005, 0.006, 0.012), 
+                  nrow=3, ncol=3)
> D.Matrix <- 2*sigma
> d.Vector <- rep(0, 3)
> A.Equality <- matrix(c(1,1,1), ncol=1)
> A.Matrix <- cbind(A.Equality, mu_return_vector, 
                    diag(3))
> b.Vector <- c(1, 0.052, rep(0, 3))
> out <- solve.QP(Dmat=D.Matrix, dvec=d.Vector, 
                  Amat=A.Matrix, bvec=b.Vector, 
                  meq=1)
> out$solution
[1] 0.4 0.2 0.4
> out$value
[1] 0.00672
>

Integration with Predictive Analytics

Optimization is usually integrated with predictive analytics, which is another important part of data analytics. For an overview of predictive analytics, here is my previous blog about it.

The overall processing can be depicted as follows:

In this diagram, we use machine learning to train a predictive model in batch mode. Once the predictive model is available, observation data is fed into it at real time and a set of output variables is predicted. Such output variable will be fed into the optimization model as the environment parameters (e.g. return of each stock, covariance ... etc.) from which a set of optimal decision variable is recommended.