Pragmatic Programming Techniques: predictive analytics

Showing posts with label predictive analytics. Show all posts

Saturday, July 15, 2017

Regression model outputting probability density distribution

For a classification problem (let say output is one of the labels R, G, B), how do we predict ?

There are two formats that we can report our prediction

Output a single value which is most probable outcome. e.g. output "B" if P(B) > P(R) and P(B) > P(G)
Output the probability estimation of each label. (e.g. R=0.2, G=0.3, B=0.4)

But if we look at regression problem (lets say we output a numeric value v), most regression model only output a single value (that minimize the RMSE). In this article, we will look at some use cases where outputting a probability density function is much preferred.

Predict the event occurrence time

As an illustrative example, we want to predict when would a student finish her work given she has already spent some time s. In other words, we want to estimate E[t | t > s] where t is a random variable representing the total duration and s is the elapse time so far.

Estimating time t is generally hard if the model only output an expectation. Notice that the model has the same set of features, expect that the elapse time has changed in a continuous manner as time passes.

Lets look at how we can train a prediction model that can output a density distribution.

Lets say our raw data schema: [feature, duration]

f1, 13.30
f2, 14.15
f3, 15.35
f4, 15.42

Take a look at the range (ie. min and max) of the output value. We transform into the training data of the following schema:

[feature, dur<13, dur<14, dur<15, dur<16]

f1, 0, 1, 1, 1
f2, 0, 0, 1, 1
f3, 0, 0, 0, 1
f4, 0, 0, 0, 1

After that, we train 4 classification model.

feature, dur<13
feature, dur<14
feature, dur<15
feature, dur<16

Now, given a new observation with corresponding feature, we can invoke these 4 model to output the probability of binary classification (cumulative probability). If we want the probability density, simply take the difference (ie: differentiation of cumulative probability).

At this moment, we can output a probability distribution given its input feature.

Now, we can easily estimate the remaining time from the expected time in the shade region. As time passed, we just need to slide the red line continuously and recalculate the expected time, we don't need to execute the prediction model unless the input features has changed.

Predict cancellation before commitment

As an illustrative example, lets say a customer of restaurant has reserved a table at 8:00pm. Time now is 7:55pm and the customer still hasn't arrive, what is the chance of no-show ?

Now, given a person (with feature x), and current time is S - t (still hasn't bought the ticket yet), predict the probability of this person watching the movie.

Lets say our raw data schema: [feature, arrival]

f1, -15.42
f2, -15.35
f3, -14.15
f4, -13.30

f5, infinity
f6, infinity

We transform into the training data of the following schema:

[feature, arr<-16, arr<-15, arr<-14, arr<-13]

f1, 0, 1, 1, 1
f2, 0, 1, 1, 1
f3, 0, 0, 1, 1
f4, 0, 0, 0, 1
f5, 0, 0, 0, 0
f6, 0, 0, 0, 0

After that, we train 4 classification models.

feature, arr<-16
feature, arr<-15
feature, arr<-14
feature, arr<-13

Notice that P(arr<0) can be smaller than 1 because the customer can be no show.

In this post, we discuss some use cases where we need the regression model to output not just its value prediction but also the probability density distribution. And we also illustrate how we can build such prediction model.

Wednesday, November 5, 2014

Common data science project flow

As working across multiple data science projects, I observed a similar pattern across a group of strategic data science projects where a common methodology can be used. In this post, I want to sketch this methodology at a high level.

First of all, "data science" itself is a very generic term that means different things to different people. For the projects I involved, many of them target to solve a very tactical and specific problem. However, over the last few years more and more enterprises start to realize the strategic value of data. I observed a growing number of strategic data science projects were started from a very broad scope and took a top-down approach to look at the overall business operation. Along the way, the enterprise prioritize the most critical areas within their operation cycle and build sophisticated models to guide and automate the decision process.

Usually, my engagement started as a data scientist / consultant, with very little (or even no) domain knowledge. Being unfamiliar with the domain is nothing to be proud of and often slow down my initial discussion. Therefore, within a squeezed time period I need to quickly learn enough "basic domain knowledge" to facilitate the discussion smooth. On the other hand, lacking a per-conceived model enables me (or you can say force me) to look from a fresh-eye view, from which I can trim off unnecessary details from the legacies and only focus on those essential elements that contributes to the core part of the data model. It is also fun to go through the concept blending process between a data scientist and a domain expert. I force them to think in my way and they force me to think in their way. This is by far the most effective way for me to learn any new concepts.

Recently I had a discussion with a company who has a small, but very sophisticated data science team that build pricing model, and demand forecasting for their product line. I am, by no means an expert in their domain. But their problem (how to predict demand, and how to set price) is general enough across many industries. Therefore, I will use this problem as an example to illustration the major steps in the common pattern that I describe above.

Problem Settings

Lets say a car manufacturer starts its quarterly planning process. Here are some key decisions that need to be made by the management.

How many cars the company should produce for next year ?
What should be the renew price of the cars ?

First of all, we need to identify the ultimate "goal" of these decisions. Such goal is usually easy to find as it usually in the company's mission statement.

In this problem, the goal is to ...
maximize: "Profit_2015"

In general, I find it is a good start to look at the problem from an "optimization" angle, from which we define our goal in terms of an objective function as well as a set of constraints.

Step 1: Identify variables and define its dependency graph

Build the dependency graph between different variables starting from the Objective function. Separate between the decision variables (where you have control) and environment variable (where you have no control).

As an illustrative example, we start from our objective function "Profit_2015" and define the dependency relationship below. Decision variable is highlighted in blue.

Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
Demand_2015 = H(GDP_2014, PhoneSold_2014)
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)
...

Identifying these variable and their potential dependencies typically come from a well-studied theory from University, or domain experts in the industry. At this stage, we don't need to know the exact formula of the function F/G/H. We only need to capture the links between the variables. It is also ok to include a link that shouldn't have exist (ie: there is no relationship between the 2 variables in reality). However, it is not good if we miss a link (ie: fail to capture a strong, existing dependency).

This round usually involves 4 to 5 half day brainstorming sessions with the domain experts, facilitated by the data scientist/consultant who is familiar with the model building process. There may be additional investigation, background studies if the subject matter experts doesn't exist. Starting from scratch, this round can take somewhere between couple weeks to couple months

Step 2: Define the dependency function

In this round, we want to identify the relationship between variable using formula of F(), G(), H().

Well-Known Function
For some relationship that is well-studied, we can use a known mathematical model.

For example, in the relationship
Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)

We can use the following Mathematical formula in a very straightforward manner
Profit = (UnitSold * Price) - (UnitProduced * Cost)

Semi-Known Function
However, some of the relationship is not as straightforward as that. For those relationship that we don't exactly know the formula, but can make a reasonable assumption on the shape of the formula, we can assume the relationship follows a family of models (e.g. Linear, Quadratic ... etc.), and then figure out the optimal parameters that best fit the historical data.

For example, in the relationship
Demand_2015 = H(GDP_2014, PhoneSold_2014)

Lets assume the "demand" is a linear combination of "GDP" and "Phone sold", which seems to be a reasonable assumption.

For the linear model we assume
Demand = w0 + (w1 * GDP) + (w2 * PhoneSold)

Then we feed the historical training data to a build a linear regression model and figure out what the fittest value of w0, w1, w2 should be.

Time-Series Function
In some cases, a variable depends only on its own past value but not other variables, here we can train a Time Series model to predict the variable based on its own past values. Typically, the model is decomposed into 3 components; Noise, Trend and Seasonality. One popular approach is to use exponential smoothing techniques such as Holt/Winters model. Another popular approach is to use the ARIMA model which decomposed the value into "Auto-Regression" and "Moving-Average".

For example, in the relationship
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)

We can use TimeSeries model to learn the relationship between the historical data to its future value.

Completely Unknown Function
But if we cannot even assume the model family, we can consider using "k nearest neighbor" approach to interpolate the output from its input. We need to define the "distance function" between data points based on domain knowledge and also to figure out what the optimal value of k should be. In many case, using a weighted average of the k-nearest neighbor is a good interpolation.

For example, in the relationship
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
It is unclear what model to be used in representing UnitSold as a function of Supply, Demand, Price and CompetitorPrice. So we go with a nearest neighbor approach.

Based on monthly sales of past 3 years, we can use "Euclidean distance" (we can also consider scaling the data to a comparable range by minus its mean and divide by its standard deviation) to find out the closest 5 neighbors, and then using the weighted average to predict the unit sold.

Step 3: Optimization

At this point, we have the following defined

A goal defined by maximizing (or minimizing) an objective function
A set of variables (including the decision and environment variables)
A set of functions that define how these variables are inter-related to each other. Some of them is defined by a mathematical formula and some of them is defined as a black-box (base on a predictive model)

Our goal is to figure out what the decision variables (which we have control) should be set such that the objective function is optimized (maximized or minimized).

Determine the value of environment variables
For those environment variables that has no dependencies on other variables, we can acquire their value from external data sources. For those environment variables that has dependencies on other environment variables (but not decision variables), we can estimate their value using the corresponding dependency function (of course, we need to estimate all its depending variables first). For those environment variables that has dependencies (direct or indirect) on decision variables, leave it as undefined.

Determine the best value of decision variables
Once we formulate the dependency function, depends on the format of these function, we can employ different optimization methods. Here is how I choose the appropriate method based on the formulation of dependency functions.

Additional Challenges

To summarize, we have following the process below

Define an objective function, constraints, decision variables and environment variables
Identify the relationship between different variables
Collect or predict those environment variables
Optimize those decision variables based on the objective functions
Return the optimal value of decision variables as the answer

So far, our dependency graph is acyclic where our decision won't affect the underlying variables. Although this is reasonably true if the enterprise is an insignificant small player in the market, it is no longer true if the enterprise is one of the few major players. For example, their pricing strategy may causes other competitors to change their own pricing strategy as well. And how the competitors would react is less predictable and historical data play a less important role here. At some point, human judgement will get involved to fill the gaps.

Sunday, July 27, 2014

Incorporate domain knowledge into predictive model

As a data scientist / consultant, in many cases we are being called in to work with domain experts who has in-depth business knowledge of industry settings. The main objective is to help our clients to validate and quantify the intuition of existing domain knowledge based on empirical data, and remove any judgement bias. In many cases, customers will also want to build a predictive model to automate their business decision making process.

To create a predictive model, feature engineering (defining the set of input) is a key part if not the most important. In this post, I'd like to share my experience in how to come up with the initial set of features and how to evolve it as we learn more.

Firstly, we need to acknowledge two forces in this setting

Domain experts tends to be narrowly focused (and potentially biased) towards their prior experience. Their domain knowledge can usually encoded in terms of "business rules" and tends to be simple and obvious (if it is too complex and hidden, human brain is not good at picking them up).
Data scientist tends to be less biased and good at mining through a large set of signals to determine how relevant they are in an objective and quantitative manner. Unfortunately, raw data rarely gives strong signals. And lacking the domain expertise, data scientist alone will not even be able to come up with a good set of features (usually requires derivation from the combination of raw data). Notice that trying out all combinations are impractical because there are infinite number of ways to combine raw data. Also, when you have too many features in the input, the training data will not be enough and resulting in model with high variance.

Maintain a balance between these forces is a critical success factor of many data science project.

This best project settings (in my opinion) is to let the data scientist to take control in the whole exercise (as less bias has an advantage) while guided by input from domain experts.

Indicator Feature

This is a binary variable based on a very specific boolean condition (ie: true or false) that the domain expert believe to be highly indicative to the output. For example, for predicting stock, one indicator feature is whether the stock has been drop more than 15 % in a day.

Notice that indicator features can be added at any time once a new boolean condition is discovered by the domain expert. Indicators features doesn't need to be independent to each other and in fact most of the time they are highly inter-correlated.

After fitting these indicator features into the predictive model, we can see how many influence each of these features is asserting in the final prediction and hence providing a feedback to the domain experts about the strength of these signals.

Derived Feature

This is a numeric variable (ie: quantity) that the domain expert believe to be important to predicting the output. The idea is same as indicator feature except it is numeric in nature.

Expert Stacking

Here we build a predictive model whose input features are taking from each of the expert's prediction output. For example, to predict the stock, our model takes 20 analyst's prediction as its input.

The strength of this approach is that it can incorporate domain expertise very easily because it treat them as a blackbox (without needing to understand their logic). The model we training will take into account the relative accuracy of each expert's prediction and adjust its weighting accordingly. On the other hand, one weakness is the reliance of domain expertise during the prediction, which may or may not be available in an on-going manner.

Monday, June 11, 2012

Predictive Analytics: Evaluate Model Performance

In previous posts, we have discussed various machine learning techniques including linear regression with regularization, SVM, Neural network, Nearest neighbor, Naive Bayes, Decision Tree and Ensemble models. How do we pick which model is the best ? or even whether the model we pick is better than a random guess ? In this posts, we will cover how we evaluate the performance of the model and what can we do next to improve the performance.

Best guess with no model

First of all, we need to understand the goal of our evaluation. Are we trying to pick the best model ? Are we trying to quantify the improvement of each model ? Regardless of our goal, I found it is always useful to think about what the baseline should be. Usually the baseline is what is your best guess if you don't have a model.

For classification problem, one approach is to do a random guess (with uniform probability) but a better approach is to guess the output class that has the largest proportion in the training samples. For regression problem, the best guess will be the mean of output of training samples.

Prepare data for evaluation

In a typical setting, the set of data is divided into 3 disjoint groups; 20% data is set aside as testing data to evaluate the model we've trained. The remaining 80% of data is dividing into k partitions. k-1 partitions will be used as training data to train a model with a particular parameter value setting and 1 partition will be used as cross-validation data to pick the best parameter value that minimize the error of the cross-validation data.

As a concrete example, lets say we have 100 records available. We'll set aside 20% which is 20 records for testing purposes and use the remaining 80 records to train the model. Lets say the model has some tunable parameters (e.g. k in KNN, λ in linear model regularization). For each particular parameter value, we'll conduct 10 rounds of training (ie: k = 10). Within each round, we randomly select 90% which is 72 records to train a model and compute the error of prediction against the 8 unselected records. Then we take the average error of these 10 rounds and pick the optimal parameter value that gives the minimal average error. After picking the optimal tuning parameter, we retrain the model using the whole 80 records.

To evaluate the predictive performance of the model, we'll test it against the 20 testing records we set aside at the beginning. The details will be described below.

Measuring Regression Performance

For regression problem, measuring the distance between the estimated output from the actual output is used to quantify the model's performance. Three measures are commonly used: Root Mean Square Error, Relative Square Error and Coefficient of Determination. Typically root mean square is used for measuring the absolute quantity of accuracy.

Mean Square Error MSE = (1/N) * ∑(y_est – y_actual)²

Root Mean Square Error RMSE = (MSE)^1/2

To measure the accuracy with respect to the baseline, we use the ratio of MSE

Relative Square Error RSE = MSE / MSE_baseline

RSE = ∑(y_est – y_actual)²/ ∑(y_mean – y_actual)²

Coefficient Of Determination (also called R square) measures the variance that is explained by the model, which is the reduction of variance when using the model. R square ranges from 0 to 1 while the model has strong predictive power when it is close to 1 and is not explaining anything when it is close to 0.

R² = (MSE_baseline – MSE) / MSE_baseline

R² = 1 – RSE

Here are some R code to compute these measures

> 
> Prestige_clean <- Prestige[!is.na(Prestige$type),]
> model <- lm(prestige~., data=Prestige_clean)
> score <- predict(model, newdata=Prestige_clean)
> actual <- Prestige_clean$prestige
> rmse <- (mean((score - actual)^2))^0.5
> rmse
[1] 6.780719
> mu <- mean(actual)
> rse <- mean((score - actual)^2) / mean((mu - actual)^2) 
> rse
[1] 0.1589543
> rsquare <- 1 - rse
> rsquare
[1] 0.8410457
>

The Mean Square Error penalize the bigger difference more because of the square effect. On the other hand, if we want to reduce the penalty of bigger difference, we can log transform the numeric quantity first.

Mean Square Log Error MSLE = (1/N) * ∑(log(y)_est – log(y)_actual)²

Root Mean Square Log Error RMSLE = (MSLE)^1/2

Measuring Classification Performance

For classification problem, there are a couple of measures.

TP = Predict +ve when Actual +ve
TN = Predict -ve when Actual -ve
FP = Predict +ve when Actual -ve
FN = Predict -ve when Actual +ve

Precision = TP / Predict +ve = TP / (TP + FP)
Recall or Sensitivity = TP / Actual +ve = TP / (TP + FN)
Specificity = TN / Actual -ve = TN / (FP + TN)
Accuracy = (TP + TN) / (TP + TN + FP + FN)

Accuracy alone is not sufficient to represent the quality of prediction because the cost of making a FP may be different from the cost of making a FN. F measures provides a tunable assigned weight for computing a final score and is commonly used to measure the quality of a classification model.

1/Fmeasure = α/recall + (1-α)/precision

Notice that most classification model is based on estimating a numeric score for each output class. By choosing the cutoff point of this score, we can control the tradeoff between precision and recall. We can plot the relationship between precision and recall at various cutoff points as follows ...

> library(ROCR)
> library(e1071)
> nb_model <- naiveBayes(Species~., data=iristrain)
> nb_prediction <- predict(nb_model,
                           iristest[,-5], 
                           type='raw') 
> score <- nb_prediction[, c("virginica")]
> actual_class <- iristest$Species == 'virginica'
> # Mix some noise to the score
> # Make the score less precise for illustration
> score <- (score + runif(length(score))) / 2
> pred <- prediction(score, actual_class)
> perf <- performance(pred, "prec", "rec")
> plot(perf)
>

Another common plot is the ROC curve which plot the "sensitivity" (true positive rate) against 1 - "specificity" (false positive rate). The area under curve "auc" is used to compare the quality between different models with varying cutoff points. Here is how we produce the ROC curve.

> library(ROCR)
> library(e1071)
> nb_model <- naiveBayes(Species~., 
                         data=iristrain)
> nb_prediction <- predict(nb_model, 
                           iristest[,-5], 
                           type='raw') 
> score <- nb_prediction[, c("virginica")]
> actual_class <- iristest$Species == 'virginica'
> # Mix some noise to the score
> # Make the score less precise for illustration
> score <- (score + runif(length(score))) / 2
> pred <- prediction(score, actual_class)
> perf <- performance(pred, "tpr", "fpr")
> auc <- performance(pred, "auc")
> auc <- unlist(slot(auc, "y.values"))
> plot(perf)
> legend(0.6,0.3,c(c(paste('AUC is', auc)),"\n"),
         border="white",cex=1.0,
         box.col = "white")
>

We can also assign the relative cost of making a false +ve and false -ve decision to find the best cutoff threshold. Here is how we plot the cost curve

> # Plot the cost curve to find the best cutoff
> # Assign the cost for False +ve and False -ve
> perf <- performance(pred, "cost", cost.fp=4, cost.fn=1)
> plot(perf)
>

From the curve, the best cutoff point 0.6 is where the cost is minimal.

Source of error: Bias and Variance

In model-based machine learning, we are making assumption that the underlying data follows some underlying mathematical model and during the training we try to fit the training data into this assumed model and determine the best model parameters which gives the minimal error.

One source of error is when our assumed model is fundamentally wrong (e.g. if the output has a non-linear relationship with the input but we are assuming a linear model). This is known as the High Bias problem which we use an over-simplified model to represent the underlying data.

Another source of error is when the model parameters fits too specifically to the training data and not generalizing well to the underlying data pattern. This is known as the High Variance problem and usually happen when there is insufficient training data compare to the number of model parameters.

High bias problem has the symptom that both training and cross-validation shows a high error rate and both error rate drops as the model complexity increases. While the training error keep decreasing as the model complexity increases, the cross-validation error will increase after certain model complexity and this indicates the beginning of a high variance problem. When the size of training data is fixed and the only thing we can choose is the model complexity, then we should choose the model complexity at the point where the cross-validation error is minimal.

Will getting more data help ?

When collecting more data cost both time and money, we need to carefully assess the situation before we spend our effort to do so. There is a very pragmatic technique suggested by Andrew Ng from Stanford by plotting the error against the size of data. In this approach, we sample different size of training data to train up different models and plot both the cross-validation error and the training error with respect to the training sample size.
If the problem is a high-bias problem, the error curve will look like the following.

In this case, collecting more data would not help. We should spend our effort to do the following instead.

Add more input features by talking to domain experts to identify more input signals and collect them.
Add more input features by combining existing input features in a non-linear way.
Try more complex, machine learning models. (e.g. increase the number of hidden layers in Neural Network, or increase the number of Neurons at each layer)

If the problem is a high-variance problem because we over-estimate the model complexity, then we should reduce the model complexity by throwing away attributes that has low influence to the output. We don't have to gather more data.

The only situation where having more data will be helpful is when the underlying data model is in fact complex. Therefore we cannot just reduce the complexity as this will immediately results in a high-bias problem. In this case, the error curve will have the following shape.

And the only way is to collect more training data such that overfitting is less likely to happen.

Evaluate the performance of a model is very important in the overall cycle of predictive analytics. Hopefully this introductory post gives a basic idea on how this can be done.

Tuesday, May 29, 2012

Predictive Analytics: Generalized Linear Regression

In the previous 2 posts, we have covered how to visualize input data to explore strong signals as well as how to prepare input data to a form that is situation for learning. In this and subsequent posts, I'll go through various machine learning techniques to build our predictive model.

Linear regression
Logistic regression
Linear and Logistic regression with regularization
Neural network
Support Vector Machine
Naive Bayes
Nearest Neighbor
Decision Tree
Random Forest
Gradient Boosted Trees

There are two general types of problems that we are interested in this discussion; Classification is about predicting a category (value that is discrete, finite with no ordering implied) while Regression is about predicting a numeric quantity (value is continuous, infinite with ordering).

For classification problem, we use the "iris" data set and predict its "species" from its "width" and "length" measures of sepals and petals. Here is how we setup our training and testing data.

> summary(iris)
  Sepal.Length       Sepal.Width        Petal.Length    Petal.Width      
 Min.   :4.300000   Min.   :2.000000   Min.   :1.000   Min.   :0.100000  
 1st Qu.:5.100000   1st Qu.:2.800000   1st Qu.:1.600   1st Qu.:0.300000  
 Median :5.800000   Median :3.000000   Median :4.350   Median :1.300000  
 Mean   :5.843333   Mean   :3.057333   Mean   :3.758   Mean   :1.199333  
 3rd Qu.:6.400000   3rd Qu.:3.300000   3rd Qu.:5.100   3rd Qu.:1.800000  
 Max.   :7.900000   Max.   :4.400000   Max.   :6.900   Max.   :2.500000  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> 
# Prepare training and testing data
> testidx <- which(1:length(iris[,1])%%5 == 0)
> iristrain <- iris[-testidx,]
> iristest <- iris[testidx,]

For regression problem, we use the "Prestige" data set (imported from the "car" R package) and predict the degree of "prestige" from a set of input variables such as "income", "education", "job types" ... etc. Here is how we setup our training and testing data.

> library(car)
> summary(Prestige)
   education            income              women         
 Min.   : 6.38000   Min.   :  611.000   Min.   : 0.00000  
 1st Qu.: 8.44500   1st Qu.: 4106.000   1st Qu.: 3.59250  
 Median :10.54000   Median : 5930.500   Median :13.60000  
 Mean   :10.73804   Mean   : 6797.902   Mean   :28.97902  
 3rd Qu.:12.64750   3rd Qu.: 8187.250   3rd Qu.:52.20250  
 Max.   :15.97000   Max.   :25879.000   Max.   :97.51000  
    prestige            census           type   
 Min.   :14.80000   Min.   :1113.000   bc  :44  
 1st Qu.:35.22500   1st Qu.:3120.500   prof:31  
 Median :43.60000   Median :5135.000   wc  :23  
 Mean   :46.83333   Mean   :5401.775   NA's: 4  
 3rd Qu.:59.27500   3rd Qu.:8312.500            
 Max.   :87.20000   Max.   :9517.000            
> head(Prestige)
                    education income women prestige census type
gov.administrators      13.11  12351 11.16     68.8   1113 prof
general.managers        12.26  25879  4.02     69.1   1130 prof
accountants             12.77   9271 15.70     63.4   1171 prof
purchasing.officers     11.42   8865  9.11     56.8   1175 prof
chemists                14.62   8403 11.68     73.5   2111 prof
physicists              15.64  11030  5.13     77.6   2113 prof
> testidx <- which(1:nrow(Prestige)%%4==0)
> prestige_train <- Prestige[-testidx,]
> prestige_test <- Prestige[testidx,]

To measure the performance of our prediction, here we use some simple metrics to get through this basic ideas. For regression problem, we'll use the correlation between our prediction and the testing result. For classification problem, we'll use the contingency table to measure the count of true/false positive/negative in our prediction.

Now we have set the stage, lets get started.

Linear Regression

Linear regression has the longest history in statistics, well understood and is the most popular machine learning model. It is based on the assumption of a linear relationship exist between the input x1, x2 ... and output variable y (numeric).

y = Ɵ₀ + Ɵ₁x₁ + Ɵ₂x₂ + …

The learning algorithm will learn the set of parameter such that the objective function: sum of square error ∑(y_actual - y_estimate)² is minimized.

Here is the code examples in R.

> model <- lm(prestige~., data=prestige_train)
> summary(model)
Call:
lm(formula = prestige ~ ., data = prestige_train)

Residuals:
        Min          1Q      Median          3Q         Max 
-13.9078951  -5.0335742   0.3158978   5.3830764  17.8851752 

Coefficients:
                  Estimate     Std. Error  t value     Pr(>|t|)    
(Intercept) -20.7073113585  11.4213272697 -1.81304    0.0743733 .  
education     4.2010288017   0.8290800388  5.06710 0.0000034862 ***
income        0.0011503739   0.0003510866  3.27661    0.0016769 ** 
women         0.0363017610   0.0400627159  0.90612    0.3681668    
census        0.0018644881   0.0009913473  1.88076    0.0644172 .  
typeprof     11.3129416488   7.3932217287  1.53018    0.1307520    
typewc        1.9873305448   4.9579992452  0.40083    0.6898376    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 7.41604 on 66 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared: 0.820444,   Adjusted R-squared: 0.8041207 
F-statistic: 50.26222 on 6 and 66 DF,  p-value: < 0.00000000000000022204 

> # Use the model to predict the output of test data
> prediction <- predict(model, newdata=prestige_test)
> # Check for the correlation with actual result
> cor(prediction, prestige_test$prestige)
[1] 0.9376719009

The coefficient column gives an estimation of Ɵi, there is an associated p-value that gives the confidence of each estimated Ɵi. For example, features not marked with at least one * can be safely ignored.

In the above model, education and income has a high influence to the prestige.

Linear regression has certain assumption about the underlying distribution of data which we need to validate. This include the residuals (error) has normal distribution with a zero mean and constant variation.

> # verify if the residuals are normally distributed
> rs <- residuals(model)
> qqnorm(rs)
> qqline(rs)
> shapiro.test(rs)
        Shapiro-Wilk normality test
data:  rs 
W = 0.9744, p-value = 0.1424

Here is the plot of how the residual aligned with the normal distribution.

Logistic Regression

Logistic Regression is basically applying a transformation of the output of a linear regression such that it fits into the value of 0 to 1, and hence mimic the probability of a binary output. Logistic regression is the most popular machine learning technique applied in solving classification problem.

In a classification problem, the output is binary rather than numeric, we can imagine of doing a linear regression and then compress the numeric output into a 0..1 range using the logit function
1/(1+e^-t)

y = 1/(1 + e ^-(Ɵ₀^{+ Ɵ}₁^x₁^{+ Ɵ}₂^x₂^{+ …)})

The objective function in logistic regression is different. It is minimizing the entropy as follows ...
∑ (y_actual * log y_estimate + (1 - y_actual) * log(1 - y_estimate))

Here is the example R code to classify iris data.

> newcol = data.frame(isSetosa=(iristrain$Species == 'setosa'))
> traindata <- cbind(iristrain, newcol)
> head(traindata)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species isSetosa
1          5.1         3.5          1.4         0.2  setosa     TRUE
2          4.9         3.0          1.4         0.2  setosa     TRUE
3          4.7         3.2          1.3         0.2  setosa     TRUE
4          4.6         3.1          1.5         0.2  setosa     TRUE
6          5.4         3.9          1.7         0.4  setosa     TRUE
7          4.6         3.4          1.4         0.3  setosa     TRUE
> formula <- isSetosa ~ Sepal.Length + Sepal.Width
                        + Petal.Length + Petal.Width
> logisticModel <- glm(formula, data=traindata, 
                       family="binomial")
> # Predict the probability for test data
> prob <- predict(logisticModel, newdata=iristest, type='response')
> round(prob, 3)
  5  10  15  20  25  30  35  40  45  50  55  60  65  70  75 
  1   1   1   1   1   1   1   1   1   1   0   0   0   0   0 
 80  85  90  95 100 105 110 115 120 125 130 135 140 145 150 
  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

Regression with Regularization

Notice our assumption of linearity between input and output variable above is not necessary true. One technique is to combine existing features by multiplying them together and hope we are lucky enough to hit some useful combination. Therefore we may end up having a large set of input features in our machine learning.

When we have a large size of input variable but moderate size of training data, we are subjected to the overfitting problem, which is our model fits too specific to the training data and not generalized enough for the data we haven't seen. Regularization is the technique of preventing the model to fit too specifically to the training data.

In linear regression, it is found that overfitting happens when Ɵ has a large value. So we can add a penalty that is proportional to the magnitude of Ɵ. In L2 regularization (also known as Ridge regression), Σ square(Ɵi) will be added to the cost function, while In L1 regularization (also known as Lasso regression), Σ ||Ɵi|| will be added to the cost function.

Both L1, L2 will shrink the magnitude of Ɵi, L2 tends to make dependent input variables having the same coefficient while L tends to pick of the coefficient of variable to be non-zero and other zero. In other words, L1 regression will penalize the coefficient of redundant variables that are linearly dependent and is frequently used to remove redundant features.

Combining L1 and L2, the general form of cost function becomes
Cost == Non-regularization-cost + λ (α.Σ ||Ɵi|| + (1- α).Σ square(Ɵi))

Notice there are 2 tunable parameters, lambda λ and alpha α. Lambda controls the degree of regularization (0 means no-regularization, infinity means ignoring all input variables because all coefficients of them will be zero). Alpha controls the degree of mix between L1 and L2. (0 means pure L2 and 1 means pure L1).

Glmnet is a popular regularization package. The alpha parameter need to be supplied based on the application need (for selecting a reduced set of variables, alpha=1 is preferred). The library provides a cross-validation test to automatically figure out what is the better lambda value.

Lets repeat the linear regression "prestige" data set and use regularization this time, we pick alpha = 0.7 to favor L1 regularization.

> library(glmnet)
> cv.fit <- cv.glmnet(as.matrix(prestige_train[,c(-4, -6)]), 
                      as.vector(prestige_train[,4]), 
                      nlambda=100, alpha=0.7, 
                      family="gaussian")
> plot(cv.fit)
> coef(cv.fit)
5 x 1 sparse Matrix of class "dgCMatrix"
                          1
(Intercept) 6.3876684930151
education   3.2111461944976
income      0.0009473793366
women       0.0000000000000
census      0.0000000000000
> prediction <- predict(cv.fit, 
                        newx=as.matrix(prestige_test[,c(-4, -6)]))
> cor(prediction, as.vector(prestige_test[,4]))
          [,1]
1 0.9291181193

Here is the cross-validation plot, which shows the best lambda with minimal mean square error.

Also we can repeat the logistic regression "iris" data set and use regularization this time, we pick alpha = 0.8 to even favor L1 regularization more.

> library(glmnet)
> cv.fit <- cv.glmnet(as.matrix(iristrain[,-5]), 
                      iristrain[,5], alpha=0.8, 
                      family="multinomial")
> prediction <- predict(cv.fit, 
                        newx=as.matrix(iristest[,-5]),
                        type="class")
> table(prediction, iristest$Species)
prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         2
  virginica       0          0         8

Instead of picking lambda (the degree of regularization) based on cross validation, we can also based on the number of input variables that we want to retain. So we can plot the "regularization path" which shows how the coefficient of each input variables changes when the lambda changes and pick the right lambda that filter out the number of input variables for us.

> # try alpha = 0, Ridge regression
> fit <- glmnet(as.matrix(iristrain[,-5]), 
                iristrain[,5], alpha=0, 
                family="multinomial")
> plot(fit)
> # try alpha = 0.8, Elastic net
> fit <- glmnet(as.matrix(iristrain[,-5]), 
                iristrain[,5], alpha=0.8, 
                family="multinomial")
> plot(fit)
> # try alpha = 1, Lasso regression
> fit <- glmnet(as.matrix(iristrain[,-5]), 
                iristrain[,5], alpha=1, 
                family="multinomial")
> plot(fit)

Here is the output of these plots. Notice that as Alpha is closer to 1 (L1 regularization), it tends to have a stronger filter to the number of input variables.

In my next post, I will cover the other machine learning techniques.

Thursday, May 17, 2012

Predictive Analytics: Data Preparation

As a continuation of my last post on predictive analytics, in this post I will focus in describing how to prepare data for the training the predictive model., I will cover how to perform necessary sampling to ensure the training data is representative and fit into the machine processing capacity. Then we validate the input data and perform necessary cleanup on format error, fill-in missing values and finally transform the collected data into our defined set of input features.

Different machine learning model will have its unique requirement in its input and output data type. Therefore, we may need to perform additional transformation to fit the model requirement

Sample the data

If the amount of raw data is huge, processing all of them may require an extensive amount of processing power which may not be practical. In this case it is quite common to sample the input data to reduce the size of data that need to be processed.

There are many sampling models. Random sampling is by far the most common model which assign a uniform probability to each record for being picked. On the other hand, stratified sampling allows different probability to be assigned to different record type. It is very useful in case of highly unbalanced class occurrence so that the high frequency class will be down-sampled. Cluster sampling is about sampling at a higher level of granularity (ie the cluster) and then pick all members within that cluster. An example of cluster sampling is to select family (rather than individuals) to conduct a survey.

Sample can also be done without replacement (each record can be picked at most once) or with replacement (same record can be picked more than once)

Here is a simple example of performing a sampling (notice record 36 has been selected twice)

> # select 10 records out from iris with replacement
> index <- sample(1:nrow(iris), 10, replace=T)
> index
 [1] 133  36 107 140  66  67  36   3  97  37
> irissample <- iris[index,]
> irissample
     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
133           6.4         2.8          5.6         2.2  virginica
36            5.0         3.2          1.2         0.2     setosa
107           4.9         2.5          4.5         1.7  virginica
140           6.9         3.1          5.4         2.1  virginica
66            6.7         3.1          4.4         1.4 versicolor
67            5.6         3.0          4.5         1.5 versicolor
36.1          5.0         3.2          1.2         0.2     setosa
3             4.7         3.2          1.3         0.2     setosa
97            5.7         2.9          4.2         1.3 versicolor
37            5.5         3.5          1.3         0.2     setosa
>

Impute missing data

It is quite common that some of the input records are incomplete in the sense that certain fields are missing or have input error. In a typical tabular data format, we need to validate each record contains the same number of fields and each field contains the data type we expect.

In case the record has some fields missing, we have the following choices

Discard the whole record if it is incomplete
Infer the missing value based on the data from other records. A common approach is to fill the missing data with the average, or the median.

> # Create some missing data
> irissample[10, 1] <- NA
> irissample
     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
133           6.4         2.8          5.6         2.2  virginica
36            5.0         3.2          1.2         0.2     setosa
107           4.9         2.5          4.5         1.7  virginica
140           6.9         3.1          5.4         2.1  virginica
66            6.7         3.1          4.4         1.4 versicolor
67            5.6         3.0          4.5         1.5 versicolor
36.1          5.0         3.2          1.2         0.2     setosa
3             4.7         3.2          1.3         0.2     setosa
97            5.7         2.9          4.2         1.3 versicolor
37             NA         3.5          1.3         0.2     setosa
> library(e1071)
Loading required package: class
Warning message:
package ‘e1071’ was built under R version 2.14.2 
> fixIris1 <- impute(irissample[,1:4], what='mean')
> fixIris1
     Sepal.Length Sepal.Width Petal.Length Petal.Width
133      6.400000         2.8          5.6         2.2
36       5.000000         3.2          1.2         0.2
107      4.900000         2.5          4.5         1.7
140      6.900000         3.1          5.4         2.1
66       6.700000         3.1          4.4         1.4
67       5.600000         3.0          4.5         1.5
36.1     5.000000         3.2          1.2         0.2
3        4.700000         3.2          1.3         0.2
97       5.700000         2.9          4.2         1.3
37       5.655556         3.5          1.3         0.2
> fixIris2 <- impute(irissample[,1:4], what='median')
> fixIris2
     Sepal.Length Sepal.Width Petal.Length Petal.Width
133           6.4         2.8          5.6         2.2
36            5.0         3.2          1.2         0.2
107           4.9         2.5          4.5         1.7
140           6.9         3.1          5.4         2.1
66            6.7         3.1          4.4         1.4
67            5.6         3.0          4.5         1.5
36.1          5.0         3.2          1.2         0.2
3             4.7         3.2          1.3         0.2
97            5.7         2.9          4.2         1.3
37            5.6         3.5          1.3         0.2
>

Normalize numeric value

Normalize data is about transforming numeric data into a uniform range. Numeric attribute can have different magnitude based on different measurement units. To compare numeric attributes at the same scale, we need to normalize data by subtracting their average and then divide by the standard deviation.

> # scale the columns
> # x-mean(x)/standard deviation
> scaleiris <- scale(iris[, 1:4])
> head(scaleiris)
     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]   -0.8976739  1.01560199    -1.335752   -1.311052
[2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
[3,]   -1.3807271  0.32731751    -1.392399   -1.311052
[4,]   -1.5014904  0.09788935    -1.279104   -1.311052
[5,]   -1.0184372  1.24503015    -1.335752   -1.311052
[6,]   -0.5353840  1.93331463    -1.165809   -1.048667
>

Reduce dimensionality

High dimensionality is a problem to machine learning task. There are two ways to reduce the number of input attributes. One is about removing irrelevant input variables, another one is about removing redundant input variables.

Looking from the other angle, removing irrelevant feature is same as selecting relevant feature. The general approach is to try different combination of subset of feature to see which combination has the best performance (how well it predicts hold-out test data). One approach is to start with zero feature and pick one feature at a time to see which one give best prediction, and then add the second feature to the best feature to find the best 2 features, so on and so forth. Another approach is to go the opposite direction, start with the full set of feature and throw away one feature to see which one retains best performance.

In my experience, having irrelevant features sitting around will affect the overall workload of processing but usually won't affect the accuracy of the prediction. This is a lesser problem than redundant features, which will give unequal weights to information that are redundant. Removing redundant features is at a higher priority in my opinion.

Principal Component Analysis is a common technique to look only at the numeric input features to measure the linear-dependency among themselves. It transforms the input feature set into a lower dimensional space while retaining most of the fidelity of data.

> # Use iris data set
> cor(iris[, -5])
              Sepal.Length   Sepal.Width  Petal.Length   Petal.Width
Sepal.Length  1.0000000000 -0.1175697841  0.8717537759  0.8179411263
Sepal.Width  -0.1175697841  1.0000000000 -0.4284401043 -0.3661259325
Petal.Length  0.8717537759 -0.4284401043  1.0000000000  0.9628654314
Petal.Width   0.8179411263 -0.3661259325  0.9628654314  1.0000000000
> # Some attributes shows high correlation, compute PCA
> pca <- prcomp(iris[,-5], scale=T)
> summary(pca)
Importance of components:
                            PC1       PC2       PC3       PC4
Standard deviation     1.708361 0.9560494 0.3830886 0.1439265
Proportion of Variance 0.729620 0.2285100 0.0366900 0.0051800
Cumulative Proportion  0.729620 0.9581300 0.9948200 1.0000000
> # Notice PC1 and PC2 covers most variation
> plot(pca)
> pca$rotation
                       PC1            PC2           PC3           PC4
Sepal.Length  0.5210659147 -0.37741761556  0.7195663527  0.2612862800
Sepal.Width  -0.2693474425 -0.92329565954 -0.2443817795 -0.1235096196
Petal.Length  0.5804130958 -0.02449160909 -0.1421263693 -0.8014492463
Petal.Width   0.5648565358 -0.06694198697 -0.6342727371  0.5235971346
> # Project first 2 records in PCA direction
> predict(pca)[1:2,]
              PC1           PC2          PC3           PC4
[1,] -2.257141176 -0.4784238321 0.1272796237 0.02408750846
[2,] -2.074013015  0.6718826870 0.2338255167 0.10266284468
> # plot all points in top 2 PCA direction
> biplot(pca)

Add derived attributes

In some cases, we may need to compute additional attributes from existing attributes.

> iris2 <- transform(iris, ratio=round(Sepal.Length/Sepal.Width, 2))
> head(iris2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species ratio
1          5.1         3.5          1.4         0.2  setosa  1.46
2          4.9         3.0          1.4         0.2  setosa  1.63
3          4.7         3.2          1.3         0.2  setosa  1.47
4          4.6         3.1          1.5         0.2  setosa  1.48
5          5.0         3.6          1.4         0.2  setosa  1.39
6          5.4         3.9          1.7         0.4  setosa  1.38

Another common use of derived attributes is to generalize some attributes to a coarser grain, such as converting a geo-location to a zip code, or converting the age to an age group.

Discretize numeric value into categories

Discretize data is about cutting a continuous value into ranges and assigning the numeric with the corresponding bucket of the range it falls on. For numeric attribute, a common way to generalize it is to discretize it into ranges, which can be either constant width (variable height/frequency) or variable width (constant height).

> # Equal width cuts
> segments <- 10
> maxL <- max(iris$Petal.Length)
> minL <- min(iris$Petal.Length)
> theBreaks <- seq(minL, maxL, 
                   by=(maxL-minL)/segments)
> cutPetalLength <- cut(iris$Petal.Length, 
                        breaks=theBreaks, 
                        include.lowest=T)
> newdata <- data.frame(orig.Petal.Len=iris$Petal.Length, 
                        cut.Petal.Len=cutPetalLength)
> head(newdata)
  orig.Petal.Len cut.Petal.Len
1            1.4      [1,1.59]
2            1.4      [1,1.59]
3            1.3      [1,1.59]
4            1.5      [1,1.59]
5            1.4      [1,1.59]
6            1.7   (1.59,2.18]
>
> # Constant frequency / height
> myBreaks <- quantile(iris$Petal.Length, 
                       probs=seq(0,1,1/segments))
> cutPetalLength2 <- cut(iris$Petal.Length, 
                         breaks=myBreaks,
                         include.lowest=T)
> newdata2 <- data.frame(orig.Petal.Len=iris$Petal.Length, 
                         cut.Petal.Len=cutPetalLength2)
> head(newdata2)
  orig.Petal.Len cut.Petal.Len
1            1.4       [1,1.4]
2            1.4       [1,1.4]
3            1.3       [1,1.4]
4            1.5     (1.4,1.5]
5            1.4       [1,1.4]
6            1.7     (1.7,3.9]
>

Binarize categorical attributes

Certain machine learning models only take binary input (or numeric input). In this case, we need to convert categorical attribute into multiple binary attributes, while each binary attribute corresponds to a particular value of the category. (e.g. sunny/rainy/cloudy can be encoded as sunny == 100 and rainy == 010)

> cat <- levels(iris$Species)
> cat
[1] "setosa"     "versicolor" "virginica" 
> binarize <- function(x) {return(iris$Species == x)}
> newcols <- sapply(cat, binarize)
> colnames(newcols) <- cat
> data <- cbind(iris[,c('Species')], newcols)
> data[45:55,]
        setosa versicolor virginica
 [1,] 1      1          0         0
 [2,] 1      1          0         0
 [3,] 1      1          0         0
 [4,] 1      1          0         0
 [5,] 1      1          0         0
 [6,] 1      1          0         0
 [7,] 2      0          1         0
 [8,] 2      0          1         0
 [9,] 2      0          1         0
[10,] 2      0          1         0
[11,] 2      0          1         0

Select, combine, aggregate data

Designing the form of training data in my opinion is the most important part of the whole predictive modeling exercise because the accuracy largely depends on whether the input features are structured in an appropriate form that provide strong signals to the learning algorithm.

Rather than using the raw data as is, it is quite common that multiple pieces of raw data need to be combined together, or aggregating multiple raw data records along some dimensions.

In this section, lets use a different data source CO2, which provides the carbon dioxide uptake in grass plants.

> head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
>

To select the record that meet a certain criteria

> data <- CO2[CO2$conc>400 & CO2$uptake>40,]
> head(data)
   Plant   Type  Treatment conc uptake
12   Qn2 Quebec nonchilled  500   40.6
13   Qn2 Quebec nonchilled  675   41.4
14   Qn2 Quebec nonchilled 1000   44.3
19   Qn3 Quebec nonchilled  500   42.9
20   Qn3 Quebec nonchilled  675   43.9
21   Qn3 Quebec nonchilled 1000   45.5

To sort the records, lets say we want to sort by conc (in ascending order) and then by uptake (in descending order)

> # ascend sort on conc, descend sort on uptake
> CO2[order(CO2$conc, -CO2$uptake),][1:20,]
   Plant        Type  Treatment conc uptake
15   Qn3      Quebec nonchilled   95   16.2
1    Qn1      Quebec nonchilled   95   16.0
36   Qc3      Quebec    chilled   95   15.1
22   Qc1      Quebec    chilled   95   14.2
8    Qn2      Quebec nonchilled   95   13.6
50   Mn2 Mississippi nonchilled   95   12.0
57   Mn3 Mississippi nonchilled   95   11.3
43   Mn1 Mississippi nonchilled   95   10.6
78   Mc3 Mississippi    chilled   95   10.6
64   Mc1 Mississippi    chilled   95   10.5
29   Qc2      Quebec    chilled   95    9.3
71   Mc2 Mississippi    chilled   95    7.7
16   Qn3      Quebec nonchilled  175   32.4
2    Qn1      Quebec nonchilled  175   30.4
9    Qn2      Quebec nonchilled  175   27.3
30   Qc2      Quebec    chilled  175   27.3
23   Qc1      Quebec    chilled  175   24.1
51   Mn2 Mississippi nonchilled  175   22.0
37   Qc3      Quebec    chilled  175   21.0
58   Mn3 Mississippi nonchilled  175   19.4
>

To look at each plant rather than each raw record, lets compute the average uptake per plant.

> aggregate(CO2[,c('uptake')], data.frame(CO2$Plant), mean)
   CO2.Plant        x
1        Qn1 33.22857
2        Qn2 35.15714
3        Qn3 37.61429
4        Qc1 29.97143
5        Qc3 32.58571
6        Qc2 32.70000
7        Mn3 24.11429
8        Mn2 27.34286
9        Mn1 26.40000
10       Mc2 12.14286
11       Mc3 17.30000
12       Mc1 18.00000
>

We can also group by the combination of type and treatment

> aggregate(CO2[,c('conc', 'uptake')],
            data.frame(CO2$Type, CO2$Treatment), 
            mean)

     CO2.Type CO2.Treatment conc   uptake
1      Quebec    nonchilled  435 35.33333
2 Mississippi    nonchilled  435 25.95238
3      Quebec       chilled  435 31.75238
4 Mississippi       chilled  435 15.81429

To join multiple data sources by a common key, we can use the merge() function.

> head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
> # Lets create some artitificial data
> state <- c('California', 'Mississippi', 'Fujian',
                   'Shandong', 'Quebec', 'Ontario')
> country <- c('USA', 'USA', 'China', 'China', 
                     'Canada', 'Canada')
> geomap <- data.frame(country=country, state=state)
> geomap
  country       state
1     USA  California
2     USA Mississippi
3   China      Fujian
4   China    Shandong
5  Canada      Quebec
6  Canada     Ontario
> # Need to match the column name in join
> colnames(geomap) <- c('country', 'Type')
> joinCO2 <- merge(CO2, countrystate, by=c('Type'))
> head(joinCO2)
         Type Plant  Treatment conc uptake country
1 Mississippi   Mn1 nonchilled   95   10.6     USA
2 Mississippi   Mn1 nonchilled  175   19.2     USA
3 Mississippi   Mn1 nonchilled  250   26.2     USA
4 Mississippi   Mn1 nonchilled  350   30.0     USA
5 Mississippi   Mn1 nonchilled  500   30.9     USA
6 Mississippi   Mn1 nonchilled  675   32.4     USA
>

Power and Log transformation

A large portion of machine learning models are based on assumption of linearity relationship (ie: the output is linearly dependent on the input), as well as normal distribution of error with constant standard deviation. However the reality is not exactly align with this assumption in many cases.

In order to use these machine models effectively, it is common practice to perform transformation on the input or output variable such that it approximates normal distribution (and in a multi-variate case, multi-normal distribution).

The commonly use transformation is a class call Box-Cox transformation which is to transform a variable x to (x^k - 1) / k where k is a parameter that we need to determine. (when k = 2, this is a square transform, when k = 0.5, this is a square root transform, when k = 0, this will become log y, when k = 1, there is no transform)

Here is how we determine what k should be for input variable x and then transform the variable.

> # Use the data set Prestige
> library(cat)
> head(Prestige)
                    education income women prestige census type
gov.administrators      13.11  12351 11.16     68.8   1113 prof
general.managers        12.26  25879  4.02     69.1   1130 prof
accountants             12.77   9271 15.70     63.4   1171 prof
purchasing.officers     11.42   8865  9.11     56.8   1175 prof
chemists                14.62   8403 11.68     73.5   2111 prof
physicists              15.64  11030  5.13     77.6   2113 prof
> plot(density(Prestige$income))
> qqnorm(Prestige$income)
> qqline(Prestige$income)
> summary(box.cox.powers(cbind(Prestige$income)))
bcPower Transformation to Normality 

   Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
Y1    0.1793   0.1108          -0.0379           0.3965

Likelihood ratio tests about transformation parameters
                            LRT df         pval
LR test, lambda = (0)  2.710304  1 9.970200e-02
LR test, lambda = (1) 47.261001  1 6.213585e-12
> transformdata <- box.cox(Prestige$income, 0.18)
> plot(density(transformdata))
> qqnorm(transformdata)
> qqline(transformdata)

Hopefully I have covered the primary data transformation tasks. This should have set the stage for my next post, which is to train the predict model.

Monday, May 14, 2012

Predictive Analytics: Overview and Data visualization

I plan to start a series of blog post on predictive analytics as there is an increasing demand on applying machine learning technique to analyze large amount of raw data. This set of technique is very useful to me and I think they should be useful to other people as well. I will also going through some coding example in R. R is a statistical programming language that is very useful for performing predictive analytic tasks. In case you are not familiar with R, here is a very useful link to get some familiarity in R.

Predictive Analytics is a specialize data processing techniques focusing in solving the problem of predicting future outcome based on analyzing previous collected data. The processing cycle typically involves two phases of processing:

Training phase: Learn a model from training data
Predicting phase: Deploy the model to production and use that to predict the unknown or future outcome

The whole lifecycle of training involve the following steps.

Determine the input and output

At this step, we define the output (what we predict) as well as the input data (what we use) in this exercise. If the output is predicting a continuous numeric quantity, we call this exercise a “regression”. If the output is predicting a discrete category, we call it a “classification”. Similarly, input can also be a number, a category, or a mix of both.

Determine the ultimate output is largely a business requirement and usually well-defined (e.g. predicting the quarterly revenue). However, there are many intermediate outputs that are related (in fact they are be input) to the final output. In my experience, determining these set of intermediate outputs usually go through an back-tracking exploratory process as follows.

Starting at the final output that we aim to predict (e.g. quarterly revenue).
Identify all input variables that is useful to predict the output. Look into the source of getting these input data. If there is no data source corresponding to the input variable, that input variable will become the candidate of the intermediate output.
We repeat this process to learn about these intermediate outputs. Effectively we build multiple layers of predictive analytics such that we can move from raw input data to intermediate output and eventually to the final output.

Feature engineering

At this step, we determine how to extract useful input information from the raw data that will be influential to the output. This is an exploratory exercise guided by domain experts. Finally a set of input feature (derived from raw input data) will be defined.

Visualizing existing data is a very useful way to come up with ideas about what features should be included. "Dataframe" in R is a common way where data samples are organized in a tabular structure. And we'll be using some dataframe that comes with the R package. Specifically the dataframe "iris" represents different types of iris and their measures in different lengths.

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> nrow(iris)
[1] 150
> table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 
>

Single Variable Frequency Plot

For numeric data, it is good to get some idea about their frequency distribution. Histogram and a smoother density plot will give a good idea.

> # Plot the histogram
> hist(iris$Sepal.Length, breaks=10, prob=T)
> # Plot the density curve
> lines(density(iris$Sepal.Length))
>

For category data, bar plot is a good choice.

> categories <- table(iris$Species)
> barplot(categories, col=c('red', 'green', 'blue'))
>

Two variable Plot

Box plot can be used to visualize the distribution of a numeric value across different categories.

> boxplot(Sepal.Length~Species, data=iris)
>

When there are multiple numeric input variables, it is useful to get an idea of their correlation to identify if there are any redundancy in our input. Certain model is sensitive to such redundancy and won't give accurate prediction if their input has high redundancy. This problem is known as the multicollinearity problem.

To get an idea of the correlation, we can use a scatter plot matrix for any two pairs of input variables. We can also use a correlation matrix for the same purpose.

> # Scatter plot for all pairs
> pairs(iris[,c(1,2,3,4)])
> # Compute the correlation matrix
> cor(iris[,c(1,2,3,4)])
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1170695    0.8716902   0.8179410
Sepal.Width    -0.1170695   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8716902  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179410  -0.3661259    0.9628654   1.0000000
>

From the observation, we can see Petal Length is highly correlated to Petal Width, and also there are some correlation between Sepal Length and Petal Width as well.

We can further drill down into analyzing the relationship between two numeric value by fitting a regression line or a regression curve (by performing local neighbor linear regression).

> # First plot the 2 variables
> plot(Petal.Width~Sepal.Length, data=iris)
> # Learn the regression model
> model <- lm(Petal.Width~Sepal.Length, data=iris)
> # Plot the regression line
> abline(model)
> # Now learn the local linear model
> model2 <- lowess(iris$Petal.Width~iris$Sepal.Length)
> lines(model2, col="red")
>

OLAP processing

If the data is in multi-dimensional form (has multiple categorical attributes), OLAP type manipulation can give a very good summary.

In this section, lets use a different data source CO2, which provides the carbon dioxide uptake in grass plants.

> head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
>

Lets look at the count summarized in different dimensions

> cube <- xtabs(~Plant+Type+Treatment, data=CO2)
> cube
, , Treatment = nonchilled

     Type
Plant Quebec Mississippi
  Qn1      7           0
  Qn2      7           0
  Qn3      7           0
  Qc1      0           0
  Qc3      0           0
  Qc2      0           0
  Mn3      0           7
  Mn2      0           7
  Mn1      0           7
  Mc2      0           0
  Mc3      0           0
  Mc1      0           0

, , Treatment = chilled

     Type
Plant Quebec Mississippi
  Qn1      0           0
  Qn2      0           0
  Qn3      0           0
  Qc1      7           0
  Qc3      7           0
  Qc2      7           0
  Mn3      0           0
  Mn2      0           0
  Mn1      0           0
  Mc2      0           7
  Mc3      0           7
  Mc1      0           7

> apply(cube, c("Plant"), sum)
Qn1 Qn2 Qn3 Qc1 Qc3 Qc2 Mn3 Mn2 Mn1 Mc2 Mc3 Mc1 
  7   7   7   7   7   7   7   7   7   7   7   7 
> apply(cube, c("Type"), sum)
     Quebec Mississippi 
         42          42 
> apply(cube, c("Plant", "Type"), mean)
     Type
Plant Quebec Mississippi
  Qn1    3.5         0.0
  Qn2    3.5         0.0
  Qn3    3.5         0.0
  Qc1    3.5         0.0
  Qc3    3.5         0.0
  Qc2    3.5         0.0
  Mn3    0.0         3.5
  Mn2    0.0         3.5
  Mn1    0.0         3.5
  Mc2    0.0         3.5
  Mc3    0.0         3.5
  Mc1    0.0         3.5
>

Prepare training data

At this step, the purpose is to transform the raw data in a form that can fit into the machine learning model. Some common steps including ...

Data sampling
Data validation and handle missing data
Normalize numeric value into a uniform range
Compute aggregated value (a special case is to compute frequency counts)
Expand categorical field to binary fields
Discretize numeric value into categories
Create derived fields from existing fields
Reduce dimensionality
Power and Log transformation

Data preparation is the bulk of effort for most predictive analytic tasks and usually consumes 60 to 80% of the time. This is an important area that deserve a dedicated post. I'll cover data preparation in my next blog post.

Train a model

At this step, we pick a machine learning model based on the characteristics of the input and output features. Then we feed the training data into the learning algorithm which produce a set of parameters to explain the occurrence of training data.

There are many machine learning models that we can choose from, some common ones including …

Each of the above models has its own characteristics and fit better in certain types of problems. I'll cover each of them more detail in future posts.

Validate the model

After we train the model, how do we validate the model do a good job in prediction. Typically, we withhold a subset of training data for testing purpose. Through the testing, we quantitatively measure the performance of our model prediction ability. I'll cover this topic in more detail in future post.

Deploy the model

When we are satisfied with the model, we'll deploy the model in production environment and use that to predict the real-life events. We should also closely monitor if the real-life accuracy aligns with our testing result.

Usually, the quality of the model prediction degrades over time due to the stationary assumption (the data pattern in future is similar to the data pattern at training time) not longer holds as time passes. On the other hand, we may acquire additional input signal that can improve the prediction accuracy. Therefore, model should have an expiration time and need to be re-train after that.

Hopefully, this post gave a high level overview on the predictive analytics process. I'll drill down to each items in more detail in my future posts.