Predictive Analytics is a specialize data processing techniques focusing in solving the problem of predicting future outcome based on analyzing previous collected data. The processing cycle typically involves two phases of processing:
- Training phase: Learn a model from training data
- Predicting phase: Deploy the model to production and use that to predict the unknown or future outcome
Determine the input and outputAt this step, we define the output (what we predict) as well as the input data (what we use) in this exercise. If the output is predicting a continuous numeric quantity, we call this exercise a “regression”. If the output is predicting a discrete category, we call it a “classification”. Similarly, input can also be a number, a category, or a mix of both.
Determine the ultimate output is largely a business requirement and usually well-defined (e.g. predicting the quarterly revenue). However, there are many intermediate outputs that are related (in fact they are be input) to the final output. In my experience, determining these set of intermediate outputs usually go through an back-tracking exploratory process as follows.
- Starting at the final output that we aim to predict (e.g. quarterly revenue).
- Identify all input variables that is useful to predict the output. Look into the source of getting these input data. If there is no data source corresponding to the input variable, that input variable will become the candidate of the intermediate output.
- We repeat this process to learn about these intermediate outputs. Effectively we build multiple layers of predictive analytics such that we can move from raw input data to intermediate output and eventually to the final output.
Feature engineeringAt this step, we determine how to extract useful input information from the raw data that will be influential to the output. This is an exploratory exercise guided by domain experts. Finally a set of input feature (derived from raw input data) will be defined.
Visualizing existing data is a very useful way to come up with ideas about what features should be included. "Dataframe" in R is a common way where data samples are organized in a tabular structure. And we'll be using some dataframe that comes with the R package. Specifically the dataframe "iris" represents different types of iris and their measures in different lengths.
> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa > nrow(iris)  150 > table(iris$Species) setosa versicolor virginica 50 50 50 >
Single Variable Frequency PlotFor numeric data, it is good to get some idea about their frequency distribution. Histogram and a smoother density plot will give a good idea.
> # Plot the histogram > hist(iris$Sepal.Length, breaks=10, prob=T) > # Plot the density curve > lines(density(iris$Sepal.Length)) >
For category data, bar plot is a good choice.
> categories <- table(iris$Species) > barplot(categories, col=c('red', 'green', 'blue')) >
Two variable PlotBox plot can be used to visualize the distribution of a numeric value across different categories.
> boxplot(Sepal.Length~Species, data=iris) >
To get an idea of the correlation, we can use a scatter plot matrix for any two pairs of input variables. We can also use a correlation matrix for the same purpose.
> # Scatter plot for all pairs > pairs(iris[,c(1,2,3,4)]) > # Compute the correlation matrix > cor(iris[,c(1,2,3,4)]) Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000 -0.1170695 0.8716902 0.8179410 Sepal.Width -0.1170695 1.0000000 -0.4284401 -0.3661259 Petal.Length 0.8716902 -0.4284401 1.0000000 0.9628654 Petal.Width 0.8179410 -0.3661259 0.9628654 1.0000000 >
We can further drill down into analyzing the relationship between two numeric value by fitting a regression line or a regression curve (by performing local neighbor linear regression).
> # First plot the 2 variables > plot(Petal.Width~Sepal.Length, data=iris) > # Learn the regression model > model <- lm(Petal.Width~Sepal.Length, data=iris) > # Plot the regression line > abline(model) > # Now learn the local linear model > model2 <- lowess(iris$Petal.Width~iris$Sepal.Length) > lines(model2, col="red") >
OLAP processingIf the data is in multi-dimensional form (has multiple categorical attributes), OLAP type manipulation can give a very good summary.
In this section, lets use a different data source CO2, which provides the carbon dioxide uptake in grass plants.
> head(CO2) Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16.0 2 Qn1 Quebec nonchilled 175 30.4 3 Qn1 Quebec nonchilled 250 34.8 4 Qn1 Quebec nonchilled 350 37.2 5 Qn1 Quebec nonchilled 500 35.3 6 Qn1 Quebec nonchilled 675 39.2 >
Lets look at the count summarized in different dimensions
> cube <- xtabs(~Plant+Type+Treatment, data=CO2) > cube , , Treatment = nonchilled Type Plant Quebec Mississippi Qn1 7 0 Qn2 7 0 Qn3 7 0 Qc1 0 0 Qc3 0 0 Qc2 0 0 Mn3 0 7 Mn2 0 7 Mn1 0 7 Mc2 0 0 Mc3 0 0 Mc1 0 0 , , Treatment = chilled Type Plant Quebec Mississippi Qn1 0 0 Qn2 0 0 Qn3 0 0 Qc1 7 0 Qc3 7 0 Qc2 7 0 Mn3 0 0 Mn2 0 0 Mn1 0 0 Mc2 0 7 Mc3 0 7 Mc1 0 7 > apply(cube, c("Plant"), sum) Qn1 Qn2 Qn3 Qc1 Qc3 Qc2 Mn3 Mn2 Mn1 Mc2 Mc3 Mc1 7 7 7 7 7 7 7 7 7 7 7 7 > apply(cube, c("Type"), sum) Quebec Mississippi 42 42 > apply(cube, c("Plant", "Type"), mean) Type Plant Quebec Mississippi Qn1 3.5 0.0 Qn2 3.5 0.0 Qn3 3.5 0.0 Qc1 3.5 0.0 Qc3 3.5 0.0 Qc2 3.5 0.0 Mn3 0.0 3.5 Mn2 0.0 3.5 Mn1 0.0 3.5 Mc2 0.0 3.5 Mc3 0.0 3.5 Mc1 0.0 3.5 >
Prepare training dataAt this step, the purpose is to transform the raw data in a form that can fit into the machine learning model. Some common steps including ...
- Data sampling
- Data validation and handle missing data
- Normalize numeric value into a uniform range
- Compute aggregated value (a special case is to compute frequency counts)
- Expand categorical field to binary fields
- Discretize numeric value into categories
- Create derived fields from existing fields
- Reduce dimensionality
- Power and Log transformation
Train a modelAt this step, we pick a machine learning model based on the characteristics of the input and output features. Then we feed the training data into the learning algorithm which produce a set of parameters to explain the occurrence of training data.
There are many machine learning models that we can choose from, some common ones including …
- Linear regression
- Logistic regression
- Regression with regularization
- Neural Network
- Support Vector Machine
- Naïve Bayes
- Nearest Neighbors
- Decision Tree
- Ensemble Methods (Random Forest, Gradient Boosted Tree)
Validate the modelAfter we train the model, how do we validate the model do a good job in prediction. Typically, we withhold a subset of training data for testing purpose. Through the testing, we quantitatively measure the performance of our model prediction ability. I'll cover this topic in more detail in future post.
Deploy the modelWhen we are satisfied with the model, we'll deploy the model in production environment and use that to predict the real-life events. We should also closely monitor if the real-life accuracy aligns with our testing result.
Usually, the quality of the model prediction degrades over time due to the stationary assumption (the data pattern in future is similar to the data pattern at training time) not longer holds as time passes. On the other hand, we may acquire additional input signal that can improve the prediction accuracy. Therefore, model should have an expiration time and need to be re-train after that.
Hopefully, this post gave a high level overview on the predictive analytics process. I'll drill down to each items in more detail in my future posts.