Pragmatic Programming Techniques: March 2011

ROC Curve is a common method to compare performance between different models. It can also be used to pick trade-off decisions between "false positives" and "false negatives". ROC curve is defined as a plot of "false positive rate" against "false negative rate". However, I don't find the ROC concept is intuitive and has been struggled for a while to grasp the concept.

Here is my attempt to explain ROC curve from a different angle. We use a binary classification example to illustrate the idea. (ie: predicting whether a patient has cancer or not)

First of all, all predictive model is not 100% correct. The desirable state is that a person who actually has cancer got a positive test result, and a person who actually has no cancer got a negative test result. Since the test is imperfect, it is possible that a person who actually has cancer was tested negative (ie: Fail to detect) or a person who actually has no cancer was tested positive (ie: False alarm).

In reality, there is always a tradeoff between the false negative rate and the false positive rate. People can tune the decision threshold to adjust them (e.g. In "random forest", we can set the threshold of predicting positive when more than 30% decision trees predicting positive). Usually, the threshold is set based on the consequence or cost of mis-classification. (e.g. in this example, fail to detect has a much higher cost than a false alarm)

This can also be used to compare model performance. A good model is one that has both low false positive rate and low false negative rate, which is indicated in the size of the gray area below (the smaller the better).

"Random guess" is the worst prediction model and is used as a baseline for comparison. The decision threshold of a random guess is a number between 0 to 1 in order to determine between positive and negative prediction.

ROC Curve is basically what I have described above with one transformation, which is transforming the y-axis from "fail to detect" to 1 - "fail to detect", which now become "success to detect". Honestly I don't understand why this representation is better though.

Now, the ROC curve will look as follows ...

I attended the San Francisco Predictive Analytic conference this week and got a chance to chat with some best data mining practitioners of the country. Here summarizes my key takeaways.

How is the division of labor between human and machine?

Another way to ask this question is how “machine learning” and “domain expertise” work together and complement each other, since each has different strength and weakness.

Machine learning is very good at processing large amount of data in an unbiased way while human is unable to process the same data volume and the judgment is usually biased. However, machine cannot look beyond the data being given. For example, if the prediction power is low, machine learning methods cannot distinguish whether it is because the data is not clean, or the wrong model is being chosen, or because some important input feature is not captured. Domain expertise must be brought in to figure out the problem.

So the consensus is data mining / machine learning is simply a toolbox that can be used to augment human’s domain expertise, but can never replace it. For example, the domain expert can throw in a large number of input features to the machine learning model, which can determine a subset that are most influential. But if the domain expert doesn’t recognize an important input feature (and not capturing it), there is no way the machine learning model can figure out what is missing, not even recognizing that something is missing.

On the other hand, human is also very good in visualizing data patterns. “Data visualization” technique can be a powerful means to get a good sense and quickly identify the area where drilldown analysis should be conducted. Of course, visualization is limited to low dimension data as human cannot comprehend more than a handful of dimensions. Human is also easily biased so they may find patterns where are actually coincidence. By having human and machine working together, they complement each other very well.

What are some of the key design decisions in data mining?

Balance between false +ve and false –ve based on cost / consequence of making a wrong decision.
We don’t have to use a method from beginning to end. We can use different methods at different stage of the analysis. For example, in a multi-class (A, B, C) problem, we can use decision tree to distinguish A from notA (ie: B, C) and then use support vector machine to separate B and C. As another example, we can use decision tree to determine the best input attributes to be used by the neural network.

What is the most powerful / most commonly used supervised machine learning modeling technique?

The general answer is that each modeling technique has its strength and weakness and none of them wins in all situations. So understand their corresponding strength and weakness is important to pick the right one.

Generalized Linear Regression
Linear and Logistic regression are based on fitting a linear plane into a set of data points such that the root mean square of error (distance between predicted output and actual output) is minimized. It is by far the most commonly used technique, one for numeric output and the other for categorical output. They have a long history in statistics. It is supported in pretty much all commercial and open source data mining tools.

Linear and Logistic regression model requires certain amount of data preparation such as missing data handling. It also assuming that the output (or logit output) is a linear combination of input features, error is expected to be normally distribution. However, real-life scenarios are not always linear. To deal with non-linearity, input terms will be mixed (usually by cross-multiplication) in different ways to generate additional input terms called “interactions”. This process is like trial and error and can generate huge number of combination. Nevertheless, they do a reasonably good job in a wide spectrum of business problems and are well-understood by statisticians and data miners. And they are commonly used as a baseline comparison with other models.

Neural Network
Neural Network is based on multiple layer of perceptrons (each is like a logistic regression with binary input and output). There is typically a hidden layer (so the number of layers is 3) with N perceptrons (where N is trial and error). Because of the extra layer and the logit() function in the neural network, it can handle non-linearity very well. If it has good predictor in its input data, Neural network can achieve very high performance in prediction.

Similar to linear regression, Neural network requires careful data preparation to remove noisy data as well as redundant input attributes (those that are highly correlated). Neural network also take much longer time to train as compared to other methods. Also the model that Neural network has learned is not explainable or make good sense out of it.

Support Vector Machine
Support Vector Machine is a binary classifier (input feature is numeric). It is based on finding a linear plane that can separate the binary output class such that the margin is maximized. The optimal solution is expressed in terms of the dot product of vectors. If the points are not linearly separable, we can use a function to transform the points to a higher dimension space such that it is linearly separable. The Math shows that the dot product (after transforming to a hi-dim space) can be generalized into a Kernel function (Radial basis function being the most common one). Although the underlying math is not easy for everyone to understand, SVM has demonstrated outstanding performance in a wide spectrum of problems and recently become one of the most effective methods.

Despite of its powerful capability, SVM is not broadly implemented in commercial products as there are some patent issue as AT&T holds the patent of SVM. On the other hand, the non-linear kernel function (such as the most common Radial Basis function) is difficult to implement in parallel programming model such as Map/Reduce. SVM is undergoing active research and a derivative Support Vector Regression can be used to predict numeric output.

Tree Ensembles
This is combining “ensemble methods” with “decision tree”.

Decision tree is the first generation machine learning algorithm based on a greedy approach. For a classification problem, decision tree try to split a branch where the combined “purity” (either by the Gini index or Entropy) after split is maximized. For a regression problem, decision tree try to split where the combined “between-class-variance” divided by “within-class-variance” can be maximized. This is equivalent to maximizing the F-value after split. The splitting continues until reaching the terminating condition such as there are too few member remains in the branch, or the gain of further split is insignificant.

Decision tree are very good at dealing with missing value (simply not using that value in learning and go own both path in scoring). Using a decision tree to capture the decision model is also very comprehensible and explainable. However, decision tree is relatively sensitive to noise and can easily overfit the data. Although the learning mechanism is easy to understand, Decision tree doesn’t perform very well in general and is rarely used in real system. However, when decision trees are used together with Ensemble methods, it becomes extraordinary powerful as all its weakness now disappears.

The idea of ensemble is simple. Instead of learning one model, we learning multiple models and combine the estimation of each individual learner (e.g. we let them vote on categorical output and compute the average for numeric output).

There are two main models for creating different learners. One is called “bagging”, which is basically drawing samples (with replacement) from the training set and then have the same Tree algorithm to learn on different sample data set. Another model is called “boosting”, which has a sequence of iterations where samples are drawn from the training set based on the probability distribution where the wrongly predicted items in last round will have a higher chance to be selected. In other words, the algorithm places more attention to learn from wrongly-classified examples.

It turns out Ensemble tree is the most popular method at this moment as it achieve very good prediction across the board, easy to understand and can be implemented in Map/reduce. Google recently published a good paper on their PLANET project which implements ensemble tree on map/reduce.

Pragmatic Programming Techniques

Saturday, March 19, 2011

Compare Machine Learning models with ROC Curve

Thursday, March 17, 2011

Predictive Analytics Conference 2011