Decision TreeDecision Tree model is one of the oldest machine learning model and is usually used to illustrate the very basic idea of machine learning. Based on a tree of decision nodes, the learning approach is to recursively divide the training data into buckets of homogeneous members through the most discriminative dividing criteria. The measurement of "homogeneity" is based on the output label; when it is a numeric value, the measurement will be the variance of the bucket; when it is a category, the measurement will be the entropy or gini index of the bucket..
During the training, various dividing criteria based on the input will be tried (using in a greedy manner); when the input is a category (Mon, Tue, Wed ...), it will first be turned into binary (isMon, isTue, isWed ...) and then use the true/false as a decision boundary to evaluate the homogeneity; when the input is a numeric or ordinal value, the lessThan, greaterThan at each training data input value will be used as the decision boundary.
The training process stops when there is no significant gain in homogeneity by further split the Tree. The members of the bucket represented at leaf node will vote for the prediction; majority wins when the output is a category and member’s average is taken when the output is a numeric.
Here is an example in R
> install.packages('rpart') > library(rpart) > #Train the decision tree > treemodel <- rpart(Species~., data=iristrain) > plot(treemodel) > text(treemodel, use.n=T) > #Predict using the decision tree > prediction <- predict(treemodel, newdata=iristest, type='class') > #Use contingency table to see how accurate it is > table(prediction, iristest$Species) prediction setosa versicolor virginica setosa 10 0 0 versicolor 0 10 3 virginica 0 0 7
Here is the decision tree that we have learned.
The good part of Tree is that it can take different data type of input and output variables which can be categorical, binary and numeric value. It can handle missing attributes and outliers well. The decision tree is also good in explaining reasoning for its prediction and therefore gives good insight about the underlying data.
The limitation of decision tree is that each decision boundary at each split point is a concrete binary decision. Also the decision criteria only consider one input attributes at a time but not a combination of multiple input variables. Another weakness of Tree is that once learned it cannot be updated incrementally. When new training data arrives, you have to throw away the old tree and retrain every data from scratch. In practice, standalone decision tree and rarely used in practice as its predictive accuracy and relatively low. Tree ensembles (described below) are the common way to use decision trees.
Tree EnsemblesInstead of picking a single model, Ensemble Method combines multiple models in a certain way to fit the training data. There are two primary ways: “bagging” and “boosting”. In “bagging”, we take a subset of training data (pick n random sample out of N training data with replacement) to train up each model. After multiple models are trained, we use a voting scheme to predict future data.
“Boosting” is another approach in Ensemble Method. Instead of sampling the input features, it samples the training data records, but it puts more emphasis on the training data that is wrongly predicted in previous iterations. Initially each training data is equally weighted. At each iteration, the data that is wrongly classified will have its weight increased. The final prediction will be voted by each tree learned at each iteration weighted by the accuracy of that tree.
Random ForestRandom Forest is one of the most popular bagging models; in additional to selecting n training data out of N, at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5) and learns a decision tree from it. Finally each tree in the forest vote for the result.
Here is the R code to use Random Forest.
> install.packages(‘randomForest') > library(randomForest) > #Train 100 trees, random selected attributes > model <- randomForest(Species~., data=iristrain, nTree=500) > #Predict using the forest > prediction <- predict(model, newdata=iristest, type='class') > table(prediction, iristest$Species) > importance(model) MeanDecreaseGini Sepal.Length 7.807602 Sepal.Width 1.677239 Petal.Length 31.145822 Petal.Width 38.617223
Gradient Boosted TreesGradient Boosting Method is one of the most popular boosting methods. It is based on incrementally add a function that fits the gradient of a lost function of residuals
- Set i = 0 at the beginning, repeat until converge
- Learn a function F(X) to predict Y. Basically find F that minimize expected(L(F(X) – Y)), where L is the lost function of the residual
- Learning another function g(X) to predict the gradient of above function
- Update F = F + a.g(X), where a is the learning rate
Here is the sample code in R.
> library(gbm) > iris2 <- iris > newcol = data.frame(isVersicolor=(iris2$Species=='versicolor')) > iris2 <- cbind(iris2, newcol) > iris2[45:55,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species isVersicolor 45 5.1 3.8 1.9 0.4 setosa FALSE 46 4.8 3.0 1.4 0.3 setosa FALSE 47 5.1 3.8 1.6 0.2 setosa FALSE 48 4.6 3.2 1.4 0.2 setosa FALSE 49 5.3 3.7 1.5 0.2 setosa FALSE 50 5.0 3.3 1.4 0.2 setosa FALSE 51 7.0 3.2 4.7 1.4 versicolor TRUE 52 6.4 3.2 4.5 1.5 versicolor TRUE 53 6.9 3.1 4.9 1.5 versicolor TRUE 54 5.5 2.3 4.0 1.3 versicolor TRUE 55 6.5 2.8 4.6 1.5 versicolor TRUE > formula <- isVersicolor ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width > model <- gbm(formula, data=iris2, n.trees=1000, interaction.depth=2, distribution="bernoulli") Iter TrainDeviance ValidDeviance StepSize Improve 1 1.2714 -1.#IND 0.0010 0.0008 2 1.2705 -1.#IND 0.0010 0.0004 3 1.2688 -1.#IND 0.0010 0.0007 4 1.2671 -1.#IND 0.0010 0.0008 5 1.2655 -1.#IND 0.0010 0.0008 6 1.2639 -1.#IND 0.0010 0.0007 7 1.2621 -1.#IND 0.0010 0.0008 8 1.2614 -1.#IND 0.0010 0.0003 9 1.2597 -1.#IND 0.0010 0.0008 10 1.2580 -1.#IND 0.0010 0.0008 100 1.1295 -1.#IND 0.0010 0.0008 200 1.0090 -1.#IND 0.0010 0.0005 300 0.9089 -1.#IND 0.0010 0.0005 400 0.8241 -1.#IND 0.0010 0.0004 500 0.7513 -1.#IND 0.0010 0.0004 600 0.6853 -1.#IND 0.0010 0.0003 700 0.6266 -1.#IND 0.0010 0.0003 800 0.5755 -1.#IND 0.0010 0.0002 900 0.5302 -1.#IND 0.0010 0.0002 1000 0.4901 -1.#IND 0.0010 0.0002 > prediction <- predict.gbm(model, iris2[45:55,], type="response", n.trees=1000) > round(prediction, 3)  0.127 0.131 0.127 0.127 0.127 0.127 0.687 0.688 0.572 0.734 0.722 > summary(model) var rel.inf 1 Petal.Length 61.4203761582 2 Petal.Width 34.7557511871 3 Sepal.Width 3.5407662531 4 Sepal.Length 0.2831064016 >
GBM R package also gave the relative importance of the input features.