## Decision Tree

Decision Tree model is one of the oldest machine learning model and is usually used to illustrate the very basic idea of machine learning. Based on a tree of decision nodes, the learning approach is to recursively divide the training data into buckets of homogeneous members through the most discriminative dividing criteria. The measurement of "homogeneity" is based on the output label; when it is a numeric value, the measurement will be the variance of the bucket; when it is a category, the measurement will be the entropy or gini index of the bucket..During the training, various dividing criteria based on the input will be tried (using in a greedy manner); when the input is a category (Mon, Tue, Wed ...), it will first be turned into binary (isMon, isTue, isWed ...) and then use the true/false as a decision boundary to evaluate the homogeneity; when the input is a numeric or ordinal value, the lessThan, greaterThan at each training data input value will be used as the decision boundary.

The training process stops when there is no significant gain in homogeneity by further split the Tree. The members of the bucket represented at leaf node will vote for the prediction; majority wins when the output is a category and member’s average is taken when the output is a numeric.

Here is an example in R

```
> install.packages('rpart')
> library(rpart)
> #Train the decision tree
> treemodel <- rpart(Species~., data=iristrain)
> plot(treemodel)
> text(treemodel, use.n=T)
> #Predict using the decision tree
> prediction <- predict(treemodel, newdata=iristest, type='class')
> #Use contingency table to see how accurate it is
> table(prediction, iristest$Species)
prediction setosa versicolor virginica
setosa 10 0 0
versicolor 0 10 3
virginica 0 0 7
```

Here is the decision tree that we have learned.

The good part of Tree is that it can take different data type of input and output variables which can be categorical, binary and numeric value. It can handle missing attributes and outliers well. The decision tree is also good in explaining reasoning for its prediction and therefore gives good insight about the underlying data.

The limitation of decision tree is that each decision boundary at each split point is a concrete binary decision. Also the decision criteria only consider one input attributes at a time but not a combination of multiple input variables. Another weakness of Tree is that once learned it cannot be updated incrementally. When new training data arrives, you have to throw away the old tree and retrain every data from scratch. In practice, standalone decision tree and rarely used in practice as its predictive accuracy and relatively low. Tree ensembles (described below) are the common way to use decision trees.

## Tree Ensembles

Instead of picking a single model, Ensemble Method combines multiple models in a certain way to fit the training data. There are two primary ways: “bagging” and “boosting”. In “bagging”, we take a subset of training data (pick n random sample out of N training data with replacement) to train up each model. After multiple models are trained, we use a voting scheme to predict future data.“Boosting” is another approach in Ensemble Method. Instead of sampling the input features, it samples the training data records, but it puts more emphasis on the training data that is wrongly predicted in previous iterations. Initially each training data is equally weighted. At each iteration, the data that is wrongly classified will have its weight increased. The final prediction will be voted by each tree learned at each iteration weighted by the accuracy of that tree.

### Random Forest

Random Forest is one of the most popular bagging models; in additional to selecting n training data out of N, at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5) and learns a decision tree from it. Finally each tree in the forest vote for the result.Here is the R code to use Random Forest.

```
> install.packages(‘randomForest')
> library(randomForest)
> #Train 100 trees, random selected attributes
> model <- randomForest(Species~., data=iristrain, nTree=500)
> #Predict using the forest
> prediction <- predict(model, newdata=iristest, type='class')
> table(prediction, iristest$Species)
> importance(model)
MeanDecreaseGini
Sepal.Length 7.807602
Sepal.Width 1.677239
Petal.Length 31.145822
Petal.Width 38.617223
```

### Gradient Boosted Trees

Gradient Boosting Method is one of the most popular boosting methods. It is based on incrementally add a function that fits the gradient of a lost function of residuals- Set i = 0 at the beginning, repeat until converge
- Learn a function F(X) to predict Y. Basically find F that minimize expected(L(F(X) – Y)), where L is the lost function of the residual
- Learning another function g(X) to predict the gradient of above function
- Update F = F + a.g(X), where a is the learning rate

Here is the sample code in R.

```
> library(gbm)
> iris2 <- iris
> newcol = data.frame(isVersicolor=(iris2$Species=='versicolor'))
> iris2 <- cbind(iris2, newcol)
> iris2[45:55,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species isVersicolor
45 5.1 3.8 1.9 0.4 setosa FALSE
46 4.8 3.0 1.4 0.3 setosa FALSE
47 5.1 3.8 1.6 0.2 setosa FALSE
48 4.6 3.2 1.4 0.2 setosa FALSE
49 5.3 3.7 1.5 0.2 setosa FALSE
50 5.0 3.3 1.4 0.2 setosa FALSE
51 7.0 3.2 4.7 1.4 versicolor TRUE
52 6.4 3.2 4.5 1.5 versicolor TRUE
53 6.9 3.1 4.9 1.5 versicolor TRUE
54 5.5 2.3 4.0 1.3 versicolor TRUE
55 6.5 2.8 4.6 1.5 versicolor TRUE
> formula <- isVersicolor ~ Sepal.Length
+ Sepal.Width
+ Petal.Length
+ Petal.Width
> model <- gbm(formula, data=iris2,
n.trees=1000,
interaction.depth=2,
distribution="bernoulli")
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.2714 -1.#IND 0.0010 0.0008
2 1.2705 -1.#IND 0.0010 0.0004
3 1.2688 -1.#IND 0.0010 0.0007
4 1.2671 -1.#IND 0.0010 0.0008
5 1.2655 -1.#IND 0.0010 0.0008
6 1.2639 -1.#IND 0.0010 0.0007
7 1.2621 -1.#IND 0.0010 0.0008
8 1.2614 -1.#IND 0.0010 0.0003
9 1.2597 -1.#IND 0.0010 0.0008
10 1.2580 -1.#IND 0.0010 0.0008
100 1.1295 -1.#IND 0.0010 0.0008
200 1.0090 -1.#IND 0.0010 0.0005
300 0.9089 -1.#IND 0.0010 0.0005
400 0.8241 -1.#IND 0.0010 0.0004
500 0.7513 -1.#IND 0.0010 0.0004
600 0.6853 -1.#IND 0.0010 0.0003
700 0.6266 -1.#IND 0.0010 0.0003
800 0.5755 -1.#IND 0.0010 0.0002
900 0.5302 -1.#IND 0.0010 0.0002
1000 0.4901 -1.#IND 0.0010 0.0002
> prediction <- predict.gbm(model, iris2[45:55,],
type="response",
n.trees=1000)
> round(prediction, 3)
[1] 0.127 0.131 0.127 0.127 0.127 0.127 0.687 0.688
0.572 0.734 0.722
> summary(model)
var rel.inf
1 Petal.Length 61.4203761582
2 Petal.Width 34.7557511871
3 Sepal.Width 3.5407662531
4 Sepal.Length 0.2831064016
>
```

GBM R package also gave the relative importance of the input features.

I have gone through most of the common machine learning techniques. In this next post, I will covered how to evaluate the performance of these models in their predictive power.

## No comments:

Post a Comment