My R Codes Archive: Decision Tree

There is a number of decision tree algorithms available. In this blog, I am describing the rpart algorithm which stands for recursive partitioning and regression tree. This algorithm allows for both regression and classification, and handles the data relatively well when there are many categorical variables.

Decision trees may lack accuracies compared to other more sophisticated techniques, but its main advantages include less restrictions during analysis and easier interpretation of its outputs even by non-technical experts such as managements or operational staff.

This algorithm requires 'rpart' package in R, and rpart() function is used to build a tree as seen in the below examples.

Classification Tree

When you have a categorical variable as y value (or target), the tree is a classification tree and you can write the function as below.

RP1 <- rpart(Species ~ ., iris)

This would produce a tree that looks like below.

## n= 150 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## 1) root 150 100 setosa (0.33333 0.33333 0.33333) ## 2) Petal.Length< 2.45 50 0 setosa (1.00000 0.00000 0.00000) * ## 3) Petal.Length>=2.45 100 50 versicolor (0.00000 0.50000 0.50000) ## 6) Petal.Width< 1.75 54 5 versicolor (0.00000 0.90741 0.09259) * ## 7) Petal.Width>=1.75 46 1 virginica (0.00000 0.02174 0.97826) *

In this example, where the target variable is a categorical variable, the decision tree will look at probabilities of the categorical values of the target variable at each node and leaf, and display the one with the highest probability.

In this representation of the tree, leaf is the terminal node where no further splits are made, and is denoted with an asterisk (*) sign. Node is a point in the branch where a split is made.

At each leaf and node, the following output values are displayed:

condition of the split (e.g. Petal.Length< 2.45)
number of data points belonging to the node/leaf
number of loss (i.e. number of misclassification) at the node/leaf
target value of the highest probability at the node/leaf
in brackets, probability scores of each target value at the node/leaf

To view the tree in a graphical representation, use the following functions.

plot(RP1) text(RP1, cex = 0.8)

In the above graphical representation, the data points that satisfy the splitting condition at the node moves to the branch on the left.

Regression Tree

When you set a numerical variable as y value (or target), the tree is a regression tree and you can write the function as below.

RP2 <- rpart(Sepal.Length ~ ., iris)

This would produce a tree that looks like below.

## n= 150
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 150 102.2000 5.843
## 2) Petal.Length< 4.25 73 13.1400 5.179
## 4) Petal.Length< 3.4 53 6.1080 5.006
## 8) Sepal.Width< 3.25 20 1.0850 4.735 *
## 9) Sepal.Width>=3.25 33 2.6700 5.170 *
## 5) Petal.Length>=3.4 20 1.1880 5.640 *
## 3) Petal.Length>=4.25 77 26.3500 6.473
## 6) Petal.Length< 6.05 68 13.4900 6.326
## 12) Petal.Length< 5.15 43 8.2580 6.165
## 24) Sepal.Width< 3.05 33 5.2220 6.055 *
## 25) Sepal.Width>=3.05 10 1.3010 6.530 *
## 13) Petal.Length>=5.15 25 2.1900 6.604 *

## 7) Petal.Length>=6.05 9 0.4156 7.578 *

In this example, where the y value is a numerical variable, the leaf of a branch produces a value similar to the value calculated by a regression technique using the variables used in the branch.

The below outputs are displayed at each node and leaf in this representation of the tree.

condition of split
number of data points belonging to the node/leaf
deviance of the calculated estimation at the node/leaf
estimated value of the target at the node/leaf

The below is a graphical representation of the tree. The output values are rounded to the whole number.

plot(RP2) text(RP2, cex = 0.8)

plot of chunk unnamed-chunk-6

If a selection of the variables in the dataset are used as predictor variables, write the function as below.

RP3 <- rpart(Species ~ Sepal.Length + Sepal.Width, iris)

You can notice the distinct difference from the above tree where other variables are used as predictors.

## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 100 setosa (0.33333 0.33333 0.33333)
## 2) Sepal.Length< 5.45 52 7 setosa (0.86538 0.11538 0.01923)
## 4) Sepal.Width>=2.8 45 1 setosa (0.97778 0.02222 0.00000) *
## 5) Sepal.Width< 2.8 7 2 versicolor (0.14286 0.71429 0.14286) *
## 3) Sepal.Length>=5.45 98 49 virginica (0.05102 0.44898 0.50000)
## 6) Sepal.Length< 6.15 43 15 versicolor (0.11628 0.65116 0.23256)
## 12) Sepal.Width>=3.1 7 2 setosa (0.71429 0.28571 0.00000) *
## 13) Sepal.Width< 3.1 36 10 versicolor (0.00000 0.72222 0.27778) *

## 7) Sepal.Length>=6.15 55 16 virginica (0.00000 0.29091 0.70909) *

plot(RP3) text(RP3, cex = 0.8)

plot of chunk unnamed-chunk-8

Pruning

Size of a tree often changes the accuracy of the prediction. In general, bigger tree means higher accuracy, but if the tree is too big, it overfits the data and results in decreased accuracy and robustness. By overfitting, it means the tree may be good at analysing the training data to a high degree of accuracy, but it may fail to correctly predict test data because it is grown with too much features of the training set it is not stable to any minor variations in those features. Robustness means the consistency in the performance of the tree, in this context. Hence, the objective would be to grow a tree to an optimal size that is accurate and robust enough.

Pruning is a technique used in determining the size of the tree. Often, you would grow a very large tree and starts to trim in down while measuring the change in accuracy and robustness. rpart() offers a parameter 'minsplit' which is used in this pruning process. It refers to minimum number of data points required before a split is made at a node. The below example compares the effect of changing 'minsplit' parameter.

RP4_1 <- rpart(Kyphosis ~ ., kyphosis, control = rpart.control(minsplit = 3)) RP4_2 <- rpart(Kyphosis ~ ., kyphosis, control = rpart.control(minsplit = 5)) RP4_3 <- rpart(Kyphosis ~ ., kyphosis, control = rpart.control(minsplit = 10))

par(mfcol = c(1, 3)) plot(RP4_1, main = "minsplit=3") text(RP4_1, cex = 0.7) plot(RP4_2, main = "minsplit=5") text(RP4_2, cex = 0.7) plot(RP4_3, main = "minsplit=10") text(RP4_3, cex = 0.7)

plot of chunk unnamed-chunk-10

Another parameter that can be used for pruning is 'complexity parameter'. This is originally created for faster growth of a tree by preventing unnecessary attempts of splits, rather than to determine the size of a tree. But, it nevertheless has impact on the size of the tree.

RP5_1 <- rpart(Kyphosis ~ ., kyphosis, control = rpart.control(cp = 0.1)) RP5_2 <- rpart(Kyphosis ~ ., kyphosis, control = rpart.control(cp = 0.001)) RP5_3 <- rpart(Kyphosis ~ ., kyphosis, control = rpart.control(cp = 1e-04))

par(mfcol = c(1, 3)) plot(RP5_1, main = "cp=0.1") text(RP5_1, cex = 0.7) plot(RP5_2, main = "cp=0.001") text(RP5_2, cex = 0.7) plot(RP5_3, main = "cp=0.0001") text(RP5_3, cex = 0.7)

plot of chunk unnamed-chunk-12

Extracting rules from the tree

In some occasions, you may wish to extract rules to derive insights or to assist in presenting the findings to general audience/business. Also, you may use the extracted rules to configure profile engines as a means of operationalisation.

One options is to use the package 'rattle', and use the function asRules().

require(rattle)

asRules(RP5_2)

Rule number: 3 [Kyphosis=present cover=19 (23%) prob=0.58]

Start< 8.5

Rule number: 23 [Kyphosis=present cover=7 (9%) prob=0.57]
Start>=8.5
Start< 14.5
Age>=55
Age< 111

Rule number: 22 [Kyphosis=absent cover=14 (17%) prob=0.14]
Start>=8.5
Start< 14.5
Age>=55
Age>=111

Rule number: 10 [Kyphosis=absent cover=12 (15%) prob=0.00]
Start>=8.5
Start< 14.5
Age< 55

Rule number: 4 [Kyphosis=absent cover=29 (36%) prob=0.00]
Start>=8.5
Start>=14.5

Alternatively, 'rpart' package offers path.rpart() function to extract the branch of the tree to the specified node. Note that the node does not have to be final node.

For RP5_2 from the above examples that looks like the following,

RP5_2

n= 81

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 81 17 absent (0.79012346 0.20987654)
2) Start>=8.5 62 6 absent (0.90322581 0.09677419)
4) Start>=14.5 29 0 absent (1.00000000 0.00000000) *
5) Start< 14.5 33 6 absent (0.81818182 0.18181818)
10) Age< 55 12 0 absent (1.00000000 0.00000000) *
11) Age>=55 21 6 absent (0.71428571 0.28571429)
22) Age>=111 14 2 absent (0.85714286 0.14285714) *
23) Age< 111 7 3 present (0.42857143 0.57142857) *
3) Start< 8.5 19 8 present (0.42105263 0.57894737) *

we can extract node 5 as below:

path.rpart(RP5_2,node=5)

node number: 5
root
Start>=8.5
Start< 14.5

or we can extract node 11 as below:

path.rpart(RP5_2,node=11)

node number: 11
root
Start>=8.5
Start< 14.5
Age>=55

To identify leaf nodes

FRAME<-RP5_2$frame

FRAME[which(FRAME$var == "<leaf>"), ]

var n wt dev yval complexity ncompete nsurrogate yval2.V1 yval2.V2

4 <leaf> 29 29 0 1 0.001 0 0 1.00000000 29.00000000

10 <leaf> 12 12 0 1 0.001 0 0 1.00000000 12.00000000

22 <leaf> 14 14 2 1 0.001 0 0 1.00000000 12.00000000

23 <leaf> 7 7 3 2 0.001 0 0 2.00000000 3.00000000

3 <leaf> 19 19 8 2 0.001 0 0 2.00000000 8.00000000

yval2.V3 yval2.V4 yval2.V5 yval2.nodeprob

4 0.00000000 1.00000000 0.00000000 0.35802469

10 0.00000000 1.00000000 0.00000000 0.14814815

22 2.00000000 0.85714286 0.14285714 0.17283951

23 4.00000000 0.42857143 0.57142857 0.08641975

3 11.00000000 0.42105263 0.57894737 0.23456790

To find path to the leaf node that show higher chance for the outcome 'absent' and has more than 20 cases.

sel<-FRAME[which(FRAME$var=="<leaf>" & FRAME$yval==1 & FRAME$n>20),]

sel

var n wt dev yval complexity ncompete nsurrogate yval2.V1 yval2.V2

4 <leaf> 29 29 0 1 0.001 0 0 1.0000000 29.0000000

yval2.V3 yval2.V4 yval2.V5 yval2.nodeprob

4 0.0000000 1.0000000 0.0000000 0.3580247

path.rpart(RP5_2,rownames(sel))

node number: 4

root

Start>=8.5

Start>=14.5

Application of a decision tree in the machine learning

The below example randomly splits the data set into training and test sets by 7:3 ratio while ensuring that both sets are mutually exclusive.

a <- sample(c(1:nrow(kyphosis)), size = nrow(kyphosis) * 0.7, replace = FALSE) Train <- kyphosis[a, ] Test <- kyphosis[-a, ]

The training set is used to build a tree. In this example, the output variable is 'Kyphosis'. As this is a categorical variable, the tree that is built here will be a classification tree.

RP <- rpart(Kyphosis ~ ., Train, control = rpart.control(minsplit = 3))

The tree may now be considered a statistical model. This model is now used to predict the output values of the test set. In this example, the output value is selected by observing the maximum probability of the output values at the node. In the real application, there will be many cross validations to determine the optimal settings and appropriate threshold point.

Pred <- predict(RP, Test[, c("Age", "Number", "Start")]) Test$Prediction <- paste("predicted", colnames(Pred)[apply(Pred, 1, which.max)], sep = "_") table(Test$Kyphosis, Test$Prediction)

## predicted_absent predicted_present
## absent 14 6
## present 1 4

The above table shows the errors of this model. In this example, the actual outputs consisted of 20 'absent' and 5 'present', and the tree predicted 15 'absent' and 10 'present'. Of the 15 predicted 'absent' 14 were correct determinations, and of the 10 predicted 'present' only 4 were correct.

If the target of this prediction is identifying 'present', then we would call the the one 'present' that was misclassified as 'absent' a False Negative, and 6 'absent' misclassified as 'present' False Positives. These error rates are used in evaluating the model performance.

My R Codes Archive

Tuesday, 10 December 2013

Decision Tree - rpart