Trees

Announcement
Acknowledgement
Air pollution data
Fit a regression tree
Recursive partitioning regression algorithm
Missing values
Diagonostics
Prediction
Tree pruning
Random Forests
- Fit a random forest
Classification trees
- Kangaroo data
- Fit a classification tree
Classification using forests

Announcement

Course evaluations

Acknowledgement

Extending the linear model with R chapter 16

Air pollution data

We apply regression tree methodology to study the relationship between atmospheric ozone concentration and meteorology in Los Angeles area in 1976. The response is Ozone (\(O_3\)), the atmospheric ozone concentration on a particular day. First, take a look over the data:

library(faraway)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

ozone %>%
  ggplot(mapping = aes(x=temp, y=O3)) + 
  geom_point(size=1) +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ozone %>%
  ggplot(aes(x=ibh, y=O3)) +
  geom_point(size=1) +
  geom_smooth() +
  theme(axis.text.x = element_text(angle = 90))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ozone %>%
  ggplot(aes(x=ibt, y=O3)) +
  geom_point(size=1) +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

An examination of the data reveals several nonlinear relationships indicating that a linear regression might not be appropriate without the addition of some transformation.

Fit a regression tree

library(rpart)

## 
## Attaching package: 'rpart'

## The following object is masked from 'package:faraway':
## 
##     solder

(tmod <- rpart(O3 ~ ., ozone))

## n= 330 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 330 21115.4100 11.775760  
##    2) temp< 67.5 214  4114.3040  7.425234  
##      4) ibh>=3573.5 108   689.6296  5.148148 *
##      5) ibh< 3573.5 106  2294.1230  9.745283  
##       10) dpg< -9.5 35   362.6857  6.457143 *
##       11) dpg>=-9.5 71  1366.4790 11.366200  
##         22) ibt< 159 40   287.9000  9.050000 *
##         23) ibt>=159 31   587.0968 14.354840 *
##    3) temp>=67.5 116  5478.4400 19.801720  
##      6) ibt< 226.5 55  1276.8360 15.945450  
##       12) humidity< 59.5 10   167.6000 10.800000 *
##       13) humidity>=59.5 45   785.6444 17.088890 *
##      7) ibt>=226.5 61  2646.2620 23.278690  
##       14) doy>=306.5 8   398.0000 16.000000 *
##       15) doy< 306.5 53  1760.4530 24.377360  
##         30) vis>=55 36  1149.8890 22.944440 *
##         31) vis< 55 17   380.1176 27.411760 *

The first split (nodes 2 and 3) is on temperature:
- 214 observations have temperatures less than 67.5 with a mean response value of \(7.4\).
- 116 observations have temperatures greater than 67.5 with a mean response value of \(20\).
The total RSS has been reduced from \(21115.00\) to \(4114.30 + 5478.40 = 9592.70\).

More substantial output

summary(tmod)

Graphical displays

plot(tmod)
text(tmod)

plot(tmod,compress=T,uniform=T,branch=0.4)
text(tmod)

library(rpart.plot)
rpart.plot(tmod, type = 2)

Recursive partitioning regression algorithm

Consider all partitions of the region of the predictors into two regions parallel to one of the axes. \(N-1\) possible partitions for each predictor with a total of \((N-1)p\) partitions.
For each partition, take the mean of the responses in the partition and compute residual sum of squares (RSS). Pick the partition that minimizes the RSS among all available partitions. \[ RSS(partition) = RSS(part_1) + RSS(part_2) = \sum_{y_i \in part_1} (y_i - \bar y_{part_1})^2 + \sum_{y_j \in part_2} (y_j - \bar y_{part_2})^2 \]

Sub-partition the partitions in a recursive manner.

For categorical predictors with \(L\) levels:
- \(L-1\) possible splits for an ordered factor
- \(2^{L-1} - 1\) possible splits for an unordered factor
Monotonely transforming a quantitative predictor makes no difference.
Transforming response will make a difference because of the computation of RSS.
Easy interpretation

Missing values

When training, we may simply exclude points with missing values provided we weight appropriately.

If we believe being missing expresses some information, then

for categorical predictor, assign an additional factor for missing values
for continuous predictors, could discretize data into ranges so that missing value becomes an additional factor

When predict the response for a new value with missing values

drop the prediction down through the tree until the missing values prevent us from going further
use the mean value for the internal node to proceed

Diagonostics

plot(jitter(predict(tmod)),residuals(tmod),xlab="Fitted",ylab="Residuals")
abline(h=0)

qqnorm(residuals(tmod))
qqline(residuals(tmod))

Prediction

Let’s predict the response for a new value, using the median value as an example

(x0 <- apply(ozone[,-1], 2, median))

##       vh     wind humidity     temp      ibh      dpg      ibt      vis 
##   5760.0      5.0     64.0     62.0   2112.5     24.0    167.5    120.0 
##      doy 
##    205.5

rpart.plot(tmod, type = 2)

predict(tmod,data.frame(t(x0)))

##        1 
## 14.35484

Tree pruning

The recursive partitioning algorithm describes how to grow the tree.
How to decide the optimal size for the tree?
Greedy strategy
- keep partitioning until the reduction in overall cost (RSS) is not reduced by more than \(\epsilon\).
- can be difficult to set \(\epsilon\) value.
Cross-validation for more general model selection
- leave one out: For a given tree, leave out one observation, recalculate the tree and predict the left-over observation. Repeat for all observations
- \(\sum_{j=1}^n (y_j - \hat f_{(j)}(x_j))^2\) where \(\hat f_{(j)}\) denotes the predicted value of input \(x_j\) given the tree given built without \(x_j\)
- k-fold cross-validation randomly divide the data into k equal parts and use \(k-1\) parts to predict the cases in the remaining part. k = 10 is typical choice.
Cost-complexity function for trees further narrow down the set of trees worth validating

\[ CC(Tree) = \sum_{\mbox{terminal nodes: i}} RSS_i + \lambda(\mbox{number of terminal nodes}) \]
- large \(\lambda\) value favous small trees

Pruning the tree

use cp (ratio of \(\lambda\) to \(RSS\) of the root tree) to get a large tree

set.seed(7360)
tmode <- rpart(O3 ~ ., ozone, cp = 0.001)
printcp(tmode)

## 
## Regression tree:
## rpart(formula = O3 ~ ., data = ozone, cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] doy      dpg      humidity ibh      ibt      temp     vh       vis     
## 
## Root node error: 21115/330 = 63.986
## 
## n= 330 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.5456993      0   1.00000 1.00649 0.076959
## 2  0.0736591      1   0.45430 0.47756 0.041641
## 3  0.0535415      2   0.38064 0.43818 0.041663
## 4  0.0267557      3   0.32710 0.37264 0.035315
## 5  0.0232760      4   0.30034 0.37593 0.036334
## 6  0.0231021      5   0.27707 0.36501 0.035956
## 7  0.0153249      6   0.25397 0.35679 0.036236
## 8  0.0109137      7   0.23864 0.33960 0.033392
## 9  0.0070746      8   0.22773 0.32130 0.031940
## 10 0.0059918      9   0.22065 0.33369 0.034103
## 11 0.0059317     10   0.21466 0.33474 0.034161
## 12 0.0049709     12   0.20280 0.33863 0.034415
## 13 0.0047996     15   0.18789 0.34231 0.034722
## 14 0.0044712     16   0.18309 0.34201 0.034777
## 15 0.0031921     17   0.17861 0.34212 0.034971
## 16 0.0022152     19   0.17223 0.34343 0.035302
## 17 0.0020733     20   0.17002 0.34959 0.035899
## 18 0.0020297     22   0.16587 0.35026 0.035914
## 19 0.0014432     23   0.16384 0.34650 0.035938
## 20 0.0011322     24   0.16240 0.34717 0.035987
## 21 0.0011035     25   0.16126 0.34622 0.036007
## 22 0.0010000     26   0.16016 0.34534 0.036015

Can select the size of the tree by minimizing the value of xerror with corresponding value of cp
- take the nine-split tree
can select the smallest tree with a CV error within one standard error of the minimum.
- \(0.350 + 0.035 = 0.385\), take the seven-split tree.
visulize the CV error

plotcp(tmod)

library(rpart.plot)
rpart.plot(tmod, type = 2)

Learn more about rpart.plot from Stephen Milborrow’s pdf

Select a tree with five splits, effectively six parameters

tmodr <- prune.rpart(tmod, 0.0153249)
1-sum(residuals(tmodr)^2)/sum((ozone$O3-mean(ozone$O3))^2)

## [1] 0.7613586

Compare to a linear regression with 6 parameters

alm <- lm(O3 ~ vis + doy + ibt + humidity + temp, data = ozone)
summary(alm)

## 
## Call:
## lm(formula = O3 ~ vis + doy + ibt + humidity + temp, data = ozone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2499  -3.0600  -0.1662   2.8773  13.2690 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.017861   1.653058  -6.060 3.78e-09 ***
## vis          -0.008201   0.003692  -2.221    0.027 *  
## doy          -0.010197   0.002445  -4.170 3.91e-05 ***
## ibt           0.034913   0.006707   5.205 3.45e-07 ***
## humidity      0.085101   0.014347   5.931 7.70e-09 ***
## temp          0.232806   0.036074   6.454 3.98e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.43 on 324 degrees of freedom
## Multiple R-squared:  0.6988, Adjusted R-squared:  0.6942 
## F-statistic: 150.3 on 5 and 324 DF,  p-value: < 2.2e-16

Tree models are not optimal for prediction or explanation purposes.

Random Forests

A method that builds on trees to form a forest
Random forest method uses bootstrap aggregating (known as bagging). For \(b=1, \dots, B\)
- Draw a sample with replacement from (\(X\), \(Y\)) to generate (\(X_b\), \(Y_b\))
- fit a regression tree to (\(X_b\), \(Y_b\))
- For cases not drawn in bootstrap sample, compute the mean squared error of prediction
The \(B\) trees form the forest. New predictions can be made by feeding predictor value into each tree and average the predictions.
Subsample predictors at each node to reduce the correlations among the trees.
- Default choice of subsample size is \(\sqrt{p}\)

Fit a random forest

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

fmod <- randomForest(O3 ~ ., ozone)

plot the mean squared error (MSE) of prediction as number of trees \(B\) increases

plot(fmod, main = "")

Use cross-validation
- start with full sample \(p = 9\)
- move on to \(step \times p\) recursively, rounding at each stage
- pick up best nvars for mtry

cvr <- rfcv(ozone[,-1],ozone[,1],step=0.9)
cbind(nvars=cvr$n.var,MSE=cvr$error.cv)

##   nvars      MSE
## 9     9 16.11308
## 8     8 15.69783
## 7     7 15.90903
## 6     6 16.63753
## 5     5 17.39111
## 4     4 19.38641
## 3     3 20.93230
## 2     2 22.45007
## 1     1 25.87572

choose mtry = 9, compute \(R^2\) value

fmod <- randomForest(O3 ~ ., ozone, mtry=9)
1-sum((fmod$predict-ozone$O3)^2)/sum((ozone$O3-mean(ozone$O3))^2)

## [1] 0.7410042

partial dependence plot compute the predicted value for every case in the dataset excluding the predictor of interest and average over the predicted values

partialPlot(fmod, ozone, "temp", main="")

A measure of importance over predictors

importance(fmod)

##          IncNodePurity
## vh            362.1073
## wind          184.8538
## humidity      865.7341
## temp        13023.4729
## ibh          1402.9185
## dpg           998.6239
## ibt          2566.1016
## vis           717.1797
## doy           806.5832

Classification trees

Responses are categorical
A split should divide the observations within a node so that class types within a split are mostly of one kind
Let \(n_{ik}\) be the number of observations of type \(k\) within terminal node \(i\) and \(p_{ik}\) be the observed proportion of type \(k\) within node \(i\).
Several measure of purity of the node:
- Deviance:
\[ D_i = -2 \sum_{k}n_{ik}\log p_{ik} \]
- Entropy:
\[ D_i = - \sum_{k} p_{ik} \log p_{ik} \]
- Gini index:
\[ D_i = 1 - \sum_k p_{ik}^2 \]
All measures are minimized when all members of the node are of the same type

Kangaroo data

Identification of the sex and species of an historical specimen of kangaroo (kanga dataset in faraway package).

Three possible species: Giganteus, Melanops and Fuliginosus
The sex of the animal
18 skull measurements

We form separate trees for identifying the sex and species.

read in data

data(kanga, package="faraway")
x0 <- c(1115,NA,748,182,NA,NA,178,311,756,226,NA,NA,NA,48,1009,NA,204,593)

exclude all variables that are missing in the test case

kanga <- kanga[,c(T,F,!is.na(x0))]
kanga[1:2,]

##     species basilar.length palate.length palate.width squamosal.depth
## 1 giganteus           1312           882           NA             180
## 2 giganteus           1439           985          230             150
##   lacrymal.width zygomatic.width orbital.width foramina.length mandible.length
## 1            394             782           249              88            1086
## 2            416             824           233             100            1158
##   mandible.depth ramus.height
## 1            179          591
## 2            181          643

First, look at missing values in the training set

apply(kanga,2,function(x) sum(is.na(x)))

##         species  basilar.length   palate.length    palate.width squamosal.depth 
##               0               1               1              24               1 
##  lacrymal.width zygomatic.width   orbital.width foramina.length mandible.length 
##               0               1               0               0              12 
##  mandible.depth    ramus.height 
##               0               0

Compute the pairwise correlation of these variables with other variables

round(cor(kanga[,-1],use="pairwise.complete.obs")[,c(3,9)],2)

##                 palate.width mandible.length
## basilar.length          0.77            0.98
## palate.length           0.81            0.98
## palate.width            1.00            0.81
## squamosal.depth         0.69            0.80
## lacrymal.width          0.77            0.92
## zygomatic.width         0.78            0.92
## orbital.width           0.12            0.25
## foramina.length         0.19            0.23
## mandible.length         0.81            1.00
## mandible.depth          0.62            0.85
## ramus.height            0.73            0.94

Remove the two variables and remaining missing cases

newko <- na.omit(kanga[,-c(4,10)])
dim(newko)

## [1] 144  10

visualize the class separation over the two variables

ggplot(newko, aes(x=zygomatic.width,y=foramina.length,shape=species, color = species)) +
  geom_point() +
  theme(legend.position = "top", legend.direction ="horizontal", legend.title=element_blank())

Fit a classification tree

Gini’s index is the default choice of criterion

set.seed(123)  # try with 7360
kt <- rpart(species ~ ., data=newko,control = rpart.control(cp = 0.001))
printcp(kt)

## 
## Classification tree:
## rpart(formula = species ~ ., data = newko, control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] basilar.length  foramina.length lacrymal.width  ramus.height   
## [5] squamosal.depth zygomatic.width
## 
## Root node error: 95/144 = 0.65972
## 
## n= 144 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.178947      0   1.00000 1.10526 0.056133
## 2 0.105263      1   0.82105 0.92632 0.061579
## 3 0.050000      2   0.71579 0.90526 0.061952
## 4 0.021053      6   0.51579 0.82105 0.062938
## 5 0.010526      7   0.49474 0.83158 0.062859
## 6 0.001000      8   0.48421 0.89474 0.062120

Select the six-split tree and plot

(ktp <- prune(kt,cp=0.0211))

## n= 144 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 144 95 fuliginosus (0.34027778 0.33333333 0.32638889)  
##    2) zygomatic.width>=923 37 13 fuliginosus (0.64864865 0.16216216 0.18918919) *
##    3) zygomatic.width< 923 107 65 giganteus (0.23364486 0.39252336 0.37383178)  
##      6) zygomatic.width>=901 16  3 giganteus (0.12500000 0.81250000 0.06250000) *
##      7) zygomatic.width< 901 91 52 melanops (0.25274725 0.31868132 0.42857143)  
##       14) foramina.length< 98.5 58 33 melanops (0.36206897 0.20689655 0.43103448)  
##         28) lacrymal.width< 448.5 50 29 fuliginosus (0.42000000 0.24000000 0.34000000)  
##           56) ramus.height>=628.5 33 14 fuliginosus (0.57575758 0.18181818 0.24242424) *
##           57) ramus.height< 628.5 17  8 melanops (0.11764706 0.35294118 0.52941176) *
##         29) lacrymal.width>=448.5 8  0 melanops (0.00000000 0.00000000 1.00000000) *
##       15) foramina.length>=98.5 33 16 giganteus (0.06060606 0.51515152 0.42424242)  
##         30) squamosal.depth< 182.5 26 10 giganteus (0.07692308 0.61538462 0.30769231) *
##         31) squamosal.depth>=182.5 7  1 melanops (0.00000000 0.14285714 0.85714286) *

rpart.plot(ktp, type = 2)

Compute the misclassification error

 (tt <- table(actual=newko$species, predicted=predict(ktp, type="class")))

##              predicted
## actual        fuliginosus giganteus melanops
##   fuliginosus          43         4        2
##   giganteus            12        29        7
##   melanops             15         9       23

1-sum(diag(tt))/sum(tt)

## [1] 0.3402778

May instead use the Principle components for constructing the tree for lower error rate (0.25)

Classification using forests

A natural extension to classification trees

cvr <- rfcv(newko[,-1],newko[,1],step=0.9)
cbind(nvars=cvr$n.var,error.rate=cvr$error.cv)

##   nvars error.rate
## 9     9  0.5138889
## 8     8  0.4652778
## 7     7  0.4722222
## 6     6  0.4722222
## 5     5  0.4791667
## 4     4  0.5208333
## 3     3  0.5763889
## 2     2  0.5694444
## 1     1  0.5833333

choose 6 as sub-sample size

fmod <- randomForest(species ~ ., newko, mtry=6)
(tt <- table(actual=newko$species,predicted=predict(fmod)))

##              predicted
## actual        fuliginosus giganteus melanops
##   fuliginosus          37         5        7
##   giganteus             5        21       22
##   melanops             11        20       16

1-sum(diag(tt))/sum(tt)

## [1] 0.4861111

Use principal components on the predictors for rescue

pck <- princomp(newko[,-1])
pcdf <- data.frame(species = newko$species, pck$scores)
fmod <- randomForest(species ~ ., pcdf, mtry=6)
tt <- table(actual = newko$species, predicted=predict(fmod))
1-sum(diag(tt))/sum(tt)

## [1] 0.3541667