Extending the linear model with R chapter 16
We apply regression tree methodology to study the relationship between atmospheric ozone concentration and meteorology in Los Angeles area in 1976. The response is Ozone (\(O_3\)), the atmospheric ozone concentration on a particular day. First, take a look over the data:
library(faraway)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ozone %>%
ggplot(mapping = aes(x=temp, y=O3)) +
geom_point(size=1) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ozone %>%
ggplot(aes(x=ibh, y=O3)) +
geom_point(size=1) +
geom_smooth() +
theme(axis.text.x = element_text(angle = 90))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ozone %>%
ggplot(aes(x=ibt, y=O3)) +
geom_point(size=1) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
library(rpart)
##
## Attaching package: 'rpart'
## The following object is masked from 'package:faraway':
##
## solder
(tmod <- rpart(O3 ~ ., ozone))
## n= 330
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 330 21115.4100 11.775760
## 2) temp< 67.5 214 4114.3040 7.425234
## 4) ibh>=3573.5 108 689.6296 5.148148 *
## 5) ibh< 3573.5 106 2294.1230 9.745283
## 10) dpg< -9.5 35 362.6857 6.457143 *
## 11) dpg>=-9.5 71 1366.4790 11.366200
## 22) ibt< 159 40 287.9000 9.050000 *
## 23) ibt>=159 31 587.0968 14.354840 *
## 3) temp>=67.5 116 5478.4400 19.801720
## 6) ibt< 226.5 55 1276.8360 15.945450
## 12) humidity< 59.5 10 167.6000 10.800000 *
## 13) humidity>=59.5 45 785.6444 17.088890 *
## 7) ibt>=226.5 61 2646.2620 23.278690
## 14) doy>=306.5 8 398.0000 16.000000 *
## 15) doy< 306.5 53 1760.4530 24.377360
## 30) vis>=55 36 1149.8890 22.944440 *
## 31) vis< 55 17 380.1176 27.411760 *
The first split (nodes 2 and 3) is on temperature:
214 observations have temperatures less than 67.5 with a mean response value of \(7.4\).
116 observations have temperatures greater than 67.5 with a mean response value of \(20\).
The total RSS has been reduced from \(21115.00\) to \(4114.30 + 5478.40 = 9592.70\).
More substantial output
summary(tmod)
Graphical displays
plot(tmod)
text(tmod)
plot(tmod,compress=T,uniform=T,branch=0.4)
text(tmod)
library(rpart.plot)
rpart.plot(tmod, type = 2)
Consider all partitions of the region of the predictors into two regions parallel to one of the axes. \(N-1\) possible partitions for each predictor with a total of \((N-1)p\) partitions.
For each partition, take the mean of the responses in the partition and compute residual sum of squares (RSS). Pick the partition that minimizes the RSS among all available partitions. \[ RSS(partition) = RSS(part_1) + RSS(part_2) = \sum_{y_i \in part_1} (y_i - \bar y_{part_1})^2 + \sum_{y_j \in part_2} (y_j - \bar y_{part_2})^2 \]
For categorical predictors with \(L\) levels:
\(L-1\) possible splits for an ordered factor
\(2^{L-1} - 1\) possible splits for an unordered factor
Monotonely transforming a quantitative predictor makes no difference.
Transforming response will make a difference because of the computation of RSS.
Easy interpretation
When training, we may simply exclude points with missing values provided we weight appropriately.
If we believe being missing expresses some information, then
When predict the response for a new value with missing values
drop the prediction down through the tree until the missing values prevent us from going further
use the mean value for the internal node to proceed
plot(jitter(predict(tmod)),residuals(tmod),xlab="Fitted",ylab="Residuals")
abline(h=0)
qqnorm(residuals(tmod))
qqline(residuals(tmod))
Let’s predict the response for a new value, using the median value as an example
(x0 <- apply(ozone[,-1], 2, median))
## vh wind humidity temp ibh dpg ibt vis
## 5760.0 5.0 64.0 62.0 2112.5 24.0 167.5 120.0
## doy
## 205.5
rpart.plot(tmod, type = 2)
predict(tmod,data.frame(t(x0)))
## 1
## 14.35484
The recursive partitioning algorithm describes how to grow the tree.
How to decide the optimal size for the tree?
Greedy strategy
keep partitioning until the reduction in overall cost (RSS) is not reduced by more than \(\epsilon\).
can be difficult to set \(\epsilon\) value.
Cross-validation for more general model selection
leave one out: For a given tree, leave out one observation, recalculate the tree and predict the left-over observation. Repeat for all observations
\(\sum_{j=1}^n (y_j - \hat f_{(j)}(x_j))^2\) where \(\hat f_{(j)}\) denotes the predicted value of input \(x_j\) given the tree given built without \(x_j\)
k-fold cross-validation randomly divide the data into k equal parts and use \(k-1\) parts to predict the cases in the remaining part. k = 10 is typical choice.
Cost-complexity function for trees further narrow down the set of trees worth validating
\[ CC(Tree) = \sum_{\mbox{terminal nodes: i}} RSS_i + \lambda(\mbox{number of terminal nodes}) \]
Pruning the tree
cp
(ratio of \(\lambda\) to \(RSS\) of the root tree) to get a large
treeset.seed(7360)
tmode <- rpart(O3 ~ ., ozone, cp = 0.001)
printcp(tmode)
##
## Regression tree:
## rpart(formula = O3 ~ ., data = ozone, cp = 0.001)
##
## Variables actually used in tree construction:
## [1] doy dpg humidity ibh ibt temp vh vis
##
## Root node error: 21115/330 = 63.986
##
## n= 330
##
## CP nsplit rel error xerror xstd
## 1 0.5456993 0 1.00000 1.00649 0.076959
## 2 0.0736591 1 0.45430 0.47756 0.041641
## 3 0.0535415 2 0.38064 0.43818 0.041663
## 4 0.0267557 3 0.32710 0.37264 0.035315
## 5 0.0232760 4 0.30034 0.37593 0.036334
## 6 0.0231021 5 0.27707 0.36501 0.035956
## 7 0.0153249 6 0.25397 0.35679 0.036236
## 8 0.0109137 7 0.23864 0.33960 0.033392
## 9 0.0070746 8 0.22773 0.32130 0.031940
## 10 0.0059918 9 0.22065 0.33369 0.034103
## 11 0.0059317 10 0.21466 0.33474 0.034161
## 12 0.0049709 12 0.20280 0.33863 0.034415
## 13 0.0047996 15 0.18789 0.34231 0.034722
## 14 0.0044712 16 0.18309 0.34201 0.034777
## 15 0.0031921 17 0.17861 0.34212 0.034971
## 16 0.0022152 19 0.17223 0.34343 0.035302
## 17 0.0020733 20 0.17002 0.34959 0.035899
## 18 0.0020297 22 0.16587 0.35026 0.035914
## 19 0.0014432 23 0.16384 0.34650 0.035938
## 20 0.0011322 24 0.16240 0.34717 0.035987
## 21 0.0011035 25 0.16126 0.34622 0.036007
## 22 0.0010000 26 0.16016 0.34534 0.036015
Can select the size of the tree by minimizing the value of
xerror
with corresponding value of cp
can select the smallest tree with a CV error within one standard error of the minimum.
visulize the CV error
plotcp(tmod)
library(rpart.plot)
rpart.plot(tmod, type = 2)
Learn more about rpart.plot
from Stephen Milborrow’s
pdf
tmodr <- prune.rpart(tmod, 0.0153249)
1-sum(residuals(tmodr)^2)/sum((ozone$O3-mean(ozone$O3))^2)
## [1] 0.7613586
alm <- lm(O3 ~ vis + doy + ibt + humidity + temp, data = ozone)
summary(alm)
##
## Call:
## lm(formula = O3 ~ vis + doy + ibt + humidity + temp, data = ozone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2499 -3.0600 -0.1662 2.8773 13.2690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.017861 1.653058 -6.060 3.78e-09 ***
## vis -0.008201 0.003692 -2.221 0.027 *
## doy -0.010197 0.002445 -4.170 3.91e-05 ***
## ibt 0.034913 0.006707 5.205 3.45e-07 ***
## humidity 0.085101 0.014347 5.931 7.70e-09 ***
## temp 0.232806 0.036074 6.454 3.98e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.43 on 324 degrees of freedom
## Multiple R-squared: 0.6988, Adjusted R-squared: 0.6942
## F-statistic: 150.3 on 5 and 324 DF, p-value: < 2.2e-16
Tree models are not optimal for prediction or explanation purposes.
A method that builds on trees to form a forest
Random forest method uses bootstrap aggregating (known as bagging). For \(b=1, \dots, B\)
Draw a sample with replacement from (\(X\), \(Y\)) to generate (\(X_b\), \(Y_b\))
fit a regression tree to (\(X_b\), \(Y_b\))
For cases not drawn in bootstrap sample, compute the mean squared error of prediction
The \(B\) trees form the forest. New predictions can be made by feeding predictor value into each tree and average the predictions.
Subsample predictors at each node to reduce the correlations among the trees.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
fmod <- randomForest(O3 ~ ., ozone)
plot(fmod, main = "")
Use cross-validation
start with full sample \(p = 9\)
move on to \(step \times p\) recursively, rounding at each stage
pick up best nvars
for mtry
cvr <- rfcv(ozone[,-1],ozone[,1],step=0.9)
cbind(nvars=cvr$n.var,MSE=cvr$error.cv)
## nvars MSE
## 9 9 16.11308
## 8 8 15.69783
## 7 7 15.90903
## 6 6 16.63753
## 5 5 17.39111
## 4 4 19.38641
## 3 3 20.93230
## 2 2 22.45007
## 1 1 25.87572
mtry
= 9, compute \(R^2\) valuefmod <- randomForest(O3 ~ ., ozone, mtry=9)
1-sum((fmod$predict-ozone$O3)^2)/sum((ozone$O3-mean(ozone$O3))^2)
## [1] 0.7410042
partialPlot(fmod, ozone, "temp", main="")
importance(fmod)
## IncNodePurity
## vh 362.1073
## wind 184.8538
## humidity 865.7341
## temp 13023.4729
## ibh 1402.9185
## dpg 998.6239
## ibt 2566.1016
## vis 717.1797
## doy 806.5832
Responses are categorical
A split should divide the observations within a node so that class types within a split are mostly of one kind
Let \(n_{ik}\) be the number of observations of type \(k\) within terminal node \(i\) and \(p_{ik}\) be the observed proportion of type \(k\) within node \(i\).
Several measure of purity of the node:
\[ D_i = -2 \sum_{k}n_{ik}\log p_{ik} \]
\[ D_i = - \sum_{k} p_{ik} \log p_{ik} \]
\[ D_i = 1 - \sum_k p_{ik}^2 \]
All measures are minimized when all members of the node are of the same type
Identification of the sex and species of an historical specimen of
kangaroo (kanga
dataset in faraway package).
Three possible species: Giganteus, Melanops and Fuliginosus
The sex of the animal
18 skull measurements
We form separate trees for identifying the sex and species.
data(kanga, package="faraway")
x0 <- c(1115,NA,748,182,NA,NA,178,311,756,226,NA,NA,NA,48,1009,NA,204,593)
kanga <- kanga[,c(T,F,!is.na(x0))]
kanga[1:2,]
## species basilar.length palate.length palate.width squamosal.depth
## 1 giganteus 1312 882 NA 180
## 2 giganteus 1439 985 230 150
## lacrymal.width zygomatic.width orbital.width foramina.length mandible.length
## 1 394 782 249 88 1086
## 2 416 824 233 100 1158
## mandible.depth ramus.height
## 1 179 591
## 2 181 643
apply(kanga,2,function(x) sum(is.na(x)))
## species basilar.length palate.length palate.width squamosal.depth
## 0 1 1 24 1
## lacrymal.width zygomatic.width orbital.width foramina.length mandible.length
## 0 1 0 0 12
## mandible.depth ramus.height
## 0 0
round(cor(kanga[,-1],use="pairwise.complete.obs")[,c(3,9)],2)
## palate.width mandible.length
## basilar.length 0.77 0.98
## palate.length 0.81 0.98
## palate.width 1.00 0.81
## squamosal.depth 0.69 0.80
## lacrymal.width 0.77 0.92
## zygomatic.width 0.78 0.92
## orbital.width 0.12 0.25
## foramina.length 0.19 0.23
## mandible.length 0.81 1.00
## mandible.depth 0.62 0.85
## ramus.height 0.73 0.94
newko <- na.omit(kanga[,-c(4,10)])
dim(newko)
## [1] 144 10
ggplot(newko, aes(x=zygomatic.width,y=foramina.length,shape=species, color = species)) +
geom_point() +
theme(legend.position = "top", legend.direction ="horizontal", legend.title=element_blank())
set.seed(123) # try with 7360
kt <- rpart(species ~ ., data=newko,control = rpart.control(cp = 0.001))
printcp(kt)
##
## Classification tree:
## rpart(formula = species ~ ., data = newko, control = rpart.control(cp = 0.001))
##
## Variables actually used in tree construction:
## [1] basilar.length foramina.length lacrymal.width ramus.height
## [5] squamosal.depth zygomatic.width
##
## Root node error: 95/144 = 0.65972
##
## n= 144
##
## CP nsplit rel error xerror xstd
## 1 0.178947 0 1.00000 1.10526 0.056133
## 2 0.105263 1 0.82105 0.92632 0.061579
## 3 0.050000 2 0.71579 0.90526 0.061952
## 4 0.021053 6 0.51579 0.82105 0.062938
## 5 0.010526 7 0.49474 0.83158 0.062859
## 6 0.001000 8 0.48421 0.89474 0.062120
(ktp <- prune(kt,cp=0.0211))
## n= 144
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 144 95 fuliginosus (0.34027778 0.33333333 0.32638889)
## 2) zygomatic.width>=923 37 13 fuliginosus (0.64864865 0.16216216 0.18918919) *
## 3) zygomatic.width< 923 107 65 giganteus (0.23364486 0.39252336 0.37383178)
## 6) zygomatic.width>=901 16 3 giganteus (0.12500000 0.81250000 0.06250000) *
## 7) zygomatic.width< 901 91 52 melanops (0.25274725 0.31868132 0.42857143)
## 14) foramina.length< 98.5 58 33 melanops (0.36206897 0.20689655 0.43103448)
## 28) lacrymal.width< 448.5 50 29 fuliginosus (0.42000000 0.24000000 0.34000000)
## 56) ramus.height>=628.5 33 14 fuliginosus (0.57575758 0.18181818 0.24242424) *
## 57) ramus.height< 628.5 17 8 melanops (0.11764706 0.35294118 0.52941176) *
## 29) lacrymal.width>=448.5 8 0 melanops (0.00000000 0.00000000 1.00000000) *
## 15) foramina.length>=98.5 33 16 giganteus (0.06060606 0.51515152 0.42424242)
## 30) squamosal.depth< 182.5 26 10 giganteus (0.07692308 0.61538462 0.30769231) *
## 31) squamosal.depth>=182.5 7 1 melanops (0.00000000 0.14285714 0.85714286) *
rpart.plot(ktp, type = 2)
(tt <- table(actual=newko$species, predicted=predict(ktp, type="class")))
## predicted
## actual fuliginosus giganteus melanops
## fuliginosus 43 4 2
## giganteus 12 29 7
## melanops 15 9 23
1-sum(diag(tt))/sum(tt)
## [1] 0.3402778
cvr <- rfcv(newko[,-1],newko[,1],step=0.9)
cbind(nvars=cvr$n.var,error.rate=cvr$error.cv)
## nvars error.rate
## 9 9 0.5138889
## 8 8 0.4652778
## 7 7 0.4722222
## 6 6 0.4722222
## 5 5 0.4791667
## 4 4 0.5208333
## 3 3 0.5763889
## 2 2 0.5694444
## 1 1 0.5833333
fmod <- randomForest(species ~ ., newko, mtry=6)
(tt <- table(actual=newko$species,predicted=predict(fmod)))
## predicted
## actual fuliginosus giganteus melanops
## fuliginosus 37 5 7
## giganteus 5 21 22
## melanops 11 20 16
1-sum(diag(tt))/sum(tt)
## [1] 0.4861111
Use principal components on the predictors for rescue
pck <- princomp(newko[,-1])
pcdf <- data.frame(species = newko$species, pck$scores)
fmod <- randomForest(species ~ ., pcdf, mtry=6)
tt <- table(actual = newko$species, predicted=predict(fmod))
1-sum(diag(tt))/sum(tt)
## [1] 0.3541667