Announcement

Acknowledgement

Extending the linear model with R chapter 16

Air pollution data

We apply regression tree methodology to study the relationship between atmospheric ozone concentration and meteorology in Los Angeles area in 1976. The response is Ozone (\(O_3\)), the atmospheric ozone concentration on a particular day. First, take a look over the data:

library(faraway)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ozone %>%
  ggplot(mapping = aes(x=temp, y=O3)) + 
  geom_point(size=1) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ozone %>%
  ggplot(aes(x=ibh, y=O3)) +
  geom_point(size=1) +
  geom_smooth() +
  theme(axis.text.x = element_text(angle = 90))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ozone %>%
  ggplot(aes(x=ibt, y=O3)) +
  geom_point(size=1) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Fit a regression tree

library(rpart)
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:faraway':
## 
##     solder
(tmod <- rpart(O3 ~ ., ozone))
## n= 330 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 330 21115.4100 11.775760  
##    2) temp< 67.5 214  4114.3040  7.425234  
##      4) ibh>=3573.5 108   689.6296  5.148148 *
##      5) ibh< 3573.5 106  2294.1230  9.745283  
##       10) dpg< -9.5 35   362.6857  6.457143 *
##       11) dpg>=-9.5 71  1366.4790 11.366200  
##         22) ibt< 159 40   287.9000  9.050000 *
##         23) ibt>=159 31   587.0968 14.354840 *
##    3) temp>=67.5 116  5478.4400 19.801720  
##      6) ibt< 226.5 55  1276.8360 15.945450  
##       12) humidity< 59.5 10   167.6000 10.800000 *
##       13) humidity>=59.5 45   785.6444 17.088890 *
##      7) ibt>=226.5 61  2646.2620 23.278690  
##       14) doy>=306.5 8   398.0000 16.000000 *
##       15) doy< 306.5 53  1760.4530 24.377360  
##         30) vis>=55 36  1149.8890 22.944440 *
##         31) vis< 55 17   380.1176 27.411760 *

More substantial output

summary(tmod)

Graphical displays

plot(tmod)
text(tmod)

plot(tmod,compress=T,uniform=T,branch=0.4)
text(tmod)

library(rpart.plot)
rpart.plot(tmod, type = 2)

Recursive partitioning regression algorithm

  1. Consider all partitions of the region of the predictors into two regions parallel to one of the axes. \(N-1\) possible partitions for each predictor with a total of \((N-1)p\) partitions.

  2. For each partition, take the mean of the responses in the partition and compute residual sum of squares (RSS). Pick the partition that minimizes the RSS among all available partitions. \[ RSS(partition) = RSS(part_1) + RSS(part_2) = \sum_{y_i \in part_1} (y_i - \bar y_{part_1})^2 + \sum_{y_j \in part_2} (y_j - \bar y_{part_2})^2 \]

  1. Sub-partition the partitions in a recursive manner.

Missing values

When training, we may simply exclude points with missing values provided we weight appropriately.

If we believe being missing expresses some information, then

When predict the response for a new value with missing values

Diagonostics

plot(jitter(predict(tmod)),residuals(tmod),xlab="Fitted",ylab="Residuals")
abline(h=0)

qqnorm(residuals(tmod))
qqline(residuals(tmod))

Prediction

Let’s predict the response for a new value, using the median value as an example

(x0 <- apply(ozone[,-1], 2, median))
##       vh     wind humidity     temp      ibh      dpg      ibt      vis 
##   5760.0      5.0     64.0     62.0   2112.5     24.0    167.5    120.0 
##      doy 
##    205.5
rpart.plot(tmod, type = 2)

predict(tmod,data.frame(t(x0)))
##        1 
## 14.35484

Tree pruning

Random Forests

Fit a random forest

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
fmod <- randomForest(O3 ~ ., ozone)
  • plot the mean squared error (MSE) of prediction as number of trees \(B\) increases
plot(fmod, main = "")

  • Use cross-validation

    • start with full sample \(p = 9\)

    • move on to \(step \times p\) recursively, rounding at each stage

    • pick up best nvars for mtry

cvr <- rfcv(ozone[,-1],ozone[,1],step=0.9)
cbind(nvars=cvr$n.var,MSE=cvr$error.cv)
##   nvars      MSE
## 9     9 16.11308
## 8     8 15.69783
## 7     7 15.90903
## 6     6 16.63753
## 5     5 17.39111
## 4     4 19.38641
## 3     3 20.93230
## 2     2 22.45007
## 1     1 25.87572
  • choose mtry = 9, compute \(R^2\) value
fmod <- randomForest(O3 ~ ., ozone, mtry=9)
1-sum((fmod$predict-ozone$O3)^2)/sum((ozone$O3-mean(ozone$O3))^2)
## [1] 0.7410042
  • partial dependence plot compute the predicted value for every case in the dataset excluding the predictor of interest and average over the predicted values
partialPlot(fmod, ozone, "temp", main="")

  • A measure of importance over predictors
importance(fmod)
##          IncNodePurity
## vh            362.1073
## wind          184.8538
## humidity      865.7341
## temp        13023.4729
## ibh          1402.9185
## dpg           998.6239
## ibt          2566.1016
## vis           717.1797
## doy           806.5832

Classification trees

Kangaroo data

Identification of the sex and species of an historical specimen of kangaroo (kanga dataset in faraway package).

  • Three possible species: Giganteus, Melanops and Fuliginosus

  • The sex of the animal

  • 18 skull measurements

We form separate trees for identifying the sex and species.

  • read in data
data(kanga, package="faraway")
x0 <- c(1115,NA,748,182,NA,NA,178,311,756,226,NA,NA,NA,48,1009,NA,204,593)
  • exclude all variables that are missing in the test case
kanga <- kanga[,c(T,F,!is.na(x0))]
kanga[1:2,]
##     species basilar.length palate.length palate.width squamosal.depth
## 1 giganteus           1312           882           NA             180
## 2 giganteus           1439           985          230             150
##   lacrymal.width zygomatic.width orbital.width foramina.length mandible.length
## 1            394             782           249              88            1086
## 2            416             824           233             100            1158
##   mandible.depth ramus.height
## 1            179          591
## 2            181          643
  • First, look at missing values in the training set
apply(kanga,2,function(x) sum(is.na(x)))
##         species  basilar.length   palate.length    palate.width squamosal.depth 
##               0               1               1              24               1 
##  lacrymal.width zygomatic.width   orbital.width foramina.length mandible.length 
##               0               1               0               0              12 
##  mandible.depth    ramus.height 
##               0               0
  • Compute the pairwise correlation of these variables with other variables
round(cor(kanga[,-1],use="pairwise.complete.obs")[,c(3,9)],2)
##                 palate.width mandible.length
## basilar.length          0.77            0.98
## palate.length           0.81            0.98
## palate.width            1.00            0.81
## squamosal.depth         0.69            0.80
## lacrymal.width          0.77            0.92
## zygomatic.width         0.78            0.92
## orbital.width           0.12            0.25
## foramina.length         0.19            0.23
## mandible.length         0.81            1.00
## mandible.depth          0.62            0.85
## ramus.height            0.73            0.94
  • Remove the two variables and remaining missing cases
newko <- na.omit(kanga[,-c(4,10)])
dim(newko)
## [1] 144  10
  • visualize the class separation over the two variables
ggplot(newko, aes(x=zygomatic.width,y=foramina.length,shape=species, color = species)) +
  geom_point() +
  theme(legend.position = "top", legend.direction ="horizontal", legend.title=element_blank())

Fit a classification tree

  • Gini’s index is the default choice of criterion
set.seed(123)  # try with 7360
kt <- rpart(species ~ ., data=newko,control = rpart.control(cp = 0.001))
printcp(kt)
## 
## Classification tree:
## rpart(formula = species ~ ., data = newko, control = rpart.control(cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] basilar.length  foramina.length lacrymal.width  ramus.height   
## [5] squamosal.depth zygomatic.width
## 
## Root node error: 95/144 = 0.65972
## 
## n= 144 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.178947      0   1.00000 1.10526 0.056133
## 2 0.105263      1   0.82105 0.92632 0.061579
## 3 0.050000      2   0.71579 0.90526 0.061952
## 4 0.021053      6   0.51579 0.82105 0.062938
## 5 0.010526      7   0.49474 0.83158 0.062859
## 6 0.001000      8   0.48421 0.89474 0.062120
  • Select the six-split tree and plot
(ktp <- prune(kt,cp=0.0211))
## n= 144 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 144 95 fuliginosus (0.34027778 0.33333333 0.32638889)  
##    2) zygomatic.width>=923 37 13 fuliginosus (0.64864865 0.16216216 0.18918919) *
##    3) zygomatic.width< 923 107 65 giganteus (0.23364486 0.39252336 0.37383178)  
##      6) zygomatic.width>=901 16  3 giganteus (0.12500000 0.81250000 0.06250000) *
##      7) zygomatic.width< 901 91 52 melanops (0.25274725 0.31868132 0.42857143)  
##       14) foramina.length< 98.5 58 33 melanops (0.36206897 0.20689655 0.43103448)  
##         28) lacrymal.width< 448.5 50 29 fuliginosus (0.42000000 0.24000000 0.34000000)  
##           56) ramus.height>=628.5 33 14 fuliginosus (0.57575758 0.18181818 0.24242424) *
##           57) ramus.height< 628.5 17  8 melanops (0.11764706 0.35294118 0.52941176) *
##         29) lacrymal.width>=448.5 8  0 melanops (0.00000000 0.00000000 1.00000000) *
##       15) foramina.length>=98.5 33 16 giganteus (0.06060606 0.51515152 0.42424242)  
##         30) squamosal.depth< 182.5 26 10 giganteus (0.07692308 0.61538462 0.30769231) *
##         31) squamosal.depth>=182.5 7  1 melanops (0.00000000 0.14285714 0.85714286) *
rpart.plot(ktp, type = 2)

  • Compute the misclassification error
 (tt <- table(actual=newko$species, predicted=predict(ktp, type="class")))
##              predicted
## actual        fuliginosus giganteus melanops
##   fuliginosus          43         4        2
##   giganteus            12        29        7
##   melanops             15         9       23
1-sum(diag(tt))/sum(tt)
## [1] 0.3402778
  • May instead use the Principle components for constructing the tree for lower error rate (0.25)

Classification using forests

cvr <- rfcv(newko[,-1],newko[,1],step=0.9)
cbind(nvars=cvr$n.var,error.rate=cvr$error.cv)
##   nvars error.rate
## 9     9  0.5138889
## 8     8  0.4652778
## 7     7  0.4722222
## 6     6  0.4722222
## 5     5  0.4791667
## 4     4  0.5208333
## 3     3  0.5763889
## 2     2  0.5694444
## 1     1  0.5833333
fmod <- randomForest(species ~ ., newko, mtry=6)
(tt <- table(actual=newko$species,predicted=predict(fmod)))
##              predicted
## actual        fuliginosus giganteus melanops
##   fuliginosus          37         5        7
##   giganteus             5        21       22
##   melanops             11        20       16
1-sum(diag(tt))/sum(tt)
## [1] 0.4861111

Use principal components on the predictors for rescue

pck <- princomp(newko[,-1])
pcdf <- data.frame(species = newko$species, pck$scores)
fmod <- randomForest(species ~ ., pcdf, mtry=6)
tt <- table(actual = newko$species, predicted=predict(fmod))
1-sum(diag(tt))/sum(tt)
## [1] 0.3541667