Binomial Regression

Announcement
Acknowledgement
O-ring example
Descriptive statistics
Binomial model
Logistic regression
Prediction
Goodness of fit
Pearson \(X^2\)
Diagnostics
Overdispersion
Quasi-binomial
Beta regression

sessionInfo()

## R version 4.3.0 (2023-04-21)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.5.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.33   R6_2.5.1        fastmap_1.1.1   xfun_0.39      
##  [5] cachem_1.0.8    knitr_1.42      htmltools_0.5.5 rmarkdown_2.21 
##  [9] cli_3.6.1       sass_0.4.6      jquerylib_0.1.4 compiler_4.3.0 
## [13] rstudioapi_0.14 tools_4.3.0     evaluate_0.20   bslib_0.4.2    
## [17] yaml_2.3.7      rlang_1.1.1     jsonlite_1.8.4

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(faraway)

Announcement

Will post presentation order on Friday lab

Acknowledgement

Dr. Hua Zhou’s slides

O-ring example

In January 1986, the space shuttle Challenger exploded 73 seconds after launch.
The culprit is the O-ring seals in the rocket boosters. At lower temperatures, rubber becomes more brittle and is a less effective sealant. At the time of the launch, the temperature was 31°F.
Could the failure of the O-rings have been predicted?
Data: 23 previous shuttle missions. Each shuttle had 2 boosters, each with 3 O-rings. We know the number of O-rings out of 6 showing some damage and the launch temperature.

orings <- orings %>%
  as_tibble(rownames = "mission") %>%
  print(n = Inf)

## # A tibble: 23 × 3
##    mission  temp damage
##    <chr>   <dbl>  <dbl>
##  1 1          53      5
##  2 2          57      1
##  3 3          58      1
##  4 4          63      1
##  5 5          66      0
##  6 6          67      0
##  7 7          67      0
##  8 8          67      0
##  9 9          68      0
## 10 10         69      0
## 11 11         70      1
## 12 12         70      0
## 13 13         70      1
## 14 14         70      0
## 15 15         72      0
## 16 16         73      0
## 17 17         75      0
## 18 18         75      1
## 19 19         76      0
## 20 20         76      0
## 21 21         78      0
## 22 22         79      0
## 23 23         81      0

Descriptive statistics

There seems a pattern: lower temperature, more damages.

oring_plot <- orings %>%
  mutate(failrate = damage / 6) %>%
  ggplot(mapping = aes(x = temp, y = failrate)) + 
  geom_point() + 
  labs(x = "Temperature (F)", y = "Proportion of damages", title = "O-ring damage data")
plot(oring_plot)

Binomial model

We model \(Y_i\) as a Binomial random variable with batch size \(m_i\) and “success” probability \(p_i\) \[ \mathbb{P}(Y_i = y_i) = \binom{m_i}{y_i} p_i^{y_i} (1 - p_i)^{m_i - y_i}. \]
The parameter \(p_i\) is linked to the predictors \(X_1, \ldots, X_{q}\) via an inverse link function \[ p_i = \frac{e^{\eta_i}}{1 + e^{\eta_i}}, \] where \(\eta_i\) is the linear predictor or systematic component \[ \eta_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_{q} x_{iq} = \mathbf{x}_i^T \boldsymbol{\beta} \]
The log-likelihood is \[\begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \left[ y_i \log p_i + (m_i - y_i) \log (1 - p_i) + \log \binom{m_i}{y_i} \right] \\ &=& \sum_{i=1}^n \left[ y_i \eta_i - m_i \log ( 1 + e^{\eta_i}) + \log \binom{m_i}{y_i} \right] \\ &=& \sum_{i=1}^n \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - m_i \log ( 1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) + \log \binom{m_i}{y_i} \right]. \end{eqnarray*}\]
The Bernoulli model in ELMR 2 is a special case with all batch sizes \(m_i = 1\).
Conversely, the Binomial model is equivalent to a Bernoulli model with \(\sum_i m_i\) observations, or a Bernoulli model with observation weights \((y_i, m_i - y_i)\).

Logistic regression

For Binomial model, we input \((\text{successes}_i, \text{failures}_i)\) as responses.

library(gtsummary)
lmod <- glm(cbind(damage, 6 - damage) ~ temp, family = binomial, data = orings)
lmod %>%
  tbl_regression(intercept = TRUE)

Characteristic	log(OR)¹	95% CI¹	p-value
(Intercept)	12	5.6, 19	<0.001
temp	-0.22	-0.33, -0.12	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

Let’s fit an equivalent (weighted) Bernoulli model. Not surprisingly, we get identical estimate.

obs_wt = c(rbind(orings$damage, 6 - orings$damage))
orings %>%
  slice(rep(1:n(), each = 2)) %>% # replicate each row twice
  mutate(damage = rep(c(1, 0), 23)) %>%
  mutate(obs_wt = obs_wt) %>%
  glm(damage ~ temp, weights = obs_wt, family = binomial, data = .) %>%
  tbl_regression(intercept = TRUE)

Characteristic	log(OR)¹	95% CI¹	p-value
(Intercept)	12	5.6, 19	<0.001
temp	-0.22	-0.33, -0.12	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

Prediction

Now we can predict the failure probability. The failure probability at 31F is nearly 1.

tibble(temp     = seq(25, 85, 0.1), 
       predprob = predict(lmod, newdata = tibble(temp = temp), type = "response")) %>%
  ggplot() + 
  geom_line(mapping = aes(x = temp, y = predprob)) + 
  geom_vline(xintercept = 31) +
  labs(x = "Temperature (F)", y = "Predicted failure probability")

Goodness of fit

To evaluate the goodness of fit of model, we compare it to the full model with number of parameters equal to the number of observations. That is the full/saturated model estimates \(p_i\) by \(y_i/m_i\). Therefore the deviance is \[\begin{eqnarray*} D &=& 2 \sum_i y_i \log(y_i/m_i) + (m_i - y_i) \log(1 - y_i / m_i) \\ & & - 2 \sum_i y_i \log(\widehat{p}_i) + (m_i - y_i) \log(1 - \widehat{p}_i) \\ &=& 2 \sum_i y_i \log(y_i / \widehat{y}_i) + (m_i - y_i) \log(m_i - y_i)/(m_i - \widehat{y}_i), \end{eqnarray*}\] where \(\widehat{y}_i\) are the fitted values from the model.
When \(Y\) is truely binomial and \(m_i\) are relatively large, the deviance \(D\) is approximately \(\chi_{n-q-1}^2\) if the model is correct. A rule of thumb to use this asymptotic approximation is \(m_i \ge 5\). The large p-value indicates that the model has an adequate fit.

pchisq(lmod$deviance, lmod$df.residual, lower = FALSE)

## [1] 0.7164099

Significance of the predictor temp can be evaluated using a similar analysis of deviance test.

drop1(lmod, test = "Chi")

## Single term deletions
## 
## Model:
## cbind(damage, 6 - damage) ~ temp
##        Df Deviance    AIC    LRT  Pr(>Chi)    
## <none>      16.912 33.675                     
## temp    1   38.898 53.660 21.985 2.747e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pearson \(X^2\)

The Pearson \(X^2\) statistic takes the form \[ X^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}. \] In the case of binomial model, we get \[\begin{eqnarray*} X^2 &=& \sum_i \frac{(y_i - n_i \widehat{p}_i)^2}{n_i \widehat{p}_i} + \frac{[(n_i - y_i) - n_i (1 - \widehat{p}_i)]^2}{n_i (1 - \widehat{p}_i)} \\ &=& \sum_i \frac{(y_i - n_i \widehat{p}_i)^2}{n_i \widehat{p}_i (1 - \widehat{p}_i)}. \end{eqnarray*}\] Comparing Pearson \(X^2\) statistic to the asymptotic null distribution \(\chi_{n - q - 1}^2\) gives a goodness of fit test.

predprob <- predict(lmod, type = "response")
# Pearson X2 statistic
(px2 <- sum((orings$damage - 6 * predprob)^2 / (6 * predprob * (1 - predprob))))

## [1] 28.06738

pchisq(px2, lmod$df.residual, lower.tail = FALSE)

## [1] 0.1382507

This large p-value indicates that this model fits data well.

We can define the Perason residuals as \[ r_i^{\text{P}} = \frac{y_i - n_i \widehat{p}_i}{\sqrt{\operatorname{Var} \widehat{y}_i}}. \] Then \[ X^2 = \sum_i \left(r_i^{\text{P}} \right)^2 \] in analogy to \(\text{RSS} = \sum_i r_i^2\) in linear regression.

(presid <- residuals(lmod, type = "pearson"))

##          1          2          3          4          5          6          7 
##  1.3928147 -0.8972668 -0.6821441  0.3214099 -0.6647553 -0.5966330 -0.5966330 
##          8          9         10         11         12         13         14 
## -0.5966330 -0.5354916 -0.4806159  1.9587595 -0.4313637  1.9587595 -0.4313637 
##         15         16         17         18         19         20         21 
## -0.3474838 -0.3118746 -0.2512296  3.7710639 -0.2254843 -0.2254843 -0.1816382 
##         22         23 
## -0.1630244 -0.1313239

Diagnostics

Plot deviance residuals against linear predictors to identify potential high residual observations.

orings %>%
  mutate(devres  = residuals(lmod, type = "deviance"),
         linpred = predict(lmod, type = "link")) %>%
  ggplot + 
  geom_point(mapping = aes(x = linpred, y = devres)) +
  labs(x = "Linear predictor", y = "Deviance residual")

Plot sorted hat values against half-normal quantiles to identify potential high leverage observations.

halfnorm(hatvalues(lmod))

orings %>%
  slice(c(1, 2))

## # A tibble: 2 × 3
##   mission  temp damage
##   <chr>   <dbl>  <dbl>
## 1 1          53      5
## 2 2          57      1

Plot sorted Cook distances against the half-normal quantiles to identify potential high influential observations.

halfnorm(cooks.distance(lmod))

orings %>%
  slice(c(1, 18))

## # A tibble: 2 × 3
##   mission  temp damage
##   <chr>   <dbl>  <dbl>
## 1 1          53      5
## 2 18         75      1

Overdispersion

If we detect a lack of fit (unusual large deviance or Pearson \(X^2\)), we consider following causes.
1. Wrong systematic component. Most common explanation is we have the wrong structural form for the model. We have not included the right predictors or we have not transformed or combined them in the correct way.
2. Outliers. A few outliers with large deviance. If there are many, then we may have assumed the wrong error distribution for \(Y\).
3. Sparse data. If binomial batch sizes are all small, then the \(\chi^2\) approximation is bad. Rule of thumb for using the \(\chi^2\) approximation is \(m_i \ge 5\) for all batches.
4. Overdisperson or underdisperson.
Binomial distribution arises when we assume that the trials in a batch are independent and share the same “success” probability.
- Are the failures of those 6 O-rings in a space shuttle really independent of each other?
- Are the failure probability of those 6 O-rings in a space shuttle same?
Violation of the iid (identically and independently distributed) assumption of Bernoulli trials can lead to inflation of variance.
We can relax the assumption of binomial distribution by allowing an overdispersion parameter \[ \operatorname{Var} Y = \sigma^2 m p (1 - p). \] The dispersion parameter \(\sigma^2\) can be estimated by \[ \widehat{\sigma}^2 = \frac{X^2}{n - q - 1}, \] where \(q\) is the number of parameters consumed by the logistic regression. Estimation of \(\boldsymbol{\beta}\) is unchanged but the inference is changed \[ \operatorname{Var} \widehat{\boldsymbol{\beta}} = \widehat{\sigma}^2 (\mathbf{X}^T \widehat{\mathbf{W}} \mathbf{X})^{-1}. \]
Analysis of deviance by differences in deviances cannot be used because now the test statistic is distributed as \(\sigma^2 \chi^2\). Since \(\sigma^2\) is estimated, we can use \(F\) test by comparing \[ F = \frac{(D_{\text{small}} - D_{\text{large}})/ (\text{df}_{\text{small}} - \text{df}_{\text{large}})}{\widehat{\sigma}^2} \] to the (asymptotic) F distribution with degrees of freedom \(\text{df}_{\text{small}} - \text{df}_{\text{large}}\) and \(n-p\).
The troutegg data set records the number of surviving trout eggs at 5 stream locations and retrieved at 4 different times.

troutegg

##    survive total location period
## 1       89    94        1      4
## 2      106   108        2      4
## 3      119   123        3      4
## 4      104   104        4      4
## 5       49    93        5      4
## 6       94    98        1      7
## 7       91   106        2      7
## 8      100   130        3      7
## 9       80    97        4      7
## 10      11   113        5      7
## 11      77    86        1      8
## 12      87    96        2      8
## 13      88   119        3      8
## 14      67    99        4      8
## 15      18    88        5      8
## 16     141   155        1     11
## 17     104   122        2     11
## 18      91   125        3     11
## 19     111   132        4     11
## 20       0   138        5     11

We fit a binomial model for the two main effects.

library(gtsummary)
bmod <- glm(cbind(survive, total - survive) ~ location + period,
            family = "binomial", data = troutegg)
#bmod %>%
#  tbl_regression(exponentiate = TRUE) %>%
#  bold_labels()
summary(bmod)

## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = "binomial", data = troutegg)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.6358     0.2813  16.479  < 2e-16 ***
## location2    -0.4168     0.2461  -1.694   0.0903 .  
## location3    -1.2421     0.2194  -5.660 1.51e-08 ***
## location4    -0.9509     0.2288  -4.157 3.23e-05 ***
## location5    -4.6138     0.2502 -18.439  < 2e-16 ***
## period7      -2.1702     0.2384  -9.103  < 2e-16 ***
## period8      -2.3256     0.2429  -9.573  < 2e-16 ***
## period11     -2.4500     0.2341 -10.466  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: 157.03
## 
## Number of Fisher Scoring iterations: 5

Analysis of deviance for the significance of two factors

drop1(bmod, test = "Chisq")

## Single term deletions
## 
## Model:
## cbind(survive, total - survive) ~ location + period
##          Df Deviance    AIC    LRT  Pr(>Chi)    
## <none>         64.50 157.03                     
## location  4   913.56 998.09 849.06 < 2.2e-16 ***
## period    3   228.57 315.11 164.08 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A deviance value of 64.4951215 on 12 degrees of freedom suggests a lack of fit. Diagnostics (not done here) does not reveal outliers. We estimate the dispersion parameter as

(sigma2 <- sum(residuals(bmod, type = "pearson")^2) / 12)

## [1] 5.330322

which is substantially larger than 1. We can now see the scaled up standard errors

summary(bmod, dispersion = sigma2)

## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = "binomial", data = troutegg)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.6358     0.6495   7.138 9.49e-13 ***
## location2    -0.4168     0.5682  -0.734   0.4632    
## location3    -1.2421     0.5066  -2.452   0.0142 *  
## location4    -0.9509     0.5281  -1.800   0.0718 .  
## location5    -4.6138     0.5777  -7.987 1.39e-15 ***
## period7      -2.1702     0.5504  -3.943 8.05e-05 ***
## period8      -2.3256     0.5609  -4.146 3.38e-05 ***
## period11     -2.4500     0.5405  -4.533 5.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 5.330322)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: 157.03
## 
## Number of Fisher Scoring iterations: 5

and make F tests on predictors

drop1(bmod, scale = sigma2, test = "F")

## Warning in drop1.glm(bmod, scale = sigma2, test = "F"): F test assumes
## 'quasibinomial' family

## Single term deletions
## 
## Model:
## cbind(survive, total - survive) ~ location + period
## 
## scale:  5.330322 
## 
##          Df Deviance    AIC F value    Pr(>F)    
## <none>         64.50 157.03                      
## location  4   913.56 308.32  39.494 8.142e-07 ***
## period    3   228.57 181.81  10.176  0.001288 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Quasi-binomial

Another way to deal with overdispersion or underdispsersion is quasi-binomial model.
Assume \[\begin{eqnarray*} \mathbb{E} (Y_i) = \mu_i, \quad \operatorname{Var}(Y_i) = \phi V(\mu_i). \end{eqnarray*}\] Define a score \[ U_i = \frac{Y_i - \mu_i}{\phi V(\mu_i)}, \] with \[\begin{eqnarray*} \mathbb{E}(U_i) &=& 0, \quad \operatorname{Var}(U_i) = \frac{1}{\phi V(\mu_i)} \\ - \mathbb{E} \frac{\partial U_i}{\partial \mu_i} &=& - \mathbb{E} \frac{- \phi V(\mu_i) - (Y_i - \mu_i) \phi V'(\mu_i)}{[\phi V(\mu_i)]^2} = \frac{1}{\phi V(\mu_i)}. \end{eqnarray*}\]
\(U_i\) acts like the derivative of a log-likelihood (score function). Thus we define \[ Q_i = \int_{y_i}^{\mu_i} \frac{y_i - t}{\phi V(t)} \, dt \] and treat \[ Q = \sum_{i=1}^n Q_i \] as a log quasi-likelihood.
\(\widehat{\beta}\) is obtained by maximizing \(Q\) and \[ \widehat{\phi} = \frac{X^2}{n - p}. \]
Let’s revisit the troutegg example.

modl <- glm(survive / total ~ location + period, 
            family = "quasibinomial", data = troutegg)
summary(modl)

## 
## Call:
## glm(formula = survive/total ~ location + period, family = "quasibinomial", 
##     data = troutegg)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.4915     0.6133   7.324 9.17e-06 ***
## location2    -0.3677     0.5574  -0.660  0.52191    
## location3    -1.2027     0.5055  -2.379  0.03481 *  
## location4    -0.9569     0.5166  -1.852  0.08874 .  
## location5    -4.4356     0.5529  -8.022 3.66e-06 ***
## period7      -2.0660     0.5137  -4.022  0.00169 ** 
## period8      -2.2046     0.5141  -4.288  0.00105 ** 
## period11     -2.3426     0.5143  -4.555  0.00066 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.04670695)
## 
##     Null deviance: 8.88362  on 19  degrees of freedom
## Residual deviance: 0.58247  on 12  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

Individual predictor significance is

drop1(modl, test = "F")

## Single term deletions
## 
## Model:
## survive/total ~ location + period
##          Df Deviance F value    Pr(>F)    
## <none>        0.5825                      
## location  4   7.9985  38.196 9.788e-07 ***
## period    3   2.0545  10.109  0.001325 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see similar p-values as the overdispersion model.

Beta regression

We may also directly model the proportions as a beta distribution \(\text{Beta}(\alpha, \beta)\) with density \[ f(y \mid \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} y^{\alpha - 1}(1 - y)^{\beta - 1}. \] If \(Y \sim \text{Beta}(\alpha, \beta)\), then \[ \mathbb{E}(Y) = \mu = \frac{\alpha}{\alpha + \beta}, \quad \operatorname{Var}(Y) = \frac{\alpha \beta}{(\alpha + \beta)^2(1 + \alpha + \beta)}. \] Then we can link mean \(\mu\) to the linear predictor \(\eta\) using any link function for binomial model.

library(mgcv)

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.

modb <- gam(survive / total ~ location + period,
            family = betar(), data = troutegg)

## Warning in family$saturated.ll(y, prior.weights, theta): saturated likelihood
## may be inaccurate

summary(modb)

## 
## Family: Beta regression(7.992) 
## Link function: logit 
## 
## Formula:
## survive/total ~ location + period
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.8507     0.5757   8.426  < 2e-16 ***
## location2    -0.3353     0.5961  -0.563   0.5738    
## location3    -0.9843     0.5739  -1.715   0.0863 .  
## location4    -0.3002     0.5975  -0.502   0.6153    
## location5    -5.3264     0.6241  -8.534  < 2e-16 ***
## period7      -2.7315     0.5614  -4.865 1.14e-06 ***
## period8      -2.9455     0.5551  -5.306 1.12e-07 ***
## period11     -3.4724     0.5410  -6.419 1.37e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.891   Deviance explained =  110%
## -REML = -67.01  Scale est. = 1         n = 20