Chapter 7: Generalized Linear Models

Logistic regression, Poisson regression, odds ratios, and model selection.

7.1 Why GLMs?

When the response variable is binary (yes/no), a count, or otherwise non-normal, ordinary least squares is inappropriate. Generalized linear models extend lm() by specifying a family (distribution + link function).

Response TypeFamilyLinkUse Case
Binary (0/1)binomiallogitClassification, pass/fail
CountpoissonlogEvent counts, arrivals
Continuous, positiveGammainverseInsurance claims, wait times

7.2 Logistic Regression

# Create a binary outcome: above-median mpg
mtcars2 <- mtcars |>
  dplyr::mutate(high_mpg = ifelse(mpg > median(mpg), 1, 0))

# Fit logistic regression
logit_fit <- glm(high_mpg ~ wt + hp + am,
                  family = binomial(link = "logit"),
                  data = mtcars2)
summary(logit_fit)

Interpreting Coefficients as Odds Ratios

# Log-odds coefficients
coef(logit_fit)

# Convert to odds ratios
exp(coef(logit_fit))

# Odds ratios with 95% CI
exp(cbind(OR = coef(logit_fit), confint(logit_fit)))
Reading odds ratios An OR of 0.75 for wt means: for each 1-unit increase in weight, the odds of high mpg are multiplied by 0.75 (i.e., decrease by 25%), holding other variables constant.

7.3 Predicted Probabilities

# Predicted probabilities on original data
mtcars2$pred_prob <- predict(logit_fit, type = "response")

# For new observations
new_cars <- data.frame(wt = c(2.5, 3.5), hp = c(100, 150), am = c(1, 0))
predict(logit_fit, newdata = new_cars, type = "response")

7.4 Marginal Effects

Since logistic coefficients are on the log-odds scale, marginal effects give the change in probability for a one-unit change in X, evaluated at the mean of all other variables.

library(margins)

# Average marginal effects
mfx <- margins(logit_fit)
summary(mfx)

# Plot marginal effects
plot(mfx)

7.5 Poisson Regression and Incidence Rate Ratios

For count data (non-negative integers), Poisson regression models the log of the expected count. The exponentiated coefficients are incidence rate ratios (IRRs).

# Simulate count data
set.seed(123)
n <- 200
count_df <- data.frame(
  x1 = rnorm(n),
  x2 = rnorm(n)
)
count_df$y <- rpois(n, lambda = exp(1 + 0.5 * count_df$x1 - 0.3 * count_df$x2))

# Fit Poisson model
pois_fit <- glm(y ~ x1 + x2, family = poisson(), data = count_df)
summary(pois_fit)

# Incidence rate ratios with 95% CI
exp(cbind(IRR = coef(pois_fit), confint(pois_fit)))

# Tidy output with IRRs directly
broom::tidy(pois_fit, exponentiate = TRUE, conf.int = TRUE)
Interpreting incidence rate ratios (IRRs) An IRR of 1.65 for x1 means: a one-unit increase in x1 is associated with a 65% increase in the expected count, holding other variables constant. An IRR of 0.74 means a 26% decrease. IRRs below 1 indicate a protective or reducing effect; IRRs above 1 indicate an increasing effect.

7.6 Overdispersion: Quasi-Poisson and Negative Binomial

The Poisson distribution assumes that the variance equals the mean (equidispersion). Real count data often exhibit overdispersion, where the variance exceeds the mean. Two common solutions are quasi-Poisson (which adjusts standard errors) and the negative binomial model (which adds a dispersion parameter).

# Check for overdispersion: ratio of residual deviance to df
with(pois_fit, deviance / df.residual)
# Values much greater than 1 suggest overdispersion

# Formal test: use AER package
library(AER)
dispersiontest(pois_fit)

# Quasi-Poisson: same coefficients, inflated SEs
quasi_fit <- glm(y ~ x1 + x2, family = quasipoisson(), data = count_df)
summary(quasi_fit)
# Note: dispersion parameter is estimated (not fixed at 1)

# Negative binomial regression with MASS
library(MASS)
nb_fit <- glm.nb(y ~ x1 + x2, data = count_df)
summary(nb_fit)

# Compare Poisson vs NB via AIC
AIC(pois_fit, nb_fit)

# IRRs from the negative binomial
exp(cbind(IRR = coef(nb_fit), confint(nb_fit)))
When to use which count model Use Poisson when variance approximately equals the mean. Use quasi-Poisson for mild overdispersion (adjusts SEs without changing coefficients). Use negative binomial for substantial overdispersion (estimates a separate dispersion parameter). If you have excess zeros beyond what Poisson or NB predicts, consider zero-inflated models (Section 7.7).

7.7 Zero-Inflated Models

When count data contain more zeros than a Poisson or negative binomial distribution predicts, a zero-inflated model splits the process into two parts: (1) a logistic model for whether the count is a "structural zero" vs. coming from the count process, and (2) a Poisson or NB model for the count part.

library(pscl)

# Simulate data with excess zeros
set.seed(42)
n <- 300
zi_df <- data.frame(x = rnorm(n), group = sample(c("A", "B"), n, replace = TRUE))
zi_df$y <- ifelse(runif(n) < 0.3, 0,
                  rpois(n, exp(1 + 0.5 * zi_df$x)))

# Check zero proportion
mean(zi_df$y == 0)

# Zero-inflated Poisson
zip_fit <- zeroinfl(y ~ x + group | x, data = zi_df)
summary(zip_fit)
# Left of | = count model predictors
# Right of | = zero-inflation model predictors

# Zero-inflated negative binomial
zinb_fit <- zeroinfl(y ~ x + group | x, data = zi_df, dist = "negbin")
summary(zinb_fit)

# Hurdle model (alternative: zeros from one process, positives from another)
hurdle_fit <- hurdle(y ~ x + group | x, data = zi_df)
summary(hurdle_fit)

# Compare all models
AIC(pois_fit, nb_fit, zip_fit, zinb_fit, hurdle_fit)
Zero-inflated vs. hurdle models Zero-inflated models assume zeros come from two sources: "structural" zeros (the event cannot occur) and "sampling" zeros (it could occur but did not). Hurdle models treat all zeros as coming from a single binary process and model positive counts separately. Choose based on your research context: zero-inflated when some units structurally cannot have counts, hurdle when the decision to participate differs from the intensity.

7.8 Ordinal Logistic Regression

When the response is an ordered categorical variable (e.g., satisfaction: low/medium/high), ordinal logistic regression (proportional odds model) is appropriate.

library(MASS)

# Create an ordered factor
mtcars2$mpg_cat <- cut(mtcars$mpg,
                        breaks = c(-Inf, 15, 22, Inf),
                        labels = c("Low", "Medium", "High"),
                        ordered = TRUE)

# Fit proportional odds model
ord_fit <- polr(mpg_cat ~ wt + hp + am, data = mtcars2, Hess = TRUE)
summary(ord_fit)

# Proportional odds ratios
exp(coef(ord_fit))
exp(confint(ord_fit))

# Predicted probabilities for each category
predict(ord_fit, type = "probs") |> head()
Testing the proportional odds assumption The proportional odds model assumes the effect of each predictor is the same across all cutpoints. Test this with brant::brant(ord_fit). If the assumption is violated, consider a multinomial logistic model instead.

7.9 Multinomial Logistic Regression

When the response has three or more unordered categories, multinomial logistic regression estimates separate log-odds for each category relative to a baseline.

library(nnet)

# Multinomial logit: predict cylinder category
mtcars2$cyl_f <- factor(mtcars$cyl)
mnom_fit <- multinom(cyl_f ~ wt + hp + am, data = mtcars2)
summary(mnom_fit)

# Exponentiate for relative risk ratios (vs. baseline category)
exp(coef(mnom_fit))

# Predicted probabilities
predict(mnom_fit, type = "probs") |> head()

# Predicted class
predict(mnom_fit, type = "class") |> table()

7.10 ROC Curves and Calibration

For binary classification models, the Receiver Operating Characteristic (ROC) curve plots sensitivity against 1-specificity across all possible thresholds. The area under the curve (AUC) summarizes discriminative ability.

library(pROC)

# ROC curve for the logistic model
roc_obj <- roc(mtcars2$high_mpg, predict(logit_fit, type = "response"))

# AUC
auc(roc_obj)

# Plot the ROC curve
plot(roc_obj, main = "ROC Curve: High MPG Prediction",
     col = "#276DC3", lwd = 2, print.auc = TRUE)

# Optimal threshold (Youden's J)
best <- coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity"))
best

# Comparing two models' ROC curves
m_simple <- glm(high_mpg ~ wt, family = binomial(), data = mtcars2)
roc1 <- roc(mtcars2$high_mpg, predict(m_simple, type = "response"))
roc2 <- roc(mtcars2$high_mpg, predict(logit_fit, type = "response"))
roc.test(roc1, roc2)  # DeLong test for AUC difference

# Calibration: predicted vs. observed proportions
mtcars2$pred_prob <- predict(logit_fit, type = "response")
mtcars2$prob_bin <- cut(mtcars2$pred_prob, breaks = seq(0, 1, 0.2))
cal_df <- mtcars2 |>
  dplyr::group_by(prob_bin) |>
  dplyr::summarize(observed = mean(high_mpg),
                    predicted = mean(pred_prob),
                    n = dplyr::n())
cal_df
AUC interpretation guidelines AUC = 0.5 is no better than random guessing. AUC 0.7-0.8 is acceptable discrimination. AUC 0.8-0.9 is excellent. AUC > 0.9 is outstanding. A well-calibrated model should have predicted probabilities that closely match observed proportions within each bin.

7.11 Model Comparison: AIC and BIC

# Fit competing models
m1 <- glm(high_mpg ~ wt, family = binomial(), data = mtcars2)
m2 <- glm(high_mpg ~ wt + hp, family = binomial(), data = mtcars2)
m3 <- glm(high_mpg ~ wt + hp + am, family = binomial(), data = mtcars2)

# Compare AIC (lower is better)
AIC(m1, m2, m3)
BIC(m1, m2, m3)

# Likelihood ratio test for nested models
anova(m1, m3, test = "Chisq")

7.12 Tidy GLM Output

library(broom)

tidy(logit_fit, conf.int = TRUE, exponentiate = TRUE)
# Gives odds ratios directly in the estimate column

glance(logit_fit)
# AIC, BIC, deviance, df.residual, nobs

Exercises

Exercise 7.1

Using the built-in Titanic dataset (convert it to a data frame with as.data.frame(Titanic)), fit a logistic regression predicting Survived from Class, Sex, and Age (use the Freq column as weights). Report the odds ratios and identify the strongest predictor.

Exercise 7.2

Simulate 500 Poisson-distributed observations with lambda = exp(0.5 + 0.8*x) where x ~ N(0,1). Fit a Poisson GLM and verify that the recovered coefficients are close to 0.5 (intercept) and 0.8 (slope). Check for overdispersion.

Exercise 7.3

Simulate 400 observations where 30% are structural zeros and the remaining follow a Poisson distribution with lambda = exp(1 + 0.5*x). Fit both a standard Poisson model and a zero-inflated Poisson model using pscl::zeroinfl(). Compare the models with AIC. Does the zero-inflated model fit better? Interpret the coefficients from both the count and zero-inflation parts.

Exercise 7.4

Using the iris dataset, create an ordered factor from Sepal.Length with categories "Short" (< 5.5), "Medium" (5.5-6.5), and "Long" (> 6.5). Fit an ordinal logistic regression predicting this variable from Petal.Length and Species. Report the proportional odds ratios. Then fit a multinomial logistic model with nnet::multinom() and compare the predicted classifications from both models.

External Resources

Key Takeaways

← Chapter 6: Linear Models Chapter 8: ML / tidymodels →