Logistic regression, Poisson regression, odds ratios, and model selection.
When the response variable is binary (yes/no), a count, or otherwise non-normal, ordinary least squares is inappropriate. Generalized linear models extend lm() by specifying a family (distribution + link function).
| Response Type | Family | Link | Use Case |
|---|---|---|---|
| Binary (0/1) | binomial | logit | Classification, pass/fail |
| Count | poisson | log | Event counts, arrivals |
| Continuous, positive | Gamma | inverse | Insurance claims, wait times |
# Create a binary outcome: above-median mpg mtcars2 <- mtcars |> dplyr::mutate(high_mpg = ifelse(mpg > median(mpg), 1, 0)) # Fit logistic regression logit_fit <- glm(high_mpg ~ wt + hp + am, family = binomial(link = "logit"), data = mtcars2) summary(logit_fit)
# Log-odds coefficients coef(logit_fit) # Convert to odds ratios exp(coef(logit_fit)) # Odds ratios with 95% CI exp(cbind(OR = coef(logit_fit), confint(logit_fit)))
wt means: for each 1-unit increase in weight, the odds of high mpg are multiplied by 0.75 (i.e., decrease by 25%), holding other variables constant.
# Predicted probabilities on original data mtcars2$pred_prob <- predict(logit_fit, type = "response") # For new observations new_cars <- data.frame(wt = c(2.5, 3.5), hp = c(100, 150), am = c(1, 0)) predict(logit_fit, newdata = new_cars, type = "response")
Since logistic coefficients are on the log-odds scale, marginal effects give the change in probability for a one-unit change in X, evaluated at the mean of all other variables.
library(margins) # Average marginal effects mfx <- margins(logit_fit) summary(mfx) # Plot marginal effects plot(mfx)
For count data (non-negative integers), Poisson regression models the log of the expected count. The exponentiated coefficients are incidence rate ratios (IRRs).
# Simulate count data set.seed(123) n <- 200 count_df <- data.frame( x1 = rnorm(n), x2 = rnorm(n) ) count_df$y <- rpois(n, lambda = exp(1 + 0.5 * count_df$x1 - 0.3 * count_df$x2)) # Fit Poisson model pois_fit <- glm(y ~ x1 + x2, family = poisson(), data = count_df) summary(pois_fit) # Incidence rate ratios with 95% CI exp(cbind(IRR = coef(pois_fit), confint(pois_fit))) # Tidy output with IRRs directly broom::tidy(pois_fit, exponentiate = TRUE, conf.int = TRUE)
The Poisson distribution assumes that the variance equals the mean (equidispersion). Real count data often exhibit overdispersion, where the variance exceeds the mean. Two common solutions are quasi-Poisson (which adjusts standard errors) and the negative binomial model (which adds a dispersion parameter).
# Check for overdispersion: ratio of residual deviance to df with(pois_fit, deviance / df.residual) # Values much greater than 1 suggest overdispersion # Formal test: use AER package library(AER) dispersiontest(pois_fit) # Quasi-Poisson: same coefficients, inflated SEs quasi_fit <- glm(y ~ x1 + x2, family = quasipoisson(), data = count_df) summary(quasi_fit) # Note: dispersion parameter is estimated (not fixed at 1) # Negative binomial regression with MASS library(MASS) nb_fit <- glm.nb(y ~ x1 + x2, data = count_df) summary(nb_fit) # Compare Poisson vs NB via AIC AIC(pois_fit, nb_fit) # IRRs from the negative binomial exp(cbind(IRR = coef(nb_fit), confint(nb_fit)))
When count data contain more zeros than a Poisson or negative binomial distribution predicts, a zero-inflated model splits the process into two parts: (1) a logistic model for whether the count is a "structural zero" vs. coming from the count process, and (2) a Poisson or NB model for the count part.
library(pscl) # Simulate data with excess zeros set.seed(42) n <- 300 zi_df <- data.frame(x = rnorm(n), group = sample(c("A", "B"), n, replace = TRUE)) zi_df$y <- ifelse(runif(n) < 0.3, 0, rpois(n, exp(1 + 0.5 * zi_df$x))) # Check zero proportion mean(zi_df$y == 0) # Zero-inflated Poisson zip_fit <- zeroinfl(y ~ x + group | x, data = zi_df) summary(zip_fit) # Left of | = count model predictors # Right of | = zero-inflation model predictors # Zero-inflated negative binomial zinb_fit <- zeroinfl(y ~ x + group | x, data = zi_df, dist = "negbin") summary(zinb_fit) # Hurdle model (alternative: zeros from one process, positives from another) hurdle_fit <- hurdle(y ~ x + group | x, data = zi_df) summary(hurdle_fit) # Compare all models AIC(pois_fit, nb_fit, zip_fit, zinb_fit, hurdle_fit)
When the response is an ordered categorical variable (e.g., satisfaction: low/medium/high), ordinal logistic regression (proportional odds model) is appropriate.
library(MASS) # Create an ordered factor mtcars2$mpg_cat <- cut(mtcars$mpg, breaks = c(-Inf, 15, 22, Inf), labels = c("Low", "Medium", "High"), ordered = TRUE) # Fit proportional odds model ord_fit <- polr(mpg_cat ~ wt + hp + am, data = mtcars2, Hess = TRUE) summary(ord_fit) # Proportional odds ratios exp(coef(ord_fit)) exp(confint(ord_fit)) # Predicted probabilities for each category predict(ord_fit, type = "probs") |> head()
brant::brant(ord_fit). If the assumption is violated, consider a multinomial logistic model instead.
When the response has three or more unordered categories, multinomial logistic regression estimates separate log-odds for each category relative to a baseline.
library(nnet) # Multinomial logit: predict cylinder category mtcars2$cyl_f <- factor(mtcars$cyl) mnom_fit <- multinom(cyl_f ~ wt + hp + am, data = mtcars2) summary(mnom_fit) # Exponentiate for relative risk ratios (vs. baseline category) exp(coef(mnom_fit)) # Predicted probabilities predict(mnom_fit, type = "probs") |> head() # Predicted class predict(mnom_fit, type = "class") |> table()
For binary classification models, the Receiver Operating Characteristic (ROC) curve plots sensitivity against 1-specificity across all possible thresholds. The area under the curve (AUC) summarizes discriminative ability.
library(pROC) # ROC curve for the logistic model roc_obj <- roc(mtcars2$high_mpg, predict(logit_fit, type = "response")) # AUC auc(roc_obj) # Plot the ROC curve plot(roc_obj, main = "ROC Curve: High MPG Prediction", col = "#276DC3", lwd = 2, print.auc = TRUE) # Optimal threshold (Youden's J) best <- coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity")) best # Comparing two models' ROC curves m_simple <- glm(high_mpg ~ wt, family = binomial(), data = mtcars2) roc1 <- roc(mtcars2$high_mpg, predict(m_simple, type = "response")) roc2 <- roc(mtcars2$high_mpg, predict(logit_fit, type = "response")) roc.test(roc1, roc2) # DeLong test for AUC difference # Calibration: predicted vs. observed proportions mtcars2$pred_prob <- predict(logit_fit, type = "response") mtcars2$prob_bin <- cut(mtcars2$pred_prob, breaks = seq(0, 1, 0.2)) cal_df <- mtcars2 |> dplyr::group_by(prob_bin) |> dplyr::summarize(observed = mean(high_mpg), predicted = mean(pred_prob), n = dplyr::n()) cal_df
# Fit competing models m1 <- glm(high_mpg ~ wt, family = binomial(), data = mtcars2) m2 <- glm(high_mpg ~ wt + hp, family = binomial(), data = mtcars2) m3 <- glm(high_mpg ~ wt + hp + am, family = binomial(), data = mtcars2) # Compare AIC (lower is better) AIC(m1, m2, m3) BIC(m1, m2, m3) # Likelihood ratio test for nested models anova(m1, m3, test = "Chisq")
library(broom) tidy(logit_fit, conf.int = TRUE, exponentiate = TRUE) # Gives odds ratios directly in the estimate column glance(logit_fit) # AIC, BIC, deviance, df.residual, nobs
Using the built-in Titanic dataset (convert it to a data frame with as.data.frame(Titanic)), fit a logistic regression predicting Survived from Class, Sex, and Age (use the Freq column as weights). Report the odds ratios and identify the strongest predictor.
Simulate 500 Poisson-distributed observations with lambda = exp(0.5 + 0.8*x) where x ~ N(0,1). Fit a Poisson GLM and verify that the recovered coefficients are close to 0.5 (intercept) and 0.8 (slope). Check for overdispersion.
Simulate 400 observations where 30% are structural zeros and the remaining follow a Poisson distribution with lambda = exp(1 + 0.5*x). Fit both a standard Poisson model and a zero-inflated Poisson model using pscl::zeroinfl(). Compare the models with AIC. Does the zero-inflated model fit better? Interpret the coefficients from both the count and zero-inflation parts.
Using the iris dataset, create an ordered factor from Sepal.Length with categories "Short" (< 5.5), "Medium" (5.5-6.5), and "Long" (> 6.5). Fit an ordinal logistic regression predicting this variable from Petal.Length and Species. Report the proportional odds ratios. Then fit a multinomial logistic model with nnet::multinom() and compare the predicted classifications from both models.
glm(y ~ x, family = binomial()) fits logistic regression for binary outcomes.exp(coef()) to get odds ratios (logistic) or incidence rate ratios (Poisson).margins() to compute average marginal effects on the probability scale.quasipoisson() or MASS::glm.nb() as alternatives.pscl::zeroinfl()) handle excess zeros; hurdle models separate the zero/positive processes.MASS::polr() for ordinal outcomes and nnet::multinom() for unordered multi-category outcomes.pROC) assess discrimination; calibration plots assess probability accuracy.anova(..., test="Chisq") for nested models.