Chapter 6: Linear Models — lm() & Diagnostics

Fit OLS regressions, interpret output, check assumptions, and produce publication tables.

6.1 Fitting a Linear Model

The lm() function fits an ordinary least squares regression. The formula interface uses ~ to separate the dependent variable from predictors.

# Simple regression: mpg predicted by weight
fit1 <- lm(mpg ~ wt, data = mtcars)
summary(fit1)

# Multiple regression
fit2 <- lm(mpg ~ wt + hp + am, data = mtcars)
summary(fit2)
Formula shortcuts y ~ . uses all other columns as predictors. y ~ x1 * x2 includes main effects and the interaction. y ~ x + I(x^2) adds a quadratic term.

6.2 Reading the summary() Output

# Key components of summary(fit2):
# Coefficients table: Estimate, Std. Error, t value, Pr(>|t|)
# Multiple R-squared, Adjusted R-squared
# F-statistic and its p-value

# Extract individual pieces
coef(fit2)                   # coefficients vector
confint(fit2)                # 95% confidence intervals
summary(fit2)$r.squared     # R-squared
summary(fit2)$adj.r.squared # Adjusted R-squared

6.3 Interaction Terms and Polynomial Regression

Interactions let the effect of one predictor depend on the value of another. Polynomial terms capture non-linear relationships within the linear model framework.

# Interaction between weight and transmission type
fit_int <- lm(mpg ~ wt * am, data = mtcars)
summary(fit_int)
# wt:am is the interaction term
# Interpretation: the slope of wt differs for manual vs. automatic

# Equivalent explicit syntax
fit_int2 <- lm(mpg ~ wt + am + wt:am, data = mtcars)

# Polynomial regression: quadratic term for wt
fit_poly <- lm(mpg ~ wt + I(wt^2), data = mtcars)
summary(fit_poly)

# Using poly() for orthogonal polynomials (preferred for stability)
fit_poly2 <- lm(mpg ~ poly(wt, 2), data = mtcars)
summary(fit_poly2)

# Three-way interaction (moderator analysis)
fit_3way <- lm(mpg ~ wt * hp * am, data = mtcars)
summary(fit_3way)
When to use interactions Include an interaction when theory suggests the effect of X1 on Y changes depending on X2. For example, the weight penalty on fuel economy may differ between manual and automatic transmissions. Always include the main effects (wt and am) when modeling an interaction (wt:am); the * operator does this automatically.
Centering predictors before interaction When two continuous variables interact, center them first (subtract the mean) to reduce multicollinearity and improve interpretability. The main effects then represent the effect at the mean of the other variable: mtcars$wt_c <- scale(mtcars$wt, scale = FALSE).

6.4 Estimated Marginal Means with emmeans

When a model contains interactions or factors, raw coefficients are hard to interpret. The emmeans package computes estimated marginal means (predicted values averaged over other factors) and pairwise contrasts.

library(emmeans)

# Fit a model with a factor and interaction
mtcars$cyl_f <- factor(mtcars$cyl)
fit_emm <- lm(mpg ~ wt * cyl_f, data = mtcars)

# Estimated marginal means for each cylinder group (at mean wt)
emm <- emmeans(fit_emm, ~ cyl_f)
emm

# Pairwise contrasts with Tukey adjustment
pairs(emm)

# Estimated marginal means at specific values of wt
emmeans(fit_emm, ~ cyl_f, at = list(wt = c(2.5, 3.5)))

# Visualize the interaction
emmip(fit_emm, cyl_f ~ wt, at = list(wt = seq(1.5, 5.5, 0.5)),
      CIs = TRUE) +
  labs(y = "Predicted MPG", x = "Weight (1000 lbs)")

6.5 ANOVA Table and Nested Model Comparison

# Test whether hp and am contribute beyond wt alone
anova(fit1, fit2)

# Type II ANOVA table for each predictor
anova(fit2)

# Compare a sequence of nested models
m0 <- lm(mpg ~ 1, data = mtcars)        # intercept only
m1 <- lm(mpg ~ wt, data = mtcars)
m2 <- lm(mpg ~ wt + hp, data = mtcars)
m3 <- lm(mpg ~ wt + hp + am, data = mtcars)
anova(m0, m1, m2, m3)

# Type III SS (useful with interactions; requires car package)
library(car)
Anova(fit_int, type = "III")
Type I vs. Type II vs. Type III sums of squares Base R's anova() uses sequential (Type I) SS, which are order-dependent. For models with interactions, use car::Anova(model, type = "III") to get marginal SS that do not depend on predictor ordering. Type II is appropriate when there are no significant interactions.

6.6 Prediction

# Point predictions
new_data <- data.frame(wt = c(2.5, 3.0, 3.5),
                        hp = c(100, 120, 150),
                        am = c(1, 0, 0))

predict(fit2, newdata = new_data)

# With confidence and prediction intervals
predict(fit2, newdata = new_data, interval = "confidence")
predict(fit2, newdata = new_data, interval = "prediction")

6.7 Diagnostic Plots

Four standard diagnostic plots help assess OLS assumptions:

# Base R diagnostics (4 plots in one window)
par(mfrow = c(2, 2))
plot(fit2)
par(mfrow = c(1, 1))
PlotChecksLook For
Residuals vs FittedLinearity, homoscedasticityRandom scatter around 0
Q-Q PlotNormality of residualsPoints on the diagonal line
Scale-LocationHomoscedasticityFlat trend line
Residuals vs LeverageInfluential observationsPoints inside Cook's distance contours
Heteroscedasticity? If residuals fan out, use heteroscedasticity-robust standard errors: lmtest::coeftest(fit2, vcov = sandwich::vcovHC(fit2, type = "HC1")).

6.8 Robust Standard Errors with sandwich and lmtest

OLS standard errors assume homoscedastic, uncorrelated errors. When that assumption fails, robust standard errors provide valid inference without changing the point estimates.

library(sandwich)
library(lmtest)

# HC1 robust SEs (equivalent to Stata's ", robust")
coeftest(fit2, vcov = vcovHC(fit2, type = "HC1"))

# HC3 (better for small samples, the default in many textbooks)
coeftest(fit2, vcov = vcovHC(fit2, type = "HC3"))

# Compare OLS vs. robust SEs side by side
se_ols    <- coeftest(fit2)
se_robust <- coeftest(fit2, vcov = vcovHC(fit2, type = "HC1"))
cbind(OLS_SE = se_ols[, 2], Robust_SE = se_robust[, 2])

# Robust confidence intervals
coefci(fit2, vcov = vcovHC(fit2, type = "HC1"))

# Cluster-robust SEs (e.g., clustered by cylinder count)
coeftest(fit2, vcov = vcovCL(fit2, cluster = mtcars$cyl))
TypeFunctionWhen to Use
HC0vcovHC(fit, "HC0")White's original estimator (biased in small samples)
HC1vcovHC(fit, "HC1")Stata default; applies df correction N/(N-k)
HC3vcovHC(fit, "HC3")Recommended for small samples; conservative
ClusteredvcovCL(fit, cluster)Observations grouped in clusters (firms, schools)

6.9 Influential Observations: Cook's Distance and Leverage

Individual data points can disproportionately influence regression estimates. Cook's distance combines leverage (how far an observation's predictors are from the mean) with the size of the residual.

# Compute influence diagnostics
infl <- influence.measures(fit2)
summary(infl)

# Cook's distance: flag observations where D > 4/n
cooks_d <- cooks.distance(fit2)
threshold <- 4 / nrow(mtcars)
influential <- which(cooks_d > threshold)
cat("Influential obs:", names(influential), "\n")

# Plot Cook's distance
plot(cooks_d, type = "h", ylab = "Cook's Distance",
     main = "Cook's Distance by Observation")
abline(h = threshold, col = "red", lty = 2)

# Leverage (hat values)
hat_vals <- hatvalues(fit2)
high_lev <- which(hat_vals > 2 * mean(hat_vals))
cat("High leverage:", names(high_lev), "\n")

# DFBETAS: how much each coefficient changes when obs i is dropped
dfb <- dfbetas(fit2)
head(dfb)

# Sensitivity check: refit without influential observations
fit2_clean <- lm(mpg ~ wt + hp + am, data = mtcars[-influential, ])
cbind(Full = coef(fit2), Trimmed = coef(fit2_clean))
Rules of thumb for influence diagnostics Cook's distance: D > 4/n or D > 1 warrants investigation. Leverage: hat values > 2(k+1)/n are high-leverage. DFBETAS: |DFBETA| > 2/sqrt(n) suggests the observation unduly affects that particular coefficient. These are screening tools; always examine flagged observations substantively before removing them.

6.10 Model Selection: Theory-Driven vs. Data-Driven

In research, model specification should be guided by theory. Data-driven approaches like stepwise selection are useful for prediction but problematic for causal inference because they inflate Type I error rates and produce biased coefficient estimates.

# Stepwise selection (use cautiously; for illustration only)
full_model <- lm(mpg ~ wt + hp + am + cyl + disp + drat, data = mtcars)
step_model <- step(full_model, direction = "both", trace = 0)
summary(step_model)

# Compare AIC and BIC across candidate models
models <- list(m1 = m1, m2 = m2, m3 = m3, step = step_model)
sapply(models, AIC)
sapply(models, BIC)

# Adjusted R-squared comparison
sapply(models, function(m) summary(m)$adj.r.squared)
Why stepwise selection is discouraged in research Stepwise procedures (forward, backward, both) select variables based on statistical significance, which depends on sample size and ignores the causal structure. The resulting models tend to overfit, yield biased standard errors, and produce p-values that are too small. In academic research, let your theoretical framework guide which variables to include and report the pre-specified models alongside robustness checks.

6.11 Tidy Output with broom

The broom package converts model objects into tidy data frames, perfect for further analysis or visualization.

library(broom)

tidy(fit2)       # coefficient table as a tibble
glance(fit2)     # model-level statistics (R2, AIC, etc.)
augment(fit2)    # original data + fitted values + residuals

# Tidy with confidence intervals
tidy(fit2, conf.int = TRUE)

6.12 Publication Tables with stargazer and modelsummary

library(stargazer)

# Side-by-side regression table
stargazer(fit1, fit2,
          type = "text",          # "latex" or "html" for documents
          title = "Regression Results",
          dep.var.labels = "Miles per Gallon",
          covariate.labels = c("Weight", "Horsepower", "Manual Trans."),
          out = "regression_table.html")  # save to file

modelsummary: The Modern Alternative

The modelsummary package produces publication-quality tables with far more flexibility than stargazer, including output to Word, LaTeX, HTML, and gt objects.

library(modelsummary)

# Basic side-by-side table
models <- list(
  "(1) Bivariate" = fit1,
  "(2) Controls"  = fit2,
  "(3) Interaction" = fit_int
)

modelsummary(models,
             stars = c("*" = 0.1, "**" = 0.05, "***" = 0.01),
             gof_omit = "IC|Log|RMSE",
             coef_rename = c(wt = "Weight", hp = "Horsepower",
                             am = "Manual Trans."),
             title = "OLS Regression: Miles per Gallon",
             notes = "Standard errors in parentheses.")

# Output to Word (useful for journal submissions)
modelsummary(models,
             output = "regression_table.docx",
             stars = TRUE)

# With robust SEs using the vcov argument
modelsummary(models,
             vcov = "HC1",
             stars = TRUE,
             title = "OLS with Robust Standard Errors")

# Coefficient plot
modelplot(models, coef_omit = "Intercept") +
  geom_vline(xintercept = 0, linetype = "dashed")
modelsummary vs. stargazer modelsummary is actively maintained, supports gt/kableExtra/flextable backends, handles robust and clustered SEs via the vcov argument, and can output directly to .docx or .tex files. It is the recommended choice for new projects.

Exercises

Exercise 6.1

Using the mtcars dataset, fit two models: (1) mpg ~ wt, and (2) mpg ~ wt + hp + cyl. Compare them with anova(). Use broom::glance() to compare R-squared and AIC side-by-side. Which model fits better?

Exercise 6.2

Fit mpg ~ wt + hp and produce the four diagnostic plots. Do the residuals appear roughly normal? Is there evidence of heteroscedasticity? If so, compute robust standard errors using lmtest and sandwich.

Exercise 6.3

Fit a model mpg ~ wt * am (with the interaction). Use emmeans to compute the predicted mpg for manual and automatic cars at wt = 2.5, 3.0, and 3.5. Plot the interaction using emmip(). Does the effect of weight on fuel economy differ by transmission type?

Exercise 6.4

Using your model from Exercise 6.1, compute Cook's distance for all observations. How many observations exceed the 4/n threshold? Refit the model after removing those influential points. Do the coefficients change substantially? Use modelsummary to display both models (full sample and trimmed sample) side by side with robust standard errors.

External Resources

Key Takeaways

← Chapter 5: Statistical Testing Chapter 7: GLMs →