Fit OLS regressions, interpret output, check assumptions, and produce publication tables.
The lm() function fits an ordinary least squares regression. The formula interface uses ~ to separate the dependent variable from predictors.
# Simple regression: mpg predicted by weight fit1 <- lm(mpg ~ wt, data = mtcars) summary(fit1) # Multiple regression fit2 <- lm(mpg ~ wt + hp + am, data = mtcars) summary(fit2)
y ~ . uses all other columns as predictors. y ~ x1 * x2 includes main effects and the interaction. y ~ x + I(x^2) adds a quadratic term.
# Key components of summary(fit2): # Coefficients table: Estimate, Std. Error, t value, Pr(>|t|) # Multiple R-squared, Adjusted R-squared # F-statistic and its p-value # Extract individual pieces coef(fit2) # coefficients vector confint(fit2) # 95% confidence intervals summary(fit2)$r.squared # R-squared summary(fit2)$adj.r.squared # Adjusted R-squared
Interactions let the effect of one predictor depend on the value of another. Polynomial terms capture non-linear relationships within the linear model framework.
# Interaction between weight and transmission type fit_int <- lm(mpg ~ wt * am, data = mtcars) summary(fit_int) # wt:am is the interaction term # Interpretation: the slope of wt differs for manual vs. automatic # Equivalent explicit syntax fit_int2 <- lm(mpg ~ wt + am + wt:am, data = mtcars) # Polynomial regression: quadratic term for wt fit_poly <- lm(mpg ~ wt + I(wt^2), data = mtcars) summary(fit_poly) # Using poly() for orthogonal polynomials (preferred for stability) fit_poly2 <- lm(mpg ~ poly(wt, 2), data = mtcars) summary(fit_poly2) # Three-way interaction (moderator analysis) fit_3way <- lm(mpg ~ wt * hp * am, data = mtcars) summary(fit_3way)
* operator does this automatically.
mtcars$wt_c <- scale(mtcars$wt, scale = FALSE).
When a model contains interactions or factors, raw coefficients are hard to interpret. The emmeans package computes estimated marginal means (predicted values averaged over other factors) and pairwise contrasts.
library(emmeans) # Fit a model with a factor and interaction mtcars$cyl_f <- factor(mtcars$cyl) fit_emm <- lm(mpg ~ wt * cyl_f, data = mtcars) # Estimated marginal means for each cylinder group (at mean wt) emm <- emmeans(fit_emm, ~ cyl_f) emm # Pairwise contrasts with Tukey adjustment pairs(emm) # Estimated marginal means at specific values of wt emmeans(fit_emm, ~ cyl_f, at = list(wt = c(2.5, 3.5))) # Visualize the interaction emmip(fit_emm, cyl_f ~ wt, at = list(wt = seq(1.5, 5.5, 0.5)), CIs = TRUE) + labs(y = "Predicted MPG", x = "Weight (1000 lbs)")
# Test whether hp and am contribute beyond wt alone anova(fit1, fit2) # Type II ANOVA table for each predictor anova(fit2) # Compare a sequence of nested models m0 <- lm(mpg ~ 1, data = mtcars) # intercept only m1 <- lm(mpg ~ wt, data = mtcars) m2 <- lm(mpg ~ wt + hp, data = mtcars) m3 <- lm(mpg ~ wt + hp + am, data = mtcars) anova(m0, m1, m2, m3) # Type III SS (useful with interactions; requires car package) library(car) Anova(fit_int, type = "III")
anova() uses sequential (Type I) SS, which are order-dependent. For models with interactions, use car::Anova(model, type = "III") to get marginal SS that do not depend on predictor ordering. Type II is appropriate when there are no significant interactions.
# Point predictions new_data <- data.frame(wt = c(2.5, 3.0, 3.5), hp = c(100, 120, 150), am = c(1, 0, 0)) predict(fit2, newdata = new_data) # With confidence and prediction intervals predict(fit2, newdata = new_data, interval = "confidence") predict(fit2, newdata = new_data, interval = "prediction")
Four standard diagnostic plots help assess OLS assumptions:
# Base R diagnostics (4 plots in one window) par(mfrow = c(2, 2)) plot(fit2) par(mfrow = c(1, 1))
| Plot | Checks | Look For |
|---|---|---|
| Residuals vs Fitted | Linearity, homoscedasticity | Random scatter around 0 |
| Q-Q Plot | Normality of residuals | Points on the diagonal line |
| Scale-Location | Homoscedasticity | Flat trend line |
| Residuals vs Leverage | Influential observations | Points inside Cook's distance contours |
lmtest::coeftest(fit2, vcov = sandwich::vcovHC(fit2, type = "HC1")).
OLS standard errors assume homoscedastic, uncorrelated errors. When that assumption fails, robust standard errors provide valid inference without changing the point estimates.
library(sandwich) library(lmtest) # HC1 robust SEs (equivalent to Stata's ", robust") coeftest(fit2, vcov = vcovHC(fit2, type = "HC1")) # HC3 (better for small samples, the default in many textbooks) coeftest(fit2, vcov = vcovHC(fit2, type = "HC3")) # Compare OLS vs. robust SEs side by side se_ols <- coeftest(fit2) se_robust <- coeftest(fit2, vcov = vcovHC(fit2, type = "HC1")) cbind(OLS_SE = se_ols[, 2], Robust_SE = se_robust[, 2]) # Robust confidence intervals coefci(fit2, vcov = vcovHC(fit2, type = "HC1")) # Cluster-robust SEs (e.g., clustered by cylinder count) coeftest(fit2, vcov = vcovCL(fit2, cluster = mtcars$cyl))
| Type | Function | When to Use |
|---|---|---|
| HC0 | vcovHC(fit, "HC0") | White's original estimator (biased in small samples) |
| HC1 | vcovHC(fit, "HC1") | Stata default; applies df correction N/(N-k) |
| HC3 | vcovHC(fit, "HC3") | Recommended for small samples; conservative |
| Clustered | vcovCL(fit, cluster) | Observations grouped in clusters (firms, schools) |
Individual data points can disproportionately influence regression estimates. Cook's distance combines leverage (how far an observation's predictors are from the mean) with the size of the residual.
# Compute influence diagnostics infl <- influence.measures(fit2) summary(infl) # Cook's distance: flag observations where D > 4/n cooks_d <- cooks.distance(fit2) threshold <- 4 / nrow(mtcars) influential <- which(cooks_d > threshold) cat("Influential obs:", names(influential), "\n") # Plot Cook's distance plot(cooks_d, type = "h", ylab = "Cook's Distance", main = "Cook's Distance by Observation") abline(h = threshold, col = "red", lty = 2) # Leverage (hat values) hat_vals <- hatvalues(fit2) high_lev <- which(hat_vals > 2 * mean(hat_vals)) cat("High leverage:", names(high_lev), "\n") # DFBETAS: how much each coefficient changes when obs i is dropped dfb <- dfbetas(fit2) head(dfb) # Sensitivity check: refit without influential observations fit2_clean <- lm(mpg ~ wt + hp + am, data = mtcars[-influential, ]) cbind(Full = coef(fit2), Trimmed = coef(fit2_clean))
In research, model specification should be guided by theory. Data-driven approaches like stepwise selection are useful for prediction but problematic for causal inference because they inflate Type I error rates and produce biased coefficient estimates.
# Stepwise selection (use cautiously; for illustration only) full_model <- lm(mpg ~ wt + hp + am + cyl + disp + drat, data = mtcars) step_model <- step(full_model, direction = "both", trace = 0) summary(step_model) # Compare AIC and BIC across candidate models models <- list(m1 = m1, m2 = m2, m3 = m3, step = step_model) sapply(models, AIC) sapply(models, BIC) # Adjusted R-squared comparison sapply(models, function(m) summary(m)$adj.r.squared)
The broom package converts model objects into tidy data frames, perfect for further analysis or visualization.
library(broom) tidy(fit2) # coefficient table as a tibble glance(fit2) # model-level statistics (R2, AIC, etc.) augment(fit2) # original data + fitted values + residuals # Tidy with confidence intervals tidy(fit2, conf.int = TRUE)
library(stargazer) # Side-by-side regression table stargazer(fit1, fit2, type = "text", # "latex" or "html" for documents title = "Regression Results", dep.var.labels = "Miles per Gallon", covariate.labels = c("Weight", "Horsepower", "Manual Trans."), out = "regression_table.html") # save to file
The modelsummary package produces publication-quality tables with far more flexibility than stargazer, including output to Word, LaTeX, HTML, and gt objects.
library(modelsummary) # Basic side-by-side table models <- list( "(1) Bivariate" = fit1, "(2) Controls" = fit2, "(3) Interaction" = fit_int ) modelsummary(models, stars = c("*" = 0.1, "**" = 0.05, "***" = 0.01), gof_omit = "IC|Log|RMSE", coef_rename = c(wt = "Weight", hp = "Horsepower", am = "Manual Trans."), title = "OLS Regression: Miles per Gallon", notes = "Standard errors in parentheses.") # Output to Word (useful for journal submissions) modelsummary(models, output = "regression_table.docx", stars = TRUE) # With robust SEs using the vcov argument modelsummary(models, vcov = "HC1", stars = TRUE, title = "OLS with Robust Standard Errors") # Coefficient plot modelplot(models, coef_omit = "Intercept") + geom_vline(xintercept = 0, linetype = "dashed")
modelsummary is actively maintained, supports gt/kableExtra/flextable backends, handles robust and clustered SEs via the vcov argument, and can output directly to .docx or .tex files. It is the recommended choice for new projects.
Using the mtcars dataset, fit two models: (1) mpg ~ wt, and (2) mpg ~ wt + hp + cyl. Compare them with anova(). Use broom::glance() to compare R-squared and AIC side-by-side. Which model fits better?
Fit mpg ~ wt + hp and produce the four diagnostic plots. Do the residuals appear roughly normal? Is there evidence of heteroscedasticity? If so, compute robust standard errors using lmtest and sandwich.
Fit a model mpg ~ wt * am (with the interaction). Use emmeans to compute the predicted mpg for manual and automatic cars at wt = 2.5, 3.0, and 3.5. Plot the interaction using emmip(). Does the effect of weight on fuel economy differ by transmission type?
Using your model from Exercise 6.1, compute Cook's distance for all observations. How many observations exceed the 4/n threshold? Refit the model after removing those influential points. Do the coefficients change substantially? Use modelsummary to display both models (full sample and trimmed sample) side by side with robust standard errors.
lm(y ~ x1 + x2, data) fits OLS. summary() gives coefficients, SEs, p-values, R-squared.y ~ x1 * x2 for interactions and I(x^2) or poly(x, 2) for polynomial terms.emmeans computes estimated marginal means and pairwise contrasts for models with factors or interactions.anova(model1, model2) for nested model comparison; car::Anova(, type="III") when interactions are present.sandwich::vcovHC() with lmtest::coeftest() for robust standard errors; vcovCL() for clustered SEs.broom::tidy() converts output to tidy data frames; modelsummary produces publication-quality tables.