Chapter 4: Linear Regression

OLS Estimation and Diagnostic Testing

4.1 Basic OLS Regression

The regress command is the workhorse of applied econometrics. It estimates parameters by ordinary least squares and reports coefficients, standard errors, t-statistics, and R-squared.

sysuse auto, clear

* Simple regression
regress price mpg

* Multiple regression
regress price mpg weight length foreign

* Factor variables (categorical dummies)
regress price mpg weight i.rep78 i.foreign

4.2 Robust vs. Clustered Standard Errors

OLS standard errors assume homoskedasticity and independence. In practice, both assumptions often fail. Understanding when to use robust, clustered, or two-way clustered standard errors is essential for valid inference.

* --- HC1 Robust (Huber-White) standard errors ---
* Correct for heteroskedasticity, assume independence
regress price mpg weight foreign, vce(robust)

* --- One-way clustered standard errors ---
* Correct for heteroskedasticity AND within-cluster correlation
regress price mpg weight, vce(cluster rep78)

* --- Two-way clustering (Stata 17+ or ivreg2) ---
* Cluster along two dimensions (e.g., firm and year)
regress price mpg weight, vce(cluster rep78 foreign)

* --- Bootstrap standard errors (alternative) ---
bootstrap, reps(1000) seed(12345) cluster(rep78): ///
    regress price mpg weight

When to Cluster: A Decision Framework Cluster standard errors at the level where treatment varies or where observations are correlated. In panel data, you typically cluster at the unit level (e.g., firm, facility). Clustering at too fine a level gives standard errors that are too small; too coarse a level gives conservative inference but may have too few clusters. A rule of thumb: you need at least 30-50 clusters for asymptotic justification. With fewer clusters, consider the wild cluster bootstrap (boottest package).

Robust vs. Cluster: A Common Confusion vce(robust) and vce(cluster id) are not interchangeable. Robust SEs handle heteroskedasticity but assume independence across observations. Clustered SEs allow arbitrary correlation within clusters. If your data has a panel or group structure, vce(robust) will typically understate standard errors because it ignores within-group correlation. When in doubt, cluster.

4.3 Post-Estimation: Testing Hypotheses

After running a regression, use test for joint F-tests and lincom for linear combinations of coefficients.

regress price mpg weight length foreign, vce(robust)

* Test if mpg and weight are jointly significant
test mpg weight

* Test a linear restriction: beta_mpg = beta_weight
test mpg = weight

* Linear combination: effect of mpg + weight
lincom mpg + weight

4.4 Diagnostics: Heteroskedasticity

Testing for heteroskedasticity helps you decide whether robust standard errors are necessary (they usually are). Two tests are standard.

regress price mpg weight foreign

* Breusch-Pagan / Cook-Weisberg test
estat hettest
* H0: constant variance. Reject => heteroskedasticity present
* This test assumes normality and tests against linear heteroskedasticity

* White's test (more general: no normality assumption, tests all cross-products)
estat imtest, white
* H0: homoskedastic AND correctly specified.
* Rejection could indicate heteroskedasticity OR misspecification.

* Visual check: residuals vs. fitted values
predict yhat_diag, xb
predict resid_diag, residuals
twoway scatter resid_diag yhat_diag, ///
    mcolor(navy%30) msize(small) ///
    yline(0, lcolor(cranberry)) ///
    title("Residuals vs. Fitted Values") ///
    xtitle("Fitted Values") ytitle("Residuals")

Interpreting Heteroskedasticity Tests The Breusch-Pagan test checks whether the error variance is a linear function of the regressors. White's test is more general, but a rejection may reflect functional form misspecification rather than heteroskedasticity per se. In practice, most applied researchers simply use robust standard errors as a default rather than testing first; the tests are more relevant for understanding your model than for making a binary decision.

4.5 Diagnostics: Omitted Variables (RESET Test)

The Ramsey RESET test checks whether nonlinear functions of the fitted values have explanatory power beyond the linear model. It is a general test for functional form misspecification.

* Ramsey RESET test
regress price mpg weight foreign
estat ovtest
* H0: no omitted variables. Reject => model may be misspecified
* Uses powers of fitted values (y-hat^2, y-hat^3, y-hat^4)

* If RESET rejects, consider:
* 1. Adding squared terms (c.mpg#c.mpg)
* 2. Using log transformations
* 3. Adding interaction terms
* 4. Including omitted controls

* Example: fix misspecification by adding a quadratic term
regress price c.mpg##c.mpg weight foreign
estat ovtest              // re-test after correction

4.6 Diagnostics: Multicollinearity

The Variance Inflation Factor (VIF) measures how much the variance of a coefficient is inflated due to collinearity with other regressors. A VIF above 10 is a common (though somewhat arbitrary) rule of thumb for concern.

regress price mpg weight length
vif

* Typical output:
*   Variable |       VIF       1/VIF
* -----------+-----------------------
*     weight |      6.34    0.157729
*     length |      5.89    0.169779
*        mpg |      3.01    0.332226

4.7 Predicted Values and Residuals

Residual analysis is critical for identifying influential observations, outliers, and model violations.

regress price mpg weight foreign

* Predicted values (fitted y-hat)
predict yhat, xb

* Residuals
predict resid, residuals

* Standardized residuals
predict rstandard, rstandard

* Cook's distance (influential observations)
predict cooksd, cooksd
list make price cooksd if cooksd > 4/74   // 4/N rule of thumb

4.7a Leverage and Influence Diagnostics

Influential observations can distort your regression results. Beyond Cook's distance, Stata offers DFBETA (how much each coefficient changes when an observation is dropped), leverage values, and DFITS. These diagnostics are particularly important in applied research where a few outliers can change the sign of a coefficient.

regress price mpg weight foreign

* DFBETA: change in each coefficient when obs i is dropped
predict dfb_mpg, dfbeta(mpg)
predict dfb_weight, dfbeta(weight)

* Rule of thumb: |DFBETA| > 2/sqrt(N) flags influential obs
local cutoff = 2 / sqrt(_N)
list make price mpg dfb_mpg if abs(dfb_mpg) > `cutoff'

* Leverage (hat values): how unusual is observation i in X-space?
predict lev, leverage
* Rule of thumb: leverage > 2k/N (k = number of predictors + 1)
local lev_cutoff = 2 * (e(df_m) + 1) / e(N)
list make price lev if lev > `lev_cutoff'

* DFITS: overall influence (combines leverage and residual)
predict dfits, dfits
list make price dfits if abs(dfits) > 2 * sqrt(e(df_m) / e(N))

* Leverage-vs-residual-squared plot
lvr2plot, mlabel(make) msize(small) ///
    title("Leverage vs. Residual Squared")

When Influential Observations Matter Influence diagnostics are not about mechanically deleting outliers. They help you understand whether your results depend on a handful of observations. If removing 2-3 observations changes the sign or significance of a key coefficient, you should investigate those observations and report the sensitivity in your paper. In panel data contexts, DFBETA analysis on key variables (like SDI) can reveal whether a few noisy facilities drive your results.

4.8 Added-Variable Plots

Added-variable (partial regression) plots show the relationship between the outcome and one regressor after partialling out all other regressors. They are useful for detecting nonlinearity and influential points.

regress price mpg weight foreign
avplot mpg, title("Added-Variable Plot: MPG")

* All added-variable plots at once
avplots

4.9 Interaction Terms

Interactions allow the effect of one variable to depend on the level of another. Stata's factor variable notation makes this straightforward.

* Continuous x continuous interaction
regress price c.mpg##c.weight, vce(robust)

* Categorical x continuous interaction
regress price i.foreign##c.mpg, vce(robust)

* Marginal effects at different levels
margins foreign, dydx(mpg)
marginsplot, title("Effect of MPG by Origin")

4.10 Margins and Marginal Effects

The margins command computes predicted values or marginal effects at specified values of covariates. Combined with marginsplot, it is the standard way to interpret interaction terms, nonlinear models, and factor variables in applied work.

regress price c.mpg##c.mpg##i.foreign c.weight, vce(robust)

* Average marginal effect of mpg (accounts for quadratic + interaction)
margins, dydx(mpg)

* Marginal effect of mpg at specific values
margins, dydx(mpg) at(mpg=(15(5)40))
marginsplot, ///
    title("Marginal Effect of MPG on Price") ///
    ytitle("Effect on Price (USD)") ///
    xtitle("MPG") ///
    recast(line) recastci(rarea) ///
    ciopt(fcolor(navy%20))

* Predicted values by group at different covariate levels
margins foreign, at(mpg=(15(5)40))
marginsplot, ///
    title("Predicted Price by Origin") ///
    legend(order(1 "Domestic" 2 "Foreign"))

* Contrast: difference between groups
margins r.foreign, at(mpg=(15(5)40))
marginsplot, yline(0) ///
    title("Foreign - Domestic Price Difference")

Tip: Margins are Essential for Interactions When your model includes interactions (especially continuous-by-continuous), the raw coefficients do not directly tell you the marginal effect of a variable. The effect depends on the level of the other variable in the interaction. Always use margins, dydx() and marginsplot to interpret and visualize interaction effects. Presenting a marginsplot alongside the coefficient table is increasingly expected in top journals.

4.11 Publishing Regression Tables with esttab and outreg2

Producing publication-ready regression tables directly from Stata eliminates transcription errors and speeds up revision rounds. Two popular packages exist: esttab (part of estout) and outreg2.

* --- esttab approach (recommended) ---
sysuse auto, clear

* Run multiple specifications and store each
regress price mpg, vce(robust)
estimates store m1
regress price mpg weight, vce(robust)
estimates store m2
regress price mpg weight foreign i.rep78, vce(robust)
estimates store m3

* Export to LaTeX
esttab m1 m2 m3 using "tables/regression.tex", ///
    star(* 0.10 ** 0.05 *** 0.01) ///
    se(3) b(3) ///
    r2(3) ar2(3) ///
    scalars("F F-statistic") ///
    mtitles("(1)" "(2)" "(3)") ///
    label replace booktabs ///
    addnotes("Robust standard errors in parentheses.")

* Export to CSV (for Excel)
esttab m1 m2 m3 using "tables/regression.csv", ///
    star(* 0.10 ** 0.05 *** 0.01) ///
    se(3) b(3) label replace csv

* --- outreg2 approach ---
* ssc install outreg2, replace
regress price mpg, vce(robust)
outreg2 using "tables/regression_or2", ///
    tex replace ctitle("(1)") dec(3) label
regress price mpg weight, vce(robust)
outreg2 using "tables/regression_or2", ///
    tex append ctitle("(2)") dec(3) label

esttab vs. outreg2 esttab is more flexible and produces cleaner LaTeX output (supports booktabs, custom cell formatting, and multiple storage). outreg2 has a simpler syntax and is popular in some fields. For journal submissions, esttab with booktabs is the most common choice in economics and OM.

4.12 Stepwise Regression (and Why to Avoid It)

Stata provides stepwise for automated variable selection. However, stepwise regression is widely criticized in modern applied econometrics and should generally be avoided in research papers.

* Forward stepwise (adds variables one at a time)
stepwise, pe(0.05): regress price mpg weight length headroom trunk

* Backward stepwise (starts with all, removes insignificant)
stepwise, pr(0.10): regress price mpg weight length headroom trunk

Why Stepwise is Problematic Stepwise regression inflates R-squared, produces biased coefficients, and gives incorrect standard errors and p-values (because the search process is not accounted for in inference). The selected model depends on the order of variable entry. In causal research, variable selection should be guided by theory and DAGs, not by automated statistical procedures. If you use stepwise, reviewers will likely ask you to justify it, and the answer is rarely satisfactory. For prediction problems, consider LASSO or elastic net instead (lasso2 package).

Exercise 4.1

Using sysuse auto, regress price on mpg, weight, length, and foreign with robust standard errors. Run the RESET test and the Breusch-Pagan test. Based on the results, is there evidence of misspecification or heteroskedasticity?

Exercise 4.2

Estimate a model with the interaction c.mpg##i.foreign. Use margins to compute the marginal effect of mpg separately for domestic and foreign cars. Create a marginsplot. Interpret the interaction.

Exercise 4.3

Run the regression regress price mpg weight length foreign. Compute DFBETA values for mpg. How many observations exceed the 2/sqrt(N) threshold? List them. Re-estimate the regression after dropping these influential observations. Do the coefficients change meaningfully?

Exercise 4.4

Estimate three nested specifications of the price model: (1) mpg only, (2) mpg weight, (3) mpg weight foreign i.rep78. Use esttab to export all three models in a single LaTeX table with robust standard errors, stars at the 10/5/1% levels, and R-squared reported at the bottom. Include a table note about the standard errors.

External Resources

Stata Manual: regress External
Stata Manual: Regression Post-Estimation External
UCLA OARC: Regression with Stata External

Key Takeaways

Always use vce(robust) or vce(cluster) in applied work; choose the clustering level based on where treatment varies or where correlation occurs.
Run diagnostic tests (RESET, Breusch-Pagan, VIF) before interpreting results.
Use DFBETA and Cook's distance to identify influential observations; report sensitivity to their exclusion.
Use Stata's factor variable notation (c., i., ##) for interactions.
Always interpret interactions with margins and marginsplot, not raw coefficients alone.
Use esttab to produce publication-ready tables; avoid stepwise regression in causal research.

← Chapter 3: Descriptive Statistics Chapter 5: Panel Data Methods →