OLS Estimation and Diagnostic Testing
The regress command is the workhorse of applied econometrics. It estimates parameters by ordinary least squares and reports coefficients, standard errors, t-statistics, and R-squared.
sysuse auto, clear * Simple regression regress price mpg * Multiple regression regress price mpg weight length foreign * Factor variables (categorical dummies) regress price mpg weight i.rep78 i.foreign
OLS standard errors assume homoskedasticity and independence. In practice, both assumptions often fail. Understanding when to use robust, clustered, or two-way clustered standard errors is essential for valid inference.
* --- HC1 Robust (Huber-White) standard errors --- * Correct for heteroskedasticity, assume independence regress price mpg weight foreign, vce(robust) * --- One-way clustered standard errors --- * Correct for heteroskedasticity AND within-cluster correlation regress price mpg weight, vce(cluster rep78) * --- Two-way clustering (Stata 17+ or ivreg2) --- * Cluster along two dimensions (e.g., firm and year) regress price mpg weight, vce(cluster rep78 foreign) * --- Bootstrap standard errors (alternative) --- bootstrap, reps(1000) seed(12345) cluster(rep78): /// regress price mpg weight
boottest package).
vce(robust) and vce(cluster id) are not interchangeable. Robust SEs handle heteroskedasticity but assume independence across observations. Clustered SEs allow arbitrary correlation within clusters. If your data has a panel or group structure, vce(robust) will typically understate standard errors because it ignores within-group correlation. When in doubt, cluster.
After running a regression, use test for joint F-tests and lincom for linear combinations of coefficients.
regress price mpg weight length foreign, vce(robust) * Test if mpg and weight are jointly significant test mpg weight * Test a linear restriction: beta_mpg = beta_weight test mpg = weight * Linear combination: effect of mpg + weight lincom mpg + weight
Testing for heteroskedasticity helps you decide whether robust standard errors are necessary (they usually are). Two tests are standard.
regress price mpg weight foreign * Breusch-Pagan / Cook-Weisberg test estat hettest * H0: constant variance. Reject => heteroskedasticity present * This test assumes normality and tests against linear heteroskedasticity * White's test (more general: no normality assumption, tests all cross-products) estat imtest, white * H0: homoskedastic AND correctly specified. * Rejection could indicate heteroskedasticity OR misspecification. * Visual check: residuals vs. fitted values predict yhat_diag, xb predict resid_diag, residuals twoway scatter resid_diag yhat_diag, /// mcolor(navy%30) msize(small) /// yline(0, lcolor(cranberry)) /// title("Residuals vs. Fitted Values") /// xtitle("Fitted Values") ytitle("Residuals")
The Ramsey RESET test checks whether nonlinear functions of the fitted values have explanatory power beyond the linear model. It is a general test for functional form misspecification.
* Ramsey RESET test regress price mpg weight foreign estat ovtest * H0: no omitted variables. Reject => model may be misspecified * Uses powers of fitted values (y-hat^2, y-hat^3, y-hat^4) * If RESET rejects, consider: * 1. Adding squared terms (c.mpg#c.mpg) * 2. Using log transformations * 3. Adding interaction terms * 4. Including omitted controls * Example: fix misspecification by adding a quadratic term regress price c.mpg##c.mpg weight foreign estat ovtest // re-test after correction
The Variance Inflation Factor (VIF) measures how much the variance of a coefficient is inflated due to collinearity with other regressors. A VIF above 10 is a common (though somewhat arbitrary) rule of thumb for concern.
regress price mpg weight length vif * Typical output: * Variable | VIF 1/VIF * -----------+----------------------- * weight | 6.34 0.157729 * length | 5.89 0.169779 * mpg | 3.01 0.332226
Residual analysis is critical for identifying influential observations, outliers, and model violations.
regress price mpg weight foreign * Predicted values (fitted y-hat) predict yhat, xb * Residuals predict resid, residuals * Standardized residuals predict rstandard, rstandard * Cook's distance (influential observations) predict cooksd, cooksd list make price cooksd if cooksd > 4/74 // 4/N rule of thumb
Influential observations can distort your regression results. Beyond Cook's distance, Stata offers DFBETA (how much each coefficient changes when an observation is dropped), leverage values, and DFITS. These diagnostics are particularly important in applied research where a few outliers can change the sign of a coefficient.
regress price mpg weight foreign * DFBETA: change in each coefficient when obs i is dropped predict dfb_mpg, dfbeta(mpg) predict dfb_weight, dfbeta(weight) * Rule of thumb: |DFBETA| > 2/sqrt(N) flags influential obs local cutoff = 2 / sqrt(_N) list make price mpg dfb_mpg if abs(dfb_mpg) > `cutoff' * Leverage (hat values): how unusual is observation i in X-space? predict lev, leverage * Rule of thumb: leverage > 2k/N (k = number of predictors + 1) local lev_cutoff = 2 * (e(df_m) + 1) / e(N) list make price lev if lev > `lev_cutoff' * DFITS: overall influence (combines leverage and residual) predict dfits, dfits list make price dfits if abs(dfits) > 2 * sqrt(e(df_m) / e(N)) * Leverage-vs-residual-squared plot lvr2plot, mlabel(make) msize(small) /// title("Leverage vs. Residual Squared")
Added-variable (partial regression) plots show the relationship between the outcome and one regressor after partialling out all other regressors. They are useful for detecting nonlinearity and influential points.
regress price mpg weight foreign avplot mpg, title("Added-Variable Plot: MPG") * All added-variable plots at once avplots
Interactions allow the effect of one variable to depend on the level of another. Stata's factor variable notation makes this straightforward.
* Continuous x continuous interaction regress price c.mpg##c.weight, vce(robust) * Categorical x continuous interaction regress price i.foreign##c.mpg, vce(robust) * Marginal effects at different levels margins foreign, dydx(mpg) marginsplot, title("Effect of MPG by Origin")
The margins command computes predicted values or marginal effects at specified values of covariates. Combined with marginsplot, it is the standard way to interpret interaction terms, nonlinear models, and factor variables in applied work.
regress price c.mpg##c.mpg##i.foreign c.weight, vce(robust) * Average marginal effect of mpg (accounts for quadratic + interaction) margins, dydx(mpg) * Marginal effect of mpg at specific values margins, dydx(mpg) at(mpg=(15(5)40)) marginsplot, /// title("Marginal Effect of MPG on Price") /// ytitle("Effect on Price (USD)") /// xtitle("MPG") /// recast(line) recastci(rarea) /// ciopt(fcolor(navy%20)) * Predicted values by group at different covariate levels margins foreign, at(mpg=(15(5)40)) marginsplot, /// title("Predicted Price by Origin") /// legend(order(1 "Domestic" 2 "Foreign")) * Contrast: difference between groups margins r.foreign, at(mpg=(15(5)40)) marginsplot, yline(0) /// title("Foreign - Domestic Price Difference")
margins, dydx() and marginsplot to interpret and visualize interaction effects. Presenting a marginsplot alongside the coefficient table is increasingly expected in top journals.
Producing publication-ready regression tables directly from Stata eliminates transcription errors and speeds up revision rounds. Two popular packages exist: esttab (part of estout) and outreg2.
* --- esttab approach (recommended) --- sysuse auto, clear * Run multiple specifications and store each regress price mpg, vce(robust) estimates store m1 regress price mpg weight, vce(robust) estimates store m2 regress price mpg weight foreign i.rep78, vce(robust) estimates store m3 * Export to LaTeX esttab m1 m2 m3 using "tables/regression.tex", /// star(* 0.10 ** 0.05 *** 0.01) /// se(3) b(3) /// r2(3) ar2(3) /// scalars("F F-statistic") /// mtitles("(1)" "(2)" "(3)") /// label replace booktabs /// addnotes("Robust standard errors in parentheses.") * Export to CSV (for Excel) esttab m1 m2 m3 using "tables/regression.csv", /// star(* 0.10 ** 0.05 *** 0.01) /// se(3) b(3) label replace csv * --- outreg2 approach --- * ssc install outreg2, replace regress price mpg, vce(robust) outreg2 using "tables/regression_or2", /// tex replace ctitle("(1)") dec(3) label regress price mpg weight, vce(robust) outreg2 using "tables/regression_or2", /// tex append ctitle("(2)") dec(3) label
esttab is more flexible and produces cleaner LaTeX output (supports booktabs, custom cell formatting, and multiple storage). outreg2 has a simpler syntax and is popular in some fields. For journal submissions, esttab with booktabs is the most common choice in economics and OM.
Stata provides stepwise for automated variable selection. However, stepwise regression is widely criticized in modern applied econometrics and should generally be avoided in research papers.
* Forward stepwise (adds variables one at a time) stepwise, pe(0.05): regress price mpg weight length headroom trunk * Backward stepwise (starts with all, removes insignificant) stepwise, pr(0.10): regress price mpg weight length headroom trunk
lasso2 package).
Using sysuse auto, regress price on mpg, weight, length, and foreign with robust standard errors. Run the RESET test and the Breusch-Pagan test. Based on the results, is there evidence of misspecification or heteroskedasticity?
Estimate a model with the interaction c.mpg##i.foreign. Use margins to compute the marginal effect of mpg separately for domestic and foreign cars. Create a marginsplot. Interpret the interaction.
Run the regression regress price mpg weight length foreign. Compute DFBETA values for mpg. How many observations exceed the 2/sqrt(N) threshold? List them. Re-estimate the regression after dropping these influential observations. Do the coefficients change meaningfully?
Estimate three nested specifications of the price model: (1) mpg only, (2) mpg weight, (3) mpg weight foreign i.rep78. Use esttab to export all three models in a single LaTeX table with robust standard errors, stars at the 10/5/1% levels, and R-squared reported at the bottom. Include a table note about the standard errors.
vce(robust) or vce(cluster) in applied work; choose the clustering level based on where treatment varies or where correlation occurs.c., i., ##) for interactions.margins and marginsplot, not raw coefficients alone.esttab to produce publication-ready tables; avoid stepwise regression in causal research.