Fixed Effects, Random Effects, and the Hausman Test
Before using any panel estimator, you must tell Stata which variable identifies units and which identifies time periods. The xtset command does this.
* Declare panel structure xtset facility_id year * Output: panel variable: facility_id (strongly balanced) * time variable: year, 2015 to 2019 * Check panel summary xtdescribe * Panel-level summary statistics xtsum qip rn_pp sdi
xtsum decomposes each variable into overall, between-unit, and within-unit variation. Fixed effects exploit only within-unit variation; random effects use both. If a variable has zero within variation (e.g., a time-invariant characteristic), FE cannot estimate its coefficient. Understanding this decomposition is fundamental: a variable with large between-variation but tiny within-variation may be statistically significant in RE but not in FE, because FE discards all cross-sectional variation.
Panel data supports three distinct estimators that use different sources of variation. Understanding their relationship clarifies why FE and RE give different answers.
* Load panel data webuse nlswork, clear xtset idcode year * --- Within estimator (equivalent to FE) --- * Uses only variation within each unit over time xtreg ln_wage age tenure hours, fe estimates store Within * --- Between estimator --- * Uses only cross-sectional variation (unit means) xtreg ln_wage age tenure hours, be estimates store Between * --- Random effects = weighted average of within and between --- xtreg ln_wage age tenure hours, re estimates store RE_model * Compare all three esttab Within Between RE_model, /// mtitles("Within (FE)" "Between" "RE") /// se star(* 0.10 ** 0.05 *** 0.01) /// scalars("r2_w R-sq within" "r2_b R-sq between" "r2_o R-sq overall")
xtreg reports three R-squared values: within (how well the model explains variation over time for each unit), between (how well it explains variation across units), and overall. FE maximizes the within R-squared. If your within R-squared is very low but between R-squared is high, your model mostly captures cross-sectional differences, and the time variation may be too noisy for FE to detect effects.
The FE estimator removes all time-invariant unobserved heterogeneity by demeaning each variable at the unit level. This is the default choice for causal inference with panel data in most OM and economics applications.
* Fixed effects regression xtreg qip rn_pp sdi i.year, fe vce(cluster facility_id) * Store results for later comparison estimates store FE * Test joint significance of time dummies testparm i.year
The RE estimator assumes that the unobserved unit-level heterogeneity is uncorrelated with the regressors. It is more efficient than FE when this assumption holds, and it can estimate coefficients on time-invariant variables.
* Random effects regression xtreg qip rn_pp sdi i.year, re vce(cluster facility_id) * Store results estimates store RE * Breusch-Pagan LM test: RE vs. pooled OLS xtreg qip rn_pp sdi i.year, re xttest0 * H0: sigma_u^2 = 0 (no panel effects). Reject => use RE over pooled OLS
The Hausman test compares FE and RE estimates. If they differ systematically, it means the RE assumption (no correlation between unit effects and regressors) is violated, and FE should be preferred.
* Run FE without cluster (Hausman requires default VCE) xtreg qip rn_pp sdi i.year, fe estimates store FE_h * Run RE without cluster xtreg qip rn_pp sdi i.year, re estimates store RE_h * Hausman test hausman FE_h RE_h * H0: RE is consistent and efficient * Reject => use FE
hausman FE_h RE_h, sigmamore. Better yet, many applied researchers simply default to FE in causal designs, since the RE assumption is difficult to defend.
A common source of confusion in panel data is why certain variables "disappear" in FE estimation. The reason is that FE demeans all variables at the unit level, so any variable that is constant within a unit over time is perfectly collinear with the unit fixed effect and is absorbed.
* Load panel data webuse nlswork, clear xtset idcode year * race is time-invariant => dropped by FE xtreg ln_wage age tenure hours race, fe * Note: race is omitted (collinear with unit FE) * RE can estimate time-invariant coefficients xtreg ln_wage age tenure hours race, re * race coefficient is estimated * Check within-variation for a variable xtsum race * If "within" std dev = 0, the variable is time-invariant
The Mundlak (1978) approach augments the RE model with the group means of all time-varying regressors. This is equivalent to testing whether FE and RE give different results, and it allows you to estimate time-invariant coefficients while controlling for correlation between the unit effect and the regressors.
webuse nlswork, clear xtset idcode year * Step 1: compute group means of time-varying regressors foreach var in age tenure hours { egen mean_`var' = mean(`var'), by(idcode) } * Step 2: RE model augmented with group means (Mundlak approach) xtreg ln_wage age tenure hours race /// mean_age mean_tenure mean_hours, re vce(cluster idcode) * Step 3: Test joint significance of group means * H0: group means are jointly zero => RE is consistent * Reject => FE is needed (same conclusion as Hausman test) test mean_age mean_tenure mean_hours * Key insight: the time-varying coefficients in CRE are numerically * identical to FE coefficients, and race is now estimable.
After estimating a model with fixed effects, you should test whether the fixed effects are jointly significant. This confirms that the panel structure matters and that pooled OLS would be inappropriate.
* F-test for joint significance of unit FE xtreg ln_wage age tenure hours, fe * The F-test at the bottom of xtreg,fe output tests: * H0: all unit fixed effects = 0 * This is reported as "F test that all u_i=0" * Test joint significance of time dummies xtreg ln_wage age tenure hours i.year, fe testparm i.year * H0: all year dummies are jointly zero * Reject => time effects matter, include them * Test subsets of regressors test age tenure // joint significance of age and tenure
When your model has multiple sets of fixed effects (e.g., facility FE and year FE, or facility-year and state-year FE), reghdfe is far more efficient than xtreg or dummy-variable approaches.
* Install: ssc install reghdfe, replace * ssc install ftools, replace * Two-way fixed effects (facility + year) reghdfe qip rn_pp sdi, absorb(facility_id year) vce(cluster facility_id) * Three-way: facility + year + state-specific trends reghdfe qip rn_pp sdi, absorb(facility_id year state#c.year) /// vce(cluster facility_id) * Store results estimates store HDFE
First differencing is an alternative to FE that also removes time-invariant unobserved heterogeneity. With T=2 periods, FE and FD are numerically identical. With T>2, FE is generally more efficient under serially uncorrelated errors, while FD is more robust to unit root processes.
* Generate first differences manually xtset facility_id year gen d_qip = D.qip gen d_rn = D.rn_pp gen d_sdi = D.sdi * First-differenced regression regress d_qip d_rn d_sdi i.year, vce(cluster facility_id)
Panel models with interactions require careful thought about what variation identifies the interaction effect. In FE models, the interaction effect is identified from within-unit changes in both interacted variables over time.
* --- xtreg with interactions --- xtreg qip c.rn_pp##c.sdi i.year, fe vce(cluster facility_id) * Marginal effect of RN staffing at different SDI levels margins, dydx(rn_pp) at(sdi=(0(20)100)) marginsplot, /// title("Effect of RN Staffing on QIP by SDI") /// ytitle("Marginal Effect of RN/100 Patients") /// xtitle("Social Deprivation Index") /// yline(0, lpattern(dash) lcolor(red)) * --- Three-way interaction --- xtreg qip c.rn_pp##c.sdi##i.emp_policy i.year, /// fe vce(cluster facility_id) * --- reghdfe with interactions --- reghdfe qip c.rn_pp##c.sdi i.emp_policy, /// absorb(facility_id year) vce(cluster facility_id) * reghdfe with interacted fixed effects (state-specific trends) reghdfe qip rn_pp sdi, /// absorb(facility_id year state#c.year) /// vce(cluster facility_id)
xtreg y c.x##i.region, fe drops the main effect of region but estimates c.x#i.region (how the effect of x differs across regions over time). This is valid and commonly used.
When the lagged dependent variable is a regressor, FE is biased (Nickell bias). For short panels (small T, large N), this bias can be substantial. The Arellano-Bond GMM estimator addresses this by using lagged levels as instruments for first-differenced equations.
* Dynamic panel model: lagged dependent variable * FE is biased with lagged DV (Nickell bias) xtreg ln_wage L.ln_wage age tenure, fe * WARNING: biased for small T (T < 20-30) * Arellano-Bond GMM (difference GMM) xtabond ln_wage age tenure, lags(1) /// twostep vce(robust) * System GMM (more efficient, uses both levels and differences) * ssc install xtabond2, replace xtabond2 ln_wage L.ln_wage age tenure i.year, /// gmm(L.ln_wage, lag(2 4)) /// iv(age tenure i.year) /// twostep robust small * Key diagnostic tests for GMM: * 1. AR(1) should be significant (expected) * 2. AR(2) should NOT be significant (no second-order autocorrelation) * 3. Hansen J test should NOT reject (instruments are valid)
* Compare FE, RE, and HDFE side by side esttab FE RE HDFE using "tables/panel_results.tex", /// star(* 0.10 ** 0.05 *** 0.01) /// se(3) b(3) /// scalars("r2_w R-sq (within)" "N_g Groups") /// mtitles("FE" "RE" "HDFE") /// label replace booktabs
Load the nlswork dataset (webuse nlswork, clear). Declare the panel with xtset idcode year. Estimate FE and RE models of ln_wage on age, tenure, and hours. Run the Hausman test. Which model is preferred?
Using the same data, estimate the wage equation with reghdfe absorbing both idcode and year fixed effects. Cluster standard errors at the individual level. Compare the coefficients with those from xtreg, fe.
Implement the Mundlak (Correlated Random Effects) approach on the nlswork data. Compute group means for age, tenure, and hours. Include race as a time-invariant regressor. Run the augmented RE model and test the joint significance of the group means. Compare the time-varying coefficients to those from FE. Are they numerically identical (or very close)?
Using webuse nlswork, estimate a dynamic panel model of ln_wage with its first lag as a regressor, plus age and tenure as controls. Compare: (a) FE with lagged DV, (b) Arellano-Bond GMM via xtabond. Report the AR(2) test p-value from the GMM estimation. Does the coefficient on the lagged DV differ meaningfully between the two approaches?
xtset before panel estimation and inspect balance with xtdescribe.reghdfe for models with multiple high-dimensional fixed effects.margins and marginsplot.