Chapter 5: Panel Data Methods

Fixed Effects, Random Effects, and the Hausman Test

5.1 Declaring Panel Data

Before using any panel estimator, you must tell Stata which variable identifies units and which identifies time periods. The xtset command does this.

* Declare panel structure
xtset facility_id year
* Output: panel variable: facility_id (strongly balanced)
*         time variable: year, 2015 to 2019

* Check panel summary
xtdescribe

* Panel-level summary statistics
xtsum qip rn_pp sdi

Between vs. Within Variation xtsum decomposes each variable into overall, between-unit, and within-unit variation. Fixed effects exploit only within-unit variation; random effects use both. If a variable has zero within variation (e.g., a time-invariant characteristic), FE cannot estimate its coefficient. Understanding this decomposition is fundamental: a variable with large between-variation but tiny within-variation may be statistically significant in RE but not in FE, because FE discards all cross-sectional variation.

5.1a Understanding Within vs. Between Estimators

Panel data supports three distinct estimators that use different sources of variation. Understanding their relationship clarifies why FE and RE give different answers.

* Load panel data
webuse nlswork, clear
xtset idcode year

* --- Within estimator (equivalent to FE) ---
* Uses only variation within each unit over time
xtreg ln_wage age tenure hours, fe
estimates store Within

* --- Between estimator ---
* Uses only cross-sectional variation (unit means)
xtreg ln_wage age tenure hours, be
estimates store Between

* --- Random effects = weighted average of within and between ---
xtreg ln_wage age tenure hours, re
estimates store RE_model

* Compare all three
esttab Within Between RE_model, ///
    mtitles("Within (FE)" "Between" "RE") ///
    se star(* 0.10 ** 0.05 *** 0.01) ///
    scalars("r2_w R-sq within" "r2_b R-sq between" "r2_o R-sq overall")

Tip: Interpreting R-squared in Panel Models xtreg reports three R-squared values: within (how well the model explains variation over time for each unit), between (how well it explains variation across units), and overall. FE maximizes the within R-squared. If your within R-squared is very low but between R-squared is high, your model mostly captures cross-sectional differences, and the time variation may be too noisy for FE to detect effects.

5.2 Fixed Effects (FE)

The FE estimator removes all time-invariant unobserved heterogeneity by demeaning each variable at the unit level. This is the default choice for causal inference with panel data in most OM and economics applications.

* Fixed effects regression
xtreg qip rn_pp sdi i.year, fe vce(cluster facility_id)

* Store results for later comparison
estimates store FE

* Test joint significance of time dummies
testparm i.year

5.3 Random Effects (RE)

The RE estimator assumes that the unobserved unit-level heterogeneity is uncorrelated with the regressors. It is more efficient than FE when this assumption holds, and it can estimate coefficients on time-invariant variables.

* Random effects regression
xtreg qip rn_pp sdi i.year, re vce(cluster facility_id)

* Store results
estimates store RE

* Breusch-Pagan LM test: RE vs. pooled OLS
xtreg qip rn_pp sdi i.year, re
xttest0
* H0: sigma_u^2 = 0 (no panel effects). Reject => use RE over pooled OLS

5.4 The Hausman Test

The Hausman test compares FE and RE estimates. If they differ systematically, it means the RE assumption (no correlation between unit effects and regressors) is violated, and FE should be preferred.

* Run FE without cluster (Hausman requires default VCE)
xtreg qip rn_pp sdi i.year, fe
estimates store FE_h

* Run RE without cluster
xtreg qip rn_pp sdi i.year, re
estimates store RE_h

* Hausman test
hausman FE_h RE_h
* H0: RE is consistent and efficient
* Reject => use FE

Hausman Test Limitations The classic Hausman test requires homoskedastic errors. With clustered or heteroskedastic errors, use the robust version: hausman FE_h RE_h, sigmamore. Better yet, many applied researchers simply default to FE in causal designs, since the RE assumption is difficult to defend.

5.4a Time-Varying vs. Time-Invariant Variables

A common source of confusion in panel data is why certain variables "disappear" in FE estimation. The reason is that FE demeans all variables at the unit level, so any variable that is constant within a unit over time is perfectly collinear with the unit fixed effect and is absorbed.

* Load panel data
webuse nlswork, clear
xtset idcode year

* race is time-invariant => dropped by FE
xtreg ln_wage age tenure hours race, fe
* Note: race is omitted (collinear with unit FE)

* RE can estimate time-invariant coefficients
xtreg ln_wage age tenure hours race, re
* race coefficient is estimated

* Check within-variation for a variable
xtsum race
* If "within" std dev = 0, the variable is time-invariant

What If You Need Time-Invariant Coefficients? If you need to estimate the effect of a time-invariant variable (like gender, race, or geographic region) while controlling for unobserved heterogeneity, you have several options: (1) Correlated Random Effects (Mundlak approach), (2) Hausman-Taylor estimator, (3) Between estimator (though this loses the causal credibility of FE). The Mundlak approach is the most common in applied research.

5.4b The Mundlak Test and Correlated Random Effects

The Mundlak (1978) approach augments the RE model with the group means of all time-varying regressors. This is equivalent to testing whether FE and RE give different results, and it allows you to estimate time-invariant coefficients while controlling for correlation between the unit effect and the regressors.

webuse nlswork, clear
xtset idcode year

* Step 1: compute group means of time-varying regressors
foreach var in age tenure hours {
    egen mean_`var' = mean(`var'), by(idcode)
}

* Step 2: RE model augmented with group means (Mundlak approach)
xtreg ln_wage age tenure hours race ///
    mean_age mean_tenure mean_hours, re vce(cluster idcode)

* Step 3: Test joint significance of group means
* H0: group means are jointly zero => RE is consistent
* Reject => FE is needed (same conclusion as Hausman test)
test mean_age mean_tenure mean_hours

* Key insight: the time-varying coefficients in CRE are numerically
* identical to FE coefficients, and race is now estimable.

Tip: CRE as a Better Alternative to Hausman The Correlated Random Effects (Mundlak) approach has two advantages over the classic Hausman test. First, it works with clustered standard errors (the classic Hausman test requires the default VCE). Second, it lets you estimate time-invariant coefficients under the same assumptions as FE. Many applied researchers now prefer CRE to the standard Hausman test.

5.4c Testing Joint Significance of Fixed Effects

After estimating a model with fixed effects, you should test whether the fixed effects are jointly significant. This confirms that the panel structure matters and that pooled OLS would be inappropriate.

* F-test for joint significance of unit FE
xtreg ln_wage age tenure hours, fe
* The F-test at the bottom of xtreg,fe output tests:
* H0: all unit fixed effects = 0
* This is reported as "F test that all u_i=0"

* Test joint significance of time dummies
xtreg ln_wage age tenure hours i.year, fe
testparm i.year
* H0: all year dummies are jointly zero
* Reject => time effects matter, include them

* Test subsets of regressors
test age tenure              // joint significance of age and tenure

5.5 High-Dimensional Fixed Effects with reghdfe

When your model has multiple sets of fixed effects (e.g., facility FE and year FE, or facility-year and state-year FE), reghdfe is far more efficient than xtreg or dummy-variable approaches.

* Install: ssc install reghdfe, replace
*          ssc install ftools, replace

* Two-way fixed effects (facility + year)
reghdfe qip rn_pp sdi, absorb(facility_id year) vce(cluster facility_id)

* Three-way: facility + year + state-specific trends
reghdfe qip rn_pp sdi, absorb(facility_id year state#c.year) ///
    vce(cluster facility_id)

* Store results
estimates store HDFE

5.6 First-Differencing

First differencing is an alternative to FE that also removes time-invariant unobserved heterogeneity. With T=2 periods, FE and FD are numerically identical. With T>2, FE is generally more efficient under serially uncorrelated errors, while FD is more robust to unit root processes.

* Generate first differences manually
xtset facility_id year
gen d_qip = D.qip
gen d_rn = D.rn_pp
gen d_sdi = D.sdi

* First-differenced regression
regress d_qip d_rn d_sdi i.year, vce(cluster facility_id)

5.7 Interactions in Panel Models

Panel models with interactions require careful thought about what variation identifies the interaction effect. In FE models, the interaction effect is identified from within-unit changes in both interacted variables over time.

* --- xtreg with interactions ---
xtreg qip c.rn_pp##c.sdi i.year, fe vce(cluster facility_id)

* Marginal effect of RN staffing at different SDI levels
margins, dydx(rn_pp) at(sdi=(0(20)100))
marginsplot, ///
    title("Effect of RN Staffing on QIP by SDI") ///
    ytitle("Marginal Effect of RN/100 Patients") ///
    xtitle("Social Deprivation Index") ///
    yline(0, lpattern(dash) lcolor(red))

* --- Three-way interaction ---
xtreg qip c.rn_pp##c.sdi##i.emp_policy i.year, ///
    fe vce(cluster facility_id)

* --- reghdfe with interactions ---
reghdfe qip c.rn_pp##c.sdi i.emp_policy, ///
    absorb(facility_id year) vce(cluster facility_id)

* reghdfe with interacted fixed effects (state-specific trends)
reghdfe qip rn_pp sdi, ///
    absorb(facility_id year state#c.year) ///
    vce(cluster facility_id)

Interactions and Time-Invariant Variables in FE If one variable in an interaction is time-invariant (e.g., region), FE absorbs its main effect but can still estimate the interaction with a time-varying variable. For example, xtreg y c.x##i.region, fe drops the main effect of region but estimates c.x#i.region (how the effect of x differs across regions over time). This is valid and commonly used.

5.8 Dynamic Panel Models

When the lagged dependent variable is a regressor, FE is biased (Nickell bias). For short panels (small T, large N), this bias can be substantial. The Arellano-Bond GMM estimator addresses this by using lagged levels as instruments for first-differenced equations.

* Dynamic panel model: lagged dependent variable
* FE is biased with lagged DV (Nickell bias)
xtreg ln_wage L.ln_wage age tenure, fe
* WARNING: biased for small T (T < 20-30)

* Arellano-Bond GMM (difference GMM)
xtabond ln_wage age tenure, lags(1) ///
    twostep vce(robust)

* System GMM (more efficient, uses both levels and differences)
* ssc install xtabond2, replace
xtabond2 ln_wage L.ln_wage age tenure i.year, ///
    gmm(L.ln_wage, lag(2 4)) ///
    iv(age tenure i.year) ///
    twostep robust small

* Key diagnostic tests for GMM:
* 1. AR(1) should be significant (expected)
* 2. AR(2) should NOT be significant (no second-order autocorrelation)
* 3. Hansen J test should NOT reject (instruments are valid)

When to Use Dynamic Panels Dynamic panel models are appropriate when you believe the outcome persists over time (state dependence). Common examples include wages (wage at t depends on wage at t-1), firm performance, and health outcomes. If the autoregressive coefficient is close to 1, it suggests strong persistence. If it is close to 0, a static FE model may be adequate. Always report the AR(2) and Hansen tests to validate the GMM specification.

5.9 Exporting Panel Results

* Compare FE, RE, and HDFE side by side
esttab FE RE HDFE using "tables/panel_results.tex", ///
    star(* 0.10 ** 0.05 *** 0.01) ///
    se(3) b(3) ///
    scalars("r2_w R-sq (within)" "N_g Groups") ///
    mtitles("FE" "RE" "HDFE") ///
    label replace booktabs

Exercise 5.1

Load the nlswork dataset (webuse nlswork, clear). Declare the panel with xtset idcode year. Estimate FE and RE models of ln_wage on age, tenure, and hours. Run the Hausman test. Which model is preferred?

Exercise 5.2

Using the same data, estimate the wage equation with reghdfe absorbing both idcode and year fixed effects. Cluster standard errors at the individual level. Compare the coefficients with those from xtreg, fe.

Exercise 5.3

Implement the Mundlak (Correlated Random Effects) approach on the nlswork data. Compute group means for age, tenure, and hours. Include race as a time-invariant regressor. Run the augmented RE model and test the joint significance of the group means. Compare the time-varying coefficients to those from FE. Are they numerically identical (or very close)?

Exercise 5.4

Using webuse nlswork, estimate a dynamic panel model of ln_wage with its first lag as a regressor, plus age and tenure as controls. Compare: (a) FE with lagged DV, (b) Arellano-Bond GMM via xtabond. Report the AR(2) test p-value from the GMM estimation. Does the coefficient on the lagged DV differ meaningfully between the two approaches?

External Resources

Stata Manual: xtreg External
UCLA OARC: Panel Data Using Stata External
reghdfe Documentation (Correia) External

Key Takeaways

Always use xtset before panel estimation and inspect balance with xtdescribe.
Understand the within-between decomposition: FE uses only within-unit variation, so time-invariant variables are not estimable.
FE is the safe default for causal inference; RE is only valid under strict exogeneity of the unit effect.
Use the Mundlak (CRE) approach to formally test FE vs. RE and to estimate time-invariant coefficients.
Use reghdfe for models with multiple high-dimensional fixed effects.
Always interpret interactions in panel models with margins and marginsplot.
For dynamic panels (lagged DV), use Arellano-Bond GMM to avoid Nickell bias; always report AR(2) and Hansen tests.
Cluster standard errors at the unit level to account for within-unit serial correlation.

← Chapter 4: Linear Regression Chapter 6: Instrumental Variables →