Chapter 7: Limited Dependent Variables

Logit, Probit, Tobit, Count Models, Selection Models, and Multinomial Models

7.1 Binary Outcome Models

When the dependent variable is binary (0/1), linear probability models (LPM) are simple but can produce predictions outside [0,1]. Logit and probit address this by using nonlinear link functions.

webuse lbw, clear
describe

* Linear Probability Model (for comparison)
regress low age lwt i.race smoke, vce(robust)

* Logit model
logit low age lwt i.race smoke, vce(robust)

* Probit model
probit low age lwt i.race smoke, vce(robust)
Logit vs. Probit Logit uses the logistic CDF; probit uses the normal CDF. In practice, they produce very similar results. Logit coefficients can be exponentiated to get odds ratios, which some audiences find more intuitive. Use whichever is conventional in your field.
When LPM Is Acceptable Despite its theoretical drawbacks, the LPM is often preferred in applied economics and causal inference work. Angrist and Pischke (2009) argue that for estimating average partial effects, the LPM with robust SEs performs well. The LPM is particularly useful when: (1) you need to include fixed effects that would create incidental parameters bias in logit/probit; (2) your main interest is in the average marginal effect rather than individual predictions; (3) predicted probabilities cluster well within [0.2, 0.8]. Reviewers at economics journals generally accept LPM, while management and health journals may insist on logit/probit.

7.2 Interpreting Coefficients

Raw logit/probit coefficients give the change in log-odds (logit) or the z-score (probit) for a unit change in X. These are difficult to interpret directly. Use marginal effects instead.

* Odds ratios (logit only)
logit low age lwt i.race smoke, or

* Average marginal effects (AME)
logit low age lwt i.race smoke
margins, dydx(*) post

* Marginal effects at the mean (MEM)
logit low age lwt i.race smoke
margins, dydx(*) atmeans
AME vs. MEM: A Deeper Look Average marginal effects (AME) compute the marginal effect for each observation and then average. Marginal effects at the mean (MEM) evaluate at the sample means of all variables. AME is generally preferred because MEM evaluates at a potentially nonexistent "average" observation. With categorical variables, the MEM evaluates at the mean of the dummy (e.g., race = 0.37), which is nonsensical. AME avoids this problem. However, AME depends on the distribution of all covariates in the sample. If you want the effect for a specific subgroup (e.g., male smokers age 30), use margins, dydx(*) at(smoke=1 age=30) instead.

7.2.1 Marginal Effects for Interaction Terms

Interpreting interactions in nonlinear models is more complex than in linear models. The cross-partial derivative is not simply the coefficient on the interaction term. Use margins to compute the correct interaction effect.

* Logit with interaction
logit low c.age##i.smoke lwt i.race

* Wrong: just looking at the interaction coefficient
* Right: compute the AME of smoke at different ages
margins, dydx(smoke) at(age=(20(5)40))
marginsplot, title("Marginal Effect of Smoking by Age") ///
    ytitle("dPr(Low Birth Weight)/dSmoke") ///
    yline(0, lcolor(red) lpattern(dash))

* Second differences: the true "interaction effect" in nonlinear models
margins, dydx(smoke) at(age=(20 40)) contrast(atcontrast(r))
Interactions in Nonlinear Models Ai and Norton (2003) showed that the interaction effect in logit/probit is not the coefficient on the interaction term. In nonlinear models, the magnitude and even the sign of the interaction effect can differ from the interaction coefficient. Always compute interaction effects through margins rather than interpreting interaction term coefficients directly. This remains one of the most common errors in applied health and management research.

7.3 Predicted Probabilities and Margins Plots

You can compute predicted probabilities at specific covariate values to make results more tangible for your audience. Margins plots are the standard visualization for nonlinear model results in published papers.

logit low age lwt i.race smoke

* Predicted probability for each observation
predict phat, pr

* Predicted probability at specific values
margins, at(smoke=(0 1) age=(20 30 40)) vsquish

* Plot predicted probabilities
margins, at(age=(15(5)45)) over(smoke)
marginsplot, title("Predicted Probability of Low Birth Weight") ///
    ytitle("Pr(Low Birth Weight)") xtitle("Mother's Age")
* ─── Publication-Quality Margins Plot ───
logit low c.age##i.race lwt smoke

* Predicted probabilities across age for each race category
margins race, at(age=(18(2)42)) vsquish

* Customized marginsplot
marginsplot, ///
    title("Predicted Probability of Low Birth Weight", size(medium)) ///
    ytitle("Pr(Low Birth Weight)") xtitle("Mother's Age") ///
    legend(order(1 "White" 2 "Black" 3 "Other") rows(1) pos(6)) ///
    plot1opts(lcolor(navy) mcolor(navy)) ///
    plot2opts(lcolor(cranberry) mcolor(cranberry)) ///
    plot3opts(lcolor(forest_green) mcolor(forest_green)) ///
    ci1opts(color(navy%20)) ci2opts(color(cranberry%20)) ci3opts(color(forest_green%20)) ///
    graphregion(color(white)) scheme(s2color)

* Export for manuscript
graph export "margins_plot.png", replace width(1200)

7.4 Tobit: Censored Outcomes

When the dependent variable is censored (e.g., hours worked cannot be negative, expenditure data piles up at zero), the tobit model accounts for this truncation.

webuse womenwk, clear

* Tobit with left-censoring at zero
tobit hours age education married children, ll(0)

* Marginal effects (on the latent variable)
margins, dydx(*) predict(ystar(0,.))

* Marginal effects (on the observed, censored outcome)
margins, dydx(*) predict(e(0,.))
Three Types of Tobit Marginal Effects The tobit model supports three different marginal effect predictions: (1) predict(ystar(0,.)) gives the effect on the latent (uncensored) variable y*; (2) predict(e(0,.)) gives the effect on E[y | y > 0], the expected value conditional on being uncensored; (3) predict(pr(0,.)) gives the effect on the probability of being uncensored. In most applications, the conditional expectation e(0,.) is the quantity of interest. Reporting the wrong one is a common error.

7.5 Count Models: Poisson and Negative Binomial

When the dependent variable is a non-negative count (e.g., number of hospital admissions, number of defects, number of patents), use Poisson or negative binomial regression rather than OLS. These models use a log link function and ensure predicted counts are non-negative.

* ─── Poisson Regression ───
webuse dollhill3, clear

* Basic Poisson
poisson deaths smokes i.agecat, exposure(pyears) irr

* Incidence Rate Ratios (IRR): exponentiated coefficients
* IRR of 1.5 means the rate is 50% higher for a unit increase in X

* Robust SEs (quasi-Poisson: relaxes the mean=variance assumption)
poisson deaths smokes i.agecat, exposure(pyears) irr vce(robust)

* ─── Negative Binomial (for overdispersion) ───
* When Var(Y) > E(Y), Poisson is too restrictive
nbreg deaths smokes i.agecat, exposure(pyears) irr

* Test for overdispersion (alpha = 0 => Poisson is adequate)
* The LR test at the bottom of nbreg output tests alpha = 0

* ─── Marginal Effects for Count Models ───
poisson deaths smokes i.agecat, exposure(pyears)
margins, dydx(*) predict(n)  // marginal effect on predicted count

7.5.1 Zero-Inflated Models

When the data has excess zeros beyond what Poisson or negative binomial can explain (e.g., many people report zero doctor visits, but among those who visit, the count follows a standard distribution), zero-inflated models are appropriate.

* Zero-inflated Poisson
zip doctor_visits age income chronic, ///
    inflate(age income distance_to_clinic) vuong

* Zero-inflated Negative Binomial
zinb doctor_visits age income chronic, ///
    inflate(age income distance_to_clinic)

* Vuong test: ZIP vs. standard Poisson
* Significant positive z => ZIP preferred over Poisson

* Marginal effects
zip doctor_visits age income chronic, inflate(age income)
margins, dydx(*) predict(n)
margins, dydx(*) predict(pr(0))  // effect on Pr(Y=0)
Choosing Among Count Models Start with Poisson with robust SEs. If the dispersion parameter from nbreg is significantly different from zero, consider negative binomial. If there are excess zeros driven by a qualitatively different process (e.g., "never users" vs. "potential users who happen to have zero counts"), consider zero-inflated models. Wooldridge (2010) recommends Poisson with robust SEs as a default because it only requires correct specification of the conditional mean (not the full distribution).

7.6 Ordered Logit and Probit

When the dependent variable has ordered categories (e.g., satisfaction: low/medium/high), use ordered logit or probit.

webuse fullauto, clear

* Ordered logit
ologit rep77 foreign length mpg

* Ordered probit
oprobit rep77 foreign length mpg

* Predicted probabilities for each category
margins, predict(outcome(1)) predict(outcome(2)) predict(outcome(3)) ///
    predict(outcome(4)) predict(outcome(5))
Proportional Odds Assumption Ordered logit (proportional odds model) assumes that the effect of each predictor is the same across all outcome thresholds. This means the odds ratio for moving from "poor" to "fair" is the same as for "good" to "excellent." Test this assumption with the Brant test: ssc install brant, replace followed by brant after ologit. If the assumption is violated, consider generalized ordered logit (gologit2) or multinomial logit instead.

7.7 Multinomial Logit

When the dependent variable is categorical without a natural ordering (e.g., mode of transportation: car, bus, train), use multinomial logit. The model estimates separate coefficients for each category relative to a base category.

webuse sysdsn1, clear

* Multinomial logit
mlogit insure age male nonwhite, baseoutcome(1)

* Relative risk ratios (exponentiated coefficients)
mlogit insure age male nonwhite, baseoutcome(1) rrr

* Average marginal effects
margins, dydx(*) predict(outcome(1))
margins, dydx(*) predict(outcome(2))
margins, dydx(*) predict(outcome(3))

7.7.1 Interpreting Multinomial Logit with Margins

Raw multinomial logit coefficients are relative to a base category and difficult to interpret. Margins provide the change in predicted probability for each category, which is what readers actually want to know.

* ─── Complete Marginal Effects Table ───
mlogit insure age male nonwhite, baseoutcome(1)

* Marginal effects for all outcomes in one table
margins, dydx(*) predict(outcome(1)) post
estimates store me_1
quietly mlogit insure age male nonwhite, baseoutcome(1)
margins, dydx(*) predict(outcome(2)) post
estimates store me_2
quietly mlogit insure age male nonwhite, baseoutcome(1)
margins, dydx(*) predict(outcome(3)) post
estimates store me_3

estimates table me_1 me_2 me_3, se

* Note: Marginal effects across all categories sum to zero
* (a unit increase in X redistributes probability among categories)

* ─── IIA Test (Hausman-McFadden) ───
* Tests whether removing a category changes the remaining coefficients
mlogit insure age male nonwhite, baseoutcome(1)
estimates store full_model
* Re-estimate excluding one category
mlogit insure age male nonwhite if insure != 3, baseoutcome(1)
estimates store restricted
hausman restricted full_model, alleqs constant

7.8 Selection Models (Heckman)

When the outcome variable is observed only for a non-random subsample, standard regression suffers from selection bias. The Heckman selection model (also called the Heckit) corrects this by modeling the selection process jointly with the outcome.

webuse womenwk, clear

* Two-step Heckman (Heckit)
heckman wage education age, ///
    select(married children education age) twostep

* Full MLE estimation (more efficient if correctly specified)
heckman wage education age, ///
    select(married children education age)

* Key output: lambda (inverse Mills ratio)
* If lambda is significant => selection bias present
* rho = correlation between selection and outcome errors
* sigma = SD of the outcome error
* lambda = rho * sigma

* Predicted wages (corrected for selection)
predict wage_hat, ycond     // E[wage | selected]
predict wage_uncon, yexpected // E[wage] (unconditional)
predict psel, psel           // Pr(selected)
predict mills, mills         // inverse Mills ratio
Exclusion Restriction in Heckman Models For identification beyond functional form, the selection equation should contain at least one variable that affects selection but not the outcome. Without this exclusion restriction, the Heckman model is identified only through the nonlinearity of the inverse Mills ratio, which is nearly linear over much of its range. In the example above, married and children affect labor force participation but are excluded from the wage equation. Top journals require you to justify this exclusion economically.

7.9 Goodness of Fit for Nonlinear Models

logit low age lwt i.race smoke

* Classification table
estat classification

* ROC curve and AUC
lroc, title("ROC Curve")

* Goodness-of-fit (Hosmer-Lemeshow)
estat gof, group(10)

* Information criteria for model comparison
estat ic

7.9.1 Comparing Nested and Non-Nested Models

* ─── Pseudo R-squared ───
* McFadden's R2 = 1 - (ll_full / ll_null)
* Reported automatically by logit/probit
* Typical values are lower than OLS R2 (0.2-0.4 is often "good")

* ─── Likelihood Ratio Test (nested models) ───
quietly logit low age lwt smoke
estimates store m_restricted
quietly logit low age lwt i.race smoke
estimates store m_full
lrtest m_restricted m_full

* ─── AIC/BIC for Non-Nested Comparison ───
quietly logit low age lwt i.race smoke
estimates store logit_model
quietly probit low age lwt i.race smoke
estimates store probit_model
estimates table logit_model probit_model, stats(aic bic ll N)

* ─── Percent Correctly Predicted ───
logit low age lwt i.race smoke
predict phat, pr
gen y_hat = (phat >= 0.5)
tab low y_hat
* Be careful: with unbalanced data (e.g., 90% zeros),
* always predicting 0 gives 90% accuracy but is useless
Goodness of Fit: What Reviewers Care About For causal inference papers, reviewers care less about R-squared or classification accuracy and more about identification strategy. For prediction-oriented papers, report AUC (ideally above 0.7), calibration (Hosmer-Lemeshow), and out-of-sample performance. For model selection among competing specifications, AIC/BIC with likelihood ratio tests for nested comparisons is standard. Never report only "percent correctly predicted" without the ROC curve, as it depends on the arbitrary 0.5 threshold.

Exercise 7.1

Using webuse lbw, estimate a logit model of low on age, lwt, smoke, and i.race. Compute average marginal effects with margins, dydx(*). What is the marginal effect of smoking on the probability of low birth weight?

Exercise 7.2

Using webuse fullauto, estimate an ordered logit of rep77 on foreign, length, and mpg. Use margins to predict the probability of each outcome category for foreign vs. domestic cars. Create a marginsplot.

Exercise 7.3

Using webuse lbw, estimate a logit model of low on c.age##i.smoke, lwt, and i.race. Use margins, dydx(smoke) at(age=(20(5)40)) to compute the marginal effect of smoking at different ages. Create a marginsplot. Does the effect of smoking change with age? Now verify your interpretation by computing the "second difference" using margins, dydx(smoke) at(age=(20 40)) contrast(atcontrast(r)).

Exercise 7.4

Using webuse womenwk, estimate a Heckman selection model of wage on education and age, with the selection equation including married, children, education, and age. Compare the two-step (twostep) and MLE estimates. Is lambda (the inverse Mills ratio) significant? What does the sign of rho tell you about the selection process?

External Resources

Key Takeaways

← Chapter 6: Instrumental Variables Chapter 8: Time Series →