Chapter 3: Descriptive Statistics & Visualization

Summary Tables and Publication-Quality Graphs

3.1 Summary Statistics

The first table in nearly every empirical paper is a descriptive statistics table. Stata offers several commands for producing these summaries.

sysuse auto, clear

* Basic summary
summarize price mpg weight length

* Detailed summary (percentiles, skewness, kurtosis)
summarize price, detail

* Flexible tabstat: choose your own statistics
tabstat price mpg weight, stats(n mean sd min p25 p50 p75 max) columns(statistics)

* Summary by group
tabstat price mpg weight, by(foreign) stats(n mean sd) nototal

3.2 Frequency Tables and Cross-Tabulations

For categorical variables, use tabulate. Two-way tables with chi2 test the independence of row and column variables.

* One-way frequency table
tabulate rep78

* Two-way table with chi-squared test
tabulate foreign rep78, chi2 row

* Tabulate with summary statistics
tabulate foreign, summarize(price)

3.3 Correlation Matrices

Pairwise correlations provide a quick check for multicollinearity and help motivate your modeling decisions. Report these in your paper alongside descriptive statistics.

* Pearson correlation matrix
correlate price mpg weight length

* Pairwise (uses all available obs per pair)
pwcorr price mpg weight length, star(0.05) sig

* Spearman rank correlations (robust to outliers)
spearman price mpg weight

3.4 Using estpost with tabstat

The estpost tabstat command gives more flexibility than estpost summarize for constructing descriptive statistics tables. It supports arbitrary statistics and grouping in a single call.

* estpost tabstat for flexible summary statistics
estpost tabstat price mpg weight length, ///
    stats(n mean sd p25 p50 p75) columns(statistics)
esttab using "tables/desc_stats_v2.tex", ///
    cells("count(fmt(0)) mean(fmt(2)) sd(fmt(2)) p25(fmt(1)) p50(fmt(1)) p75(fmt(1))") ///
    noobs label replace booktabs ///
    title("Summary Statistics")

* estpost tabstat by group (separate panels)
estpost tabstat price mpg weight, by(foreign) ///
    stats(n mean sd) nototal
esttab using "tables/desc_by_origin.tex", ///
    main(mean %9.2f) aux(sd %9.2f) ///
    nostar unstack noobs label replace booktabs
estpost summarize vs. estpost tabstat estpost summarize stores results in e() matrices named after each statistic (mean, sd, etc.) and works well for simple tables. estpost tabstat is more flexible: it supports percentiles, custom statistic lists, and the by() option for grouped tables. For publication-quality Table 1, estpost tabstat is usually the better choice.

3.5 Exporting Tables with estout

The estout / esttab package lets you export summary statistics and regression tables to LaTeX, CSV, or RTF.

* Install if needed: ssc install estout, replace

* Export summary statistics to LaTeX
estpost summarize price mpg weight length, detail
esttab using "tables/desc_stats.tex", ///
    cells("count mean(fmt(2)) sd(fmt(2)) min max") ///
    noobs label replace booktabs

3.5 Histograms

Histograms reveal the distributional shape of a variable, which informs decisions about transformations (e.g., logging a right-skewed variable).

* Basic histogram
histogram price, frequency ///
    title("Distribution of Price") ///
    xtitle("Price (USD)") ytitle("Frequency") ///
    color(navy%60) lcolor(white)

* Histogram with normal curve overlay
histogram mpg, normal ///
    title("Miles per Gallon") ///
    note("Source: auto.dta")

3.6 Scatter Plots

Scatter plots are essential for visualizing bivariate relationships and checking for nonlinearity before running regressions.

* Basic scatter with linear fit
twoway (scatter mpg weight, mcolor(navy%40) msize(small)) ///
       (lfit mpg weight, lcolor(cranberry) lwidth(medium)), ///
    title("MPG vs. Weight") ///
    xtitle("Weight (lbs)") ytitle("Miles per Gallon") ///
    legend(order(1 "Observations" 2 "Linear fit"))

* Scatter with lowess smoother
twoway (scatter mpg weight, mcolor(navy%30)) ///
       (lowess mpg weight, lcolor(orange) lwidth(thick)), ///
    title("MPG vs. Weight (Nonparametric)")

3.7 Combining and Exporting Graphs

Use graph combine to place multiple graphs in a single figure, and graph export to save in publication formats.

* Create and name individual graphs
histogram price, name(g1, replace) title("Price") nodraw
histogram mpg, name(g2, replace) title("MPG") nodraw

* Combine into one figure
graph combine g1 g2, rows(1) ///
    title("Variable Distributions") ///
    note("Source: 1978 Automobile Data")

* Export to PDF or PNG
graph export "figures/distributions.pdf", replace
graph export "figures/distributions.png", replace width(1200)
Tip: Graph Schemes Stata's default graphs have a blue background that journals dislike. Use set scheme s2mono or set scheme plotplain (install via ssc install blindschemes) for a clean white background.

3.8 Box Plots and Bar Charts

* Box plot by group
graph box price, over(foreign) ///
    title("Price by Origin") ///
    ytitle("Price (USD)")

* Bar chart of means
graph bar (mean) price mpg, over(foreign) ///
    title("Average Price and MPG") ///
    legend(order(1 "Price" 2 "MPG")) ///
    blabel(bar, format(%9.1f))

3.9 Advanced Scatter Plots

Beyond basic scatters with linear fits, you can overlay quadratic fits, confidence intervals, connected lines, and area fills. These are essential for visualizing nonlinear relationships and time trends in research papers.

* Scatter with quadratic fit line
twoway (scatter price mpg, mcolor(navy%30) msize(small)) ///
       (qfit price mpg, lcolor(cranberry) lwidth(medthick)), ///
    title("Price vs. MPG with Quadratic Fit") ///
    xtitle("Miles per Gallon") ytitle("Price (USD)") ///
    legend(order(1 "Observations" 2 "Quadratic fit"))

* Scatter with confidence interval band
twoway (lfitci price mpg, fcolor(navy%15) alcolor(navy%30)) ///
       (scatter price mpg, mcolor(navy%50) msize(vsmall)), ///
    title("Price vs. MPG with 95% CI") ///
    legend(order(2 "95% CI" 3 "Observations"))

* Connected line plot (for time series / trends)
collapse (mean) mean_price = price, by(rep78)
drop if missing(rep78)
twoway connected mean_price rep78, ///
    mcolor(navy) lcolor(navy) msymbol(circle) ///
    title("Mean Price by Repair Record") ///
    xtitle("Repair Record (1-5)") ytitle("Mean Price (USD)")

* Area plot (useful for cumulative distributions or stacked compositions)
sysuse auto, clear
twoway area mpg weight, sort ///
    color(navy%30) lcolor(navy) ///
    title("MPG Area Plot")

3.10 Graph Bar with over() and Subgroups

Bar charts are the standard way to present categorical comparisons. The over() option lets you group bars by one or more categorical variables.

sysuse auto, clear

* Bar chart with one grouping variable
graph bar (mean) price, over(rep78, label(angle(45))) ///
    title("Mean Price by Repair Record") ///
    ytitle("Mean Price (USD)") ///
    blabel(bar, format(%9.0f))

* Grouped bars: two over() levels
graph bar (mean) price, over(foreign) over(rep78) ///
    title("Mean Price by Origin and Repair") ///
    legend(rows(1)) ///
    blabel(bar, format(%9.0f) size(vsmall))

* Horizontal bar chart
graph hbar (mean) price mpg, over(foreign) ///
    title("Price and MPG by Origin") ///
    legend(order(1 "Mean Price" 2 "Mean MPG"))

3.11 Scatter Plot Matrix

A scatter plot matrix shows all pairwise scatter plots for a set of variables. It gives a quick overview of bivariate relationships and potential collinearity, which is useful for the exploratory phase of analysis before running regressions.

* Scatter plot matrix for key variables
graph matrix price mpg weight length, ///
    half msize(vsmall) mcolor(navy%40) ///
    title("Pairwise Scatter Plots")

* With diagonal labels showing variable names
graph matrix price mpg weight, ///
    half msymbol(circle_hollow) mcolor(cranberry)

3.12 Customizing Graph Schemes

Stata's default graph scheme (s2color) uses a blue-tinted background that most journals do not accept. Choosing the right scheme is the single biggest improvement you can make to your graphs with minimal effort.

* Built-in clean schemes
set scheme s2mono                    // clean, grayscale
set scheme s1color                   // white background, color

* Install popular community schemes
ssc install blindschemes, replace
set scheme plotplain                  // minimal white background
set scheme plotplainblind             // colorblind-friendly

* Set a permanent default
set scheme plotplain, permanently

* Or specify per-graph
histogram price, scheme(plotplain) ///
    title("Price Distribution")
Tip: Colorblind-Friendly Graphs Approximately 8% of men have color vision deficiency. Use plotplainblind or manually choose colorblind-safe palettes. Avoid relying solely on red vs. green to distinguish groups. Many journals now require colorblind accessibility.

3.13 Automated Reports with putdocx and putexcel

For projects that require frequent report updates (e.g., during revision rounds), Stata can generate Word documents and Excel files programmatically. This eliminates copy-paste errors between Stata output and your manuscript.

* --- putdocx: generate a Word document ---
putdocx begin
putdocx paragraph, style(Heading1)
putdocx text ("Summary Statistics")

* Add a table from stored results
sysuse auto, clear
tabstat price mpg weight, stats(n mean sd min max) save
putdocx table tbl1 = matrix(r(StatTotal)), rownames colnames

* Add a graph
histogram price, name(hist1, replace) nodraw
graph export "temp_hist.png", name(hist1) replace width(800)
putdocx paragraph
putdocx image "temp_hist.png"

putdocx save "reports/summary_report.docx", replace

* --- putexcel: write results to Excel ---
putexcel set "tables/results.xlsx", sheet("DescStats") replace
putexcel A1 = "Variable" B1 = "N" C1 = "Mean" D1 = "SD"

local row = 2
foreach var in price mpg weight {
    quietly summarize `var'
    putexcel A`row' = "`var'" B`row' = (r(N)) ///
             C`row' = (r(mean)) D`row' = (r(sd))
    local ++row
}
putexcel close
When to Use putdocx vs. LaTeX If your manuscript is in LaTeX, export tables with esttab to .tex files and include them with \input{}. Use putdocx when your co-authors or journal require Word format, or for interim reports shared with non-technical collaborators. putexcel is best for sharing raw results with co-authors who want to inspect numbers in a spreadsheet.

Exercise 3.1

Using sysuse auto, create a publication-ready scatter plot of price (y-axis) vs. mpg (x-axis) with separate markers for domestic and foreign cars. Add a quadratic fit line. Export it as a PNG file.

Exercise 3.2

Use tabstat to produce a table with mean, standard deviation, and the 5th and 95th percentiles for price, mpg, and weight, separately for domestic and foreign cars. Export it using esttab.

Exercise 3.3

Using sysuse auto, create a graph matrix of price, mpg, weight, and length using the plotplain scheme (install blindschemes first). Then use graph export to save it as both PDF and PNG. Which pair of variables shows the strongest visual correlation?

Exercise 3.4

Write a do-file that uses putexcel to create an Excel file with two sheets. Sheet 1 should contain summary statistics (N, Mean, SD, Min, Max) for price, mpg, and weight. Sheet 2 should contain the pairwise correlation matrix for the same variables. Use a foreach loop for the summary statistics sheet.

External Resources

Key Takeaways

← Chapter 2: Data Management Chapter 4: Linear Regression →