Summary Tables and Publication-Quality Graphs
The first table in nearly every empirical paper is a descriptive statistics table. Stata offers several commands for producing these summaries.
sysuse auto, clear * Basic summary summarize price mpg weight length * Detailed summary (percentiles, skewness, kurtosis) summarize price, detail * Flexible tabstat: choose your own statistics tabstat price mpg weight, stats(n mean sd min p25 p50 p75 max) columns(statistics) * Summary by group tabstat price mpg weight, by(foreign) stats(n mean sd) nototal
For categorical variables, use tabulate. Two-way tables with chi2 test the independence of row and column variables.
* One-way frequency table tabulate rep78 * Two-way table with chi-squared test tabulate foreign rep78, chi2 row * Tabulate with summary statistics tabulate foreign, summarize(price)
Pairwise correlations provide a quick check for multicollinearity and help motivate your modeling decisions. Report these in your paper alongside descriptive statistics.
* Pearson correlation matrix correlate price mpg weight length * Pairwise (uses all available obs per pair) pwcorr price mpg weight length, star(0.05) sig * Spearman rank correlations (robust to outliers) spearman price mpg weight
The estpost tabstat command gives more flexibility than estpost summarize for constructing descriptive statistics tables. It supports arbitrary statistics and grouping in a single call.
* estpost tabstat for flexible summary statistics estpost tabstat price mpg weight length, /// stats(n mean sd p25 p50 p75) columns(statistics) esttab using "tables/desc_stats_v2.tex", /// cells("count(fmt(0)) mean(fmt(2)) sd(fmt(2)) p25(fmt(1)) p50(fmt(1)) p75(fmt(1))") /// noobs label replace booktabs /// title("Summary Statistics") * estpost tabstat by group (separate panels) estpost tabstat price mpg weight, by(foreign) /// stats(n mean sd) nototal esttab using "tables/desc_by_origin.tex", /// main(mean %9.2f) aux(sd %9.2f) /// nostar unstack noobs label replace booktabs
estpost summarize stores results in e() matrices named after each statistic (mean, sd, etc.) and works well for simple tables. estpost tabstat is more flexible: it supports percentiles, custom statistic lists, and the by() option for grouped tables. For publication-quality Table 1, estpost tabstat is usually the better choice.
The estout / esttab package lets you export summary statistics and regression tables to LaTeX, CSV, or RTF.
* Install if needed: ssc install estout, replace * Export summary statistics to LaTeX estpost summarize price mpg weight length, detail esttab using "tables/desc_stats.tex", /// cells("count mean(fmt(2)) sd(fmt(2)) min max") /// noobs label replace booktabs
Histograms reveal the distributional shape of a variable, which informs decisions about transformations (e.g., logging a right-skewed variable).
* Basic histogram histogram price, frequency /// title("Distribution of Price") /// xtitle("Price (USD)") ytitle("Frequency") /// color(navy%60) lcolor(white) * Histogram with normal curve overlay histogram mpg, normal /// title("Miles per Gallon") /// note("Source: auto.dta")
Scatter plots are essential for visualizing bivariate relationships and checking for nonlinearity before running regressions.
* Basic scatter with linear fit twoway (scatter mpg weight, mcolor(navy%40) msize(small)) /// (lfit mpg weight, lcolor(cranberry) lwidth(medium)), /// title("MPG vs. Weight") /// xtitle("Weight (lbs)") ytitle("Miles per Gallon") /// legend(order(1 "Observations" 2 "Linear fit")) * Scatter with lowess smoother twoway (scatter mpg weight, mcolor(navy%30)) /// (lowess mpg weight, lcolor(orange) lwidth(thick)), /// title("MPG vs. Weight (Nonparametric)")
Use graph combine to place multiple graphs in a single figure, and graph export to save in publication formats.
* Create and name individual graphs histogram price, name(g1, replace) title("Price") nodraw histogram mpg, name(g2, replace) title("MPG") nodraw * Combine into one figure graph combine g1 g2, rows(1) /// title("Variable Distributions") /// note("Source: 1978 Automobile Data") * Export to PDF or PNG graph export "figures/distributions.pdf", replace graph export "figures/distributions.png", replace width(1200)
set scheme s2mono or set scheme plotplain (install via ssc install blindschemes) for a clean white background.
* Box plot by group graph box price, over(foreign) /// title("Price by Origin") /// ytitle("Price (USD)") * Bar chart of means graph bar (mean) price mpg, over(foreign) /// title("Average Price and MPG") /// legend(order(1 "Price" 2 "MPG")) /// blabel(bar, format(%9.1f))
Beyond basic scatters with linear fits, you can overlay quadratic fits, confidence intervals, connected lines, and area fills. These are essential for visualizing nonlinear relationships and time trends in research papers.
* Scatter with quadratic fit line twoway (scatter price mpg, mcolor(navy%30) msize(small)) /// (qfit price mpg, lcolor(cranberry) lwidth(medthick)), /// title("Price vs. MPG with Quadratic Fit") /// xtitle("Miles per Gallon") ytitle("Price (USD)") /// legend(order(1 "Observations" 2 "Quadratic fit")) * Scatter with confidence interval band twoway (lfitci price mpg, fcolor(navy%15) alcolor(navy%30)) /// (scatter price mpg, mcolor(navy%50) msize(vsmall)), /// title("Price vs. MPG with 95% CI") /// legend(order(2 "95% CI" 3 "Observations")) * Connected line plot (for time series / trends) collapse (mean) mean_price = price, by(rep78) drop if missing(rep78) twoway connected mean_price rep78, /// mcolor(navy) lcolor(navy) msymbol(circle) /// title("Mean Price by Repair Record") /// xtitle("Repair Record (1-5)") ytitle("Mean Price (USD)") * Area plot (useful for cumulative distributions or stacked compositions) sysuse auto, clear twoway area mpg weight, sort /// color(navy%30) lcolor(navy) /// title("MPG Area Plot")
Bar charts are the standard way to present categorical comparisons. The over() option lets you group bars by one or more categorical variables.
sysuse auto, clear * Bar chart with one grouping variable graph bar (mean) price, over(rep78, label(angle(45))) /// title("Mean Price by Repair Record") /// ytitle("Mean Price (USD)") /// blabel(bar, format(%9.0f)) * Grouped bars: two over() levels graph bar (mean) price, over(foreign) over(rep78) /// title("Mean Price by Origin and Repair") /// legend(rows(1)) /// blabel(bar, format(%9.0f) size(vsmall)) * Horizontal bar chart graph hbar (mean) price mpg, over(foreign) /// title("Price and MPG by Origin") /// legend(order(1 "Mean Price" 2 "Mean MPG"))
A scatter plot matrix shows all pairwise scatter plots for a set of variables. It gives a quick overview of bivariate relationships and potential collinearity, which is useful for the exploratory phase of analysis before running regressions.
* Scatter plot matrix for key variables graph matrix price mpg weight length, /// half msize(vsmall) mcolor(navy%40) /// title("Pairwise Scatter Plots") * With diagonal labels showing variable names graph matrix price mpg weight, /// half msymbol(circle_hollow) mcolor(cranberry)
Stata's default graph scheme (s2color) uses a blue-tinted background that most journals do not accept. Choosing the right scheme is the single biggest improvement you can make to your graphs with minimal effort.
* Built-in clean schemes set scheme s2mono // clean, grayscale set scheme s1color // white background, color * Install popular community schemes ssc install blindschemes, replace set scheme plotplain // minimal white background set scheme plotplainblind // colorblind-friendly * Set a permanent default set scheme plotplain, permanently * Or specify per-graph histogram price, scheme(plotplain) /// title("Price Distribution")
plotplainblind or manually choose colorblind-safe palettes. Avoid relying solely on red vs. green to distinguish groups. Many journals now require colorblind accessibility.
For projects that require frequent report updates (e.g., during revision rounds), Stata can generate Word documents and Excel files programmatically. This eliminates copy-paste errors between Stata output and your manuscript.
* --- putdocx: generate a Word document --- putdocx begin putdocx paragraph, style(Heading1) putdocx text ("Summary Statistics") * Add a table from stored results sysuse auto, clear tabstat price mpg weight, stats(n mean sd min max) save putdocx table tbl1 = matrix(r(StatTotal)), rownames colnames * Add a graph histogram price, name(hist1, replace) nodraw graph export "temp_hist.png", name(hist1) replace width(800) putdocx paragraph putdocx image "temp_hist.png" putdocx save "reports/summary_report.docx", replace * --- putexcel: write results to Excel --- putexcel set "tables/results.xlsx", sheet("DescStats") replace putexcel A1 = "Variable" B1 = "N" C1 = "Mean" D1 = "SD" local row = 2 foreach var in price mpg weight { quietly summarize `var' putexcel A`row' = "`var'" B`row' = (r(N)) /// C`row' = (r(mean)) D`row' = (r(sd)) local ++row } putexcel close
esttab to .tex files and include them with \input{}. Use putdocx when your co-authors or journal require Word format, or for interim reports shared with non-technical collaborators. putexcel is best for sharing raw results with co-authors who want to inspect numbers in a spreadsheet.
Using sysuse auto, create a publication-ready scatter plot of price (y-axis) vs. mpg (x-axis) with separate markers for domestic and foreign cars. Add a quadratic fit line. Export it as a PNG file.
Use tabstat to produce a table with mean, standard deviation, and the 5th and 95th percentiles for price, mpg, and weight, separately for domestic and foreign cars. Export it using esttab.
Using sysuse auto, create a graph matrix of price, mpg, weight, and length using the plotplain scheme (install blindschemes first). Then use graph export to save it as both PDF and PNG. Which pair of variables shows the strongest visual correlation?
Write a do-file that uses putexcel to create an Excel file with two sheets. Sheet 1 should contain summary statistics (N, Mean, SD, Min, Max) for price, mpg, and weight. Sheet 2 should contain the pairwise correlation matrix for the same variables. Use a foreach loop for the summary statistics sheet.
estpost tabstat for maximum flexibility in summary statistics tables.graph matrix for a quick overview of all pairwise relationships.plotplainblind) for publication-ready output.graph combine and export at high resolution.putdocx and putexcel to automate report generation and eliminate copy-paste errors.