Interface, Do-Files, and Log Files
When you open Stata, you see five windows. The Command window is where you type individual commands. The Results window shows output. The Review window keeps a history of commands you have run. The Variables window lists variables in the current dataset, and the Properties window shows metadata for selected variables.
Before doing anything, configure a few global settings that prevent common frustrations.
* --- Essential startup settings --- clear all set more off // prevent "-- more --" pauses set linesize 120 // wider output for log files set maxvar 10000 // allow more variables (SE/MP) set matsize 5000 // larger matrix for many regressors set seed 12345 // reproducible random numbers
sample, bootstrap, simulate, gen runiform()) depends on the random-number seed. Setting it explicitly ensures that you and your co-authors get identical results. Choose any integer; what matters is consistency across runs.
Stata allocates memory for data and matrices. If you work with large datasets (millions of observations) or models with many regressors, you may need to adjust these limits.
* --- Memory and matrix settings --- set matsize 11000 // max for Stata/IC is 800; SE/MP goes to 11000 set maxvar 32767 // maximum for SE/MP set max_memory 16g // cap memory usage at 16 GB set segmentsize 64m // segment size for large datasets * Check current memory usage memory * Timer for benchmarking code blocks timer clear timer on 1 * ... your computationally expensive code ... timer off 1 timer list 1 // reports elapsed seconds
matsize in IC is 800, while SE/MP allows up to 11,000. If you encounter "matsize too small" errors, check which edition you are running with about. For large panel datasets with many fixed effects, SE or MP is strongly recommended.
A do-file is a plain text script that contains Stata commands. It is the backbone of reproducible research. Every analysis you submit with a paper should be fully replicable from a do-file.
* ============================================ * Project: Panel analysis of dialysis quality * Author: Your Name * Date: 2026-03-16 * ============================================ version 17 // lock Stata version for reproducibility clear all set more off * Set working directory cd "~/Documents/my_project/" * Load data use "data/analysis_panel.dta", clear * Descriptive statistics summarize qip rn_staffing sdi, detail * Main regression regress qip rn_staffing sdi i.year, vce(cluster facility_id)
version 17 command at the top of your do-file ensures that even if you later upgrade Stata, the commands are interpreted using version 17 syntax. This prevents subtle breaking changes.
A log file captures all commands and output. Journals increasingly require replication logs, and your future self will thank you for keeping them.
* Start a log (text format for portability) log using "logs/analysis_main.log", text replace * ... all your analysis commands here ... * Close the log log close
Use replace to overwrite an existing log when you re-run the do-file. Use append if you want to add to an existing log. The text option produces a plain-text file; omit it for Stata's native SMCL format.
Macros are Stata's variables for storing text, numbers, or lists. They are essential for writing flexible, maintainable do-files. Local macros exist only within the do-file or program that defines them. Global macros persist for the entire Stata session. Prefer locals unless you genuinely need session-wide persistence.
* --- Local macros --- local depvar "qip" local controls "rn_pp sdi facility_size chain_flag" local fe_vars "i.year i.state" local cluster "facility_id" * Use locals with backtick-single-quote syntax regress `depvar' `controls' `fe_vars', vce(cluster `cluster') * Store numeric results local n_obs = _N local mean_price = r(mean) display "Sample size: `n_obs'" * --- Global macros --- global datadir "~/Documents/my_project/data" global tabledir "~/Documents/my_project/tables" * Use globals with a dollar sign use "$datadir/analysis_panel.dta", clear esttab using "$tabledir/results.tex", replace
Loops let you repeat operations without copying and pasting code. Stata provides two loop constructs: forvalues for numeric sequences and foreach for arbitrary lists.
* --- forvalues: loop over numeric ranges --- forvalues y = 2015/2019 { display "Processing year `y'" use "data/year`y'.dta", clear * ... analysis for each year ... save "data/cleaned_`y'.dta", replace } * --- foreach: loop over a list of items --- foreach var in price mpg weight length { summarize `var', detail histogram `var', name(h_`var', replace) nodraw } * foreach over variables matching a pattern foreach var of varlist rn_* { gen ln_`var' = ln(`var') label variable ln_`var' "Log of `var'" } * Nested loops for multiple specifications local outcomes "qip mortality readmission" local treatments "rn_pp rn_ps" foreach y of local outcomes { foreach x of local treatments { regress `y' `x' i.year, vce(cluster facility_id) estimates store m_`y'_`x' } }
Stata lets you define your own commands with program define. This is useful for encapsulating repeated analysis steps, data-cleaning routines, or custom estimation procedures.
* Define a simple program capture program drop my_summary program define my_summary syntax varlist [if] [in] foreach var of varlist `varlist' { quietly summarize `var' `if' `in', detail display "`var': Mean = " %9.3f r(mean) /// " SD = " %9.3f r(sd) /// " N = " r(N) } end * Use the program sysuse auto, clear my_summary price mpg weight my_summary price mpg if foreign == 1
The assert command checks a condition and stops execution with an error if it fails. Use it liberally to validate assumptions about your data. This is one of the most underused commands in applied work, yet it catches countless errors in data pipelines.
* Verify expected sample size count assert r(N) == 27077 * Verify no missing values in key variables assert !missing(facility_id) assert !missing(year) * Verify value ranges assert year >= 2015 & year <= 2019 assert qip >= 0 & qip <= 100 if !missing(qip) * Verify uniqueness of panel identifiers isid facility_id year // errors if not uniquely identified * Verify merge results merge m:1 zcta using "sdi_data.dta" assert _merge != 2 // no unmatched from using data drop _merge
assert like a contract between you and your data. Every time you make an assumption (the panel is balanced, an ID is unique, a variable is non-negative), write an assert. When the data changes or a co-author updates the cleaning code, the assert will catch silent breakage before it corrupts your results.
Here are the commands you will use every single session.
* Display text or expressions display "Hello, Stata!" display 2 + 2 display sqrt(144) * Load a built-in dataset sysuse auto, clear * Describe the dataset structure describe * Summarize numeric variables summarize price mpg weight summarize price, detail // percentiles, skewness, kurtosis * List observations list make price mpg in 1/5 * Count observations count count if foreign == 1
Stata has excellent built-in documentation. Use help followed by any command name to open the reference page. Use search to find commands by keyword.
help regress // full documentation for regress search panel data // find commands related to panel data findit reghdfe // search for user-written packages
Many commands used in empirical research (e.g., reghdfe, ivreg2, estout) are contributed by the community. Install them from SSC (the Statistical Software Components archive).
* Install from SSC ssc install reghdfe, replace ssc install estout, replace ssc install ivreg2, replace * Update all installed packages adoupdate, update
Putting everything together, here is a template that incorporates all the best practices from this chapter. Use this as the starting point for every new project.
* ============================================ * Project: [Your Project Title] * File: 01_clean_data.do * Author: [Your Name] * Created: 2026-03-16 * Modified: 2026-03-16 * Purpose: Clean raw data and build analysis panel * Input: data/raw/facility_raw.csv * Output: data/analysis/facility_panel.dta * ============================================ version 17 clear all set more off set linesize 120 set seed 20260316 * --- Directory globals --- global root "~/Documents/my_project" global raw "$root/data/raw" global clean "$root/data/analysis" global tables "$root/tables" global logs "$root/logs" * --- Log file --- log using "$logs/01_clean_data.log", text replace * --- Main code --- timer clear timer on 1 import delimited "$raw/facility_raw.csv", clear varnames(1) * ... cleaning and transformation code ... * --- Validation --- isid facility_id year assert !missing(facility_id) assert year >= 2015 & year <= 2019 * --- Save and finish --- save "$clean/facility_panel.dta", replace timer off 1 timer list 1 log close
Create a do-file that: (1) sets version to your Stata version, (2) opens a log file, (3) loads the built-in auto dataset, (4) runs describe and summarize, and (5) closes the log. Run the do-file and verify the log was created.
Use help summarize to find the option that reports the 10th and 90th percentiles. Apply it to the price variable in the auto dataset. What is the interquartile range of price?
Write a foreach loop that iterates over the variables price, mpg, weight, and length in the auto dataset. For each variable, display the variable name, its mean, and its standard deviation on a single line. (Hint: use quietly summarize inside the loop and refer to r(mean) and r(sd).)
Define a program called check_panel that takes two arguments (a panel ID variable and a time variable), runs isid to verify uniqueness, then displays the number of unique units and the number of time periods. Test it on webuse nlswork with idcode and year. (Hint: distinct or codebook, compact can count unique values.)
version, clear all, set more off, and set seed.forvalues and foreach loops for repetitive operations.assert and isid to validate data assumptions throughout your pipeline.