Chapter 1: Getting Started

Interface, Do-Files, and Log Files

1.1 The Stata Interface

When you open Stata, you see five windows. The Command window is where you type individual commands. The Results window shows output. The Review window keeps a history of commands you have run. The Variables window lists variables in the current dataset, and the Properties window shows metadata for selected variables.

Tip: Interactive vs. Scripted Work Use the Command window for quick exploration, but always save your final analysis in a do-file for reproducibility.

1.2 Essential Settings

Before doing anything, configure a few global settings that prevent common frustrations.

* --- Essential startup settings ---
clear all
set more off              // prevent "-- more --" pauses
set linesize 120          // wider output for log files
set maxvar 10000          // allow more variables (SE/MP)
set matsize 5000          // larger matrix for many regressors
set seed 12345            // reproducible random numbers
set seed for Reproducibility Any command that involves random numbers (e.g., sample, bootstrap, simulate, gen runiform()) depends on the random-number seed. Setting it explicitly ensures that you and your co-authors get identical results. Choose any integer; what matters is consistency across runs.

1.2a Memory and Matrix Settings

Stata allocates memory for data and matrices. If you work with large datasets (millions of observations) or models with many regressors, you may need to adjust these limits.

* --- Memory and matrix settings ---
set matsize 11000         // max for Stata/IC is 800; SE/MP goes to 11000
set maxvar 32767          // maximum for SE/MP
set max_memory 16g        // cap memory usage at 16 GB
set segmentsize 64m       // segment size for large datasets

* Check current memory usage
memory

* Timer for benchmarking code blocks
timer clear
timer on 1
* ... your computationally expensive code ...
timer off 1
timer list 1             // reports elapsed seconds
Stata Flavors and Limits Stata comes in four editions: IC, SE, MP, and BE (Basic). The maximum matsize in IC is 800, while SE/MP allows up to 11,000. If you encounter "matsize too small" errors, check which edition you are running with about. For large panel datasets with many fixed effects, SE or MP is strongly recommended.

1.3 Writing Do-Files

A do-file is a plain text script that contains Stata commands. It is the backbone of reproducible research. Every analysis you submit with a paper should be fully replicable from a do-file.

* ============================================
* Project: Panel analysis of dialysis quality
* Author:  Your Name
* Date:    2026-03-16
* ============================================

version 17               // lock Stata version for reproducibility
clear all
set more off

* Set working directory
cd "~/Documents/my_project/"

* Load data
use "data/analysis_panel.dta", clear

* Descriptive statistics
summarize qip rn_staffing sdi, detail

* Main regression
regress qip rn_staffing sdi i.year, vce(cluster facility_id)
Version Control The version 17 command at the top of your do-file ensures that even if you later upgrade Stata, the commands are interpreted using version 17 syntax. This prevents subtle breaking changes.

1.4 Log Files

A log file captures all commands and output. Journals increasingly require replication logs, and your future self will thank you for keeping them.

* Start a log (text format for portability)
log using "logs/analysis_main.log", text replace

* ... all your analysis commands here ...

* Close the log
log close

Use replace to overwrite an existing log when you re-run the do-file. Use append if you want to add to an existing log. The text option produces a plain-text file; omit it for Stata's native SMCL format.

1.5 Macros: Locals and Globals

Macros are Stata's variables for storing text, numbers, or lists. They are essential for writing flexible, maintainable do-files. Local macros exist only within the do-file or program that defines them. Global macros persist for the entire Stata session. Prefer locals unless you genuinely need session-wide persistence.

* --- Local macros ---
local depvar   "qip"
local controls "rn_pp sdi facility_size chain_flag"
local fe_vars  "i.year i.state"
local cluster  "facility_id"

* Use locals with backtick-single-quote syntax
regress `depvar' `controls' `fe_vars', vce(cluster `cluster')

* Store numeric results
local n_obs = _N
local mean_price = r(mean)
display "Sample size: `n_obs'"

* --- Global macros ---
global datadir  "~/Documents/my_project/data"
global tabledir "~/Documents/my_project/tables"

* Use globals with a dollar sign
use "$datadir/analysis_panel.dta", clear
esttab using "$tabledir/results.tex", replace
Tip: Locals vs. Globals Use locals for everything inside a single do-file. Locals vanish when the do-file ends, so they cannot accidentally carry stale values between runs. Reserve globals for directory paths that multiple do-files share (defined in a master do-file), or for interactive exploration at the command line.

1.6 Loops

Loops let you repeat operations without copying and pasting code. Stata provides two loop constructs: forvalues for numeric sequences and foreach for arbitrary lists.

* --- forvalues: loop over numeric ranges ---
forvalues y = 2015/2019 {
    display "Processing year `y'"
    use "data/year`y'.dta", clear
    * ... analysis for each year ...
    save "data/cleaned_`y'.dta", replace
}

* --- foreach: loop over a list of items ---
foreach var in price mpg weight length {
    summarize `var', detail
    histogram `var', name(h_`var', replace) nodraw
}

* foreach over variables matching a pattern
foreach var of varlist rn_* {
    gen ln_`var' = ln(`var')
    label variable ln_`var' "Log of `var'"
}

* Nested loops for multiple specifications
local outcomes "qip mortality readmission"
local treatments "rn_pp rn_ps"
foreach y of local outcomes {
    foreach x of local treatments {
        regress `y' `x' i.year, vce(cluster facility_id)
        estimates store m_`y'_`x'
    }
}
When Loops Become Fragile Loops are powerful but can obscure your code if overused. For a paper with three or four specifications, it is often clearer to write each regression explicitly. Use loops when you have many repetitions (e.g., processing 50 state files) or when building robustness tables with systematic variations.

1.7 Writing Programs

Stata lets you define your own commands with program define. This is useful for encapsulating repeated analysis steps, data-cleaning routines, or custom estimation procedures.

* Define a simple program
capture program drop my_summary
program define my_summary
    syntax varlist [if] [in]
    foreach var of varlist `varlist' {
        quietly summarize `var' `if' `in', detail
        display "`var': Mean = " %9.3f r(mean) ///
                "  SD = " %9.3f r(sd) ///
                "  N = " r(N)
    }
end

* Use the program
sysuse auto, clear
my_summary price mpg weight
my_summary price mpg if foreign == 1

1.8 Data Validation with assert

The assert command checks a condition and stops execution with an error if it fails. Use it liberally to validate assumptions about your data. This is one of the most underused commands in applied work, yet it catches countless errors in data pipelines.

* Verify expected sample size
count
assert r(N) == 27077

* Verify no missing values in key variables
assert !missing(facility_id)
assert !missing(year)

* Verify value ranges
assert year >= 2015 & year <= 2019
assert qip >= 0 & qip <= 100 if !missing(qip)

* Verify uniqueness of panel identifiers
isid facility_id year           // errors if not uniquely identified

* Verify merge results
merge m:1 zcta using "sdi_data.dta"
assert _merge != 2              // no unmatched from using data
drop _merge
Defensive Programming Treat assert like a contract between you and your data. Every time you make an assumption (the panel is balanced, an ID is unique, a variable is non-negative), write an assert. When the data changes or a co-author updates the cleaning code, the assert will catch silent breakage before it corrupts your results.

1.9 Basic Commands

Here are the commands you will use every single session.

* Display text or expressions
display "Hello, Stata!"
display 2 + 2
display sqrt(144)

* Load a built-in dataset
sysuse auto, clear

* Describe the dataset structure
describe

* Summarize numeric variables
summarize price mpg weight
summarize price, detail       // percentiles, skewness, kurtosis

* List observations
list make price mpg in 1/5

* Count observations
count
count if foreign == 1

1.10 Getting Help

Stata has excellent built-in documentation. Use help followed by any command name to open the reference page. Use search to find commands by keyword.

help regress              // full documentation for regress
search panel data         // find commands related to panel data
findit reghdfe            // search for user-written packages

1.11 Installing User-Written Packages

Many commands used in empirical research (e.g., reghdfe, ivreg2, estout) are contributed by the community. Install them from SSC (the Statistical Software Components archive).

* Install from SSC
ssc install reghdfe, replace
ssc install estout, replace
ssc install ivreg2, replace

* Update all installed packages
adoupdate, update

1.12 A Complete Do-File Template

Putting everything together, here is a template that incorporates all the best practices from this chapter. Use this as the starting point for every new project.

* ============================================
* Project:  [Your Project Title]
* File:     01_clean_data.do
* Author:   [Your Name]
* Created:  2026-03-16
* Modified: 2026-03-16
* Purpose:  Clean raw data and build analysis panel
* Input:    data/raw/facility_raw.csv
* Output:   data/analysis/facility_panel.dta
* ============================================

version 17
clear all
set more off
set linesize 120
set seed 20260316

* --- Directory globals ---
global root    "~/Documents/my_project"
global raw     "$root/data/raw"
global clean   "$root/data/analysis"
global tables  "$root/tables"
global logs    "$root/logs"

* --- Log file ---
log using "$logs/01_clean_data.log", text replace

* --- Main code ---
timer clear
timer on 1

import delimited "$raw/facility_raw.csv", clear varnames(1)

* ... cleaning and transformation code ...

* --- Validation ---
isid facility_id year
assert !missing(facility_id)
assert year >= 2015 & year <= 2019

* --- Save and finish ---
save "$clean/facility_panel.dta", replace

timer off 1
timer list 1
log close

Exercise 1.1

Create a do-file that: (1) sets version to your Stata version, (2) opens a log file, (3) loads the built-in auto dataset, (4) runs describe and summarize, and (5) closes the log. Run the do-file and verify the log was created.

Exercise 1.2

Use help summarize to find the option that reports the 10th and 90th percentiles. Apply it to the price variable in the auto dataset. What is the interquartile range of price?

Exercise 1.3

Write a foreach loop that iterates over the variables price, mpg, weight, and length in the auto dataset. For each variable, display the variable name, its mean, and its standard deviation on a single line. (Hint: use quietly summarize inside the loop and refer to r(mean) and r(sd).)

Exercise 1.4

Define a program called check_panel that takes two arguments (a panel ID variable and a time variable), runs isid to verify uniqueness, then displays the number of unique units and the number of time periods. Test it on webuse nlswork with idcode and year. (Hint: distinct or codebook, compact can count unique values.)

External Resources

Key Takeaways

← Guide Home Chapter 2: Data Management →