Import, Clean, and Reshape
Research data arrives in many formats. Stata can read its native .dta files, delimited text (CSV/TSV), and Excel spreadsheets directly.
* Load a Stata dataset use "data/facility_panel.dta", clear * Import a CSV file import delimited "data/raw_cms_data.csv", clear varnames(1) * Import an Excel file (specify sheet and cell range) import excel "data/sdi_scores.xlsx", sheet("Sheet1") firstrow clear * Save the current dataset save "data/cleaned_data.dta", replace
describe after importing to verify variable types.
While generate operates observation by observation, egen (extended generate) provides aggregate functions that operate across observations or across variables. It is indispensable for creating group-level statistics, row-wise summaries, and panel identifiers.
sysuse auto, clear * --- Group-level statistics --- egen mean_price = mean(price), by(foreign) // group mean egen sd_price = sd(price), by(foreign) // group std dev egen n_obs = count(price), by(foreign) // group count egen total_price = total(price), by(foreign) // group sum egen min_price = min(price), by(foreign) // group minimum egen max_price = max(price), by(foreign) // group maximum egen med_price = median(price), by(foreign) // group median * --- Row-wise operations (across variables, within an observation) --- egen avg_dims = rowmean(weight length) // mean across columns egen n_nonmiss = rownonmiss(price mpg weight rep78) // count non-missing * --- Group identifiers --- egen group_id = group(foreign rep78) // sequential group ID egen tag = tag(foreign) // 1 for first obs per group * --- Standardize a variable (z-score) --- egen z_price = std(price) // (x - mean) / sd
gen for observation-level arithmetic (logs, powers, indicators). Use egen when you need to aggregate across observations (means, counts, ranks by group) or across variables (row means, row totals). A common mistake is trying gen group_mean = mean(x), which does not work; you need egen for that.
The generate and replace commands create and modify variables. Use gen for new variables and replace to update existing ones.
* Create new variables gen ln_price = ln(price) gen price_sq = price^2 gen high_mpg = (mpg > 25) // indicator variable gen price_per_mpg = price / mpg * Replace with conditions replace price = . if price < 0 // set negatives to missing replace region = "Unknown" if region == "" * Recode a variable recode rep78 (1 2 = 1 "Poor") (3 = 2 "Average") (4 5 = 3 "Good"), gen(rep_cat)
Converting between string and numeric types is a frequent task when cleaning administrative data.
* Convert string to numeric destring zip_code, replace destring revenue, replace force // force: set non-numeric to missing * Convert numeric to string tostring fips_code, replace format(%05.0f) * Encode string categories to numeric (with labels) encode state_name, gen(state_num) * Decode numeric labels back to string decode state_num, gen(state_str)
Duplicate observations are a leading source of errors in empirical work, especially when merging multiple data sources. Always check for duplicates before and after merges.
* Report duplicates on key variables duplicates report facility_id year * List duplicate observations duplicates list facility_id year * Tag duplicates (creates a variable with duplicate count) duplicates tag facility_id year, gen(dup_flag) tab dup_flag list facility_id year if dup_flag > 0 * Drop exact duplicate rows (all variables identical) duplicates drop * Drop duplicates on specific keys (keeps first occurrence) duplicates drop facility_id year, force * Verify uniqueness isid facility_id year
duplicates drop varlist, force silently deletes observations. Before doing this, always examine the duplicates to understand why they exist. Are they true duplicates (data entry errors) or do they reflect real observations (e.g., two records per facility-year for different reporting periods)? Document your decision in comments.
Variable and value labels make output readable and are expected in any shared dataset.
* Variable labels label variable qip "Quality Incentive Program Score" label variable rn_pp "RN FTE per 100 Patients" * Value labels label define yesno 0 "No" 1 "Yes" label values high_mpg yesno * Check labels label list yesno describe qip rn_pp
codebook, compact to verify that all variables are labeled and documented.
Panel data often needs to move between wide format (one row per unit, columns for each time period) and long format (one row per unit-period).
* --- Wide to Long --- * Suppose variables: id revenue2015 revenue2016 revenue2017 reshape long revenue, i(id) j(year) * --- Long to Wide --- reshape wide revenue, i(id) j(year) * Verify the reshape describe list in 1/5
i() uniquely identifies units, (2) ensure j() values are consistent, (3) check for duplicates with duplicates report id year. Reshape will fail silently or produce errors if these conditions are not met.
Combining datasets is central to empirical research. Stata's merge creates a _merge variable that reports match status.
* 1:1 merge (each key appears once in both files) use "data/facility_chars.dta", clear merge 1:1 facility_id using "data/quality_scores.dta" tab _merge * m:1 merge (many observations per key in master) use "data/facility_year.dta", clear merge m:1 zcta using "data/sdi_by_zcta.dta" * Keep only matched observations keep if _merge == 3 drop _merge * Append datasets (stack vertically) use "data/year2015.dta", clear append using "data/year2016.dta" "data/year2017.dta"
Several commands help you restructure data in ways that go beyond simple reshaping. These are particularly useful for creating balanced panels, generating simulation data, or building lookup tables.
* --- fillin: fill in missing combinations to balance a panel --- fillin facility_id year * Creates a variable _fillin = 1 for newly created observations tab _fillin * --- expand: replicate each observation N times --- sysuse auto, clear expand 3 // each observation now appears 3 times count // 74 * 3 = 222 * Conditional expand (useful for simulation) sysuse auto, clear expand 2 if foreign == 1 // double only foreign cars * --- cross: Cartesian product of two datasets --- * Creates every combination of obs from current data and using data clear input str10 state "Alabama" "Alaska" end save "temp_states.dta", replace clear input int year 2015 2016 2017 end cross using "temp_states.dta" list // 2 states x 3 years = 6 rows
Sometimes you need to modify the data temporarily (e.g., to compute a statistic on a subset) and then return to the original. The preserve/restore pair saves and reloads the current dataset in memory. Temporary files and variables provide scratch space without cluttering your workspace.
* --- preserve / restore --- sysuse auto, clear preserve keep if foreign == 1 summarize price // statistics on foreign cars only local foreign_mean = r(mean) restore // full dataset is back display "Foreign mean price: `foreign_mean'" * --- tempfile: a temporary dataset on disk --- tempfile subset preserve keep if rep78 >= 4 save `subset' restore * Later, you can merge or append from `subset' merge 1:1 make using `subset', nogen keep(master match) * --- tempvar: a temporary variable (auto-deleted) --- tempvar centered_price quietly summarize price gen `centered_price' = price - r(mean) regress mpg `centered_price' * `centered_price' vanishes when the program/do-file ends
preserve/restore copies the entire dataset in memory, so it is fast for small-to-medium data. For very large datasets (100M+ observations), saving a tempfile and reloading is more memory-efficient. Never nest preserve calls; Stata allows only one level of preservation at a time.
Stata 16 introduced frames, which let you hold multiple datasets in memory simultaneously. This eliminates the need for constant saving, loading, and merging when working with several related datasets.
* --- Working with frames --- * Create a new frame and load data into it frame create sdi_data frame sdi_data: use "data/sdi_by_zcta.dta", clear * Check what frames exist frames dir * Switch to a frame frame change sdi_data describe frame change default * Link frames for lookups (like a SQL join) frlink m:1 zcta, frame(sdi_data) frget sdi_score, from(sdi_data) * Drop a frame when done frame drop sdi_data
preserve/restore solve different problems. Use preserve when you need to temporarily modify and then revert a single dataset. Use frames when you need to work with two or more distinct datasets at the same time, looking up values from one to populate another. Frames are the Stata equivalent of having multiple data tables open in R or Python.
Stata represents numeric missing as . (which is treated as positive infinity in comparisons). Always account for this in conditional statements.
* Count missing values per variable misstable summarize * Tabulate patterns of missingness misstable patterns * Drop observations with missing outcome drop if missing(qip) * CAREFUL: this keeps missing values too! list if price > 10000 // missing > 10000 is TRUE in Stata * Correct way list if price > 10000 & !missing(price)
Load sysuse auto. Create a variable weight_kg that converts weight from pounds to kilograms (divide by 2.205). Label it "Weight (kg)". Save the modified dataset as auto_metric.dta.
Using the auto dataset, generate a new string variable origin that equals "Domestic" when foreign == 0 and "Foreign" when foreign == 1. Then encode it back to a numeric variable origin_num and verify the labels are correct.
Load sysuse auto. Use egen to create: (a) mean_price_group, the mean price by foreign status; (b) z_mpg, the standardized (z-score) version of mpg; (c) n_missing, a row-wise count of how many variables among price, mpg, rep78, weight are missing for each observation. How many observations have at least one missing value?
Create a do-file that demonstrates preserve/restore. Start with sysuse auto. Inside a preserve block, keep only cars with price > 10000, compute the mean mpg for this subset, and store it in a local macro. After restore, display both the subset mean and the overall mean of mpg. Verify that the full dataset has been restored by checking the observation count.
describe after importing to check variable types.egen for group-level statistics (mean, count, total by group) and row-wise operations.duplicates report.reshape long for panel analysis and reshape wide for summary tables._merge after every merge to catch unexpected non-matches.preserve/restore for temporary data modifications and frames (Stata 16+) for multi-dataset workflows.