Chapter 2: Data Management

Import, Clean, and Reshape

2.1 Importing Data

Research data arrives in many formats. Stata can read its native .dta files, delimited text (CSV/TSV), and Excel spreadsheets directly.

* Load a Stata dataset
use "data/facility_panel.dta", clear

* Import a CSV file
import delimited "data/raw_cms_data.csv", clear varnames(1)

* Import an Excel file (specify sheet and cell range)
import excel "data/sdi_scores.xlsx", sheet("Sheet1") firstrow clear

* Save the current dataset
save "data/cleaned_data.dta", replace

Common Pitfall: String Variables When importing CSV files, Stata may read numeric columns as strings if any cell contains non-numeric text (e.g., "N/A" or blank with spaces). Always run describe after importing to verify variable types.

2.1a The egen Command

While generate operates observation by observation, egen (extended generate) provides aggregate functions that operate across observations or across variables. It is indispensable for creating group-level statistics, row-wise summaries, and panel identifiers.

sysuse auto, clear

* --- Group-level statistics ---
egen mean_price = mean(price), by(foreign)        // group mean
egen sd_price   = sd(price), by(foreign)           // group std dev
egen n_obs      = count(price), by(foreign)        // group count
egen total_price = total(price), by(foreign)       // group sum
egen min_price  = min(price), by(foreign)          // group minimum
egen max_price  = max(price), by(foreign)          // group maximum
egen med_price  = median(price), by(foreign)       // group median

* --- Row-wise operations (across variables, within an observation) ---
egen avg_dims = rowmean(weight length)              // mean across columns
egen n_nonmiss = rownonmiss(price mpg weight rep78) // count non-missing

* --- Group identifiers ---
egen group_id = group(foreign rep78)                // sequential group ID
egen tag = tag(foreign)                             // 1 for first obs per group

* --- Standardize a variable (z-score) ---
egen z_price = std(price)                           // (x - mean) / sd

Tip: gen vs. egen Use gen for observation-level arithmetic (logs, powers, indicators). Use egen when you need to aggregate across observations (means, counts, ranks by group) or across variables (row means, row totals). A common mistake is trying gen group_mean = mean(x), which does not work; you need egen for that.

2.2 Generating and Replacing Variables

The generate and replace commands create and modify variables. Use gen for new variables and replace to update existing ones.

* Create new variables
gen ln_price = ln(price)
gen price_sq = price^2
gen high_mpg = (mpg > 25)           // indicator variable
gen price_per_mpg = price / mpg

* Replace with conditions
replace price = . if price < 0       // set negatives to missing
replace region = "Unknown" if region == ""

* Recode a variable
recode rep78 (1 2 = 1 "Poor") (3 = 2 "Average") (4 5 = 3 "Good"), gen(rep_cat)

2.3 String and Numeric Conversions

Converting between string and numeric types is a frequent task when cleaning administrative data.

* Convert string to numeric
destring zip_code, replace
destring revenue, replace force       // force: set non-numeric to missing

* Convert numeric to string
tostring fips_code, replace format(%05.0f)

* Encode string categories to numeric (with labels)
encode state_name, gen(state_num)

* Decode numeric labels back to string
decode state_num, gen(state_str)

2.3a Duplicates and Data Integrity

Duplicate observations are a leading source of errors in empirical work, especially when merging multiple data sources. Always check for duplicates before and after merges.

* Report duplicates on key variables
duplicates report facility_id year

* List duplicate observations
duplicates list facility_id year

* Tag duplicates (creates a variable with duplicate count)
duplicates tag facility_id year, gen(dup_flag)
tab dup_flag
list facility_id year if dup_flag > 0

* Drop exact duplicate rows (all variables identical)
duplicates drop

* Drop duplicates on specific keys (keeps first occurrence)
duplicates drop facility_id year, force

* Verify uniqueness
isid facility_id year

duplicates drop with force Using duplicates drop varlist, force silently deletes observations. Before doing this, always examine the duplicates to understand why they exist. Are they true duplicates (data entry errors) or do they reflect real observations (e.g., two records per facility-year for different reporting periods)? Document your decision in comments.

2.4 Labels

Variable and value labels make output readable and are expected in any shared dataset.

* Variable labels
label variable qip "Quality Incentive Program Score"
label variable rn_pp "RN FTE per 100 Patients"

* Value labels
label define yesno 0 "No" 1 "Yes"
label values high_mpg yesno

* Check labels
label list yesno
describe qip rn_pp

Labeling Best Practices Good labels serve two audiences: your future self and your co-authors. Variable labels should describe the measure and its units (e.g., "RN FTE per 100 Patients" rather than just "rn_pp"). Value labels should be applied to all binary and categorical variables. When sharing data with co-authors, run codebook, compact to verify that all variables are labeled and documented.

2.5 Reshaping Data

Panel data often needs to move between wide format (one row per unit, columns for each time period) and long format (one row per unit-period).

* --- Wide to Long ---
* Suppose variables: id revenue2015 revenue2016 revenue2017
reshape long revenue, i(id) j(year)

* --- Long to Wide ---
reshape wide revenue, i(id) j(year)

* Verify the reshape
describe
list in 1/5

Reshape Checklist Before reshaping: (1) ensure i() uniquely identifies units, (2) ensure j() values are consistent, (3) check for duplicates with duplicates report id year. Reshape will fail silently or produce errors if these conditions are not met.

2.6 Merging Datasets

Combining datasets is central to empirical research. Stata's merge creates a _merge variable that reports match status.

* 1:1 merge (each key appears once in both files)
use "data/facility_chars.dta", clear
merge 1:1 facility_id using "data/quality_scores.dta"
tab _merge

* m:1 merge (many observations per key in master)
use "data/facility_year.dta", clear
merge m:1 zcta using "data/sdi_by_zcta.dta"

* Keep only matched observations
keep if _merge == 3
drop _merge

* Append datasets (stack vertically)
use "data/year2015.dta", clear
append using "data/year2016.dta" "data/year2017.dta"

2.7 Expanding, Filling, and Crossing Data

Several commands help you restructure data in ways that go beyond simple reshaping. These are particularly useful for creating balanced panels, generating simulation data, or building lookup tables.

* --- fillin: fill in missing combinations to balance a panel ---
fillin facility_id year
* Creates a variable _fillin = 1 for newly created observations
tab _fillin

* --- expand: replicate each observation N times ---
sysuse auto, clear
expand 3                    // each observation now appears 3 times
count                       // 74 * 3 = 222

* Conditional expand (useful for simulation)
sysuse auto, clear
expand 2 if foreign == 1    // double only foreign cars

* --- cross: Cartesian product of two datasets ---
* Creates every combination of obs from current data and using data
clear
input str10 state
    "Alabama"
    "Alaska"
end
save "temp_states.dta", replace

clear
input int year
    2015
    2016
    2017
end
cross using "temp_states.dta"
list                        // 2 states x 3 years = 6 rows

2.8 Preserve/Restore and Temporary Objects

Sometimes you need to modify the data temporarily (e.g., to compute a statistic on a subset) and then return to the original. The preserve/restore pair saves and reloads the current dataset in memory. Temporary files and variables provide scratch space without cluttering your workspace.

* --- preserve / restore ---
sysuse auto, clear
preserve
    keep if foreign == 1
    summarize price           // statistics on foreign cars only
    local foreign_mean = r(mean)
restore                      // full dataset is back
display "Foreign mean price: `foreign_mean'"

* --- tempfile: a temporary dataset on disk ---
tempfile subset
preserve
    keep if rep78 >= 4
    save `subset'
restore
* Later, you can merge or append from `subset'
merge 1:1 make using `subset', nogen keep(master match)

* --- tempvar: a temporary variable (auto-deleted) ---
tempvar centered_price
quietly summarize price
gen `centered_price' = price - r(mean)
regress mpg `centered_price'
* `centered_price' vanishes when the program/do-file ends

Tip: When to Use Preserve/Restore preserve/restore copies the entire dataset in memory, so it is fast for small-to-medium data. For very large datasets (100M+ observations), saving a tempfile and reloading is more memory-efficient. Never nest preserve calls; Stata allows only one level of preservation at a time.

2.9 Frames (Stata 16+)

Stata 16 introduced frames, which let you hold multiple datasets in memory simultaneously. This eliminates the need for constant saving, loading, and merging when working with several related datasets.

* --- Working with frames ---
* Create a new frame and load data into it
frame create sdi_data
frame sdi_data: use "data/sdi_by_zcta.dta", clear

* Check what frames exist
frames dir

* Switch to a frame
frame change sdi_data
describe
frame change default

* Link frames for lookups (like a SQL join)
frlink m:1 zcta, frame(sdi_data)
frget sdi_score, from(sdi_data)

* Drop a frame when done
frame drop sdi_data

Frames vs. Preserve/Restore Frames and preserve/restore solve different problems. Use preserve when you need to temporarily modify and then revert a single dataset. Use frames when you need to work with two or more distinct datasets at the same time, looking up values from one to populate another. Frames are the Stata equivalent of having multiple data tables open in R or Python.

2.10 Handling Missing Values

Stata represents numeric missing as . (which is treated as positive infinity in comparisons). Always account for this in conditional statements.

* Count missing values per variable
misstable summarize

* Tabulate patterns of missingness
misstable patterns

* Drop observations with missing outcome
drop if missing(qip)

* CAREFUL: this keeps missing values too!
list if price > 10000     // missing > 10000 is TRUE in Stata

* Correct way
list if price > 10000 & !missing(price)

Exercise 2.1

Load sysuse auto. Create a variable weight_kg that converts weight from pounds to kilograms (divide by 2.205). Label it "Weight (kg)". Save the modified dataset as auto_metric.dta.

Exercise 2.2

Using the auto dataset, generate a new string variable origin that equals "Domestic" when foreign == 0 and "Foreign" when foreign == 1. Then encode it back to a numeric variable origin_num and verify the labels are correct.

Exercise 2.3

Load sysuse auto. Use egen to create: (a) mean_price_group, the mean price by foreign status; (b) z_mpg, the standardized (z-score) version of mpg; (c) n_missing, a row-wise count of how many variables among price, mpg, rep78, weight are missing for each observation. How many observations have at least one missing value?

Exercise 2.4

Create a do-file that demonstrates preserve/restore. Start with sysuse auto. Inside a preserve block, keep only cars with price > 10000, compute the mean mpg for this subset, and store it in a local macro. After restore, display both the subset mean and the overall mean of mpg. Verify that the full dataset has been restored by checking the observation count.

External Resources

Key Takeaways

Always run describe after importing to check variable types.
Use egen for group-level statistics (mean, count, total by group) and row-wise operations.
Check for and document duplicates before any merge with duplicates report.
Use reshape long for panel analysis and reshape wide for summary tables.
Check _merge after every merge to catch unexpected non-matches.
Use preserve/restore for temporary data modifications and frames (Stata 16+) for multi-dataset workflows.
Label every variable with a descriptive label including units of measurement.
Remember that Stata treats missing values as larger than any number.

← Chapter 1: Getting Started Chapter 3: Descriptive Statistics →