Chapter 5: Pandas

DataFrames and Data Wrangling

5.1 Why Pandas?

Pandas provides the DataFrame, a two-dimensional labeled data structure that is the standard tool for tabular data manipulation in Python. If you have worked with Excel pivot tables or SQL queries, pandas will feel familiar but far more powerful and reproducible.

import pandas as pd
import numpy as np

5.2 Reading Data

# CSV (most common)
df = pd.read_csv("orders.csv")

# Excel
df_xl = pd.read_excel("inventory.xlsx", sheet_name="Sheet1")

# Quick inspection
print(df.shape)        # (rows, cols)
print(df.dtypes)       # column types
df.head()               # first 5 rows
df.describe()           # summary statistics
df.info()               # non-null counts and memory

5.3 Creating DataFrames

# From a dictionary
data = {
    "product":  ["Widget", "Gadget", "Bracket", "Bolt"],
    "price":    [29.99, 14.50, 7.25, 1.10],
    "quantity": [500, 320, 750, 4000],
    "category": ["A", "B", "C", "C"]
}
df = pd.DataFrame(data)
print(df)

5.4 Selecting Data: loc and iloc

# loc: label-based selection
df.loc[0, "product"]          # "Widget"
df.loc[:2, ["product", "price"]]  # first 3 rows, 2 columns

# iloc: integer-position based
df.iloc[0, 1]               # 29.99 (row 0, col 1)
df.iloc[:2, :3]             # first 2 rows, first 3 cols

# Boolean filtering
expensive = df[df["price"] > 10]
print(expensive)
loc vs iloc: Use loc when selecting by column names and row labels. Use iloc when selecting by integer position. When in doubt, prefer loc for clarity.

5.5 Adding and Transforming Columns

# New computed column
df["revenue"] = df["price"] * df["quantity"]

# Apply a function
df["price_tier"] = df["price"].apply(
    lambda p: "Premium" if p > 20 else "Standard"
)

print(df)

5.6 GroupBy and Aggregation

# Group by category and compute summary stats
summary = df.groupby("category").agg(
    total_revenue=("revenue", "sum"),
    avg_price=("price", "mean"),
    n_products=("product", "count")
).reset_index()

print(summary)

5.7 Merging DataFrames

# Supplier table
suppliers = pd.DataFrame({
    "product":  ["Widget", "Gadget", "Bracket", "Bolt"],
    "supplier": ["Acme", "Globex", "Acme", "Initech"],
    "lead_days": [7, 14, 5, 3]
})

# Inner join on product name
merged = pd.merge(df, suppliers, on="product", how="left")
print(merged[["product", "revenue", "supplier", "lead_days"]])

5.8 Pivot Tables

# Pivot: rows=category, values=revenue
pivot = df.pivot_table(
    values="revenue",
    index="category",
    aggfunc=["sum", "mean"]
)
print(pivot)

5.9 The apply() Method

apply() lets you run any function along an axis of a DataFrame. It is useful when vectorized operations are not sufficient, for example when you need row-wise logic that depends on multiple columns. Use it sparingly because it is slower than vectorized operations.

# apply a function to each row
def margin_tier(row):
    margin = (row["revenue"] - row["price"] * row["quantity"] * 0.6) / row["revenue"]
    if margin > 0.4:
        return "High"
    elif margin > 0.2:
        return "Medium"
    else:
        return "Low"

df["margin_tier"] = df.apply(margin_tier, axis=1)

# apply to a single column (same as .map())
df["price_usd"] = df["price"].apply(lambda x: round(x * 1.1, 2))

# applymap — apply element-wise to entire DataFrame (renamed to map in pandas 2.1+)
numeric_df = df[["price", "quantity"]]
formatted = numeric_df.applymap(lambda x: f"{x:,.2f}")
print(formatted)
Performance note: apply() runs Python code row by row and is much slower than vectorized operations. Before using apply(), check whether np.where(), np.select(), or pandas built-in methods can accomplish the same task. Reserve apply() for complex logic that cannot be vectorized.

5.10 String Accessor (.str)

The .str accessor exposes string methods on an entire Series, eliminating the need to loop through rows. This is essential for cleaning text columns such as names, addresses, and product descriptions.

orders = pd.DataFrame({
    "customer": ["  John Smith ", "JANE DOE", "bob jones", "Alice Lee"],
    "sku": ["WDG-2024-A", "GDT-2024-B", "BRK-2023-C", "BLT-2024-A"],
    "email": ["john@co.com", "jane@co.com", "bob@co.com", None]
})

# Clean customer names
orders["customer"] = orders["customer"].str.strip().str.title()
print(orders["customer"])
# 0    John Smith
# 1      Jane Doe
# 2     Bob Jones
# 3     Alice Lee

# Extract parts from SKU
orders["product_code"] = orders["sku"].str.split("-").str[0]
orders["year"] = orders["sku"].str.split("-").str[1].astype(int)

# Boolean check — contains, startswith, endswith
print(orders["sku"].str.contains("2024"))
print(orders["email"].str.endswith("@co.com"))  # handles NaN gracefully

# Replace text
orders["sku_clean"] = orders["sku"].str.replace("-", "")

# String length
print(orders["customer"].str.len())

5.11 DateTime Accessor (.dt)

When a column is parsed as datetime, the .dt accessor exposes properties like year, month, day of week, and quarter. This is critical for time-series analysis, seasonal patterns, and cohort-based reporting.

# Parse string dates into datetime
sales = pd.DataFrame({
    "date": ["2024-01-15", "2024-03-22", "2024-07-04", "2024-11-28"],
    "amount": [1200, 3400, 2800, 4100]
})
sales["date"] = pd.to_datetime(sales["date"])

# Extract components
sales["year"]    = sales["date"].dt.year
sales["month"]   = sales["date"].dt.month
sales["quarter"] = sales["date"].dt.quarter
sales["weekday"] = sales["date"].dt.day_name()

print(sales)

# Filter by date range
h2 = sales[sales["date"] >= "2024-07-01"]
print(h2)

# Compute days since last sale
sales["days_gap"] = sales["date"].diff().dt.days
Always parse dates early. Call pd.to_datetime() right after reading your data. Keeping dates as strings prevents you from filtering by date ranges, extracting components, and computing differences. Many read_csv() calls benefit from the parse_dates=["col"] parameter.

5.12 Window Functions: Rolling and Expanding

Window functions compute statistics over a sliding window of rows. rolling(n) uses a fixed window of n observations; expanding() uses all rows from the start up to the current row. These are essential for computing moving averages, cumulative sums, and trend indicators.

# Monthly revenue data
monthly = pd.DataFrame({
    "month": pd.date_range("2024-01", periods=12, freq="MS"),
    "revenue": [45, 52, 48, 61, 55, 70, 68, 72, 65, 80, 78, 90]
})

# 3-month rolling average (smooths out noise)
monthly["rolling_3m"] = monthly["revenue"].rolling(3).mean()

# Cumulative (expanding) sum — year-to-date total
monthly["ytd_total"] = monthly["revenue"].expanding().sum()

# Rolling standard deviation (volatility measure)
monthly["rolling_std"] = monthly["revenue"].rolling(3).std()

# Percent change from previous period
monthly["pct_change"] = monthly["revenue"].pct_change()

print(monthly.round(2))

5.13 MultiIndex (Hierarchical Indexing)

A MultiIndex lets you have multiple levels of row or column labels. This is useful when your data has a natural hierarchy (e.g., region and store, or year and quarter). GroupBy operations often produce MultiIndex results.

# Create a MultiIndex from groupby
data = pd.DataFrame({
    "region": ["East", "East", "West", "West", "East", "West"],
    "product": ["Widget", "Gadget", "Widget", "Gadget", "Bolt", "Bolt"],
    "revenue": [15000, 8000, 12000, 9500, 3000, 4500]
})

summary = data.groupby(["region", "product"])["revenue"].sum()
print(summary)
# region  product
# East    Bolt        3000
#         Gadget      8000
#         Widget     15000
# West    Bolt        4500
#         ...

# Access with .loc on MultiIndex
print(summary.loc["East"])          # All East products
print(summary.loc[("East", "Widget")])  # 15000

# Reset back to flat DataFrame
flat = summary.reset_index()
print(flat)

5.14 Reading and Writing: Advanced Options

Beyond basic read_csv(), pandas supports Excel files, SQL databases, JSON, and many other formats. Knowing the key parameters saves hours of manual data cleaning.

# read_csv — important parameters
df = pd.read_csv(
    "sales.csv",
    parse_dates=["order_date"],      # auto-parse date columns
    dtype={"zip_code": str},         # keep leading zeros
    na_values=["N/A", "missing", ""], # treat these as NaN
    usecols=["order_date", "product", "revenue"],  # read only needed cols
    nrows=1000                       # read first 1000 rows (for testing)
)

# read_excel — specify sheet and range
df_xl = pd.read_excel(
    "report.xlsx",
    sheet_name="Q4 Data",
    skiprows=2,                      # skip header rows
    usecols="A:F"                     # Excel-style column range
)

# read_sql — query a database directly
import sqlite3
conn = sqlite3.connect("warehouse.db")
df_sql = pd.read_sql("SELECT * FROM orders WHERE year = 2024", conn)

# to_csv — important parameters
df.to_csv(
    "output.csv",
    index=False,           # don't write row numbers
    encoding="utf-8-sig",   # BOM for Excel compatibility
    float_format="%.2f",    # limit decimal places
    columns=["product", "revenue"]  # write only selected columns
)
dtype parameter: Always specify dtype={"zip_code": str} for columns that look like numbers but should remain strings (zip codes, product IDs with leading zeros). Without this, pandas silently converts "07102" to the integer 7102.

5.15 Method Chaining

Method chaining is the practice of calling multiple DataFrame methods in sequence, where each method returns a new DataFrame. This produces clean, readable data pipelines that are easy to understand and modify. Think of it as building a recipe, step by step.

# Without chaining — hard to follow
df2 = df.dropna()
df3 = df2[df2["revenue"] > 0]
df4 = df3.assign(margin=df3["revenue"] - df3["cost"])
df5 = df4.sort_values("margin", ascending=False)

# With chaining — clean, readable pipeline
result = (
    df
    .dropna()
    .query("revenue > 0")
    .assign(margin=lambda x: x["revenue"] - x["cost"])
    .sort_values("margin", ascending=False)
    .head(10)
)

# Real-world example: monthly sales report
report = (
    pd.read_csv("orders.csv", parse_dates=["date"])
    .assign(
        month=lambda x: x["date"].dt.to_period("M"),
        total=lambda x: x["quantity"] * x["unit_price"]
    )
    .groupby("month")
    .agg(
        orders=("order_id", "count"),
        revenue=("total", "sum")
    )
    .reset_index()
)
Best practices for chaining: (1) Wrap the entire chain in parentheses so you can break across lines without backslashes. (2) Use .assign() instead of df["col"] = ... because assign() returns a new DataFrame and supports chaining. (3) Use .query() instead of boolean indexing inside chains for readability.

5.16 Handling Missing Data

# Detect missing values
print(df.isnull().sum())

# Fill missing values
df["price"].fillna(df["price"].median(), inplace=True)

# Drop rows with any missing value
df_clean = df.dropna()

# Forward fill (useful for time series)
df["stock"].fillna(method="ffill", inplace=True)

5.17 Sorting and Exporting

# Sort by revenue descending
df_sorted = df.sort_values("revenue", ascending=False)

# Export
df_sorted.to_csv("output.csv", index=False)
df_sorted.to_excel("output.xlsx", index=False)

Exercise 5.1

Create a DataFrame of 20 fictitious orders with columns: order_id, customer, product, quantity, unit_price, and region (East/West/Central). Compute total_price, then use groupby to find the region with the highest total revenue and the product with the highest average order size.

Exercise 5.2

Download any public CSV dataset (e.g., from Kaggle). Read it with pd.read_csv(), check for missing values with .isnull().sum(), fill or drop them, and create a pivot table summarizing a numeric column by a categorical column.

Exercise 5.3

Create a DataFrame with a "date" column spanning 365 days and a "daily_sales" column with random values. Parse dates with pd.to_datetime(), then: (a) extract month and day-of-week columns using .dt, (b) compute a 7-day rolling average, (c) find the month with the highest total sales using groupby.

Exercise 5.4

Given a DataFrame with messy customer names (mixed case, extra spaces), use the .str accessor to clean them. Then rewrite the entire cleaning pipeline using method chaining: read the CSV, strip and title-case names, filter out rows with missing emails, add a "domain" column extracted from the email using .str.split("@").str[1], and sort by domain.

Official Resources

Chapter 5 Takeaways

← Chapter 4: NumPy Chapter 6: Visualization →