DataFrames and Data Wrangling
Pandas provides the DataFrame, a two-dimensional labeled data structure that is the standard tool for tabular data manipulation in Python. If you have worked with Excel pivot tables or SQL queries, pandas will feel familiar but far more powerful and reproducible.
import pandas as pd import numpy as np
# CSV (most common) df = pd.read_csv("orders.csv") # Excel df_xl = pd.read_excel("inventory.xlsx", sheet_name="Sheet1") # Quick inspection print(df.shape) # (rows, cols) print(df.dtypes) # column types df.head() # first 5 rows df.describe() # summary statistics df.info() # non-null counts and memory
# From a dictionary data = { "product": ["Widget", "Gadget", "Bracket", "Bolt"], "price": [29.99, 14.50, 7.25, 1.10], "quantity": [500, 320, 750, 4000], "category": ["A", "B", "C", "C"] } df = pd.DataFrame(data) print(df)
# loc: label-based selection df.loc[0, "product"] # "Widget" df.loc[:2, ["product", "price"]] # first 3 rows, 2 columns # iloc: integer-position based df.iloc[0, 1] # 29.99 (row 0, col 1) df.iloc[:2, :3] # first 2 rows, first 3 cols # Boolean filtering expensive = df[df["price"] > 10] print(expensive)
loc when selecting by column names and row labels. Use iloc when selecting by integer position. When in doubt, prefer loc for clarity.
# New computed column df["revenue"] = df["price"] * df["quantity"] # Apply a function df["price_tier"] = df["price"].apply( lambda p: "Premium" if p > 20 else "Standard" ) print(df)
# Group by category and compute summary stats summary = df.groupby("category").agg( total_revenue=("revenue", "sum"), avg_price=("price", "mean"), n_products=("product", "count") ).reset_index() print(summary)
# Supplier table suppliers = pd.DataFrame({ "product": ["Widget", "Gadget", "Bracket", "Bolt"], "supplier": ["Acme", "Globex", "Acme", "Initech"], "lead_days": [7, 14, 5, 3] }) # Inner join on product name merged = pd.merge(df, suppliers, on="product", how="left") print(merged[["product", "revenue", "supplier", "lead_days"]])
# Pivot: rows=category, values=revenue pivot = df.pivot_table( values="revenue", index="category", aggfunc=["sum", "mean"] ) print(pivot)
apply() lets you run any function along an axis of a DataFrame. It is useful when vectorized operations are not sufficient, for example when you need row-wise logic that depends on multiple columns. Use it sparingly because it is slower than vectorized operations.
# apply a function to each row def margin_tier(row): margin = (row["revenue"] - row["price"] * row["quantity"] * 0.6) / row["revenue"] if margin > 0.4: return "High" elif margin > 0.2: return "Medium" else: return "Low" df["margin_tier"] = df.apply(margin_tier, axis=1) # apply to a single column (same as .map()) df["price_usd"] = df["price"].apply(lambda x: round(x * 1.1, 2)) # applymap — apply element-wise to entire DataFrame (renamed to map in pandas 2.1+) numeric_df = df[["price", "quantity"]] formatted = numeric_df.applymap(lambda x: f"{x:,.2f}") print(formatted)
apply() runs Python code row by row and is much slower than vectorized operations. Before using apply(), check whether np.where(), np.select(), or pandas built-in methods can accomplish the same task. Reserve apply() for complex logic that cannot be vectorized.
The .str accessor exposes string methods on an entire Series, eliminating the need to loop through rows. This is essential for cleaning text columns such as names, addresses, and product descriptions.
orders = pd.DataFrame({ "customer": [" John Smith ", "JANE DOE", "bob jones", "Alice Lee"], "sku": ["WDG-2024-A", "GDT-2024-B", "BRK-2023-C", "BLT-2024-A"], "email": ["john@co.com", "jane@co.com", "bob@co.com", None] }) # Clean customer names orders["customer"] = orders["customer"].str.strip().str.title() print(orders["customer"]) # 0 John Smith # 1 Jane Doe # 2 Bob Jones # 3 Alice Lee # Extract parts from SKU orders["product_code"] = orders["sku"].str.split("-").str[0] orders["year"] = orders["sku"].str.split("-").str[1].astype(int) # Boolean check — contains, startswith, endswith print(orders["sku"].str.contains("2024")) print(orders["email"].str.endswith("@co.com")) # handles NaN gracefully # Replace text orders["sku_clean"] = orders["sku"].str.replace("-", "") # String length print(orders["customer"].str.len())
When a column is parsed as datetime, the .dt accessor exposes properties like year, month, day of week, and quarter. This is critical for time-series analysis, seasonal patterns, and cohort-based reporting.
# Parse string dates into datetime sales = pd.DataFrame({ "date": ["2024-01-15", "2024-03-22", "2024-07-04", "2024-11-28"], "amount": [1200, 3400, 2800, 4100] }) sales["date"] = pd.to_datetime(sales["date"]) # Extract components sales["year"] = sales["date"].dt.year sales["month"] = sales["date"].dt.month sales["quarter"] = sales["date"].dt.quarter sales["weekday"] = sales["date"].dt.day_name() print(sales) # Filter by date range h2 = sales[sales["date"] >= "2024-07-01"] print(h2) # Compute days since last sale sales["days_gap"] = sales["date"].diff().dt.days
pd.to_datetime() right after reading your data. Keeping dates as strings prevents you from filtering by date ranges, extracting components, and computing differences. Many read_csv() calls benefit from the parse_dates=["col"] parameter.
Window functions compute statistics over a sliding window of rows. rolling(n) uses a fixed window of n observations; expanding() uses all rows from the start up to the current row. These are essential for computing moving averages, cumulative sums, and trend indicators.
# Monthly revenue data monthly = pd.DataFrame({ "month": pd.date_range("2024-01", periods=12, freq="MS"), "revenue": [45, 52, 48, 61, 55, 70, 68, 72, 65, 80, 78, 90] }) # 3-month rolling average (smooths out noise) monthly["rolling_3m"] = monthly["revenue"].rolling(3).mean() # Cumulative (expanding) sum — year-to-date total monthly["ytd_total"] = monthly["revenue"].expanding().sum() # Rolling standard deviation (volatility measure) monthly["rolling_std"] = monthly["revenue"].rolling(3).std() # Percent change from previous period monthly["pct_change"] = monthly["revenue"].pct_change() print(monthly.round(2))
A MultiIndex lets you have multiple levels of row or column labels. This is useful when your data has a natural hierarchy (e.g., region and store, or year and quarter). GroupBy operations often produce MultiIndex results.
# Create a MultiIndex from groupby data = pd.DataFrame({ "region": ["East", "East", "West", "West", "East", "West"], "product": ["Widget", "Gadget", "Widget", "Gadget", "Bolt", "Bolt"], "revenue": [15000, 8000, 12000, 9500, 3000, 4500] }) summary = data.groupby(["region", "product"])["revenue"].sum() print(summary) # region product # East Bolt 3000 # Gadget 8000 # Widget 15000 # West Bolt 4500 # ... # Access with .loc on MultiIndex print(summary.loc["East"]) # All East products print(summary.loc[("East", "Widget")]) # 15000 # Reset back to flat DataFrame flat = summary.reset_index() print(flat)
Beyond basic read_csv(), pandas supports Excel files, SQL databases, JSON, and many other formats. Knowing the key parameters saves hours of manual data cleaning.
# read_csv — important parameters df = pd.read_csv( "sales.csv", parse_dates=["order_date"], # auto-parse date columns dtype={"zip_code": str}, # keep leading zeros na_values=["N/A", "missing", ""], # treat these as NaN usecols=["order_date", "product", "revenue"], # read only needed cols nrows=1000 # read first 1000 rows (for testing) ) # read_excel — specify sheet and range df_xl = pd.read_excel( "report.xlsx", sheet_name="Q4 Data", skiprows=2, # skip header rows usecols="A:F" # Excel-style column range ) # read_sql — query a database directly import sqlite3 conn = sqlite3.connect("warehouse.db") df_sql = pd.read_sql("SELECT * FROM orders WHERE year = 2024", conn) # to_csv — important parameters df.to_csv( "output.csv", index=False, # don't write row numbers encoding="utf-8-sig", # BOM for Excel compatibility float_format="%.2f", # limit decimal places columns=["product", "revenue"] # write only selected columns )
dtype={"zip_code": str} for columns that look like numbers but should remain strings (zip codes, product IDs with leading zeros). Without this, pandas silently converts "07102" to the integer 7102.
Method chaining is the practice of calling multiple DataFrame methods in sequence, where each method returns a new DataFrame. This produces clean, readable data pipelines that are easy to understand and modify. Think of it as building a recipe, step by step.
# Without chaining — hard to follow df2 = df.dropna() df3 = df2[df2["revenue"] > 0] df4 = df3.assign(margin=df3["revenue"] - df3["cost"]) df5 = df4.sort_values("margin", ascending=False) # With chaining — clean, readable pipeline result = ( df .dropna() .query("revenue > 0") .assign(margin=lambda x: x["revenue"] - x["cost"]) .sort_values("margin", ascending=False) .head(10) ) # Real-world example: monthly sales report report = ( pd.read_csv("orders.csv", parse_dates=["date"]) .assign( month=lambda x: x["date"].dt.to_period("M"), total=lambda x: x["quantity"] * x["unit_price"] ) .groupby("month") .agg( orders=("order_id", "count"), revenue=("total", "sum") ) .reset_index() )
.assign() instead of df["col"] = ... because assign() returns a new DataFrame and supports chaining. (3) Use .query() instead of boolean indexing inside chains for readability.
# Detect missing values print(df.isnull().sum()) # Fill missing values df["price"].fillna(df["price"].median(), inplace=True) # Drop rows with any missing value df_clean = df.dropna() # Forward fill (useful for time series) df["stock"].fillna(method="ffill", inplace=True)
# Sort by revenue descending df_sorted = df.sort_values("revenue", ascending=False) # Export df_sorted.to_csv("output.csv", index=False) df_sorted.to_excel("output.xlsx", index=False)
Create a DataFrame of 20 fictitious orders with columns: order_id, customer, product, quantity, unit_price, and region (East/West/Central). Compute total_price, then use groupby to find the region with the highest total revenue and the product with the highest average order size.
Download any public CSV dataset (e.g., from Kaggle). Read it with pd.read_csv(), check for missing values with .isnull().sum(), fill or drop them, and create a pivot table summarizing a numeric column by a categorical column.
Create a DataFrame with a "date" column spanning 365 days and a "daily_sales" column with random values. Parse dates with pd.to_datetime(), then: (a) extract month and day-of-week columns using .dt, (b) compute a 7-day rolling average, (c) find the month with the highest total sales using groupby.
Given a DataFrame with messy customer names (mixed case, extra spaces), use the .str accessor to clean them. Then rewrite the entire cleaning pipeline using method chaining: read the CSV, strip and title-case names, filter out rows with missing emails, add a "domain" column extracted from the email using .str.split("@").str[1], and sort by domain.
loc for label-based selection and iloc for position-based selection.groupby + agg is the pandas equivalent of SQL GROUP BY; pivot_table adds a spreadsheet-style crosstab..str accessor cleans text columns without loops; the .dt accessor extracts date components.rolling() and expanding() compute moving averages, cumulative sums, and other window-based statistics..assign(), .query(), and parentheses produces clean, readable data pipelines.fillna() or dropna() depending on your analysis needs.dtype and parse_dates in read_csv() to avoid silent type conversion errors.