Chapter 9: Query Optimization

Indexes, EXPLAIN, and Performance Tuning

9.1 Why Optimization Matters

On small tables, every query runs fast. On tables with millions of rows, the difference between a well-optimized query and a naive one can be seconds versus minutes. Understanding indexes and query plans lets you write queries that scale.

9.2 Indexes

An index is a data structure (typically a B-tree) that lets the database find rows without scanning the entire table. Think of it like the index at the back of a textbook: instead of reading every page, you look up the topic and jump directly to the right page.

-- Create an index on a frequently filtered column
CREATE INDEX idx_orders_customer
ON orders(customer_id);

-- Composite index (for queries that filter on both columns)
CREATE INDEX idx_orders_date_customer
ON orders(order_date, customer_id);

-- Unique index (enforces uniqueness + speeds lookups)
CREATE UNIQUE INDEX idx_customers_email
ON customers(email);

-- List all indexes on a table (SQLite)
.indexes orders

-- Drop an index
DROP INDEX idx_orders_customer;

Index trade-offs: Indexes speed up reads (SELECT, WHERE, JOIN, ORDER BY) but slow down writes (INSERT, UPDATE, DELETE), because the database must update the index alongside the table. Index columns you query often; do not index every column.

9.3 EXPLAIN and Query Plans

-- SQLite: EXPLAIN QUERY PLAN shows how the query will execute
EXPLAIN QUERY PLAN
SELECT * FROM orders
WHERE customer_id = 42;
-- Without index: SCAN orders (reads every row)
-- With index:    SEARCH orders USING INDEX idx_orders_customer

-- PostgreSQL: EXPLAIN ANALYZE (executes and reports actual timing)
-- EXPLAIN ANALYZE
-- SELECT * FROM orders WHERE customer_id = 42;

Reading EXPLAIN output: Look for "SCAN" (bad: full table scan) vs. "SEARCH" or "USING INDEX" (good: index lookup). If you see SCAN on a large table with a WHERE clause, you probably need an index on that column.

9.4 Index Selection Guidelines

Scenario	Index Recommendation
WHERE column = value	Single-column index on that column
WHERE a = x AND b = y	Composite index on (a, b)
JOIN ON a.id = b.foreign_id	Index on the foreign key column
ORDER BY column	Index on that column (avoids sorting)
WHERE a = x ORDER BY b	Composite index on (a, b)
Rarely queried column	No index (saves write overhead)

9.5 Common Anti-Patterns

-- BAD: function on indexed column disables the index
SELECT * FROM orders
WHERE STRFTIME('%Y', order_date) = '2024';
-- Database cannot use index on order_date

-- GOOD: rewrite to preserve index usage
SELECT * FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';

-- BAD: SELECT * when you only need two columns
SELECT * FROM orders;  -- reads all columns from disk

-- GOOD: name only the columns you need
SELECT order_id, total FROM orders;

-- BAD: implicit type conversion
SELECT * FROM products
WHERE product_id = '42';  -- string compared to integer column

-- GOOD: match the column type
SELECT * FROM products
WHERE product_id = 42;

9.6 Views

A view is a saved query that acts like a virtual table. It does not store data; it re-executes the underlying query each time you reference it.

-- Create a view for frequently used report
CREATE VIEW monthly_revenue AS
SELECT
    STRFTIME('%Y-%m', order_date) AS month,
    COUNT(*) AS order_count,
    SUM(total) AS revenue
FROM orders
GROUP BY month;

-- Query the view like a table
SELECT * FROM monthly_revenue
WHERE revenue > 10000;

-- Drop a view
DROP VIEW monthly_revenue;

-- PostgreSQL: materialized views (cached, need manual refresh)
-- CREATE MATERIALIZED VIEW monthly_revenue AS ...
-- REFRESH MATERIALIZED VIEW monthly_revenue;

Views vs. materialized views: Regular views run the query every time, so they always return fresh data. Materialized views store the result on disk, making reads fast but requiring a manual REFRESH to update. Use materialized views for expensive aggregations that do not need real-time freshness.

9.7 Temporary Tables

-- Create a temp table (exists only for the current session)
CREATE TEMP TABLE high_value_customers AS
SELECT
    customer_id,
    SUM(total) AS lifetime_value
FROM orders
GROUP BY customer_id
HAVING SUM(total) > 1000;

-- Use it in subsequent queries
SELECT c.first_name, c.last_name, hvc.lifetime_value
FROM high_value_customers hvc
INNER JOIN customers c ON hvc.customer_id = c.customer_id;

9.8 Query Rewriting Tips

-- EXISTS is often faster than IN for large subqueries
-- Instead of:
SELECT * FROM products
WHERE product_id IN (SELECT product_id FROM order_items);

-- Use:
SELECT * FROM products p
WHERE EXISTS (
    SELECT 1 FROM order_items oi
    WHERE oi.product_id = p.product_id
);

-- Avoid correlated subqueries in SELECT when a JOIN works
-- Instead of:
SELECT c.first_name,
    (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) AS cnt
FROM customers c;

-- Use:
SELECT c.first_name, COUNT(o.order_id) AS cnt
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name;

Exercise 9.1

Create a products table with 1000+ rows (use a recursive CTE to generate data). Run EXPLAIN QUERY PLAN SELECT * FROM products WHERE category = 'Electronics' before and after adding an index on category. Compare the output.

Exercise 9.2

Create a view called customer_summary that shows each customer's name, total number of orders, total spend, and average order value. Query the view to find the top 10 customers by total spend.

Exercise 9.3

Identify a query you wrote in a previous chapter that could benefit from an index. Add the index, run EXPLAIN QUERY PLAN before and after, and document the difference.

Official Resources

Chapter 9 Takeaways

Indexes speed up reads by avoiding full table scans, but add overhead to writes.
Use EXPLAIN QUERY PLAN to see whether a query uses an index or scans the entire table.
Index columns that appear in WHERE, JOIN ON, and ORDER BY clauses.
Avoid applying functions to indexed columns in WHERE; rewrite the condition to preserve index usage.
Views provide reusable, named queries; materialized views cache results for expensive aggregations.
Prefer EXISTS over IN for large subqueries; prefer JOINs over correlated subqueries in SELECT.

← Chapter 8: Database Design Chapter 10: Project →