Chapter 4: Aggregation & GROUP BY

Summarizing Data

Raw detail rows are useful, but business decisions require summaries. How much revenue did we earn last quarter? Which product category sells best? Aggregation functions collapse many rows into a single summary value, and GROUP BY produces those summaries for each group independently.

4.1 Aggregate Functions

SQL provides five core aggregate functions. Each takes a set of values from multiple rows and returns one result.

-- Basic aggregate functions on the orders table
SELECT
    COUNT(*)     AS total_orders,
    SUM(total)   AS total_revenue,
    AVG(total)   AS avg_order_value,
    MIN(total)   AS smallest_order,
    MAX(total)   AS largest_order
FROM orders
WHERE order_date >= '2024-01-01';
Aggregates collapse rows: When you use an aggregate function without GROUP BY, the entire result set is treated as one group. The query above returns exactly one row, no matter how many orders exist.

4.2 COUNT Variations

COUNT behaves differently depending on its argument. Understanding the three forms prevents subtle bugs.

SELECT
    COUNT(*)                    AS total_rows,
    COUNT(email)                AS has_email,
    COUNT(DISTINCT city)        AS unique_cities
FROM customers;
COUNT(*) vs COUNT(column): COUNT(*) counts every row, including those with NULLs. COUNT(column) counts only non-NULL values in that column. COUNT(DISTINCT column) counts unique non-NULL values. If email is NULL for some customers, COUNT(email) excludes them.

4.3 GROUP BY Basics

GROUP BY splits the result set into groups. Each group gets its own aggregate calculation.

-- Revenue by product category
SELECT
    p.category,
    COUNT(*)                            AS num_orders,
    SUM(oi.quantity * oi.unit_price)   AS total_revenue,
    AVG(oi.quantity * oi.unit_price)   AS avg_line_total
FROM order_items oi
INNER JOIN products p ON oi.product_id = p.product_id
GROUP BY p.category
ORDER BY total_revenue DESC;

-- Grouping by multiple columns: orders per customer per month
SELECT
    customer_id,
    STRFTIME('%Y-%m', order_date) AS month,
    COUNT(*)                        AS order_count,
    SUM(total)                      AS monthly_spend
FROM orders
GROUP BY customer_id, STRFTIME('%Y-%m', order_date)
ORDER BY customer_id, month;

Every non-aggregated column in your SELECT list must appear in the GROUP BY clause. Violating this rule produces an error in standard SQL (though MySQL historically allowed it with unpredictable results).

4.4 HAVING Clause

WHERE filters individual rows before grouping. HAVING filters groups after aggregation. This distinction matters because you cannot reference aggregate results in a WHERE clause.

-- Categories with more than $500 in total sales
SELECT
    p.category,
    SUM(oi.quantity * oi.unit_price) AS total_sales,
    COUNT(*)                          AS items_sold
FROM order_items oi
INNER JOIN products p ON oi.product_id = p.product_id
WHERE oi.unit_price > 0             -- row-level filter
GROUP BY p.category
HAVING SUM(oi.quantity * oi.unit_price) > 500   -- group-level filter
ORDER BY total_sales DESC;

-- Customers who placed at least 3 orders
SELECT
    customer_id,
    COUNT(*)    AS order_count,
    SUM(total)  AS total_spend
FROM orders
GROUP BY customer_id
HAVING COUNT(*) >= 3;
Execution order matters: The logical processing order is: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. WHERE cannot see aggregate results because aggregation has not happened yet. HAVING operates on the already-formed groups.

4.5 Aggregation with NULL

All aggregate functions except COUNT(*) ignore NULL values. This can produce surprising results if you are not careful.

-- Suppose some customers have NULL email
SELECT
    COUNT(*)            AS total_customers,
    COUNT(email)        AS with_email,
    COUNT(*) - COUNT(email) AS missing_email
FROM customers;

-- SUM of an all-NULL group returns NULL, not zero
-- Wrap in COALESCE to get zero instead
SELECT
    category,
    COALESCE(SUM(in_stock), 0) AS total_stock
FROM products
GROUP BY category;

4.6 GROUP BY with Expressions

You can group by computed expressions, not just raw columns. This is especially useful for date-based summaries and conditional aggregation.

-- Monthly revenue summary (SQLite date function)
SELECT
    STRFTIME('%Y-%m', order_date) AS month,
    SUM(total)                      AS monthly_revenue,
    COUNT(*)                        AS num_orders
FROM orders
GROUP BY STRFTIME('%Y-%m', order_date)
ORDER BY month;

-- Conditional aggregation with CASE inside SUM
SELECT
    STRFTIME('%Y-%m', order_date) AS month,
    COUNT(*) AS total_orders,
    SUM(CASE WHEN total >= 100 THEN 1 ELSE 0 END) AS large_orders,
    ROUND(
        100.0 * SUM(CASE WHEN total >= 100 THEN 1 ELSE 0 END)
        / COUNT(*), 1
    ) AS large_order_pct
FROM orders
GROUP BY STRFTIME('%Y-%m', order_date);
Conditional aggregation: CASE inside an aggregate is a powerful pattern. It lets you pivot data and compute multiple filtered metrics in a single pass, avoiding the need for repeated subqueries or separate queries.

4.7 ROLLUP and CUBE (PostgreSQL)

Standard GROUP BY produces one level of grouping. ROLLUP adds subtotals and a grand total. CUBE generates all possible subtotal combinations. These are supported in PostgreSQL, SQL Server, and Oracle (not SQLite).

-- PostgreSQL: ROLLUP for subtotals by category, then grand total
SELECT
    COALESCE(category, 'ALL CATEGORIES') AS category,
    SUM(in_stock)   AS total_stock,
    COUNT(*)        AS num_products
FROM products
GROUP BY ROLLUP(category)
ORDER BY category;

ROLLUP(A, B) produces grouping sets: (A, B), (A), and (). CUBE(A, B) produces: (A, B), (A), (B), and ().

4.8 GROUP_CONCAT (SQLite) / STRING_AGG (PostgreSQL)

Sometimes you need to concatenate text values within a group rather than count or sum them.

-- SQLite: list all product names per category
SELECT
    category,
    COUNT(*) AS num_products,
    GROUP_CONCAT(product_name, ', ') AS product_list
FROM products
GROUP BY category;

-- PostgreSQL equivalent: STRING_AGG
-- SELECT category, STRING_AGG(product_name, ', ' ORDER BY product_name)
-- FROM products GROUP BY category;

Exercise 4.1

Write a query that returns, for each category in the products table: the number of products, the average unit price, the most expensive product price, and the total inventory value (SUM of unit_price * in_stock). Order by total inventory value descending.

Exercise 4.2

Using the orders table, find all customers who have placed more than 2 orders and whose total spending exceeds $200. Return the customer_id, order count, and total spend.

Exercise 4.3

Write a single query that shows, for each month: total revenue, number of orders, and the percentage of orders with total above $100. Use CASE inside an aggregate for the percentage calculation.

Exercise 4.4

Use GROUP_CONCAT (SQLite) to create a query that shows each customer_id alongside a comma-separated list of their order dates.

Official Resources

Chapter 4 Takeaways

← Chapter 3: Joins Chapter 5: Subqueries & CTEs →