Chapter 3: Data Wrangling with tidyverse

Transform, reshape, and summarize data using dplyr and tidyr pipes.

3.1 The Pipe Operator

Pipes chain operations left-to-right, making code readable. R now has a native pipe (|>, R 4.1+) alongside the magrittr pipe (%>%).

library(tidyverse)

# Without pipes: nested and hard to read
round(mean(c(3.1, 4.7, 2.9)), 1)

# With pipe: reads left-to-right
c(3.1, 4.7, 2.9) |> mean() |> round(1)
# Both return 3.6
|> vs. %>% The native pipe |> is slightly faster and requires no package. The magrittr %>% has extra features like . placeholder. For most work, either is fine. This guide uses |>.

3.2 dplyr: Core Verbs

We will use the mpg dataset (bundled with ggplot2) to demonstrate each verb.

filter() — Select Rows

# Cars with highway mpg > 30 and 4 cylinders
mpg |>
  filter(hwy > 30, cyl == 4)

select() — Choose Columns

mpg |>
  select(manufacturer, model, year, hwy)

# Helper functions: starts_with, ends_with, contains
mpg |>
  select(starts_with("c"))  # cty, cyl, class

mutate() — Create/Modify Columns

mpg |>
  mutate(
    kpl_city = cty * 0.4251,     # mpg to km/L
    efficient = hwy > 30         # logical flag
  )

arrange() — Sort Rows

mpg |>
  arrange(desc(hwy)) |>        # highest hwy first
  head(5)

summarize() + group_by() — Aggregate

mpg |>
  group_by(class) |>
  summarize(
    n       = n(),
    avg_hwy = mean(hwy),
    sd_hwy  = sd(hwy)
  ) |>
  arrange(desc(avg_hwy))
Tip: .by argument (dplyr 1.1+) Instead of group_by() |> summarize(), you can write summarize(..., .by = class). It auto-ungroups the result.

3.3 Combining Verbs

The real power of dplyr is chaining multiple verbs together:

# Which manufacturer has the most fuel-efficient 6-cylinder cars?
mpg |>
  filter(cyl == 6) |>
  group_by(manufacturer) |>
  summarize(avg_hwy = mean(hwy), n = n()) |>
  filter(n >= 3) |>
  arrange(desc(avg_hwy))

3.4 across() for Column-wise Operations

When you need to apply the same transformation to multiple columns, across() eliminates repetitive code. It works inside mutate() and summarize().

# Compute the mean of all numeric columns by group
mpg |>
  group_by(drv) |>
  summarize(across(where(is.numeric), mean, .names = "avg_{.col}"))

# Round specific columns to 1 decimal place
mpg |>
  mutate(across(c(cty, hwy), \(x) round(x * 0.4251, 1),
         .names = "{.col}_kpl"))

# Apply multiple functions at once
mpg |>
  summarize(across(c(cty, hwy),
            list(mean = mean, sd = sd),
            .names = "{.col}_{.fn}"))
Tip: Selecting columns in across() You can use all the same helpers as select(): starts_with(), ends_with(), contains(), where(is.numeric), or simply a vector of column names.

3.5 case_when() for Complex Conditionals

case_when() is the tidyverse equivalent of a multi-branch if/else. It is far more readable than nested ifelse() calls.

# Categorize highway mpg into efficiency tiers
mpg |>
  mutate(
    efficiency = case_when(
      hwy >= 35 ~ "excellent",
      hwy >= 25 ~ "good",
      hwy >= 18 ~ "average",
      .default   = "poor"
    )
  ) |>
  count(efficiency)

# Create academic letter grades from numeric scores
students <- tibble(name = c("Ana","Ben","Cal"), score = c(92,78,65))
students |>
  mutate(grade = case_when(
    score >= 90 ~ "A",
    score >= 80 ~ "B",
    score >= 70 ~ "C",
    score >= 60 ~ "D",
    .default   = "F"
  ))

3.6 Counting and Slicing

Several dplyr helpers make quick counting and row selection painless.

# n() counts rows in the current group
# n_distinct() counts unique values
mpg |>
  group_by(manufacturer) |>
  summarize(
    total_models = n(),
    unique_models = n_distinct(model)
  )

# slice_max() / slice_min() — top/bottom rows by a variable
mpg |>
  slice_max(hwy, n = 5)        # top 5 by highway mpg

mpg |>
  group_by(class) |>
  slice_min(cty, n = 1)        # worst city mpg per class

# count() is a shortcut for group_by + summarize(n = n())
mpg |> count(class, sort = TRUE)

3.7 relocate() and rowwise()

# relocate() — move columns to a new position
mpg |>
  relocate(hwy, cty, .before = manufacturer)

mpg |>
  relocate(where(is.character), .after = last_col())

# rowwise() — operate across columns within each row
scores <- tibble(
  student = c("Ana", "Ben", "Cal"),
  exam1 = c(88, 75, 92),
  exam2 = c(91, 80, 85),
  exam3 = c(78, 88, 90)
)
scores |>
  rowwise() |>
  mutate(avg = mean(c(exam1, exam2, exam3))) |>
  ungroup()
rowwise() vs. across() Use across() when you want to apply the same function to each column independently (column-wise). Use rowwise() when you need to combine values across columns within a single row (row-wise). Always call ungroup() after rowwise() to return to normal behavior.

3.8 nest() / unnest() and Reading Multiple Files

Nested data frames store entire tibbles inside cells. Combined with purrr::map(), this pattern is powerful for batch operations.

library(tidyr)
library(purrr)

# Nest: create a list-column of data frames
nested <- mpg |>
  group_by(drv) |>
  nest()
nested    # 3 rows (f, 4, r), each with a 'data' list-column

# Apply a model to each nested group
nested |>
  mutate(model = map(data, \(df) lm(hwy ~ displ, data = df)))

# Unnest: flatten back to a regular tibble
nested |> unnest(data)
# Read multiple CSV files at once using purrr::map
# Suppose you have: data/sales_2020.csv, data/sales_2021.csv, etc.
files <- list.files("data/", pattern = "sales_.*\\.csv", full.names = TRUE)

all_sales <- files |>
  map(read_csv) |>
  list_rbind()    # stack all data frames into one

# Alternative: use purrr::map with an id column
all_sales <- files |>
  set_names() |>
  map(read_csv) |>
  list_rbind(names_to = "source_file")
Tip: janitor::clean_names() Real-world data often has messy column names with spaces, special characters, or inconsistent case. The janitor package fixes this in one step: df |> janitor::clean_names() converts all column names to consistent snake_case. Install it with install.packages("janitor").

3.9 Joining Data Frames

orders <- tibble(
  id = c(1, 2, 3),
  product = c("A", "B", "C"),
  amount = c(100, 200, 150)
)
customers <- tibble(
  id = c(1, 2, 4),
  name = c("Alice", "Bob", "Dana")
)

left_join(orders, customers, by = "id")   # keep all orders
inner_join(orders, customers, by = "id")  # only matching rows
full_join(orders, customers, by = "id")   # keep everything

3.10 tidyr: Reshaping Data

pivot_longer() — Wide to Long

wide <- tibble(
  country = c("US", "UK"),
  y2020 = c(331, 67),
  y2021 = c(333, 67)
)
wide |>
  pivot_longer(cols = starts_with("y"),
               names_to = "year",
               values_to = "population")

pivot_wider() — Long to Wide

long <- tibble(
  name = c("A","A","B","B"),
  metric = c("sales","profit","sales","profit"),
  value = c(100,20,150,35)
)
long |>
  pivot_wider(names_from = metric, values_from = value)

separate() and unite()

df <- tibble(date = c("2024-01-15", "2024-03-22"))

df |> separate(date, into = c("year","month","day"), sep = "-")

# unite() does the reverse
tibble(first = "John", last = "Doe") |>
  unite("full_name", first, last, sep = " ")

Exercises

Exercise 3.1

Using the mpg dataset, find the average city and highway mpg for each drv (drive type: f, r, 4). Which drive type is most fuel-efficient overall? Chain at least three dplyr verbs.

Exercise 3.2

Create a wide tibble with quarterly revenue (Q1 through Q4) for two companies. Pivot it to long format, then compute each company's total annual revenue using summarize().

Exercise 3.3

Using mpg, create a new column fuel_grade using case_when(): "excellent" if hwy >= 30, "good" if hwy >= 20, and "poor" otherwise. Then use count() to find how many cars fall into each grade. Finally, use across() inside summarize() to compute the mean and standard deviation of both cty and hwy for each fuel_grade.

Exercise 3.4

The storms dataset (part of dplyr) tracks tropical storms. For each named storm (name), find the maximum wind speed and the maximum category reached. Then use slice_max() to display the 10 strongest storms by maximum wind speed. How many of them reached category 5?

External Resources

Key Takeaways

← Chapter 2: Data Structures Chapter 4: Visualization →