Transform, reshape, and summarize data using dplyr and tidyr pipes.
Pipes chain operations left-to-right, making code readable. R now has a native pipe (|>, R 4.1+) alongside the magrittr pipe (%>%).
library(tidyverse) # Without pipes: nested and hard to read round(mean(c(3.1, 4.7, 2.9)), 1) # With pipe: reads left-to-right c(3.1, 4.7, 2.9) |> mean() |> round(1) # Both return 3.6
|> is slightly faster and requires no package. The magrittr %>% has extra features like . placeholder. For most work, either is fine. This guide uses |>.
We will use the mpg dataset (bundled with ggplot2) to demonstrate each verb.
# Cars with highway mpg > 30 and 4 cylinders mpg |> filter(hwy > 30, cyl == 4)
mpg |> select(manufacturer, model, year, hwy) # Helper functions: starts_with, ends_with, contains mpg |> select(starts_with("c")) # cty, cyl, class
mpg |> mutate( kpl_city = cty * 0.4251, # mpg to km/L efficient = hwy > 30 # logical flag )
mpg |> arrange(desc(hwy)) |> # highest hwy first head(5)
mpg |> group_by(class) |> summarize( n = n(), avg_hwy = mean(hwy), sd_hwy = sd(hwy) ) |> arrange(desc(avg_hwy))
group_by() |> summarize(), you can write summarize(..., .by = class). It auto-ungroups the result.
The real power of dplyr is chaining multiple verbs together:
# Which manufacturer has the most fuel-efficient 6-cylinder cars? mpg |> filter(cyl == 6) |> group_by(manufacturer) |> summarize(avg_hwy = mean(hwy), n = n()) |> filter(n >= 3) |> arrange(desc(avg_hwy))
When you need to apply the same transformation to multiple columns, across() eliminates repetitive code. It works inside mutate() and summarize().
# Compute the mean of all numeric columns by group mpg |> group_by(drv) |> summarize(across(where(is.numeric), mean, .names = "avg_{.col}")) # Round specific columns to 1 decimal place mpg |> mutate(across(c(cty, hwy), \(x) round(x * 0.4251, 1), .names = "{.col}_kpl")) # Apply multiple functions at once mpg |> summarize(across(c(cty, hwy), list(mean = mean, sd = sd), .names = "{.col}_{.fn}"))
select(): starts_with(), ends_with(), contains(), where(is.numeric), or simply a vector of column names.
case_when() is the tidyverse equivalent of a multi-branch if/else. It is far more readable than nested ifelse() calls.
# Categorize highway mpg into efficiency tiers mpg |> mutate( efficiency = case_when( hwy >= 35 ~ "excellent", hwy >= 25 ~ "good", hwy >= 18 ~ "average", .default = "poor" ) ) |> count(efficiency) # Create academic letter grades from numeric scores students <- tibble(name = c("Ana","Ben","Cal"), score = c(92,78,65)) students |> mutate(grade = case_when( score >= 90 ~ "A", score >= 80 ~ "B", score >= 70 ~ "C", score >= 60 ~ "D", .default = "F" ))
Several dplyr helpers make quick counting and row selection painless.
# n() counts rows in the current group # n_distinct() counts unique values mpg |> group_by(manufacturer) |> summarize( total_models = n(), unique_models = n_distinct(model) ) # slice_max() / slice_min() — top/bottom rows by a variable mpg |> slice_max(hwy, n = 5) # top 5 by highway mpg mpg |> group_by(class) |> slice_min(cty, n = 1) # worst city mpg per class # count() is a shortcut for group_by + summarize(n = n()) mpg |> count(class, sort = TRUE)
# relocate() — move columns to a new position mpg |> relocate(hwy, cty, .before = manufacturer) mpg |> relocate(where(is.character), .after = last_col()) # rowwise() — operate across columns within each row scores <- tibble( student = c("Ana", "Ben", "Cal"), exam1 = c(88, 75, 92), exam2 = c(91, 80, 85), exam3 = c(78, 88, 90) ) scores |> rowwise() |> mutate(avg = mean(c(exam1, exam2, exam3))) |> ungroup()
across() when you want to apply the same function to each column independently (column-wise). Use rowwise() when you need to combine values across columns within a single row (row-wise). Always call ungroup() after rowwise() to return to normal behavior.
Nested data frames store entire tibbles inside cells. Combined with purrr::map(), this pattern is powerful for batch operations.
library(tidyr) library(purrr) # Nest: create a list-column of data frames nested <- mpg |> group_by(drv) |> nest() nested # 3 rows (f, 4, r), each with a 'data' list-column # Apply a model to each nested group nested |> mutate(model = map(data, \(df) lm(hwy ~ displ, data = df))) # Unnest: flatten back to a regular tibble nested |> unnest(data)
# Read multiple CSV files at once using purrr::map # Suppose you have: data/sales_2020.csv, data/sales_2021.csv, etc. files <- list.files("data/", pattern = "sales_.*\\.csv", full.names = TRUE) all_sales <- files |> map(read_csv) |> list_rbind() # stack all data frames into one # Alternative: use purrr::map with an id column all_sales <- files |> set_names() |> map(read_csv) |> list_rbind(names_to = "source_file")
janitor package fixes this in one step: df |> janitor::clean_names() converts all column names to consistent snake_case. Install it with install.packages("janitor").
orders <- tibble( id = c(1, 2, 3), product = c("A", "B", "C"), amount = c(100, 200, 150) ) customers <- tibble( id = c(1, 2, 4), name = c("Alice", "Bob", "Dana") ) left_join(orders, customers, by = "id") # keep all orders inner_join(orders, customers, by = "id") # only matching rows full_join(orders, customers, by = "id") # keep everything
wide <- tibble( country = c("US", "UK"), y2020 = c(331, 67), y2021 = c(333, 67) ) wide |> pivot_longer(cols = starts_with("y"), names_to = "year", values_to = "population")
long <- tibble( name = c("A","A","B","B"), metric = c("sales","profit","sales","profit"), value = c(100,20,150,35) ) long |> pivot_wider(names_from = metric, values_from = value)
df <- tibble(date = c("2024-01-15", "2024-03-22")) df |> separate(date, into = c("year","month","day"), sep = "-") # unite() does the reverse tibble(first = "John", last = "Doe") |> unite("full_name", first, last, sep = " ")
Using the mpg dataset, find the average city and highway mpg for each drv (drive type: f, r, 4). Which drive type is most fuel-efficient overall? Chain at least three dplyr verbs.
Create a wide tibble with quarterly revenue (Q1 through Q4) for two companies. Pivot it to long format, then compute each company's total annual revenue using summarize().
Using mpg, create a new column fuel_grade using case_when(): "excellent" if hwy >= 30, "good" if hwy >= 20, and "poor" otherwise. Then use count() to find how many cars fall into each grade. Finally, use across() inside summarize() to compute the mean and standard deviation of both cty and hwy for each fuel_grade.
The storms dataset (part of dplyr) tracks tropical storms. For each named storm (name), find the maximum wind speed and the maximum category reached. Then use slice_max() to display the 10 strongest storms by maximum wind speed. How many of them reached category 5?
|>) chains verbs for readable, step-by-step transformations.filter, select, mutate, summarize, arrange.group_by() unlocks split-apply-combine for aggregation.pivot_longer and pivot_wider reshape between long and wide formats.left_join, inner_join, etc.) merge data frames on shared keys.