Chapter 4: Data Visualization with ggplot2

Build layered, publication-quality graphics using the grammar of graphics.

4.1 The Grammar of Graphics

Every ggplot2 plot follows a consistent template: data + aesthetic mappings + geometric objects. Additional layers control scales, facets, coordinate systems, and themes.

library(ggplot2)

# Template
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
  <GEOM_FUNCTION>()

4.2 Scatterplot: geom_point()

# mpg dataset: engine displacement vs. highway mpg
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

# Map color and shape to categorical variables
ggplot(mpg, aes(x = displ, y = hwy, color = drv, shape = drv)) +
  geom_point(size = 2.5, alpha = 0.7)

# Add a smoothed trend line
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE)

4.3 Line Chart: geom_line()

# Time series with economics dataset
ggplot(economics, aes(x = date, y = unemploy)) +
  geom_line(color = "#276DC3", linewidth = 0.8) +
  labs(title = "US Unemployment Over Time",
       x = "Date", y = "Unemployed (thousands)")

4.4 Bar Chart: geom_bar() and geom_col()

# geom_bar() counts observations
ggplot(mpg, aes(x = class)) +
  geom_bar(fill = "#276DC3")

# geom_col() plots precomputed values
avg_mpg <- mpg |>
  dplyr::group_by(class) |>
  dplyr::summarize(avg_hwy = mean(hwy))

ggplot(avg_mpg, aes(x = reorder(class, avg_hwy), y = avg_hwy)) +
  geom_col(fill = "#276DC3") +
  coord_flip() +
  labs(x = "Vehicle Class", y = "Mean Highway MPG")

4.5 Histogram and Density: geom_histogram(), geom_density()

# Histogram
ggplot(mpg, aes(x = hwy)) +
  geom_histogram(binwidth = 2, fill = "#276DC3", color = "white")

# Density overlay by group
ggplot(mpg, aes(x = hwy, fill = drv)) +
  geom_density(alpha = 0.4)

4.6 Faceting

Facets split a plot into panels by one or two categorical variables.

# facet_wrap — one variable
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ drv, nrow = 1)

# facet_grid — two variables
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(drv ~ cyl)

4.7 Themes and Labels

p <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point(size = 2)

p +
  labs(
    title    = "Engine Size vs. Fuel Efficiency",
    subtitle = "Data from EPA, 1999-2008",
    x = "Displacement (L)",
    y = "Highway MPG",
    color = "Vehicle Class",
    caption = "Source: mpg dataset"
  ) +
  theme_minimal(base_size = 13)
Tip: Built-in themes Try theme_classic(), theme_bw(), theme_light(), or install the ggthemes package for more options like theme_economist().

4.8 stat_smooth(): Regression Lines on Plots

geom_smooth() and stat_smooth() are interchangeable and add fitted regression lines or loess curves to scatterplots. Controlling the method, formula, and confidence interval gives you publication-ready overlays.

# Linear regression line with 95% confidence band
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.5) +
  stat_smooth(method = "lm", formula = y ~ x, color = "red")

# Polynomial fit
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.5) +
  stat_smooth(method = "lm", formula = y ~ poly(x, 2), color = "blue")

# Separate regression lines by group
ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point(alpha = 0.5) +
  stat_smooth(method = "lm", se = FALSE)

4.9 Controlling Scales

Scale functions control how data values map to visual properties. The naming pattern is scale_{aesthetic}_{type}().

# Continuous axis: custom breaks and labels
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_x_continuous(
    breaks = seq(1, 7, by = 1),
    limits = c(1, 7)
  ) +
  scale_y_continuous(
    breaks = seq(10, 50, by = 10),
    labels = function(x) paste0(x, " mpg")
  )

# Discrete scale: reorder and relabel
ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_boxplot() +
  scale_x_discrete(labels = c("4" = "4WD", "f" = "Front", "r" = "Rear"))

# Log scale (useful for skewed data like income or counts)
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.1) +
  scale_x_log10() +
  scale_y_log10()

4.10 Box Plots and Jitter

Box plots show the distribution of a continuous variable by groups. Adding jittered points reveals the raw data underneath.

# Basic boxplot
ggplot(mpg, aes(x = class, y = hwy)) +
  geom_boxplot(fill = "#dbeafe", outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.4, color = "#276DC3") +
  coord_flip() +
  labs(title = "Highway MPG by Vehicle Class",
       x = NULL, y = "Highway MPG") +
  theme_minimal()
Tip: coord_flip() for readability When category labels are long, flip coordinates so bars or boxes run horizontally. In newer ggplot2 versions, you can also place the categorical variable on y directly instead of using coord_flip().

4.11 Heatmaps with geom_tile()

Heatmaps are excellent for visualizing correlation matrices, cross-tabulations, or any grid of values.

# Correlation heatmap
library(dplyr)
library(tidyr)

cor_data <- mtcars |>
  select(mpg, hp, wt, qsec, disp) |>
  cor() |>
  as.data.frame() |>
  mutate(var1 = rownames(cor(select(mtcars, mpg, hp, wt, qsec, disp)))) |>
  pivot_longer(-var1, names_to = "var2", values_to = "r")

ggplot(cor_data, aes(x = var1, y = var2, fill = r)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "#d73027", mid = "white",
                       high = "#276DC3", midpoint = 0) +
  geom_text(aes(label = round(r, 2)), size = 3.5) +
  labs(title = "Correlation Matrix", x = NULL, y = NULL) +
  theme_minimal()

4.12 Annotations

Use annotate() to add text, rectangles, arrows, or other elements that are not mapped to data.

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(alpha = 0.5) +
  annotate("text", x = 6, y = 40,
           label = "Outlier region", color = "red", size = 4) +
  annotate("rect", xmin = 5, xmax = 7, ymin = 35, ymax = 45,
           alpha = 0.1, fill = "red") +
  annotate("segment", x = 5.5, xend = 5, y = 44, yend = 44,
           arrow = arrow(length = unit(0.2, "cm")),
           color = "red")

4.13 Combining Plots with patchwork

The patchwork package makes it simple to arrange multiple ggplot objects into a single figure, replacing the need for grid.arrange() or cowplot.

# install.packages("patchwork")
library(patchwork)

p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() + ggtitle("Scatter")

p2 <- ggplot(mpg, aes(x = hwy)) +
  geom_histogram(binwidth = 2, fill = "#276DC3") + ggtitle("Histogram")

p3 <- ggplot(mpg, aes(x = class)) +
  geom_bar(fill = "#276DC3") + ggtitle("Bar")

# Side by side
p1 | p2

# Stacked
p1 / p2

# Complex layout with annotation
(p1 | p2) / p3 +
  plot_annotation(
    title = "MPG Dataset Overview",
    tag_levels = "A"   # auto-labels: (A), (B), (C)
  )
patchwork operators | places plots side by side. / stacks them vertically. + adds them to a layout grid. Use plot_layout(widths = c(2, 1)) to control relative sizes. plot_annotation() adds titles, subtitles, and panel tags.

4.14 Color-Blind Friendly Palettes (viridis)

Approximately 8% of men have some form of color vision deficiency. Using the viridis scale ensures your plots remain interpretable for all readers.

# Continuous viridis scale
ggplot(diamonds |> dplyr::sample_n(2000),
       aes(x = carat, y = price, color = depth)) +
  geom_point(alpha = 0.6) +
  scale_color_viridis_c() +
  theme_minimal()

# Discrete viridis scale for categorical data
ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point(size = 2) +
  scale_color_viridis_d(option = "D") +  # options: A-H
  theme_minimal()

# Other color-blind safe palettes
# scale_color_brewer(palette = "Set2")      # ColorBrewer
# scale_color_manual(values = c(...))        # fully custom
Avoid red-green only palettes The most common color vision deficiency is red-green (deuteranopia). Default ggplot2 colors use a hue-based palette that can be hard to distinguish. Always consider using viridis, ColorBrewer qualitative palettes, or adding shape/linetype as a redundant encoding.

4.15 Interactive Plots with plotly

The plotly package can convert any ggplot into an interactive, zoomable HTML widget with a single function call. This is invaluable for exploratory analysis and presentations.

# install.packages("plotly")
library(plotly)

p <- ggplot(mpg, aes(x = displ, y = hwy, color = class,
                     text = paste("Model:", model))) +
  geom_point(size = 2) +
  theme_minimal()

# Convert to interactive — hover to see tooltips
ggplotly(p, tooltip = c("text", "x", "y"))
Tip: Custom tooltips Map a text aesthetic in aes() with paste() to build informative tooltips. Then pass tooltip = "text" to ggplotly(). This is especially useful when each point represents a specific entity (a company, a patient, a country).

4.16 Saving Plots with ggsave()

# Save the last plot
ggsave("my_plot.png", width = 8, height = 5, dpi = 300)

# Save a specific plot object
ggsave("my_plot.pdf", plot = p, width = 8, height = 5)

Exercises

Exercise 4.1

Using the diamonds dataset, create a scatterplot of carat (x) vs. price (y) colored by cut. Add geom_smooth(method = "lm"), apply theme_minimal(), and save it as a PNG at 300 DPI.

Exercise 4.2

Build a bar chart showing the count of vehicles per manufacturer in the mpg dataset. Sort bars from most to least using fct_infreq() from the forcats package. Use coord_flip() for readability.

Exercise 4.3

Create a correlation heatmap using geom_tile() for the numeric columns of the mtcars dataset. Use scale_fill_gradient2() with a diverging palette centered at zero. Add correlation coefficient labels with geom_text(). Which pair of variables has the strongest positive correlation? The strongest negative?

Exercise 4.4

Using the patchwork package, create a three-panel figure from mpg: (A) a boxplot of hwy by drv with jittered points, (B) a density plot of hwy colored by drv, and (C) a scatterplot of displ vs. hwy with regression lines by drv. Arrange them as two on top, one on bottom, with automatic panel tags and a shared title. Use a color-blind friendly palette throughout. Save the result as a PDF at 10 x 7 inches.

External Resources

Key Takeaways

← Chapter 3: Data Wrangling Chapter 5: Statistical Testing →