Chapter 10: Project — Statistical Report Pipeline

An end-to-end workflow tying together every chapter: import, clean, visualize, model, and render.

10.1 Project Overview

You will build a complete, reproducible statistical report using a publicly available dataset. The pipeline follows the same sequence a working analyst uses every day:

  1. Import data with readr::read_csv() (Ch 1, 3)
  2. Inspect structure and types (Ch 2)
  3. Clean and transform with dplyr/tidyr (Ch 3)
  4. Visualize patterns with ggplot2 (Ch 4)
  5. Test hypotheses (Ch 5)
  6. Model with lm/glm (Ch 6, 7)
  7. Render a polished report with R Markdown (Ch 9)
Dataset: Palmer Penguins We use palmerpenguins::penguins, a clean dataset of 344 penguins from three species measured at Palmer Station, Antarctica. Install with install.packages("palmerpenguins").

10.2 Step 1: Setup and Import

# report.Rmd setup chunk
library(tidyverse)
library(palmerpenguins)
library(broom)
library(knitr)

# Load and inspect
data <- penguins
glimpse(data)
dim(data)  # 344 rows, 8 columns

10.3 Step 2: Clean and Transform

# Drop rows with missing values
df <- data |>
  drop_na() |>
  mutate(
    bill_ratio = bill_length_mm / bill_depth_mm,
    mass_kg    = body_mass_g / 1000,
    species    = fct_relevel(species, "Adelie", "Chinstrap", "Gentoo")
  )

cat("Clean sample:", nrow(df), "penguins\n")

# Summary by species
species_summary <- df |>
  group_by(species) |>
  summarize(
    n         = n(),
    avg_mass  = mean(mass_kg),
    avg_bill  = mean(bill_length_mm),
    avg_flip  = mean(flipper_length_mm)
  )

kable(species_summary, digits = 1, caption = "Summary by Species")

10.4 Step 3: Exploratory Visualization

# Scatterplot: bill length vs. depth, colored by species
p1 <- ggplot(df, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Bill Dimensions by Species",
       x = "Bill Length (mm)", y = "Bill Depth (mm)") +
  theme_minimal(base_size = 13)
p1

# Distribution of body mass by species
p2 <- ggplot(df, aes(x = mass_kg, fill = species)) +
  geom_histogram(binwidth = 0.25, alpha = 0.7, position = "identity") +
  facet_wrap(~ species, ncol = 1) +
  labs(x = "Body Mass (kg)", y = "Count") +
  theme_minimal(base_size = 13)
p2

10.5 Step 4: Hypothesis Testing

# Does body mass differ across species? (ANOVA)
aov_mass <- aov(mass_kg ~ species, data = df)
summary(aov_mass)

# Post-hoc pairwise comparisons
tukey <- TukeyHSD(aov_mass)
tukey

# Correlation between bill length and flipper length
cor.test(df$bill_length_mm, df$flipper_length_mm)

10.6 Step 5: Linear Modeling

# Model 1: body mass predicted by bill dimensions
fit1 <- lm(mass_kg ~ bill_length_mm + bill_depth_mm, data = df)

# Model 2: add species and flipper length
fit2 <- lm(mass_kg ~ bill_length_mm + bill_depth_mm +
             flipper_length_mm + species, data = df)

# Compare models
anova(fit1, fit2)

# Tidy coefficient table for the full model
tidy(fit2, conf.int = TRUE) |>
  kable(digits = 3, caption = "Regression Coefficients (Model 2)")

# Model fit statistics
glance(fit2) |>
  select(r.squared, adj.r.squared, AIC, BIC) |>
  kable(digits = 3)

10.7 Step 6: Logistic Regression Extension

# Predict species (Gentoo vs. others) from body measurements
df <- df |>
  mutate(is_gentoo = ifelse(species == "Gentoo", 1, 0))

logit_fit <- glm(is_gentoo ~ bill_length_mm + bill_depth_mm +
                   flipper_length_mm + mass_kg,
                 family = binomial(), data = df)

# Odds ratios
tidy(logit_fit, exponentiate = TRUE, conf.int = TRUE) |>
  kable(digits = 3, caption = "Logistic Regression: Odds Ratios")

# Classification accuracy
df$pred_gentoo <- ifelse(predict(logit_fit, type = "response") > 0.5, 1, 0)
accuracy <- mean(df$pred_gentoo == df$is_gentoo)
cat("Accuracy:", round(accuracy * 100, 1), "%\n")

10.8 Step 7: Render the Report

Combine all the code above into a single .Rmd or .qmd file with narrative sections between chunks. Render it to multiple formats:

# From the terminal or R console
rmarkdown::render("penguin_report.Rmd", output_format = "html_document")
rmarkdown::render("penguin_report.Rmd", output_format = "pdf_document")
rmarkdown::render("penguin_report.Rmd", output_format = "word_document")

# Or with Quarto
# quarto render penguin_report.qmd --to html,pdf
Project structure best practice Organize your project folder with subfolders: data/ for raw files, R/ for helper functions, output/ for rendered reports, and the .Rmd at the root. Use an .Rproj file to set the working directory.

10.9 Portable Paths with here::here()

Hardcoded paths like "C:/Users/me/project/data/file.csv" break when the project moves to another machine. The here package constructs paths relative to the project root, making code portable.

library(here)

# here() automatically finds the project root (.Rproj, .git, etc.)
here()  # prints the project root

# Build paths relative to the root
data_path   <- here("data", "raw", "penguins.csv")
output_path <- here("output", "report.html")

# Use in read/write operations
# df <- read_csv(here("data", "raw", "survey_responses.csv"))
# ggsave(here("output", "figures", "fig1.png"), p1)

# Works regardless of working directory or OS
# Never use setwd() - it creates fragile, non-portable code
Avoid setwd() Calling setwd() makes your scripts dependent on a specific directory structure. Instead, use here::here() for all file paths. Combined with an .Rproj file, this ensures your project runs correctly on any machine.

10.10 Reproducible Environments with renv

Package versions matter. An analysis that works today may break next month when a package updates. The renv package creates project-specific libraries that lock package versions.

# Initialize renv in your project
renv::init()

# This creates:
# renv.lock     - snapshot of all package versions
# renv/         - project-local library
# .Rprofile     - auto-activates renv when you open the project

# Install packages as usual
install.packages("palmerpenguins")

# Snapshot: record current package versions to renv.lock
renv::snapshot()

# When a collaborator clones your project:
renv::restore()  # installs exact same package versions from renv.lock

# Check status: which packages differ from the lockfile?
renv::status()
When to use renv Use renv for any project that will be shared with collaborators, submitted as a replication package, or needs to run months or years later. The renv.lock file should be committed to version control, but the renv/library/ directory should be in your .gitignore.

10.11 Pipeline Automation with targets

For complex analyses with many interdependent steps, the targets package tracks which steps are up-to-date and only re-runs what has changed. This saves time and ensures correctness.

# _targets.R (place at project root)
library(targets)
library(tarchetypes)

tar_option_set(packages = c("tidyverse", "palmerpenguins", "broom"))

list(
  # Step 1: Load and clean data
  tar_target(raw_data, palmerpenguins::penguins),
  tar_target(clean_data, raw_data |> drop_na()),

  # Step 2: Fit models
  tar_target(model_1, lm(body_mass_g ~ bill_length_mm, data = clean_data)),
  tar_target(model_2, lm(body_mass_g ~ bill_length_mm + species, data = clean_data)),

  # Step 3: Summarize results
  tar_target(model_comparison, bind_rows(
    glance(model_1) |> mutate(model = "Bivariate"),
    glance(model_2) |> mutate(model = "With Species")
  )),

  # Step 4: Render report
  tar_render(report, "report.Rmd")
)

# Run from the console:
# targets::tar_make()       # run the pipeline
# targets::tar_visnetwork() # visualize the DAG
# targets::tar_read(model_comparison)  # load a result
targets vs. Make vs. scripts Unlike running scripts sequentially, targets tracks dependencies automatically. If you change the cleaning step, it re-runs everything downstream. If you only change the report template, it skips data cleaning and modeling entirely. This is especially valuable for analyses that take minutes or hours to run.

10.12 Profiling with profvis

When your analysis is slow, profiling helps identify bottlenecks. The profvis package provides an interactive flame graph showing where R spends its time.

library(profvis)

# Profile a block of code
profvis({
  # Simulate a typical analysis pipeline
  df <- palmerpenguins::penguins |> tidyr::drop_na()

  # Many models in a loop (common bottleneck)
  results <- lapply(1:100, function(i) {
    sample_df <- df[sample(nrow(df), nrow(df), replace = TRUE), ]
    fit <- lm(body_mass_g ~ bill_length_mm + species, data = sample_df)
    broom::tidy(fit)
  })

  bind_rows(results)
})

# Tips for speed:
# 1. Use data.table or collapse for large data wrangling
# 2. Replace loops with vectorized operations or purrr::map()
# 3. Cache expensive computations with targets or knitr caching
# 4. Use furrr::future_map() for parallel processing

10.13 Interactive Dashboards with Shiny

Shiny turns R analyses into interactive web applications. Users can filter data, change parameters, and see results update in real time.

library(shiny)

# Minimal Shiny app (save as app.R)
ui <- fluidPage(
  titlePanel("Penguin Explorer"),
  sidebarLayout(
    sidebarPanel(
      selectInput("species", "Species:",
                  choices = c("All", "Adelie", "Chinstrap", "Gentoo")),
      sliderInput("mass_range", "Body Mass (g):",
                  min = 2500, max = 6500,
                  value = c(2500, 6500))
    ),
    mainPanel(
      plotOutput("scatter"),
      tableOutput("summary_table")
    )
  )
)

server <- function(input, output) {
  filtered_data <- reactive({
    df <- palmerpenguins::penguins |> tidyr::drop_na()
    if (input$species != "All") df <- df |> dplyr::filter(species == input$species)
    df |> dplyr::filter(body_mass_g >= input$mass_range[1],
                          body_mass_g <= input$mass_range[2])
  })

  output$scatter <- renderPlot({
    ggplot(filtered_data(), aes(bill_length_mm, bill_depth_mm, color = species)) +
      geom_point(size = 2) + theme_minimal()
  })

  output$summary_table <- renderTable({
    filtered_data() |> dplyr::summarize(N = dplyr::n(), Avg_Mass = mean(body_mass_g))
  })
}

shinyApp(ui, server)

# Run: click "Run App" in RStudio, or source("app.R")
Deploying Shiny apps Share your app by deploying to shinyapps.io (free tier available): rsconnect::deployApp(). For institutional use, Posit Connect or Shiny Server provide more control. You can also embed small Shiny apps in Quarto documents with server: shiny in the YAML header.

10.14 GitHub Actions for Automated Rendering

GitHub Actions can automatically render your R Markdown or Quarto reports whenever you push changes, ensuring the output is always up to date.

# .github/workflows/render-report.yml
name: Render Report

on:
  push:
    branches: [main]
  workflow_dispatch:

jobs:
  render:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: r-lib/actions/setup-r@v2

      - uses: r-lib/actions/setup-r-dependencies@v2
        with:
          extra-packages: |
            rmarkdown
            palmerpenguins

      - name: Render Report
        run: Rscript -e 'rmarkdown::render("report.Rmd")'

      - name: Upload artifact
        uses: actions/upload-artifact@v4
        with:
          name: report
          path: report.html

# For Quarto, replace the render step with:
# - uses: quarto-dev/quarto-actions/render@v2
CI/CD for research Automated rendering catches errors early: if a package update breaks your code, the GitHub Action fails and alerts you. Combined with renv for pinned packages and targets for pipeline management, this creates a fully reproducible and automated research workflow.

10.15 Diagnostic Checks

# Residual diagnostics for the linear model
par(mfrow = c(2, 2))
plot(fit2)
par(mfrow = c(1, 1))

# Coefficient plot with ggplot2
tidy(fit2, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  ggplot(aes(x = estimate, y = fct_reorder(term, estimate))) +
  geom_point(size = 3, color = "#276DC3") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  labs(title = "Coefficient Plot", x = "Estimate", y = NULL) +
  theme_minimal(base_size = 13)

Exercises

Exercise 10.1

Recreate this entire pipeline using the diamonds dataset (from ggplot2). Predict price from carat, cut, color, and clarity. Produce an R Markdown report with summary tables, at least two plots, a linear model, and a brief written interpretation of results.

Exercise 10.2

Extend the penguin report by adding a tidymodels section: split the data, build a random forest to classify all three species (not just Gentoo), evaluate with 5-fold CV, and report accuracy. Include the results in your rendered document.

Exercise 10.3

Set up a project using here and renv. Create the folder structure: data/, R/, output/. Write an R script in R/ that loads data using here("data", "raw", ...), runs a linear model, and saves a plot to output/. Initialize renv and snapshot. Verify that renv::status() shows no issues.

Exercise 10.4

Create a minimal Shiny app that lets users select a variable from the mtcars dataset (via a dropdown) and displays: (1) a histogram of the selected variable, and (2) a summary statistics table below it. Add a checkbox that toggles whether to show the data grouped by cylinder count. Deploy it to shinyapps.io or run it locally.

External Resources

Key Takeaways

← Chapter 9: Reproducible Research Back to Index →