An end-to-end workflow tying together every chapter: import, clean, visualize, model, and render.
You will build a complete, reproducible statistical report using a publicly available dataset. The pipeline follows the same sequence a working analyst uses every day:
readr::read_csv() (Ch 1, 3)palmerpenguins::penguins, a clean dataset of 344 penguins from three species measured at Palmer Station, Antarctica. Install with install.packages("palmerpenguins").
# report.Rmd setup chunk library(tidyverse) library(palmerpenguins) library(broom) library(knitr) # Load and inspect data <- penguins glimpse(data) dim(data) # 344 rows, 8 columns
# Drop rows with missing values df <- data |> drop_na() |> mutate( bill_ratio = bill_length_mm / bill_depth_mm, mass_kg = body_mass_g / 1000, species = fct_relevel(species, "Adelie", "Chinstrap", "Gentoo") ) cat("Clean sample:", nrow(df), "penguins\n") # Summary by species species_summary <- df |> group_by(species) |> summarize( n = n(), avg_mass = mean(mass_kg), avg_bill = mean(bill_length_mm), avg_flip = mean(flipper_length_mm) ) kable(species_summary, digits = 1, caption = "Summary by Species")
# Scatterplot: bill length vs. depth, colored by species p1 <- ggplot(df, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point(size = 2, alpha = 0.7) + geom_smooth(method = "lm", se = FALSE) + labs(title = "Bill Dimensions by Species", x = "Bill Length (mm)", y = "Bill Depth (mm)") + theme_minimal(base_size = 13) p1 # Distribution of body mass by species p2 <- ggplot(df, aes(x = mass_kg, fill = species)) + geom_histogram(binwidth = 0.25, alpha = 0.7, position = "identity") + facet_wrap(~ species, ncol = 1) + labs(x = "Body Mass (kg)", y = "Count") + theme_minimal(base_size = 13) p2
# Does body mass differ across species? (ANOVA) aov_mass <- aov(mass_kg ~ species, data = df) summary(aov_mass) # Post-hoc pairwise comparisons tukey <- TukeyHSD(aov_mass) tukey # Correlation between bill length and flipper length cor.test(df$bill_length_mm, df$flipper_length_mm)
# Model 1: body mass predicted by bill dimensions fit1 <- lm(mass_kg ~ bill_length_mm + bill_depth_mm, data = df) # Model 2: add species and flipper length fit2 <- lm(mass_kg ~ bill_length_mm + bill_depth_mm + flipper_length_mm + species, data = df) # Compare models anova(fit1, fit2) # Tidy coefficient table for the full model tidy(fit2, conf.int = TRUE) |> kable(digits = 3, caption = "Regression Coefficients (Model 2)") # Model fit statistics glance(fit2) |> select(r.squared, adj.r.squared, AIC, BIC) |> kable(digits = 3)
# Predict species (Gentoo vs. others) from body measurements df <- df |> mutate(is_gentoo = ifelse(species == "Gentoo", 1, 0)) logit_fit <- glm(is_gentoo ~ bill_length_mm + bill_depth_mm + flipper_length_mm + mass_kg, family = binomial(), data = df) # Odds ratios tidy(logit_fit, exponentiate = TRUE, conf.int = TRUE) |> kable(digits = 3, caption = "Logistic Regression: Odds Ratios") # Classification accuracy df$pred_gentoo <- ifelse(predict(logit_fit, type = "response") > 0.5, 1, 0) accuracy <- mean(df$pred_gentoo == df$is_gentoo) cat("Accuracy:", round(accuracy * 100, 1), "%\n")
Combine all the code above into a single .Rmd or .qmd file with narrative sections between chunks. Render it to multiple formats:
# From the terminal or R console rmarkdown::render("penguin_report.Rmd", output_format = "html_document") rmarkdown::render("penguin_report.Rmd", output_format = "pdf_document") rmarkdown::render("penguin_report.Rmd", output_format = "word_document") # Or with Quarto # quarto render penguin_report.qmd --to html,pdf
data/ for raw files, R/ for helper functions, output/ for rendered reports, and the .Rmd at the root. Use an .Rproj file to set the working directory.
Hardcoded paths like "C:/Users/me/project/data/file.csv" break when the project moves to another machine. The here package constructs paths relative to the project root, making code portable.
library(here) # here() automatically finds the project root (.Rproj, .git, etc.) here() # prints the project root # Build paths relative to the root data_path <- here("data", "raw", "penguins.csv") output_path <- here("output", "report.html") # Use in read/write operations # df <- read_csv(here("data", "raw", "survey_responses.csv")) # ggsave(here("output", "figures", "fig1.png"), p1) # Works regardless of working directory or OS # Never use setwd() - it creates fragile, non-portable code
setwd() makes your scripts dependent on a specific directory structure. Instead, use here::here() for all file paths. Combined with an .Rproj file, this ensures your project runs correctly on any machine.
Package versions matter. An analysis that works today may break next month when a package updates. The renv package creates project-specific libraries that lock package versions.
# Initialize renv in your project renv::init() # This creates: # renv.lock - snapshot of all package versions # renv/ - project-local library # .Rprofile - auto-activates renv when you open the project # Install packages as usual install.packages("palmerpenguins") # Snapshot: record current package versions to renv.lock renv::snapshot() # When a collaborator clones your project: renv::restore() # installs exact same package versions from renv.lock # Check status: which packages differ from the lockfile? renv::status()
renv for any project that will be shared with collaborators, submitted as a replication package, or needs to run months or years later. The renv.lock file should be committed to version control, but the renv/library/ directory should be in your .gitignore.
For complex analyses with many interdependent steps, the targets package tracks which steps are up-to-date and only re-runs what has changed. This saves time and ensures correctness.
# _targets.R (place at project root) library(targets) library(tarchetypes) tar_option_set(packages = c("tidyverse", "palmerpenguins", "broom")) list( # Step 1: Load and clean data tar_target(raw_data, palmerpenguins::penguins), tar_target(clean_data, raw_data |> drop_na()), # Step 2: Fit models tar_target(model_1, lm(body_mass_g ~ bill_length_mm, data = clean_data)), tar_target(model_2, lm(body_mass_g ~ bill_length_mm + species, data = clean_data)), # Step 3: Summarize results tar_target(model_comparison, bind_rows( glance(model_1) |> mutate(model = "Bivariate"), glance(model_2) |> mutate(model = "With Species") )), # Step 4: Render report tar_render(report, "report.Rmd") ) # Run from the console: # targets::tar_make() # run the pipeline # targets::tar_visnetwork() # visualize the DAG # targets::tar_read(model_comparison) # load a result
targets tracks dependencies automatically. If you change the cleaning step, it re-runs everything downstream. If you only change the report template, it skips data cleaning and modeling entirely. This is especially valuable for analyses that take minutes or hours to run.
When your analysis is slow, profiling helps identify bottlenecks. The profvis package provides an interactive flame graph showing where R spends its time.
library(profvis) # Profile a block of code profvis({ # Simulate a typical analysis pipeline df <- palmerpenguins::penguins |> tidyr::drop_na() # Many models in a loop (common bottleneck) results <- lapply(1:100, function(i) { sample_df <- df[sample(nrow(df), nrow(df), replace = TRUE), ] fit <- lm(body_mass_g ~ bill_length_mm + species, data = sample_df) broom::tidy(fit) }) bind_rows(results) }) # Tips for speed: # 1. Use data.table or collapse for large data wrangling # 2. Replace loops with vectorized operations or purrr::map() # 3. Cache expensive computations with targets or knitr caching # 4. Use furrr::future_map() for parallel processing
Shiny turns R analyses into interactive web applications. Users can filter data, change parameters, and see results update in real time.
library(shiny) # Minimal Shiny app (save as app.R) ui <- fluidPage( titlePanel("Penguin Explorer"), sidebarLayout( sidebarPanel( selectInput("species", "Species:", choices = c("All", "Adelie", "Chinstrap", "Gentoo")), sliderInput("mass_range", "Body Mass (g):", min = 2500, max = 6500, value = c(2500, 6500)) ), mainPanel( plotOutput("scatter"), tableOutput("summary_table") ) ) ) server <- function(input, output) { filtered_data <- reactive({ df <- palmerpenguins::penguins |> tidyr::drop_na() if (input$species != "All") df <- df |> dplyr::filter(species == input$species) df |> dplyr::filter(body_mass_g >= input$mass_range[1], body_mass_g <= input$mass_range[2]) }) output$scatter <- renderPlot({ ggplot(filtered_data(), aes(bill_length_mm, bill_depth_mm, color = species)) + geom_point(size = 2) + theme_minimal() }) output$summary_table <- renderTable({ filtered_data() |> dplyr::summarize(N = dplyr::n(), Avg_Mass = mean(body_mass_g)) }) } shinyApp(ui, server) # Run: click "Run App" in RStudio, or source("app.R")
rsconnect::deployApp(). For institutional use, Posit Connect or Shiny Server provide more control. You can also embed small Shiny apps in Quarto documents with server: shiny in the YAML header.
GitHub Actions can automatically render your R Markdown or Quarto reports whenever you push changes, ensuring the output is always up to date.
# .github/workflows/render-report.yml name: Render Report on: push: branches: [main] workflow_dispatch: jobs: render: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: r-lib/actions/setup-r@v2 - uses: r-lib/actions/setup-r-dependencies@v2 with: extra-packages: | rmarkdown palmerpenguins - name: Render Report run: Rscript -e 'rmarkdown::render("report.Rmd")' - name: Upload artifact uses: actions/upload-artifact@v4 with: name: report path: report.html # For Quarto, replace the render step with: # - uses: quarto-dev/quarto-actions/render@v2
renv for pinned packages and targets for pipeline management, this creates a fully reproducible and automated research workflow.
# Residual diagnostics for the linear model par(mfrow = c(2, 2)) plot(fit2) par(mfrow = c(1, 1)) # Coefficient plot with ggplot2 tidy(fit2, conf.int = TRUE) |> filter(term != "(Intercept)") |> ggplot(aes(x = estimate, y = fct_reorder(term, estimate))) + geom_point(size = 3, color = "#276DC3") + geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) + geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") + labs(title = "Coefficient Plot", x = "Estimate", y = NULL) + theme_minimal(base_size = 13)
Recreate this entire pipeline using the diamonds dataset (from ggplot2). Predict price from carat, cut, color, and clarity. Produce an R Markdown report with summary tables, at least two plots, a linear model, and a brief written interpretation of results.
Extend the penguin report by adding a tidymodels section: split the data, build a random forest to classify all three species (not just Gentoo), evaluate with 5-fold CV, and report accuracy. Include the results in your rendered document.
Set up a project using here and renv. Create the folder structure: data/, R/, output/. Write an R script in R/ that loads data using here("data", "raw", ...), runs a linear model, and saves a plot to output/. Initialize renv and snapshot. Verify that renv::status() shows no issues.
Create a minimal Shiny app that lets users select a variable from the mtcars dataset (via a dropdown) and displays: (1) a histogram of the selected variable, and (2) a summary statistics table below it. Add a checkbox that toggles whether to show the data grouped by cylinder count. Deploy it to shinyapps.io or run it locally.
here::here() for portable paths; never hardcode absolute paths or use setwd().renv to lock package versions for reproducibility across machines and over time.targets package automates pipelines, tracking dependencies and skipping unchanged steps.profvis to identify and fix performance bottlenecks.