Chapter 8: Machine Learning with tidymodels

Recipes, workflows, cross-validation, tuning, and evaluation in a tidy framework.

8.1 The tidymodels Ecosystem

tidymodels provides a unified, tidy interface for the full ML pipeline. Its core packages are:

PackageRole
rsampleTrain/test splits, cross-validation
recipesFeature engineering and preprocessing
parsnipUnified model specification (engine-agnostic)
workflowsBundle recipe + model together
tuneHyperparameter tuning
yardstickEvaluation metrics
library(tidymodels)
tidymodels_prefer()  # resolve naming conflicts

8.2 Data Splitting with rsample

data(ames, package = "modeldata")

# 75/25 train-test split, stratified by Sale_Price
set.seed(2024)
split <- initial_split(ames, prop = 0.75, strata = Sale_Price)
train <- training(split)
test  <- testing(split)

nrow(train)  # ~2198
nrow(test)   # ~732

8.3 Feature Engineering with recipes

A recipe defines preprocessing steps before modeling. Steps are declared but not executed until the workflow runs.

ames_rec <- recipe(Sale_Price ~ Gr_Liv_Area + Year_Built +
                     Bldg_Type + Neighborhood,
                   data = train) |>
  step_log(Sale_Price, base = 10) |>         # log-transform DV
  step_normalize(all_numeric_predictors()) |>  # center and scale
  step_dummy(all_nominal_predictors()) |>     # one-hot encode categoricals
  step_zv(all_predictors())                   # remove zero-variance cols

ames_rec
Useful recipe steps step_impute_median() fills missing numerics, step_interact(~ x1:x2) creates interactions, step_pca() adds PCA components.

8.4 Model Specification with parsnip

# Linear regression
lm_spec <- linear_reg() |>
  set_engine("lm")

# Random forest
rf_spec <- rand_forest(mtry = 3, trees = 500, min_n = 10) |>
  set_engine("ranger") |>
  set_mode("regression")

# XGBoost (requires xgboost package)
xgb_spec <- boost_tree(trees = 500, learn_rate = 0.01) |>
  set_engine("xgboost") |>
  set_mode("regression")

8.5 Workflows: Bundle Recipe + Model

# Create and fit a workflow
lm_wflow <- workflow() |>
  add_recipe(ames_rec) |>
  add_model(lm_spec)

lm_fit <- lm_wflow |> fit(data = train)

# Predict on test set
preds <- lm_fit |>
  predict(new_data = test) |>
  bind_cols(test |> select(Sale_Price))

8.6 Cross-Validation with rsample

# 10-fold CV, stratified
folds <- vfold_cv(train, v = 10, strata = Sale_Price)

# Fit workflow across all folds
lm_cv <- lm_wflow |>
  fit_resamples(
    resamples = folds,
    metrics   = metric_set(rmse, rsq, mae)
  )

# Aggregate CV metrics
collect_metrics(lm_cv)

8.7 Hyperparameter Tuning with tune_grid

# Specify tunable parameters with tune()
rf_tune_spec <- rand_forest(
  mtry  = tune(),
  trees = 500,
  min_n = tune()
) |>
  set_engine("ranger") |>
  set_mode("regression")

rf_wflow <- workflow() |>
  add_recipe(ames_rec) |>
  add_model(rf_tune_spec)

# Grid search
rf_grid <- grid_regular(
  mtry(range = c(1, 4)),
  min_n(range = c(5, 20)),
  levels = 4
)

rf_tuned <- rf_wflow |>
  tune_grid(
    resamples = folds,
    grid      = rf_grid,
    metrics   = metric_set(rmse, rsq)
  )

# Best hyperparameters
show_best(rf_tuned, metric = "rmse", n = 3)

8.8 Finalizing and Evaluating

# Select the best hyperparameters
best_rf <- select_best(rf_tuned, metric = "rmse")

# Finalize workflow and fit on full training set
final_wflow <- rf_wflow |> finalize_workflow(best_rf)
final_fit   <- final_wflow |> last_fit(split)

# Test-set metrics
collect_metrics(final_fit)

# Test-set predictions
collect_predictions(final_fit) |>
  ggplot(aes(x = Sale_Price, y = .pred)) +
  geom_point(alpha = 0.4) +
  geom_abline(color = "#276DC3") +
  labs(title = "Predicted vs. Actual (Test Set)")

8.9 Handling Imbalanced Data with themis

When one class is rare (e.g., fraud detection, rare diseases), models tend to predict the majority class. The themis package provides recipe steps for resampling strategies.

library(themis)

# Example: imbalanced classification
data(credit_data, package = "modeldata")
table(credit_data$Status)  # likely imbalanced

# Recipe with SMOTE (Synthetic Minority Over-sampling)
smote_rec <- recipe(Status ~ ., data = credit_data) |>
  step_impute_median(all_numeric_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_smote(Status)  # SMOTE resampling of minority class

# Other options from themis:
# step_upsample(Status)    - random oversampling
# step_downsample(Status)  - random undersampling
# step_rose(Status)        - ROSE algorithm
# step_adasyn(Status)      - ADASYN adaptive synthetic

# Build a workflow with the SMOTE recipe
lr_spec <- logistic_reg() |> set_engine("glm")
smote_wflow <- workflow() |>
  add_recipe(smote_rec) |>
  add_model(lr_spec)
SMOTE only on training data Resampling steps like step_smote() should only be applied during training. The recipes framework handles this automatically: these steps are skipped when predicting on new data. Never apply SMOTE before splitting.

8.10 Comparing Multiple Models with workflowsets

The workflowsets package lets you define many recipe-model combinations and evaluate them all at once.

library(workflowsets)

# Define multiple model specs
lm_spec  <- linear_reg() |> set_engine("lm")
rf_spec  <- rand_forest(trees = 500) |> set_engine("ranger") |> set_mode("regression")
xgb_spec <- boost_tree(trees = 500) |> set_engine("xgboost") |> set_mode("regression")

# Create workflow set: cross all recipes with all models
all_wflows <- workflow_set(
  preproc = list(base = ames_rec),
  models  = list(lm = lm_spec, rf = rf_spec, xgb = xgb_spec)
)

# Evaluate all workflows on the same folds
all_results <- all_wflows |>
  workflow_map("fit_resamples",
              resamples = folds,
              metrics = metric_set(rmse, rsq))

# Compare performance
rank_results(all_results, rank_metric = "rmse")
autoplot(all_results)

8.11 Bayesian Hyperparameter Optimization

Grid search can be inefficient for large parameter spaces. Bayesian optimization uses a surrogate model to intelligently select the next set of hyperparameters to try.

# Bayesian optimization with tune_bayes()
xgb_tune_spec <- boost_tree(
  trees      = tune(),
  learn_rate = tune(),
  tree_depth = tune(),
  min_n      = tune()
) |>
  set_engine("xgboost") |>
  set_mode("regression")

xgb_wflow <- workflow() |>
  add_recipe(ames_rec) |>
  add_model(xgb_tune_spec)

# Bayesian search (uses Gaussian process surrogate)
xgb_bayes <- xgb_wflow |>
  tune_bayes(
    resamples = folds,
    metrics   = metric_set(rmse),
    initial   = 10,   # initial grid points
    iter      = 25,   # Bayesian iterations
    control   = control_bayes(verbose = TRUE, no_improve = 10)
  )

# Best results
show_best(xgb_bayes, metric = "rmse", n = 5)
autoplot(xgb_bayes)
Grid search vs. Bayesian optimization tune_grid() evaluates all parameter combinations independently, which is exhaustive but slow. tune_bayes() learns from previous evaluations to focus on promising regions of the parameter space. Use Bayesian optimization when you have more than 2-3 parameters to tune, especially with expensive models like XGBoost or neural networks.

8.12 Model Explainability with vip and DALEX

After fitting a model, understanding which features drive predictions is essential for trust and communication. The vip package provides variable importance plots, and DALEX offers model-agnostic explanations.

library(vip)

# Variable importance for the final random forest
final_rf <- final_wflow |> fit(data = train)
final_rf |>
  extract_fit_parsnip() |>
  vip(num_features = 10, aesthetics = list(fill = "#276DC3"))

# DALEX: model-agnostic explainability
library(DALEX)

# Create an explainer object
explainer <- explain(
  extract_fit_parsnip(final_rf),
  data = train |> select(Gr_Liv_Area, Year_Built, Bldg_Type, Neighborhood),
  y    = log10(train$Sale_Price),
  label = "Random Forest"
)

# Partial dependence plot for a single feature
pdp <- model_profile(explainer, variables = "Gr_Liv_Area")
plot(pdp)

# SHAP-like breakdown for a single observation
new_obs <- test[1, ]
bd <- predict_parts(explainer, new_observation = new_obs, type = "shap")
plot(bd)
Model-specific vs. model-agnostic importance Tree-based models (random forest, XGBoost) have built-in importance measures based on impurity reduction or permutation. DALEX provides model-agnostic tools that work with any model, enabling fair comparison across model types. Use permutation-based importance for the most reliable cross-model comparisons.

8.13 Stacking Models with stacks

Model stacking (ensemble learning) combines predictions from multiple models using a meta-learner. The stacks package integrates with tidymodels workflows.

library(stacks)

# Fit candidate models with control_stack_resamples()
ctrl <- control_stack_resamples()

lm_res <- lm_wflow |>
  fit_resamples(folds, control = ctrl)

rf_res <- workflow() |>
  add_recipe(ames_rec) |>
  add_model(rf_spec) |>
  fit_resamples(folds, control = ctrl)

# Initialize a stack and add candidate members
ames_stack <- stacks() |>
  add_candidates(lm_res) |>
  add_candidates(rf_res)

# Blend predictions (penalized regression meta-learner)
ames_stack <- ames_stack |> blend_predictions()
autoplot(ames_stack)

# Fit the final stacked model
ames_stack <- ames_stack |> fit_members()

# Predict on test set
stack_preds <- predict(ames_stack, test)
bind_cols(test |> select(Sale_Price), stack_preds)

8.14 Deployment Considerations

After building and evaluating a model, you may need to save it for later use, deploy it as an API, or embed it in a Shiny app.

# Save a fitted workflow for later use
saveRDS(final_rf, "fitted_model.rds")

# Load and predict later
loaded_model <- readRDS("fitted_model.rds")
predict(loaded_model, new_data = test[1:5, ])

# For production: use vetiver for model versioning and deployment
# library(vetiver)
# v <- vetiver_model(final_rf, "ames_price_model")
# vetiver_pin_write(board, v)  # pin to a board
# vetiver_deploy_rsconnect(board, "ames_price_model")  # deploy API
The vetiver package vetiver provides a framework for versioning, deploying, and monitoring ML models in production. It integrates with tidymodels workflows and can deploy models as Plumber APIs on Posit Connect, or as Docker containers for cloud deployment.

Exercises

Exercise 8.1

Build a classification workflow to predict Species in the iris dataset using a decision tree (decision_tree() with engine "rpart"). Use 5-fold CV and report accuracy. Then try a random forest and compare.

Exercise 8.2

Using the Ames housing data, create a recipe that includes step_interact(~ Gr_Liv_Area:Year_Built). Fit an XGBoost model with tune() on learn_rate and trees. Which combination gives the best RMSE?

Exercise 8.3

Use workflowsets to compare linear regression, random forest, and XGBoost on the Ames data with the same recipe and same 10-fold CV splits. Rank the models by RMSE and plot the results with autoplot(). Then use vip to show the top 10 most important features for the best-performing model.

Exercise 8.4

Load the credit_data from the modeldata package. Build two classification workflows for predicting Status: one with a standard recipe and one with step_smote() from themis. Compare the two workflows using 5-fold CV with roc_auc as the metric. Does SMOTE improve the AUC? Explain why or why not.

External Resources

Key Takeaways

← Chapter 7: GLMs Chapter 9: Reproducible Research →