Recipes, workflows, cross-validation, tuning, and evaluation in a tidy framework.
tidymodels provides a unified, tidy interface for the full ML pipeline. Its core packages are:
| Package | Role |
|---|---|
rsample | Train/test splits, cross-validation |
recipes | Feature engineering and preprocessing |
parsnip | Unified model specification (engine-agnostic) |
workflows | Bundle recipe + model together |
tune | Hyperparameter tuning |
yardstick | Evaluation metrics |
library(tidymodels) tidymodels_prefer() # resolve naming conflicts
data(ames, package = "modeldata") # 75/25 train-test split, stratified by Sale_Price set.seed(2024) split <- initial_split(ames, prop = 0.75, strata = Sale_Price) train <- training(split) test <- testing(split) nrow(train) # ~2198 nrow(test) # ~732
A recipe defines preprocessing steps before modeling. Steps are declared but not executed until the workflow runs.
ames_rec <- recipe(Sale_Price ~ Gr_Liv_Area + Year_Built + Bldg_Type + Neighborhood, data = train) |> step_log(Sale_Price, base = 10) |> # log-transform DV step_normalize(all_numeric_predictors()) |> # center and scale step_dummy(all_nominal_predictors()) |> # one-hot encode categoricals step_zv(all_predictors()) # remove zero-variance cols ames_rec
step_impute_median() fills missing numerics, step_interact(~ x1:x2) creates interactions, step_pca() adds PCA components.
# Linear regression lm_spec <- linear_reg() |> set_engine("lm") # Random forest rf_spec <- rand_forest(mtry = 3, trees = 500, min_n = 10) |> set_engine("ranger") |> set_mode("regression") # XGBoost (requires xgboost package) xgb_spec <- boost_tree(trees = 500, learn_rate = 0.01) |> set_engine("xgboost") |> set_mode("regression")
# Create and fit a workflow lm_wflow <- workflow() |> add_recipe(ames_rec) |> add_model(lm_spec) lm_fit <- lm_wflow |> fit(data = train) # Predict on test set preds <- lm_fit |> predict(new_data = test) |> bind_cols(test |> select(Sale_Price))
# 10-fold CV, stratified folds <- vfold_cv(train, v = 10, strata = Sale_Price) # Fit workflow across all folds lm_cv <- lm_wflow |> fit_resamples( resamples = folds, metrics = metric_set(rmse, rsq, mae) ) # Aggregate CV metrics collect_metrics(lm_cv)
# Specify tunable parameters with tune() rf_tune_spec <- rand_forest( mtry = tune(), trees = 500, min_n = tune() ) |> set_engine("ranger") |> set_mode("regression") rf_wflow <- workflow() |> add_recipe(ames_rec) |> add_model(rf_tune_spec) # Grid search rf_grid <- grid_regular( mtry(range = c(1, 4)), min_n(range = c(5, 20)), levels = 4 ) rf_tuned <- rf_wflow |> tune_grid( resamples = folds, grid = rf_grid, metrics = metric_set(rmse, rsq) ) # Best hyperparameters show_best(rf_tuned, metric = "rmse", n = 3)
# Select the best hyperparameters best_rf <- select_best(rf_tuned, metric = "rmse") # Finalize workflow and fit on full training set final_wflow <- rf_wflow |> finalize_workflow(best_rf) final_fit <- final_wflow |> last_fit(split) # Test-set metrics collect_metrics(final_fit) # Test-set predictions collect_predictions(final_fit) |> ggplot(aes(x = Sale_Price, y = .pred)) + geom_point(alpha = 0.4) + geom_abline(color = "#276DC3") + labs(title = "Predicted vs. Actual (Test Set)")
When one class is rare (e.g., fraud detection, rare diseases), models tend to predict the majority class. The themis package provides recipe steps for resampling strategies.
library(themis) # Example: imbalanced classification data(credit_data, package = "modeldata") table(credit_data$Status) # likely imbalanced # Recipe with SMOTE (Synthetic Minority Over-sampling) smote_rec <- recipe(Status ~ ., data = credit_data) |> step_impute_median(all_numeric_predictors()) |> step_impute_mode(all_nominal_predictors()) |> step_dummy(all_nominal_predictors()) |> step_smote(Status) # SMOTE resampling of minority class # Other options from themis: # step_upsample(Status) - random oversampling # step_downsample(Status) - random undersampling # step_rose(Status) - ROSE algorithm # step_adasyn(Status) - ADASYN adaptive synthetic # Build a workflow with the SMOTE recipe lr_spec <- logistic_reg() |> set_engine("glm") smote_wflow <- workflow() |> add_recipe(smote_rec) |> add_model(lr_spec)
step_smote() should only be applied during training. The recipes framework handles this automatically: these steps are skipped when predicting on new data. Never apply SMOTE before splitting.
The workflowsets package lets you define many recipe-model combinations and evaluate them all at once.
library(workflowsets) # Define multiple model specs lm_spec <- linear_reg() |> set_engine("lm") rf_spec <- rand_forest(trees = 500) |> set_engine("ranger") |> set_mode("regression") xgb_spec <- boost_tree(trees = 500) |> set_engine("xgboost") |> set_mode("regression") # Create workflow set: cross all recipes with all models all_wflows <- workflow_set( preproc = list(base = ames_rec), models = list(lm = lm_spec, rf = rf_spec, xgb = xgb_spec) ) # Evaluate all workflows on the same folds all_results <- all_wflows |> workflow_map("fit_resamples", resamples = folds, metrics = metric_set(rmse, rsq)) # Compare performance rank_results(all_results, rank_metric = "rmse") autoplot(all_results)
Grid search can be inefficient for large parameter spaces. Bayesian optimization uses a surrogate model to intelligently select the next set of hyperparameters to try.
# Bayesian optimization with tune_bayes() xgb_tune_spec <- boost_tree( trees = tune(), learn_rate = tune(), tree_depth = tune(), min_n = tune() ) |> set_engine("xgboost") |> set_mode("regression") xgb_wflow <- workflow() |> add_recipe(ames_rec) |> add_model(xgb_tune_spec) # Bayesian search (uses Gaussian process surrogate) xgb_bayes <- xgb_wflow |> tune_bayes( resamples = folds, metrics = metric_set(rmse), initial = 10, # initial grid points iter = 25, # Bayesian iterations control = control_bayes(verbose = TRUE, no_improve = 10) ) # Best results show_best(xgb_bayes, metric = "rmse", n = 5) autoplot(xgb_bayes)
tune_grid() evaluates all parameter combinations independently, which is exhaustive but slow. tune_bayes() learns from previous evaluations to focus on promising regions of the parameter space. Use Bayesian optimization when you have more than 2-3 parameters to tune, especially with expensive models like XGBoost or neural networks.
After fitting a model, understanding which features drive predictions is essential for trust and communication. The vip package provides variable importance plots, and DALEX offers model-agnostic explanations.
library(vip) # Variable importance for the final random forest final_rf <- final_wflow |> fit(data = train) final_rf |> extract_fit_parsnip() |> vip(num_features = 10, aesthetics = list(fill = "#276DC3")) # DALEX: model-agnostic explainability library(DALEX) # Create an explainer object explainer <- explain( extract_fit_parsnip(final_rf), data = train |> select(Gr_Liv_Area, Year_Built, Bldg_Type, Neighborhood), y = log10(train$Sale_Price), label = "Random Forest" ) # Partial dependence plot for a single feature pdp <- model_profile(explainer, variables = "Gr_Liv_Area") plot(pdp) # SHAP-like breakdown for a single observation new_obs <- test[1, ] bd <- predict_parts(explainer, new_observation = new_obs, type = "shap") plot(bd)
Model stacking (ensemble learning) combines predictions from multiple models using a meta-learner. The stacks package integrates with tidymodels workflows.
library(stacks) # Fit candidate models with control_stack_resamples() ctrl <- control_stack_resamples() lm_res <- lm_wflow |> fit_resamples(folds, control = ctrl) rf_res <- workflow() |> add_recipe(ames_rec) |> add_model(rf_spec) |> fit_resamples(folds, control = ctrl) # Initialize a stack and add candidate members ames_stack <- stacks() |> add_candidates(lm_res) |> add_candidates(rf_res) # Blend predictions (penalized regression meta-learner) ames_stack <- ames_stack |> blend_predictions() autoplot(ames_stack) # Fit the final stacked model ames_stack <- ames_stack |> fit_members() # Predict on test set stack_preds <- predict(ames_stack, test) bind_cols(test |> select(Sale_Price), stack_preds)
After building and evaluating a model, you may need to save it for later use, deploy it as an API, or embed it in a Shiny app.
# Save a fitted workflow for later use saveRDS(final_rf, "fitted_model.rds") # Load and predict later loaded_model <- readRDS("fitted_model.rds") predict(loaded_model, new_data = test[1:5, ]) # For production: use vetiver for model versioning and deployment # library(vetiver) # v <- vetiver_model(final_rf, "ames_price_model") # vetiver_pin_write(board, v) # pin to a board # vetiver_deploy_rsconnect(board, "ames_price_model") # deploy API
vetiver provides a framework for versioning, deploying, and monitoring ML models in production. It integrates with tidymodels workflows and can deploy models as Plumber APIs on Posit Connect, or as Docker containers for cloud deployment.
Build a classification workflow to predict Species in the iris dataset using a decision tree (decision_tree() with engine "rpart"). Use 5-fold CV and report accuracy. Then try a random forest and compare.
Using the Ames housing data, create a recipe that includes step_interact(~ Gr_Liv_Area:Year_Built). Fit an XGBoost model with tune() on learn_rate and trees. Which combination gives the best RMSE?
Use workflowsets to compare linear regression, random forest, and XGBoost on the Ames data with the same recipe and same 10-fold CV splits. Rank the models by RMSE and plot the results with autoplot(). Then use vip to show the top 10 most important features for the best-performing model.
Load the credit_data from the modeldata package. Build two classification workflows for predicting Status: one with a standard recipe and one with step_smote() from themis. Compare the two workflows using 5-fold CV with roc_auc as the metric. Does SMOTE improve the AUC? Explain why or why not.
vfold_cv() for cross-validation and tune_grid() for hyperparameter search.tune_bayes() offers smarter hyperparameter search for large parameter spaces.themis::step_smote() inside recipes (applied only during training).workflowsets enables systematic comparison of many recipe-model combinations.vip and DALEX provide variable importance and model-agnostic explainability.stacks combines predictions from multiple models via a meta-learner for improved accuracy.vetiver to version, deploy, and monitor models in production.