Lab Overview
Learning Objectives
By completing this lab, you will develop practical skills in implementing boosting algorithms for a real-world financial application. You will learn how to handle imbalanced datasets, optimize hyperparameters using cross-validation, interpret feature importance for regulatory compliance, and compare multiple gradient boosting frameworks. This lab simulates the work you would perform as a data scientist at a financial institution.
Business Context
You are a data scientist at FinanceFirst Bank, a regional bank processing 50,000 loan applications annually. The current credit scoring system approves 65% of applications with a 12% default rate, resulting in $24 million in annual losses. Your task is to build an XGBoost model that improves default prediction while maintaining approval rates above 60% to meet growth targets. The model must also provide feature importance scores to satisfy regulatory requirements for explainability in lending decisions.
Task 1: Data Preparation and Exploration
βοΈ Your Task
Load the credit risk dataset, perform exploratory data analysis to understand feature distributions and default rates, identify missing values and outliers, and prepare the data for modeling by handling categorical variables and splitting into training and test sets.
Hint
Your exploration should reveal that younger applicants (18β30) have higher default rates (~18%), while applicants over 50 show rates below 8%. Income levels show a strong inverse relationship with default rates. Credit scores provide clear separation β most defaults occur below 620. Use pd.cut() for binning continuous features and groupby().mean() to compute group-level default rates.
Checkpoint 1
Before proceeding, verify that you have identified the class imbalance (~12% default rate), confirmed that credit score and income are strong predictors, noted missing values requiring imputation, and created visualizations showing featureβdefault relationships. These foundational insights will inform your modeling decisions in the next tasks.
Task 2: Build and Train the XGBoost Model
βοΈ Your Task
Implement an XGBoost classifier with appropriate parameters for the imbalanced credit risk dataset. Configure the model to handle class imbalance, set regularization parameters to prevent overfitting, and use early stopping to find the optimal number of trees. Train the model and evaluate its performance using metrics appropriate for imbalanced classification.
Hint
Use scale_pos_weight = count(negative) / count(positive) to account for class imbalance β this tells XGBoost to penalize false negatives more heavily. Set early_stopping_rounds=20 to prevent overfitting; optimal stopping typically occurs between 80β150 rounds. Evaluate with AUC rather than accuracy because accuracy is misleading on imbalanced data (a model predicting all non-defaults achieves 88% accuracy but zero recall on defaults).
Checkpoint 2
Your model should achieve an AUC above 0.85. The approval rate should remain above 60%, while the default rate among approved loans should drop below 8% (from the current 12%). Early stopping should halt between 80β150 rounds. If metrics fall short, revisit your feature engineering or adjust scale_pos_weight to better balance precision and recall.
Task 3: Optimize and Deploy the Final Model
βοΈ Your Task
Use cross-validation to find optimal hyperparameters, analyze feature importance to identify the top predictors of credit risk, calculate the expected financial impact of deploying your model, and prepare a model summary accessible to non-technical stakeholders.
Hint
In GridSearchCV use scoring='roc_auc' and cv=5 for stable estimates on imbalanced data. For the financial impact calculation, start with average loan = $100K and LGD = 20%. The formula is: savings = annual_volume Γ (old_default_rate β new_default_rate) Γ avg_loan Γ LGD. For regulators, frame feature importance in plain language β "credit score" and "debt-to-income ratio" are understandable; "feature_7" is not.
π Interactive Feature Importance Chart
Click any bar to see details about that feature and its role in credit risk prediction.
Checkpoint 3
Verify that grid search improved AUC by at least 0.002 over the initial model, that credit_score and debt_to_income dominate feature importance (together >40%), and that the projected annual savings exceed $8M. The interactive chart above shows the full feature ranking β click any bar for a plain-language explanation suitable for a regulatory filing.
Task 4: Model Comparison β RF vs XGBoost vs LightGBM
βοΈ Your Task
Compare three state-of-the-art ensemble methods on the credit risk dataset: Random Forest, XGBoost, and LightGBM. Evaluate each model across five metrics β AUC, F1-score (default class), training time, inference speed, and memory footprint β to identify the best model for production deployment under FinanceFirst Bank's constraints.
Hint
LightGBM uses leaf-wise (best-first) tree growth vs XGBoost's level-wise growth, making it faster but more prone to overfitting on small datasets. Random Forest's bagging approach provides more variance reduction but less bias reduction than boosting. For credit risk (where false negatives = unpaid loans), prioritize recall on the default class over overall accuracy. LightGBM typically trains 3β10Γ faster than XGBoost on large datasets.
Model Comparison Summary
| Model | AUC | F1 (Default) | Train Time | Inference | Best For |
|---|---|---|---|---|---|
| Random Forest | 0.6812 | 14.7 s | 38.2 ms | Stability, low variance | |
| XGBoost | 0.7231 | 8.3 s | 12.6 ms | Interpretability + performance | |
| π LightGBM | 0.7319 | 3.1 s | 8.4 ms | Large-scale production |
Checkpoint 4
LightGBM edges out XGBoost on both AUC and F1-Default while training 4.7Γ faster β a compelling case for production deployment. However, note that all three models materially outperform the bank's current rule-based system. The final model choice should weigh regulatory interpretability requirements against raw performance; XGBoost's SHAP ecosystem gives it an edge in regulated environments.
Task 5: Hyperparameter Tuning β Interactive Learning Rate Explorer
βοΈ Your Task
Explore how the learning rate and tree depth interact to produce different training and validation loss curves. Use the interactive controls below to simulate training runs and observe overfitting, underfitting, and optimal convergence patterns. Identify the hyperparameter combination that minimizes validation loss while maintaining a small trainβval gap.
Hint
A high learning rate (β₯ 0.3) causes validation loss to diverge after a few rounds β the model takes large gradient steps and overshoots the optimum. A very low learning rate (β€ 0.01) means the model learns slowly and needs many more trees to converge. The sweet spot is typically 0.05β0.10 with appropriate regularization. Watch for overfitting when the trainβval gap exceeds 0.05 log-loss units. Use n_estimators inversely proportional to learning rate.
ποΈ Interactive Train / Validation Loss Explorer
Adjust learning rate and max depth, then click Run Simulation to see how the loss curves change.
Checkpoint 5
Try learning rates of 0.01, 0.05, 0.10, and 0.30 with max depth 5 and observe the simulation. You should see: slow convergence at 0.01, an optimal balance at 0.05, modest overfitting at 0.10, and clear overfitting (trainβval divergence) at 0.30. The ideal configuration minimizes validation loss with the smallest possible trainβval gap. Record your best configuration for the final model recommendation in your lab report.