🔬 Module 6 Lab: Credit Risk Assessment with Boosting

Build a production-grade credit scoring model using XGBoost

Lab Overview

Learning Objectives

By completing this lab, you will develop practical skills in implementing boosting algorithms for a real-world financial application. You will learn how to handle imbalanced datasets, optimize hyperparameters using cross-validation, interpret feature importance for regulatory compliance, and deploy a model that meets business requirements for accuracy and explainability. This lab simulates the work you would perform as a data scientist at a financial institution.

Business Context

You are a data scientist at FinanceFirst Bank, a regional bank processing 50,000 loan applications annually. The current credit scoring system approves 65% of applications with a 12% default rate, resulting in $24 million in annual losses. Your task is to build an XGBoost model that improves default prediction while maintaining approval rates above 60% to meet growth targets. The model must also provide feature importance scores to satisfy regulatory requirements for explainability in lending decisions.

Task 1: Data Preparation and Exploration

Your Task

Load the credit risk dataset, perform exploratory data analysis to understand feature distributions and default rates, identify missing values and outliers, and prepare the data for modeling by handling categorical variables and splitting into training and test sets.

# Import required libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder import warnings warnings.filterwarnings('ignore') # Load the dataset # Dataset contains 50,000 loan applications with 23 features df = pd.read_csv('credit_risk_data.csv') # Initial exploration print("Dataset shape:", df.shape) print("\nFirst few rows:") print(df.head()) print("\nFeature types:") print(df.dtypes) print("\nMissing values:") print(df.isnull().sum()) print("\nTarget variable distribution:") print(df['default'].value_counts()) print(f"Default rate: {df['default'].mean():.2%}") # Visualize key relationships fig, axes = plt.subplots(2, 2, figsize=(15, 12)) # Default rate by age group df['age_group'] = pd.cut(df['age'], bins=[18, 30, 40, 50, 60, 100], labels=['18-30', '30-40', '40-50', '50-60', '60+']) age_default = df.groupby('age_group')['default'].mean() axes[0, 0].bar(age_default.index, age_default.values) axes[0, 0].set_title('Default Rate by Age Group') axes[0, 0].set_ylabel('Default Rate') # Default rate by income level df['income_group'] = pd.cut(df['annual_income'], bins=[0, 30000, 60000, 100000, 500000], labels=['<30K', '30-60K', '60-100K', '>100K']) income_default = df.groupby('income_group')['default'].mean() axes[0, 1].bar(income_default.index, income_default.values) axes[0, 1].set_title('Default Rate by Income Level') axes[0, 1].set_ylabel('Default Rate') # Loan amount distribution axes[1, 0].hist([df[df['default']==0]['loan_amount'], df[df['default']==1]['loan_amount']], label=['Non-default', 'Default'], bins=30) axes[1, 0].set_title('Loan Amount Distribution') axes[1, 0].legend() # Credit score distribution axes[1, 1].hist([df[df['default']==0]['credit_score'], df[df['default']==1]['credit_score']], label=['Non-default', 'Default'], bins=30) axes[1, 1].set_title('Credit Score Distribution') axes[1, 1].legend() plt.tight_layout() plt.show()

Key Observations to Note

Your exploration should reveal that younger applicants (18-30) have higher default rates around 18%, while applicants over 50 show default rates below 8%. Income levels show a strong inverse relationship with default rates, with high-income borrowers defaulting at less than 5% compared to 22% for low-income borrowers. Credit scores provide clear separation between defaults and non-defaults, with most defaults occurring below 620. Loan amounts show moderate predictive power, with larger loans having slightly higher default rates. These insights will guide your feature engineering and model interpretation.

Checkpoint 1

Before proceeding, verify that you have successfully identified the class imbalance (default rate around 12%), confirmed that credit score and income are strong predictors based on distribution differences, noted any missing values that need imputation, and created visualizations that clearly show the relationship between key features and default rates. These foundational understandings will inform your modeling decisions in the next tasks.

Task 2: Build and Train the XGBoost Model

Your Task

Implement an XGBoost classifier with appropriate parameters for the imbalanced credit risk dataset. Configure the model to handle class imbalance, set regularization parameters to prevent overfitting, and use early stopping to find the optimal number of trees. Train the model and evaluate its performance using metrics appropriate for imbalanced classification problems.

# Prepare data for modeling # Handle categorical variables categorical_cols = ['employment_status', 'home_ownership', 'loan_purpose'] for col in categorical_cols: le = LabelEncoder() df[col] = le.fit_transform(df[col]) # Separate features and target X = df.drop(['default', 'age_group', 'income_group'], axis=1) y = df['default'] # Split into train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) print(f"Training set size: {len(X_train)}") print(f"Test set size: {len(X_test)}") print(f"Training default rate: {y_train.mean():.2%}") # Build XGBoost model import xgboost as xgb from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix # Calculate scale_pos_weight for imbalanced data scale_pos_weight = len(y_train[y_train==0]) / len(y_train[y_train==1]) # Set parameters params = { 'objective': 'binary:logistic', 'max_depth': 5, 'learning_rate': 0.05, 'n_estimators': 500, 'subsample': 0.8, 'colsample_bytree': 0.8, 'min_child_weight': 3, 'gamma': 0.1, 'reg_alpha': 0.1, 'reg_lambda': 1.0, 'scale_pos_weight': scale_pos_weight, 'eval_metric': ['auc', 'logloss'], 'random_state': 42 } # Create DMatrix dtrain = xgb.DMatrix(X_train, label=y_train) dtest = xgb.DMatrix(X_test, label=y_test) # Train with early stopping evals = [(dtrain, 'train'), (dtest, 'eval')] model = xgb.train( params, dtrain, num_boost_round=500, evals=evals, early_stopping_rounds=20, verbose_eval=50 ) # Make predictions y_pred_proba = model.predict(dtest) y_pred = (y_pred_proba >= 0.5).astype(int) # Evaluate model print("\nModel Performance:") print(classification_report(y_test, y_pred)) print(f"\nAUC Score: {roc_auc_score(y_test, y_pred_proba):.4f}") # Confusion matrix cm = confusion_matrix(y_test, y_pred) print("\nConfusion Matrix:") print(cm) # Calculate business metrics tn, fp, fn, tp = cm.ravel() approval_rate = (tp + fp) / len(y_test) default_rate_approved = fn / (tp + fp) if (tp + fp) > 0 else 0 print(f"\nBusiness Metrics:") print(f"Approval Rate: {approval_rate:.2%}") print(f"Default Rate Among Approved: {default_rate_approved:.2%}")

Checkpoint 2

Your model should achieve an AUC score above 0.85, indicating strong discriminatory power between defaulters and non-defaulters. The approval rate should remain above 60% to meet business growth targets, while the default rate among approved loans should drop below 8% (compared to the current 12%). The early stopping mechanism should have halted training somewhere between 80-150 trees, demonstrating that the model converged to an optimal solution without unnecessary complexity. If your metrics fall short of these targets, revisit your feature engineering or adjust the scale_pos_weight parameter to better balance precision and recall.

Task 3: Optimize and Deploy the Final Model

Your Task

Use cross-validation to find optimal hyperparameters for your XGBoost model, focusing on parameters that control model complexity and learning rate. Analyze feature importance to identify the top predictors of credit risk. Calculate the expected financial impact of deploying your model compared to the current system. Prepare a model summary that explains key drivers of credit risk in language accessible to non-technical stakeholders.

# Hyperparameter tuning with cross-validation from sklearn.model_selection import GridSearchCV # Define parameter grid (focused search) param_grid = { 'max_depth': [3, 5, 7], 'learning_rate': [0.01, 0.05, 0.1], 'min_child_weight': [1, 3, 5], 'subsample': [0.8, 1.0] } # Create XGBoost classifier xgb_clf = xgb.XGBClassifier( objective='binary:logistic', scale_pos_weight=scale_pos_weight, random_state=42 ) # Perform grid search grid_search = GridSearchCV( xgb_clf, param_grid, cv=5, scoring='roc_auc', n_jobs=-1, verbose=1 ) grid_search.fit(X_train, y_train) print("Best parameters:", grid_search.best_params_) print("Best AUC score:", grid_search.best_score_) # Train final model with best parameters final_model = grid_search.best_estimator_ y_pred_final = final_model.predict(X_test) y_pred_proba_final = final_model.predict_proba(X_test)[:, 1] print("\nFinal Model Performance:") print(classification_report(y_test, y_pred_final)) print(f"AUC: {roc_auc_score(y_test, y_pred_proba_final):.4f}") # Feature importance analysis import pandas as pd feature_importance = pd.DataFrame({ 'feature': X.columns, 'importance': final_model.feature_importances_ }).sort_values('importance', ascending=False) print("\nTop 10 Most Important Features:") print(feature_importance.head(10)) # Visualize feature importance plt.figure(figsize=(12, 8)) plt.barh(feature_importance.head(15)['feature'], feature_importance.head(15)['importance']) plt.xlabel('Importance Score') plt.title('Top 15 Features for Credit Risk Prediction') plt.tight_layout() plt.show() # Calculate financial impact cm_final = confusion_matrix(y_test, y_pred_final) tn_f, fp_f, fn_f, tp_f = cm_final.ravel() # Assume $100,000 average loan, 20% loss given default avg_loan = 100000 lgd = 0.20 current_loss_rate = 0.12 new_loss_rate = fn_f / (tp_f + fp_f) if (tp_f + fp_f) > 0 else 0 annual_volume = 50000 current_losses = annual_volume * current_loss_rate * avg_loan * lgd new_losses = annual_volume * new_loss_rate * avg_loan * lgd savings = current_losses - new_losses print(f"\nFinancial Impact Analysis:") print(f"Current annual losses: ${current_losses:,.0f}") print(f"Projected annual losses: ${new_losses:,.0f}") print(f"Annual savings: ${savings:,.0f}") print(f"ROI from model deployment: {(savings/100000)*100:.1f}x")