Module 4: Decision Trees for Intelligent Loan Approval

The $85 Million Problem

Regional Bank Corporation faces a crisis: their traditional loan approval process, based on rigid credit score cutoffs and manual review, is hemorrhaging money. Last year alone, they lost $50 million to defaults on approved loans and $35 million in missed revenue from rejected creditworthy applicants. You've been hired to build an interpretable, data-driven decision system that can revolutionize their lending operations.

Part I: The Paradigm Shift - From Rules to Learning

Traditional Approach: Fixed Rules

Machine Learning Approach: Adaptive Trees

Part II: Mathematical Foundations

Information Theory: The Heart of Decision Trees

Decision trees make splits by maximizing information gain - the reduction in uncertainty about loan outcomes:

H(S) = -Σ p(c) × log₂(p(c))

Where H(S) is the entropy of the dataset S, and p(c) is the proportion of samples belonging to class c (default/no-default).

Information Gain

The effectiveness of a split is measured by how much it reduces entropy:

IG(S, A) = H(S) - Σ (|Sᵥ|/|S|) × H(Sᵥ)

Where A is the attribute we're splitting on, and Sᵥ represents the subset of S for each value v of attribute A.

Gini Impurity: An Alternative Metric

Many implementations use Gini impurity for computational efficiency:

Gini(S) = 1 - Σ p(c)²

Lower Gini values indicate purer nodes (more homogeneous loan outcomes).

Part III: Building the Tree - A Step-by-Step Journey

# Step 1: Understanding our loan data structure import pandas as pd import numpy as np from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split # Load historical loan data (5 years, 50,000 applications) loan_data = pd.read_csv('loan_applications.csv') # Key features that determine loan risk features = [ 'credit_score', # FICO score (300-850) 'annual_income', # Annual income in USD 'debt_to_income', # DTI ratio (%) 'employment_years', # Years at current job 'loan_amount', # Requested loan amount 'loan_purpose', # Encoded: home, auto, personal, etc. 'home_ownership', # Encoded: rent, own, mortgage 'previous_defaults', # Number of past defaults ] # Target: 1 if loan defaulted, 0 if repaid successfully X = loan_data[features] y = loan_data['defaulted'] # Split data: 70% training, 30% testing X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y )
# Step 2: Building our first decision tree def build_loan_decision_tree(max_depth=None, min_samples_split=20): """ Build a decision tree optimized for loan approval. Business constraints: - Must be interpretable (max_depth limits complexity) - Must handle at least 20 samples per decision (regulatory requirement) - Must provide probability estimates for risk scoring """ # Initialize tree with business-driven parameters tree_model = DecisionTreeClassifier( criterion='gini', # Use Gini for speed max_depth=max_depth, # Control interpretability min_samples_split=min_samples_split, # Regulatory compliance min_samples_leaf=10, # Avoid overfitting class_weight='balanced', # Handle imbalanced defaults random_state=42 ) # Train the model tree_model.fit(X_train, y_train) # Calculate feature importance for business insights feature_importance = pd.DataFrame({ 'feature': features, 'importance': tree_model.feature_importances_ }).sort_values('importance', ascending=False) return tree_model, feature_importance # Build our model model, importance = build_loan_decision_tree(max_depth=5) print("Top Risk Factors:") print(importance.head())

Part IV: The Business Impact - Beyond Accuracy

The Cost-Sensitive Nature of Loan Decisions

Not all errors are equal in lending:

Our decision tree must optimize for business value, not just accuracy!

# Step 3: Implementing cost-sensitive decision making def calculate_business_impact(y_true, y_pred, y_prob): """ Calculate the actual dollar impact of our loan decisions. Business model: - Average loan amount: $50,000 - Default rate in approved loans affects profitability - Each default costs: principal loss + collection costs - Each successful loan generates: interest income over loan term """ # Business parameters avg_loan_amount = 50000 default_loss = 25000 # Average loss per defaulted loan good_loan_profit = 3500 # Average profit per successful loan # Calculate confusion matrix components true_positives = np.sum((y_true == 1) & (y_pred == 1)) # Correctly rejected bad loans false_positives = np.sum((y_true == 0) & (y_pred == 1)) # Incorrectly rejected good loans true_negatives = np.sum((y_true == 0) & (y_pred == 0)) # Correctly approved good loans false_negatives = np.sum((y_true == 1) & (y_pred == 0)) # Incorrectly approved bad loans # Calculate financial impact saved_from_defaults = true_positives * default_loss lost_opportunity = false_positives * good_loan_profit earned_from_good_loans = true_negatives * good_loan_profit losses_from_defaults = false_negatives * default_loss total_impact = saved_from_defaults + earned_from_good_loans - lost_opportunity - losses_from_defaults return { 'total_impact': total_impact, 'saved_from_defaults': saved_from_defaults, 'lost_opportunities': lost_opportunity, 'earned_from_good_loans': earned_from_good_loans, 'losses_from_defaults': losses_from_defaults, 'approval_rate': (true_negatives + false_negatives) / len(y_true) } # Evaluate our model's business impact y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] impact = calculate_business_impact(y_test, y_pred, y_prob) print(f"Annual Financial Impact: ${impact['total_impact']:,.0f}") print(f"Loan Approval Rate: {impact['approval_rate']:.1%}")

Part V: Advanced Techniques - Pruning and Optimization

# Step 4: Optimizing tree complexity for production def optimize_tree_depth(X_train, y_train, X_val, y_val): """ Find the optimal tree depth that maximizes business value. Strategy: Balance model complexity with financial performance """ depth_analysis = [] for depth in range(2, 15): # Train tree with specific depth tree = DecisionTreeClassifier( max_depth=depth, min_samples_split=20, class_weight='balanced', random_state=42 ) tree.fit(X_train, y_train) # Evaluate on validation set y_pred = tree.predict(X_val) y_prob = tree.predict_proba(X_val)[:, 1] # Calculate business metrics impact = calculate_business_impact(y_val, y_pred, y_prob) # Store results depth_analysis.append({ 'depth': depth, 'n_leaves': tree.get_n_leaves(), 'financial_impact': impact['total_impact'], 'approval_rate': impact['approval_rate'] }) return pd.DataFrame(depth_analysis) # Find optimal complexity optimization_results = optimize_tree_depth(X_train, y_train, X_test, y_test) optimal_depth = optimization_results.loc[ optimization_results['financial_impact'].idxmax(), 'depth' ] print(f"Optimal tree depth: {optimal_depth}") print(f"Expected annual impact: ${optimization_results['financial_impact'].max():,.0f}")

Part VI: Interpretability - The Regulatory Requirement

# Step 5: Extracting human-readable decision rules def extract_decision_path(tree_model, sample_features): """ Extract the decision path for a loan application. Required for regulatory compliance and customer communication. """ # Get the decision path feature_names = sample_features.index decision_path = tree_model.decision_path([sample_features.values]) leaf_id = tree_model.apply([sample_features.values])[0] # Extract rules from root to leaf node_indicator = decision_path.toarray()[0] rules = [] for node_id in range(len(node_indicator)): if node_indicator[node_id] == 1: if tree_model.tree_.children_left[node_id] != -1: # Not a leaf feature_id = tree_model.tree_.feature[node_id] threshold = tree_model.tree_.threshold[node_id] feature_name = feature_names[feature_id] # Determine which branch was taken if sample_features.values[feature_id] <= threshold: rules.append(f"{feature_name} <= {threshold:.2f}") else: rules.append(f"{feature_name} > {threshold:.2f}") # Get the prediction probability proba = tree_model.predict_proba([sample_features.values])[0] return { 'rules': rules, 'default_probability': proba[1], 'decision': 'Approve' if proba[1] < 0.3 else 'Reject' } # Example: Explain a specific loan decision sample_application = X_test.iloc[0] explanation = extract_decision_path(model, sample_application) print("Loan Decision Explanation:") for i, rule in enumerate(explanation['rules'], 1): print(f" Step {i}: {rule}") print(f"Default Risk: {explanation['default_probability']:.1%}") print(f"Decision: {explanation['decision']}")

Part VII: Ensemble Enhancement - Random Forests

From Single Tree to Forest

While a single decision tree provides interpretability, combining multiple trees through Random Forests can dramatically improve performance while maintaining reasonable interpretability through feature importance analysis.

# Step 6: Building a Random Forest for comparison from sklearn.ensemble import RandomForestClassifier def build_loan_forest(n_trees=100): """ Build a Random Forest model for loan approval. Trades some interpretability for significantly better performance. """ forest = RandomForestClassifier( n_estimators=n_trees, max_depth=7, min_samples_split=20, class_weight='balanced', n_jobs=-1, # Use all CPU cores random_state=42 ) forest.fit(X_train, y_train) # Compare single tree vs forest performance tree_pred = model.predict(X_test) forest_pred = forest.predict(X_test) tree_impact = calculate_business_impact(y_test, tree_pred, model.predict_proba(X_test)[:, 1]) forest_impact = calculate_business_impact(y_test, forest_pred, forest.predict_proba(X_test)[:, 1]) improvement = forest_impact['total_impact'] - tree_impact['total_impact'] print(f"Single Tree Annual Impact: ${tree_impact['total_impact']:,.0f}") print(f"Random Forest Annual Impact: ${forest_impact['total_impact']:,.0f}") print(f"Additional Value from Forest: ${improvement:,.0f}") return forest forest_model = build_loan_forest()

Module Summary: Decision Trees in Production

Key Achievements:

Critical Insights:

Production Considerations:

The Bottom Line

Your decision tree system has transformed Regional Bank's lending operations. By replacing rigid rules with adaptive, data-driven decisions, you've not only saved $52 million annually but also increased loan approval rates by 15% while reducing default rates by 40%. The bank can now make instant, explainable loan decisions that satisfy both customers and regulators. This is the power of decision trees - turning complex patterns into clear, actionable business rules.