๐Ÿ  Course Home ๐Ÿ“š Teaching Home

Module 4: Decision Trees for Intelligent Loan Approval

๐Ÿ“Š Module Progress:
0%
Complete quizzes to track your progress (0/7 parts completed)

The $85 Million Problem

Regional Bank Corporation faces a crisis: their traditional loan approval process, based on rigid credit score cutoffs and manual review, is hemorrhaging money. Last year alone, they lost $50 million to defaults on approved loans and $35 million in missed revenue from rejected creditworthy applicants. You've been hired to build an interpretable, data-driven decision system that can revolutionize their lending operations.

Part I: The Paradigm Shift - From Rules to Learning

Traditional Approach: Fixed Rules

Machine Learning Approach: Adaptive Trees

๐Ÿง  Quick Check โ€” Part I

What is the key advantage of machine learning decision trees over hard-coded rule systems for loan approval?

They are faster to run on hardware
They automatically discover data-driven splitting points and feature interactions
They require no data to train
They eliminate the need for regulatory compliance

Part II: Mathematical Foundations

Information Theory: The Heart of Decision Trees

Decision trees make splits by maximizing information gain - the reduction in uncertainty about loan outcomes:

H(S) = -ฮฃ p(c) ร— logโ‚‚(p(c))

Where H(S) is the entropy of the dataset S, and p(c) is the proportion of samples belonging to class c (default/no-default).

Information Gain

The effectiveness of a split is measured by how much it reduces entropy:

IG(S, A) = H(S) - ฮฃ (|Sแตฅ|/|S|) ร— H(Sแตฅ)

Where A is the attribute we're splitting on, and Sแตฅ represents the subset of S for each value v of attribute A.

Gini Impurity: An Alternative Metric

Many implementations use Gini impurity for computational efficiency:

Gini(S) = 1 - ฮฃ p(c)ยฒ

Lower Gini values indicate purer nodes (more homogeneous loan outcomes).

๐Ÿ”ข Interactive Entropy Calculator

Drag the slider to change the proportion of defaults in a loan dataset and see how entropy changes:

0% defaults (pure)100% defaults (pure)
Default rate: 50% | Entropy: 1.000 bits

๐Ÿ”ต Low entropy = pure node (easier decision) | ๐Ÿ”ด High entropy = mixed node (harder decision)

๐Ÿง  Quick Check โ€” Part II

At what default rate is entropy maximized in a binary classification problem?

0% (all non-defaults)
25% defaults
50% defaults
100% (all defaults)

Part III: Building the Tree - A Step-by-Step Journey

# Step 1: Understanding our loan data structure import pandas as pd import numpy as np from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split # Load historical loan data (5 years, 50,000 applications) loan_data = pd.read_csv('loan_applications.csv') # Key features that determine loan risk features = [ 'credit_score', # FICO score (300-850) 'annual_income', # Annual income in USD 'debt_to_income', # DTI ratio (%) 'employment_years', # Years at current job 'loan_amount', # Requested loan amount 'loan_purpose', # Encoded: home, auto, personal, etc. 'home_ownership', # Encoded: rent, own, mortgage 'previous_defaults', # Number of past defaults ] X = loan_data[features] y = loan_data['defaulted'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# Step 2: Building our first decision tree def build_loan_decision_tree(max_depth=None, min_samples_split=20): tree_model = DecisionTreeClassifier( criterion='gini', max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=10, class_weight='balanced', random_state=42 ) tree_model.fit(X_train, y_train) feature_importance = pd.DataFrame({ 'feature': features, 'importance': tree_model.feature_importances_ }).sort_values('importance', ascending=False) return tree_model, feature_importance model, importance = build_loan_decision_tree(max_depth=5)

๐ŸŒณ Interactive Decision Tree Builder

Click any non-leaf node to expand it. Each node shows the Gini impurity and sample counts for the loan dataset.

Node color: ๐ŸŸข mostly approved (Gini < 0.3) | ๐ŸŸก mixed (0.3โ€“0.45) | ๐Ÿ”ด mostly rejected (Gini > 0.45)

๐Ÿง  Quick Check โ€” Part III

When building a decision tree for loan approval, which split would be chosen at the root node?

The split with the highest entropy after split
The split that maximizes information gain (minimizes weighted child entropy)
The split on the first feature alphabetically
A random split to ensure unbiased training

Part IV: The Business Impact - Beyond Accuracy

The Cost-Sensitive Nature of Loan Decisions

Not all errors are equal in lending:

Our decision tree must optimize for business value, not just accuracy!

# Step 3: Implementing cost-sensitive decision making def calculate_business_impact(y_true, y_pred, y_prob): default_loss = 25000 good_loan_profit = 3500 true_positives = np.sum((y_true == 1) & (y_pred == 1)) false_positives = np.sum((y_true == 0) & (y_pred == 1)) true_negatives = np.sum((y_true == 0) & (y_pred == 0)) false_negatives = np.sum((y_true == 1) & (y_pred == 0)) total_impact = (true_positives * default_loss + true_negatives * good_loan_profit - false_positives * good_loan_profit - false_negatives * default_loss) return {'total_impact': total_impact, 'approval_rate': (true_negatives + false_negatives) / len(y_true)}

๐Ÿ”ฒ Interactive Confusion Matrix

Adjust the decision threshold and see how the confusion matrix and business metrics change:

More Approvals (low threshold)More Rejections (high threshold)
Decision Threshold: 0.30 (approve if P(default) < 0.30)
Predicted
Approved (0)Rejected (1)
Actual No Default (0) 7,200 800
Default (1) 300 700
Accuracy
88.9%
Financial Impact
+$27.7M
Approval Rate
76%

๐Ÿง  Quick Check โ€” Part IV

In a loan approval model, a False Negative means:

Rejecting a creditworthy applicant
Approving a loan that later defaults
Correctly approving a good loan
Correctly rejecting a bad loan

Part V: Advanced Techniques - Pruning and Optimization

# Step 4: Optimizing tree complexity for production def optimize_tree_depth(X_train, y_train, X_val, y_val): depth_analysis = [] for depth in range(2, 15): tree = DecisionTreeClassifier(max_depth=depth, min_samples_split=20, class_weight='balanced', random_state=42) tree.fit(X_train, y_train) y_pred = tree.predict(X_val) impact = calculate_business_impact(y_val, y_pred, tree.predict_proba(X_val)[:, 1]) depth_analysis.append({'depth': depth, 'n_leaves': tree.get_n_leaves(), 'financial_impact': impact['total_impact']}) return pd.DataFrame(depth_analysis)

๐ŸŽš๏ธ Max Depth Explorer: Bias-Variance Tradeoff

Use the slider to see how tree depth affects training accuracy, test accuracy, and model complexity:

Depth 1 (very simple)Depth 10 (complex)
Max Depth = 3
Train Accuracy
82.1%
Test Accuracy
79.4%
Tree Leaves
8
Status
Good Fit

๐Ÿง  Quick Check โ€” Part V

A decision tree with max_depth=20 on a training dataset of 10,000 samples will likely suffer from:

Underfitting โ€” too simple to capture patterns
Overfitting โ€” memorizing training data but failing on new data
Slow training โ€” too many parameters to optimize
Class imbalance โ€” unequal leaf node sizes

Part VI: Interpretability - The Regulatory Requirement

# Step 5: Extracting human-readable decision rules def extract_decision_path(tree_model, sample_features): feature_names = sample_features.index decision_path = tree_model.decision_path([sample_features.values]) node_indicator = decision_path.toarray()[0] rules = [] for node_id in range(len(node_indicator)): if node_indicator[node_id] == 1: if tree_model.tree_.children_left[node_id] != -1: feature_id = tree_model.tree_.feature[node_id] threshold = tree_model.tree_.threshold[node_id] feature_name = feature_names[feature_id] if sample_features.values[feature_id] <= threshold: rules.append(f"{feature_name} <= {threshold:.2f}") else: rules.append(f"{feature_name} > {threshold:.2f}") proba = tree_model.predict_proba([sample_features.values])[0] return {'rules': rules, 'default_probability': proba[1], 'decision': 'Approve' if proba[1] < 0.3 else 'Reject'}

๐Ÿฆ Interactive Loan Approval Explorer

Enter a loan application and see how the decision tree evaluates it step by step:

๐Ÿง  Quick Check โ€” Part VI

Why is interpretability particularly important for loan approval decision trees compared to, say, an image classifier?

Image classifiers are more accurate so they don't need explanation
Decision trees are too complex for regulators to review
Financial regulations require explainable decisions for applicant rights and regulatory audits
Loan amounts are too small to warrant complex explanations

Part VII: Ensemble Enhancement - Random Forests

From Single Tree to Forest

While a single decision tree provides interpretability, combining multiple trees through Random Forests can dramatically improve performance while maintaining reasonable interpretability through feature importance analysis.

# Step 6: Building a Random Forest for comparison from sklearn.ensemble import RandomForestClassifier def build_loan_forest(n_trees=100): forest = RandomForestClassifier(n_estimators=n_trees, max_depth=7, min_samples_split=20, class_weight='balanced', n_jobs=-1, random_state=42) forest.fit(X_train, y_train) tree_pred = model.predict(X_test) forest_pred = forest.predict(X_test) tree_impact = calculate_business_impact(y_test, tree_pred, model.predict_proba(X_test)[:, 1]) forest_impact = calculate_business_impact(y_test, forest_pred, forest.predict_proba(X_test)[:, 1]) improvement = forest_impact['total_impact'] - tree_impact['total_impact'] print(f"Additional Value from Forest: ${improvement:,.0f}") return forest forest_model = build_loan_forest()

๐ŸŒฒ Single Tree vs Random Forest: Performance Comparison

Select the number of trees in the forest and see how ensemble size affects performance:

1 tree200 trees
Forest Size: 10 trees
Single Tree AUC
0.81
Forest AUC
0.89
Improvement
+$2.8M/yr

๐Ÿง  Quick Check โ€” Part VII

What is the primary tradeoff when switching from a single decision tree to a Random Forest for loan approval?

Training speed vs accuracy (forests are much faster)
Interpretability vs performance (forests are harder to explain but more accurate)
Memory usage vs feature selection (forests use less memory)
Data requirements vs bias (forests need less data)

Module Summary: Decision Trees in Production

Key Achievements:

Critical Insights:

Production Considerations:

The Bottom Line

Your decision tree system has transformed Regional Bank's lending operations. By replacing rigid rules with adaptive, data-driven decisions, you've not only saved $52 million annually but also increased loan approval rates by 15% while reducing default rates by 40%. The bank can now make instant, explainable loan decisions that satisfy both customers and regulators. This is the power of decision trees - turning complex patterns into clear, actionable business rules.