Module 4: Decision Trees for Intelligent Loan Approval
The $85 Million Problem
Regional Bank Corporation faces a crisis: their traditional loan approval process, based on rigid credit score cutoffs and manual review, is hemorrhaging money. Last year alone, they lost $50 million to defaults on approved loans and $35 million in missed revenue from rejected creditworthy applicants. You've been hired to build an interpretable, data-driven decision system that can revolutionize their lending operations.
Part I: The Paradigm Shift - From Rules to Learning
Traditional Approach: Fixed Rules
- Hard-coded thresholds (e.g., "Reject if credit score < 650")
- Linear decision boundaries that miss complex patterns
- No ability to discover interaction effects
- Manual feature engineering based on intuition
Machine Learning Approach: Adaptive Trees
- Data-driven splitting points optimized for actual outcomes
- Automatic discovery of non-linear patterns
- Natural handling of feature interactions
- Interpretable decision paths for regulatory compliance
Part II: Mathematical Foundations
Information Theory: The Heart of Decision Trees
Decision trees make splits by maximizing information gain - the reduction in uncertainty about loan outcomes:
H(S) = -Σ p(c) × log₂(p(c))
Where H(S) is the entropy of the dataset S, and p(c) is the proportion of samples belonging to class c (default/no-default).
Information Gain
The effectiveness of a split is measured by how much it reduces entropy:
IG(S, A) = H(S) - Σ (|Sᵥ|/|S|) × H(Sᵥ)
Where A is the attribute we're splitting on, and Sᵥ represents the subset of S for each value v of attribute A.
Gini Impurity: An Alternative Metric
Many implementations use Gini impurity for computational efficiency:
Gini(S) = 1 - Σ p(c)²
Lower Gini values indicate purer nodes (more homogeneous loan outcomes).
Part III: Building the Tree - A Step-by-Step Journey
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
loan_data = pd.read_csv('loan_applications.csv')
features = [
'credit_score',
'annual_income',
'debt_to_income',
'employment_years',
'loan_amount',
'loan_purpose',
'home_ownership',
'previous_defaults',
]
X = loan_data[features]
y = loan_data['defaulted']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
def build_loan_decision_tree(max_depth=None, min_samples_split=20):
"""
Build a decision tree optimized for loan approval.
Business constraints:
- Must be interpretable (max_depth limits complexity)
- Must handle at least 20 samples per decision (regulatory requirement)
- Must provide probability estimates for risk scoring
"""
tree_model = DecisionTreeClassifier(
criterion='gini',
max_depth=max_depth,
min_samples_split=min_samples_split,
min_samples_leaf=10,
class_weight='balanced',
random_state=42
)
tree_model.fit(X_train, y_train)
feature_importance = pd.DataFrame({
'feature': features,
'importance': tree_model.feature_importances_
}).sort_values('importance', ascending=False)
return tree_model, feature_importance
model, importance = build_loan_decision_tree(max_depth=5)
print("Top Risk Factors:")
print(importance.head())
Part IV: The Business Impact - Beyond Accuracy
The Cost-Sensitive Nature of Loan Decisions
Not all errors are equal in lending:
- False Negative (Approving a bad loan): Average loss of $25,000 per default
- False Positive (Rejecting a good loan): Lost profit of $3,500 per loan
Our decision tree must optimize for business value, not just accuracy!
def calculate_business_impact(y_true, y_pred, y_prob):
"""
Calculate the actual dollar impact of our loan decisions.
Business model:
- Average loan amount: $50,000
- Default rate in approved loans affects profitability
- Each default costs: principal loss + collection costs
- Each successful loan generates: interest income over loan term
"""
avg_loan_amount = 50000
default_loss = 25000
good_loan_profit = 3500
true_positives = np.sum((y_true == 1) & (y_pred == 1))
false_positives = np.sum((y_true == 0) & (y_pred == 1))
true_negatives = np.sum((y_true == 0) & (y_pred == 0))
false_negatives = np.sum((y_true == 1) & (y_pred == 0))
saved_from_defaults = true_positives * default_loss
lost_opportunity = false_positives * good_loan_profit
earned_from_good_loans = true_negatives * good_loan_profit
losses_from_defaults = false_negatives * default_loss
total_impact = saved_from_defaults + earned_from_good_loans - lost_opportunity - losses_from_defaults
return {
'total_impact': total_impact,
'saved_from_defaults': saved_from_defaults,
'lost_opportunities': lost_opportunity,
'earned_from_good_loans': earned_from_good_loans,
'losses_from_defaults': losses_from_defaults,
'approval_rate': (true_negatives + false_negatives) / len(y_true)
}
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
impact = calculate_business_impact(y_test, y_pred, y_prob)
print(f"Annual Financial Impact: ${impact['total_impact']:,.0f}")
print(f"Loan Approval Rate: {impact['approval_rate']:.1%}")
Part V: Advanced Techniques - Pruning and Optimization
def optimize_tree_depth(X_train, y_train, X_val, y_val):
"""
Find the optimal tree depth that maximizes business value.
Strategy: Balance model complexity with financial performance
"""
depth_analysis = []
for depth in range(2, 15):
tree = DecisionTreeClassifier(
max_depth=depth,
min_samples_split=20,
class_weight='balanced',
random_state=42
)
tree.fit(X_train, y_train)
y_pred = tree.predict(X_val)
y_prob = tree.predict_proba(X_val)[:, 1]
impact = calculate_business_impact(y_val, y_pred, y_prob)
depth_analysis.append({
'depth': depth,
'n_leaves': tree.get_n_leaves(),
'financial_impact': impact['total_impact'],
'approval_rate': impact['approval_rate']
})
return pd.DataFrame(depth_analysis)
optimization_results = optimize_tree_depth(X_train, y_train, X_test, y_test)
optimal_depth = optimization_results.loc[
optimization_results['financial_impact'].idxmax(), 'depth'
]
print(f"Optimal tree depth: {optimal_depth}")
print(f"Expected annual impact: ${optimization_results['financial_impact'].max():,.0f}")
Part VI: Interpretability - The Regulatory Requirement
def extract_decision_path(tree_model, sample_features):
"""
Extract the decision path for a loan application.
Required for regulatory compliance and customer communication.
"""
feature_names = sample_features.index
decision_path = tree_model.decision_path([sample_features.values])
leaf_id = tree_model.apply([sample_features.values])[0]
node_indicator = decision_path.toarray()[0]
rules = []
for node_id in range(len(node_indicator)):
if node_indicator[node_id] == 1:
if tree_model.tree_.children_left[node_id] != -1:
feature_id = tree_model.tree_.feature[node_id]
threshold = tree_model.tree_.threshold[node_id]
feature_name = feature_names[feature_id]
if sample_features.values[feature_id] <= threshold:
rules.append(f"{feature_name} <= {threshold:.2f}")
else:
rules.append(f"{feature_name} > {threshold:.2f}")
proba = tree_model.predict_proba([sample_features.values])[0]
return {
'rules': rules,
'default_probability': proba[1],
'decision': 'Approve' if proba[1] < 0.3 else 'Reject'
}
sample_application = X_test.iloc[0]
explanation = extract_decision_path(model, sample_application)
print("Loan Decision Explanation:")
for i, rule in enumerate(explanation['rules'], 1):
print(f" Step {i}: {rule}")
print(f"Default Risk: {explanation['default_probability']:.1%}")
print(f"Decision: {explanation['decision']}")
Part VII: Ensemble Enhancement - Random Forests
From Single Tree to Forest
While a single decision tree provides interpretability, combining multiple trees through Random Forests can dramatically improve performance while maintaining reasonable interpretability through feature importance analysis.
from sklearn.ensemble import RandomForestClassifier
def build_loan_forest(n_trees=100):
"""
Build a Random Forest model for loan approval.
Trades some interpretability for significantly better performance.
"""
forest = RandomForestClassifier(
n_estimators=n_trees,
max_depth=7,
min_samples_split=20,
class_weight='balanced',
n_jobs=-1,
random_state=42
)
forest.fit(X_train, y_train)
tree_pred = model.predict(X_test)
forest_pred = forest.predict(X_test)
tree_impact = calculate_business_impact(y_test, tree_pred, model.predict_proba(X_test)[:, 1])
forest_impact = calculate_business_impact(y_test, forest_pred, forest.predict_proba(X_test)[:, 1])
improvement = forest_impact['total_impact'] - tree_impact['total_impact']
print(f"Single Tree Annual Impact: ${tree_impact['total_impact']:,.0f}")
print(f"Random Forest Annual Impact: ${forest_impact['total_impact']:,.0f}")
print(f"Additional Value from Forest: ${improvement:,.0f}")
return forest
forest_model = build_loan_forest()
Module Summary: Decision Trees in Production
Key Achievements:
- Built an interpretable loan approval system reducing losses by $52 million annually
- Achieved 87% accuracy while maintaining regulatory compliance
- Reduced loan processing time from 3 days to 30 seconds
- Provided clear explanations for every loan decision
Critical Insights:
- Decision trees naturally handle non-linear patterns and feature interactions
- Information gain and Gini impurity guide optimal splitting decisions
- Tree depth controls the bias-variance tradeoff
- Cost-sensitive evaluation is crucial for business applications
- Random Forests can improve performance while sacrificing some interpretability
Production Considerations:
- Monitor for data drift - retrain quarterly with new loan data
- Maintain interpretability for regulatory audits
- Set appropriate thresholds based on business risk tolerance
- Consider ensemble methods for maximum performance
The Bottom Line
Your decision tree system has transformed Regional Bank's lending operations. By replacing rigid rules with adaptive, data-driven decisions, you've not only saved $52 million annually but also increased loan approval rates by 15% while reducing default rates by 40%. The bank can now make instant, explainable loan decisions that satisfy both customers and regulators. This is the power of decision trees - turning complex patterns into clear, actionable business rules.