Module 4: Decision Trees for Intelligent Loan Approval

The $85 Million Problem

Regional Bank Corporation faces a crisis: their traditional loan approval process, based on rigid credit score cutoffs and manual review, is hemorrhaging money. Last year alone, they lost $50 million to defaults on approved loans and $35 million in missed revenue from rejected creditworthy applicants. You've been hired to build an interpretable, data-driven decision system that can revolutionize their lending operations.

Part I: The Paradigm Shift - From Rules to Learning

Traditional Approach: Fixed Rules

Hard-coded thresholds (e.g., "Reject if credit score < 650")
Linear decision boundaries that miss complex patterns
No ability to discover interaction effects
Manual feature engineering based on intuition

Machine Learning Approach: Adaptive Trees

Data-driven splitting points optimized for actual outcomes
Automatic discovery of non-linear patterns
Natural handling of feature interactions
Interpretable decision paths for regulatory compliance

Part II: Mathematical Foundations

Information Theory: The Heart of Decision Trees

Decision trees make splits by maximizing information gain - the reduction in uncertainty about loan outcomes:

H(S) = -Σ p(c) × log₂(p(c))

Where H(S) is the entropy of the dataset S, and p(c) is the proportion of samples belonging to class c (default/no-default).

Information Gain

The effectiveness of a split is measured by how much it reduces entropy:

IG(S, A) = H(S) - Σ (|Sᵥ|/|S|) × H(Sᵥ)

Where A is the attribute we're splitting on, and Sᵥ represents the subset of S for each value v of attribute A.

Gini Impurity: An Alternative Metric

Many implementations use Gini impurity for computational efficiency:

Gini(S) = 1 - Σ p(c)²

Lower Gini values indicate purer nodes (more homogeneous loan outcomes).

Part III: Building the Tree - A Step-by-Step Journey

# Step 1: Understanding our loan data structure
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load historical loan data (5 years, 50,000 applications)
loan_data = pd.read_csv('loan_applications.csv')

# Key features that determine loan risk
features = [
    'credit_score',        # FICO score (300-850)
    'annual_income',       # Annual income in USD
    'debt_to_income',      # DTI ratio (%)
    'employment_years',    # Years at current job
    'loan_amount',         # Requested loan amount
    'loan_purpose',        # Encoded: home, auto, personal, etc.
    'home_ownership',      # Encoded: rent, own, mortgage
    'previous_defaults',   # Number of past defaults
]

# Target: 1 if loan defaulted, 0 if repaid successfully
X = loan_data[features]
y = loan_data['defaulted']

# Split data: 70% training, 30% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
        

# Step 2: Building our first decision tree
def build_loan_decision_tree(max_depth=None, min_samples_split=20):
    """
    Build a decision tree optimized for loan approval.
    
    Business constraints:
    - Must be interpretable (max_depth limits complexity)
    - Must handle at least 20 samples per decision (regulatory requirement)
    - Must provide probability estimates for risk scoring
    """
    
    # Initialize tree with business-driven parameters
    tree_model = DecisionTreeClassifier(
        criterion='gini',              # Use Gini for speed
        max_depth=max_depth,             # Control interpretability
        min_samples_split=min_samples_split,  # Regulatory compliance
        min_samples_leaf=10,             # Avoid overfitting
        class_weight='balanced',       # Handle imbalanced defaults
        random_state=42
    )
    
    # Train the model
    tree_model.fit(X_train, y_train)
    
    # Calculate feature importance for business insights
    feature_importance = pd.DataFrame({
        'feature': features,
        'importance': tree_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    return tree_model, feature_importance

# Build our model
model, importance = build_loan_decision_tree(max_depth=5)
print("Top Risk Factors:")
print(importance.head())
        

Part IV: The Business Impact - Beyond Accuracy

The Cost-Sensitive Nature of Loan Decisions

Not all errors are equal in lending:

False Negative (Approving a bad loan): Average loss of $25,000 per default
False Positive (Rejecting a good loan): Lost profit of $3,500 per loan

Our decision tree must optimize for business value, not just accuracy!

# Step 3: Implementing cost-sensitive decision making
def calculate_business_impact(y_true, y_pred, y_prob):
    """
    Calculate the actual dollar impact of our loan decisions.
    
    Business model:
    - Average loan amount: $50,000
    - Default rate in approved loans affects profitability
    - Each default costs: principal loss + collection costs
    - Each successful loan generates: interest income over loan term
    """
    
    # Business parameters
    avg_loan_amount = 50000
    default_loss = 25000      # Average loss per defaulted loan
    good_loan_profit = 3500    # Average profit per successful loan
    
    # Calculate confusion matrix components
    true_positives = np.sum((y_true == 1) & (y_pred == 1))   # Correctly rejected bad loans
    false_positives = np.sum((y_true == 0) & (y_pred == 1))  # Incorrectly rejected good loans
    true_negatives = np.sum((y_true == 0) & (y_pred == 0))   # Correctly approved good loans
    false_negatives = np.sum((y_true == 1) & (y_pred == 0))  # Incorrectly approved bad loans
    
    # Calculate financial impact
    saved_from_defaults = true_positives * default_loss
    lost_opportunity = false_positives * good_loan_profit
    earned_from_good_loans = true_negatives * good_loan_profit
    losses_from_defaults = false_negatives * default_loss
    
    total_impact = saved_from_defaults + earned_from_good_loans - lost_opportunity - losses_from_defaults
    
    return {
        'total_impact': total_impact,
        'saved_from_defaults': saved_from_defaults,
        'lost_opportunities': lost_opportunity,
        'earned_from_good_loans': earned_from_good_loans,
        'losses_from_defaults': losses_from_defaults,
        'approval_rate': (true_negatives + false_negatives) / len(y_true)
    }

# Evaluate our model's business impact
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
impact = calculate_business_impact(y_test, y_pred, y_prob)

print(f"Annual Financial Impact: ${impact['total_impact']:,.0f}")
print(f"Loan Approval Rate: {impact['approval_rate']:.1%}")
        

Part V: Advanced Techniques - Pruning and Optimization

# Step 4: Optimizing tree complexity for production
def optimize_tree_depth(X_train, y_train, X_val, y_val):
    """
    Find the optimal tree depth that maximizes business value.
    
    Strategy: Balance model complexity with financial performance
    """
    
    depth_analysis = []
    
    for depth in range(2, 15):
        # Train tree with specific depth
        tree = DecisionTreeClassifier(
            max_depth=depth,
            min_samples_split=20,
            class_weight='balanced',
            random_state=42
        )
        tree.fit(X_train, y_train)
        
        # Evaluate on validation set
        y_pred = tree.predict(X_val)
        y_prob = tree.predict_proba(X_val)[:, 1]
        
        # Calculate business metrics
        impact = calculate_business_impact(y_val, y_pred, y_prob)
        
        # Store results
        depth_analysis.append({
            'depth': depth,
            'n_leaves': tree.get_n_leaves(),
            'financial_impact': impact['total_impact'],
            'approval_rate': impact['approval_rate']
        })
    
    return pd.DataFrame(depth_analysis)

# Find optimal complexity
optimization_results = optimize_tree_depth(X_train, y_train, X_test, y_test)
optimal_depth = optimization_results.loc[
    optimization_results['financial_impact'].idxmax(), 'depth'
]

print(f"Optimal tree depth: {optimal_depth}")
print(f"Expected annual impact: ${optimization_results['financial_impact'].max():,.0f}")
        

Part VI: Interpretability - The Regulatory Requirement

# Step 5: Extracting human-readable decision rules
def extract_decision_path(tree_model, sample_features):
    """
    Extract the decision path for a loan application.
    Required for regulatory compliance and customer communication.
    """
    
    # Get the decision path
    feature_names = sample_features.index
    decision_path = tree_model.decision_path([sample_features.values])
    leaf_id = tree_model.apply([sample_features.values])[0]
    
    # Extract rules from root to leaf
    node_indicator = decision_path.toarray()[0]
    rules = []
    
    for node_id in range(len(node_indicator)):
        if node_indicator[node_id] == 1:
            if tree_model.tree_.children_left[node_id] != -1:  # Not a leaf
                feature_id = tree_model.tree_.feature[node_id]
                threshold = tree_model.tree_.threshold[node_id]
                feature_name = feature_names[feature_id]
                
                # Determine which branch was taken
                if sample_features.values[feature_id] <= threshold:
                    rules.append(f"{feature_name} <= {threshold:.2f}")
                else:
                    rules.append(f"{feature_name} > {threshold:.2f}")
    
    # Get the prediction probability
    proba = tree_model.predict_proba([sample_features.values])[0]
    
    return {
        'rules': rules,
        'default_probability': proba[1],
        'decision': 'Approve' if proba[1] < 0.3 else 'Reject'
    }

# Example: Explain a specific loan decision
sample_application = X_test.iloc[0]
explanation = extract_decision_path(model, sample_application)

print("Loan Decision Explanation:")
for i, rule in enumerate(explanation['rules'], 1):
    print(f"  Step {i}: {rule}")
print(f"Default Risk: {explanation['default_probability']:.1%}")
print(f"Decision: {explanation['decision']}")
        

Part VII: Ensemble Enhancement - Random Forests

From Single Tree to Forest

While a single decision tree provides interpretability, combining multiple trees through Random Forests can dramatically improve performance while maintaining reasonable interpretability through feature importance analysis.

# Step 6: Building a Random Forest for comparison
from sklearn.ensemble import RandomForestClassifier

def build_loan_forest(n_trees=100):
    """
    Build a Random Forest model for loan approval.
    Trades some interpretability for significantly better performance.
    """
    
    forest = RandomForestClassifier(
        n_estimators=n_trees,
        max_depth=7,
        min_samples_split=20,
        class_weight='balanced',
        n_jobs=-1,  # Use all CPU cores
        random_state=42
    )
    
    forest.fit(X_train, y_train)
    
    # Compare single tree vs forest performance
    tree_pred = model.predict(X_test)
    forest_pred = forest.predict(X_test)
    
    tree_impact = calculate_business_impact(y_test, tree_pred, model.predict_proba(X_test)[:, 1])
    forest_impact = calculate_business_impact(y_test, forest_pred, forest.predict_proba(X_test)[:, 1])
    
    improvement = forest_impact['total_impact'] - tree_impact['total_impact']
    
    print(f"Single Tree Annual Impact: ${tree_impact['total_impact']:,.0f}")
    print(f"Random Forest Annual Impact: ${forest_impact['total_impact']:,.0f}")
    print(f"Additional Value from Forest: ${improvement:,.0f}")
    
    return forest

forest_model = build_loan_forest()
        

Module Summary: Decision Trees in Production

Key Achievements:

Built an interpretable loan approval system reducing losses by $52 million annually
Achieved 87% accuracy while maintaining regulatory compliance
Reduced loan processing time from 3 days to 30 seconds
Provided clear explanations for every loan decision

Critical Insights:

Decision trees naturally handle non-linear patterns and feature interactions
Information gain and Gini impurity guide optimal splitting decisions
Tree depth controls the bias-variance tradeoff
Cost-sensitive evaluation is crucial for business applications
Random Forests can improve performance while sacrificing some interpretability

Production Considerations:

Monitor for data drift - retrain quarterly with new loan data
Maintain interpretability for regulatory audits
Set appropriate thresholds based on business risk tolerance
Consider ensemble methods for maximum performance

The Bottom Line

Your decision tree system has transformed Regional Bank's lending operations. By replacing rigid rules with adaptive, data-driven decisions, you've not only saved $52 million annually but also increased loan approval rates by 15% while reducing default rates by 40%. The bank can now make instant, explainable loan decisions that satisfy both customers and regulators. This is the power of decision trees - turning complex patterns into clear, actionable business rules.