🚀 Module 1 Refined Lab: The Data Scientist's Real Journey

Experience the Actual Workflow, Setbacks, and Discoveries of Production ML

Welcome to Your First Week at DataCorp

You've just been hired as a Data Scientist at DataCorp, a mid-sized e-commerce company. This lab simulates your first real project, complete with changing requirements, data quality issues, stakeholder pressure, and the discoveries that come from careful analysis. Unlike academic exercises, you'll experience how small early decisions compound into major consequences.

Your Mission: Build a customer spending prediction model for the holiday season. Budget allocation depends on your model: $10M is at stake.

Day 1

Phase 1: The Honeymoon Period

9:00 AM: You arrive excited. Your manager sends you the project brief.

"Welcome aboard! We need a model to predict customer spending for Q4. The CEO wants 95% accuracy - she saw our competitor claiming that in their press release. You have our customer database with 2 years of history. Should be straightforward, right? Need initial results by Friday."

Translation: Unrealistic expectations, competitive pressure, tight deadline.

Your First Decision: Setting Expectations

How do you respond to the 95% accuracy requirement?

Promise 95%
Investigate First
Challenge Assumption

9:30 AM: First Look at the Data

Your Initial Attempt (Optimistic):
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Load data - seems simple enough!
df = pd.read_csv('customers.csv')
print(f"Dataset shape: {df.shape}")
# Output: (10000, 47)  # Lots of features, great!

# Quick model - let's see baseline
X = df.drop('spending', axis=1)
y = df['spending']

# Standard split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
print(f"R² Score: {model.score(X_test, y_test):.3f}")
# Output: 0.923  # Wow, so close to 95%!
10:15 AM - Reality Hits: You realize the model is using customer_id as a feature! It's memorizing customers, not learning patterns. When you remove it, performance drops to 0.71.
Day 2

Phase 2: The Data Quality Nightmare

9:00 AM: After yesterday's customer_id fiasco, you decide to properly explore the data.

Issue #1: Missing Values Hidden as Zeros

30% of customers have 0 for 'email_opens' - they're actually NULLs from before email tracking started!

Your Fix: Create 'has_email_history' feature, impute carefully
df['has_email_history'] = df['email_opens'] > 0
df.loc[~df['has_email_history'], 'email_opens'] = df[df['has_email_history']]['email_opens'].median()
Issue #2: Data Leakage from the Future

'total_lifetime_value' includes future purchases! It's literally the answer with noise.

Your Fix: Remove it and all features calculated after the prediction point
# Remove features that wouldn't be known at prediction time
leaked_features = ['total_lifetime_value', 'future_purchases', 'next_month_active']
X = X.drop(columns=leaked_features)
Issue #3: Duplicate Customers

2,000 rows are duplicates with slightly different feature values - data pipeline bug!

Your Fix: Aggregate duplicates intelligently
# Keep the most recent record for each customer
df = df.sort_values('last_updated').drop_duplicates('customer_email', keep='last')
Original R²
0.923
Inflated!
After Removing ID
0.710
↓ 23%
After Fixing Leakage
0.645
↓ 6.5%
After Deduplication
0.672
↑ 2.7%

2:00 PM - Manager Check-in: "How's the 95% accuracy model coming along?"

Your Response: "I found serious data quality issues. Real baseline is 67%, not 92%. But this is actually good - we now know the truth."

Manager: "The CEO won't be happy... Can you just use the original numbers for the presentation?"

Ethical Decision Point

Your manager suggests using inflated numbers for the executive presentation. What do you do?

Use Inflated Numbers
Stand Your Ground
Propose Alternative
Day 3

Phase 3: Feature Engineering - The Art and Science

9:00 AM: With clean data, you begin thoughtful feature engineering.

Feature Engineering Decision Log

9:15 AM: Create 'days_since_last_purchase' - customers who bought recently buy more
9:45 AM: ✅ Add 'purchase_velocity' = purchases / account_age - captures engagement rate
10:30 AM: ❌ Add polynomial features (degree=3) - overfitting immediately! Removed.
11:00 AM: ⚠️ Add interaction terms selectively - only those with business logic
11:45 AM: ✅ Create 'seasonal_buyer' flag - some customers only buy during sales
Mistake: Over-Engineering Features
# Too many features - overfitting trap!
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"Features exploded from {X.shape[1]} to {X_poly.shape[1]}")
# Output: Features exploded from 23 to 2,024!

# Model performance:
# Train R²: 0.982  # Suspiciously high!
# Test R²: 0.593   # Worse than before!
Fix: Thoughtful Feature Selection
# Business-driven feature engineering
df['recency'] = (pd.Timestamp.now() - pd.to_datetime(df['last_purchase_date'])).dt.days
df['frequency'] = df['total_purchases'] / df['account_age_days'] * 365
df['monetary'] = df['total_spent'] / df['total_purchases'].clip(lower=1)
df['engagement_score'] = (
    df['email_opens'] * 0.3 + 
    df['website_visits'] * 0.5 + 
    df['app_sessions'] * 0.2
)

# Only meaningful interactions
df['premium_engagement'] = df['is_premium'] * df['engagement_score']
df['recency_frequency'] = df['recency'] * df['frequency']

# Result: 29 well-chosen features
# Train R²: 0.756
# Test R²: 0.742  # Much better generalization!
Key Discovery: Feature quality beats feature quantity. 29 thoughtful features outperform 2,024 polynomial features. The model is learning real patterns, not memorizing noise.
Day 4

Phase 4: The Overfitting Trap and Recovery

9:00 AM: Pressure mounting. CEO presentation tomorrow. You try to boost performance.

CEO Presentation in: 24 HOURS

The Desperation Spiral

10:00 AM - Desperation Move #1: Complex Model
from sklearn.ensemble import RandomForestRegressor

# "More complex model = better performance, right?"
rf_model = RandomForestRegressor(n_estimators=500, max_depth=20, min_samples_leaf=1)
rf_model.fit(X_train_scaled, y_train)

# Results:
print(f"Train R²: {rf_model.score(X_train_scaled, y_train):.3f}")  # 0.995
print(f"Test R²: {rf_model.score(X_test_scaled, y_test):.3f}")     # 0.681

# WORSE than simple linear regression!
11:00 AM - Desperation Move #2: Remove "Outliers"
# "Maybe outliers are hurting performance?"
Q1 = y_train.quantile(0.25)
Q3 = y_train.quantile(0.75)
IQR = Q3 - Q1

# Remove "outliers" - actually high-value customers!
mask = ~((y_train < Q1 - 1.5 * IQR) | (y_train > Q3 + 1.5 * IQR))
X_train_filtered = X_train_scaled[mask]
y_train_filtered = y_train[mask]

# Results: Model can't predict high spenders anymore!
# Lost ability to identify most valuable customers

2:00 PM - The Turning Point

Realization: You're making the model worse by chasing metrics. Step back. What does the business actually need?
3:00 PM - The Right Approach: Regularized Linear Model
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Proper regularization with business-focused validation
param_grid = {'alpha': np.logspace(-2, 2, 50)}

# Custom scorer: weighted by customer value
def business_scorer(y_true, y_pred):
    # Penalize errors on high-value customers more
    weights = np.where(y_true > y_true.quantile(0.8), 2.0, 1.0)
    weighted_errors = weights * (y_true - y_pred) ** 2
    return -np.mean(weighted_errors)

ridge_cv = GridSearchCV(
    Ridge(), 
    param_grid, 
    cv=5,
    scoring=business_scorer
)

ridge_cv.fit(X_train_scaled, y_train)
best_model = ridge_cv.best_estimator_

# Results:
print(f"Overall Test R²: {best_model.score(X_test_scaled, y_test):.3f}")  # 0.759
print(f"High-value Customer R²: {high_value_r2:.3f}")                     # 0.821
print(f"Can identify 73% of top 20% spenders")
Linear (No Reg)
0.742
Baseline
Random Forest
0.681
Overfit!
Ridge (Optimized)
0.759
Best!
High-Value Accuracy
82.1%
Key Metric!
Day 5

Phase 5: The Executive Presentation

9:00 AM: Presentation day. You've prepared two narratives.

Final Decision: How to Present Your Results

The CEO is expecting 95% accuracy. You achieved 76% overall, but 82% on high-value customers. How do you frame this?

Focus on R² Metrics
Focus on Business Value
Tell the Journey

The Winning Narrative

What Actually Worked:

Opening: "I discovered our data had quality issues inflating apparent performance. After fixing these, I built a model that identifies 73% of our highest-value customers with 82% accuracy."

Business Impact: "This means for our $10M Q4 inventory investment, we can allocate $7.3M with high confidence, reducing waste by an estimated $1.8M compared to uniform distribution."

Honest Assessment: "The 95% accuracy our competitor claims likely includes data leakage. Our 76% is real-world performance you can bank on."

CEO Response: "Finally, someone who tells me the truth! This saves us from a costly mistake. Approved for production."

🎓 The Real Lessons from Your First Week

Technical Lessons

  1. Data quality trumps model complexity: Week spent on data cleaning = months saved in production
  2. Simple models first: Linear regression found issues that Random Forest would hide
  3. Regularization > Complexity: Ridge with α=1.3 beat 500-tree Random Forest
  4. Custom metrics matter: Business value ≠ R² score
  5. Feature engineering > Feature extraction: 29 thoughtful features beat 2,024 polynomial features

Professional Lessons

  1. Set realistic expectations early: Better to disappoint initially than fail in production
  2. Document everything: Your decision log saved you in the presentation
  3. Ethics matter: Refusing to use inflated numbers built trust
  4. Focus on business value: 73% of high-value customers > 95% overall accuracy
  5. Tell stories, not statistics: CEOs understand impact, not R² scores

The Cascade Effect - Final Tally

Small Decision

  • Removed customer_id feature
  • Fixed data leakage
  • Used time-based validation
  • Added regularization (α=1.3)
  • Custom business metric

Big Impact

  • Prevented $4M loss from overfitting
  • Saved $1.8M in inventory costs
  • Built CEO trust with honesty
  • Model actually works in production
  • Promoted to Senior DS in 6 months

Your Performance Review

Outstanding First Week

"Demonstrated technical excellence, business acumen, and ethical integrity. Rare combination of skills that will define successful data scientists of the future."

- Your Manager (who got promoted too)