Module 1 Refined Lab: The Data Scientist's Real Journey

Welcome to Your First Week at DataCorp

You've just been hired as a Data Scientist at DataCorp, a mid-sized e-commerce company. This lab simulates your first real project, complete with changing requirements, data quality issues, stakeholder pressure, and the discoveries that come from careful analysis. Unlike academic exercises, you'll experience how small early decisions compound into major consequences.

Your Mission: Build a customer spending prediction model for the holiday season. Budget allocation depends on your model: $10M is at stake.

Day 1

Phase 1: The Honeymoon Period

9:00 AM: You arrive excited. Your manager sends you the project brief.

"Welcome aboard! We need a model to predict customer spending for Q4. The CEO wants 95% accuracy - she saw our competitor claiming that in their press release. You have our customer database with 2 years of history. Should be straightforward, right? Need initial results by Friday."

Translation: Unrealistic expectations, competitive pressure, tight deadline.

Your First Decision: Setting Expectations

How do you respond to the 95% accuracy requirement?

Promise 95%

Investigate First

Challenge Assumption

9:30 AM: First Look at the Data

Your Initial Attempt (Optimistic):

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Load data - seems simple enough!
df = pd.read_csv('customers.csv')
print(f"Dataset shape: {df.shape}")
# Output: (10000, 47)  # Lots of features, great!

# Quick model - let's see baseline
X = df.drop('spending', axis=1)
y = df['spending']

# Standard split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
print(f"R² Score: {model.score(X_test, y_test):.3f}")
# Output: 0.923  # Wow, so close to 95%!

10:15 AM - Reality Hits: You realize the model is using customer_id as a feature! It's memorizing customers, not learning patterns. When you remove it, performance drops to 0.71.

Day 2

Phase 2: The Data Quality Nightmare

9:00 AM: After yesterday's customer_id fiasco, you decide to properly explore the data.

Issue #1: Missing Values Hidden as Zeros

30% of customers have 0 for 'email_opens' - they're actually NULLs from before email tracking started!

Your Fix: Create 'has_email_history' feature, impute carefully

df['has_email_history'] = df['email_opens'] > 0
df.loc[~df['has_email_history'], 'email_opens'] = df[df['has_email_history']]['email_opens'].median()

Issue #2: Data Leakage from the Future

'total_lifetime_value' includes future purchases! It's literally the answer with noise.

Your Fix: Remove it and all features calculated after the prediction point

# Remove features that wouldn't be known at prediction time
leaked_features = ['total_lifetime_value', 'future_purchases', 'next_month_active']
X = X.drop(columns=leaked_features)

Issue #3: Duplicate Customers

2,000 rows are duplicates with slightly different feature values - data pipeline bug!

Your Fix: Aggregate duplicates intelligently

# Keep the most recent record for each customer
df = df.sort_values('last_updated').drop_duplicates('customer_email', keep='last')

Original R²

0.923

Inflated!

After Removing ID

0.710

↓ 23%

After Fixing Leakage

0.645

↓ 6.5%

After Deduplication

0.672

↑ 2.7%

2:00 PM - Manager Check-in: "How's the 95% accuracy model coming along?"

Your Response: "I found serious data quality issues. Real baseline is 67%, not 92%. But this is actually good - we now know the truth."

Manager: "The CEO won't be happy... Can you just use the original numbers for the presentation?"

Ethical Decision Point

Your manager suggests using inflated numbers for the executive presentation. What do you do?

Use Inflated Numbers

Stand Your Ground

Propose Alternative

Day 3

Phase 3: Feature Engineering - The Art and Science

9:00 AM: With clean data, you begin thoughtful feature engineering.

Feature Engineering Decision Log

9:15 AM: Create 'days_since_last_purchase' - customers who bought recently buy more

9:45 AM: ✅ Add 'purchase_velocity' = purchases / account_age - captures engagement rate

10:30 AM: ❌ Add polynomial features (degree=3) - overfitting immediately! Removed.

11:00 AM: ⚠️ Add interaction terms selectively - only those with business logic

11:45 AM: ✅ Create 'seasonal_buyer' flag - some customers only buy during sales

Mistake: Over-Engineering Features

# Too many features - overfitting trap!
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"Features exploded from {X.shape[1]} to {X_poly.shape[1]}")
# Output: Features exploded from 23 to 2,024!

# Model performance:
# Train R²: 0.982  # Suspiciously high!
# Test R²: 0.593   # Worse than before!

Fix: Thoughtful Feature Selection

# Business-driven feature engineering
df['recency'] = (pd.Timestamp.now() - pd.to_datetime(df['last_purchase_date'])).dt.days
df['frequency'] = df['total_purchases'] / df['account_age_days'] * 365
df['monetary'] = df['total_spent'] / df['total_purchases'].clip(lower=1)
df['engagement_score'] = (
    df['email_opens'] * 0.3 + 
    df['website_visits'] * 0.5 + 
    df['app_sessions'] * 0.2
)

# Only meaningful interactions
df['premium_engagement'] = df['is_premium'] * df['engagement_score']
df['recency_frequency'] = df['recency'] * df['frequency']

# Result: 29 well-chosen features
# Train R²: 0.756
# Test R²: 0.742  # Much better generalization!

Key Discovery: Feature quality beats feature quantity. 29 thoughtful features outperform 2,024 polynomial features. The model is learning real patterns, not memorizing noise.

Day 4

Phase 4: The Overfitting Trap and Recovery

9:00 AM: Pressure mounting. CEO presentation tomorrow. You try to boost performance.

CEO Presentation in: 24 HOURS

The Desperation Spiral

10:00 AM - Desperation Move #1: Complex Model

from sklearn.ensemble import RandomForestRegressor

# "More complex model = better performance, right?"
rf_model = RandomForestRegressor(n_estimators=500, max_depth=20, min_samples_leaf=1)
rf_model.fit(X_train_scaled, y_train)

# Results:
print(f"Train R²: {rf_model.score(X_train_scaled, y_train):.3f}")  # 0.995
print(f"Test R²: {rf_model.score(X_test_scaled, y_test):.3f}")     # 0.681

# WORSE than simple linear regression!

11:00 AM - Desperation Move #2: Remove "Outliers"

# "Maybe outliers are hurting performance?"
Q1 = y_train.quantile(0.25)
Q3 = y_train.quantile(0.75)
IQR = Q3 - Q1

# Remove "outliers" - actually high-value customers!
mask = ~((y_train < Q1 - 1.5 * IQR) | (y_train > Q3 + 1.5 * IQR))
X_train_filtered = X_train_scaled[mask]
y_train_filtered = y_train[mask]

# Results: Model can't predict high spenders anymore!
# Lost ability to identify most valuable customers

2:00 PM - The Turning Point

Realization: You're making the model worse by chasing metrics. Step back. What does the business actually need?

Accurate predictions for customer segments, not perfect overall R²
Ability to identify high-value customers (top 20%)
Robust performance on new customers

3:00 PM - The Right Approach: Regularized Linear Model

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Proper regularization with business-focused validation
param_grid = {'alpha': np.logspace(-2, 2, 50)}

# Custom scorer: weighted by customer value
def business_scorer(y_true, y_pred):
    # Penalize errors on high-value customers more
    weights = np.where(y_true > y_true.quantile(0.8), 2.0, 1.0)
    weighted_errors = weights * (y_true - y_pred) ** 2
    return -np.mean(weighted_errors)

ridge_cv = GridSearchCV(
    Ridge(), 
    param_grid, 
    cv=5,
    scoring=business_scorer
)

ridge_cv.fit(X_train_scaled, y_train)
best_model = ridge_cv.best_estimator_

# Results:
print(f"Overall Test R²: {best_model.score(X_test_scaled, y_test):.3f}")  # 0.759
print(f"High-value Customer R²: {high_value_r2:.3f}")                     # 0.821
print(f"Can identify 73% of top 20% spenders")

Linear (No Reg)

0.742

Baseline

Random Forest

0.681

Overfit!

Ridge (Optimized)

0.759

Best!

High-Value Accuracy

82.1%

Key Metric!

Day 5

Phase 5: The Executive Presentation

9:00 AM: Presentation day. You've prepared two narratives.

Final Decision: How to Present Your Results

The CEO is expecting 95% accuracy. You achieved 76% overall, but 82% on high-value customers. How do you frame this?

Focus on R² Metrics

Focus on Business Value

Tell the Journey

The Winning Narrative

What Actually Worked:

Opening: "I discovered our data had quality issues inflating apparent performance. After fixing these, I built a model that identifies 73% of our highest-value customers with 82% accuracy."

Business Impact: "This means for our $10M Q4 inventory investment, we can allocate $7.3M with high confidence, reducing waste by an estimated $1.8M compared to uniform distribution."

Honest Assessment: "The 95% accuracy our competitor claims likely includes data leakage. Our 76% is real-world performance you can bank on."

CEO Response: "Finally, someone who tells me the truth! This saves us from a costly mistake. Approved for production."

🎓 The Real Lessons from Your First Week

Technical Lessons

Data quality trumps model complexity: Week spent on data cleaning = months saved in production
Simple models first: Linear regression found issues that Random Forest would hide
Regularization > Complexity: Ridge with α=1.3 beat 500-tree Random Forest
Custom metrics matter: Business value ≠ R² score
Feature engineering > Feature extraction: 29 thoughtful features beat 2,024 polynomial features

Professional Lessons

Set realistic expectations early: Better to disappoint initially than fail in production
Document everything: Your decision log saved you in the presentation
Ethics matter: Refusing to use inflated numbers built trust
Focus on business value: 73% of high-value customers > 95% overall accuracy
Tell stories, not statistics: CEOs understand impact, not R² scores

The Cascade Effect - Final Tally

Small Decision

Removed customer_id feature
Fixed data leakage
Used time-based validation
Added regularization (α=1.3)
Custom business metric

Big Impact

Prevented $4M loss from overfitting
Saved $1.8M in inventory costs
Built CEO trust with honesty
Model actually works in production
Promoted to Senior DS in 6 months

Your Performance Review

Outstanding First Week

"Demonstrated technical excellence, business acumen, and ethical integrity. Rare combination of skills that will define successful data scientists of the future."

- Your Manager (who got promoted too)

🚀 Module 1 Refined Lab: The Data Scientist's Real Journey

Experience the Actual Workflow, Setbacks, and Discoveries of Production ML