You've just been hired as a Data Scientist at DataCorp, a mid-sized e-commerce company. This lab simulates your first real project, complete with changing requirements, data quality issues, stakeholder pressure, and the discoveries that come from careful analysis. Unlike academic exercises, you'll experience how small early decisions compound into major consequences.
Your Mission: Build a customer spending prediction model for the holiday season. Budget allocation depends on your model: $10M is at stake.
9:00 AM: You arrive excited. Your manager sends you the project brief.
"Welcome aboard! We need a model to predict customer spending for Q4. The CEO wants 95% accuracy - she saw our competitor claiming that in their press release. You have our customer database with 2 years of history. Should be straightforward, right? Need initial results by Friday."
Translation: Unrealistic expectations, competitive pressure, tight deadline.
How do you respond to the 95% accuracy requirement?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load data - seems simple enough!
df = pd.read_csv('customers.csv')
print(f"Dataset shape: {df.shape}")
# Output: (10000, 47) # Lots of features, great!
# Quick model - let's see baseline
X = df.drop('spending', axis=1)
y = df['spending']
# Standard split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
print(f"R² Score: {model.score(X_test, y_test):.3f}")
# Output: 0.923 # Wow, so close to 95%!
9:00 AM: After yesterday's customer_id fiasco, you decide to properly explore the data.
30% of customers have 0 for 'email_opens' - they're actually NULLs from before email tracking started!
df['has_email_history'] = df['email_opens'] > 0 df.loc[~df['has_email_history'], 'email_opens'] = df[df['has_email_history']]['email_opens'].median()
'total_lifetime_value' includes future purchases! It's literally the answer with noise.
# Remove features that wouldn't be known at prediction time leaked_features = ['total_lifetime_value', 'future_purchases', 'next_month_active'] X = X.drop(columns=leaked_features)
2,000 rows are duplicates with slightly different feature values - data pipeline bug!
# Keep the most recent record for each customer
df = df.sort_values('last_updated').drop_duplicates('customer_email', keep='last')
2:00 PM - Manager Check-in: "How's the 95% accuracy model coming along?"
Your Response: "I found serious data quality issues. Real baseline is 67%, not 92%. But this is actually good - we now know the truth."
Manager: "The CEO won't be happy... Can you just use the original numbers for the presentation?"
Your manager suggests using inflated numbers for the executive presentation. What do you do?
9:00 AM: With clean data, you begin thoughtful feature engineering.
# Too many features - overfitting trap!
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"Features exploded from {X.shape[1]} to {X_poly.shape[1]}")
# Output: Features exploded from 23 to 2,024!
# Model performance:
# Train R²: 0.982 # Suspiciously high!
# Test R²: 0.593 # Worse than before!
# Business-driven feature engineering
df['recency'] = (pd.Timestamp.now() - pd.to_datetime(df['last_purchase_date'])).dt.days
df['frequency'] = df['total_purchases'] / df['account_age_days'] * 365
df['monetary'] = df['total_spent'] / df['total_purchases'].clip(lower=1)
df['engagement_score'] = (
df['email_opens'] * 0.3 +
df['website_visits'] * 0.5 +
df['app_sessions'] * 0.2
)
# Only meaningful interactions
df['premium_engagement'] = df['is_premium'] * df['engagement_score']
df['recency_frequency'] = df['recency'] * df['frequency']
# Result: 29 well-chosen features
# Train R²: 0.756
# Test R²: 0.742 # Much better generalization!
9:00 AM: Pressure mounting. CEO presentation tomorrow. You try to boost performance.
from sklearn.ensemble import RandomForestRegressor
# "More complex model = better performance, right?"
rf_model = RandomForestRegressor(n_estimators=500, max_depth=20, min_samples_leaf=1)
rf_model.fit(X_train_scaled, y_train)
# Results:
print(f"Train R²: {rf_model.score(X_train_scaled, y_train):.3f}") # 0.995
print(f"Test R²: {rf_model.score(X_test_scaled, y_test):.3f}") # 0.681
# WORSE than simple linear regression!
# "Maybe outliers are hurting performance?" Q1 = y_train.quantile(0.25) Q3 = y_train.quantile(0.75) IQR = Q3 - Q1 # Remove "outliers" - actually high-value customers! mask = ~((y_train < Q1 - 1.5 * IQR) | (y_train > Q3 + 1.5 * IQR)) X_train_filtered = X_train_scaled[mask] y_train_filtered = y_train[mask] # Results: Model can't predict high spenders anymore! # Lost ability to identify most valuable customers
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
# Proper regularization with business-focused validation
param_grid = {'alpha': np.logspace(-2, 2, 50)}
# Custom scorer: weighted by customer value
def business_scorer(y_true, y_pred):
# Penalize errors on high-value customers more
weights = np.where(y_true > y_true.quantile(0.8), 2.0, 1.0)
weighted_errors = weights * (y_true - y_pred) ** 2
return -np.mean(weighted_errors)
ridge_cv = GridSearchCV(
Ridge(),
param_grid,
cv=5,
scoring=business_scorer
)
ridge_cv.fit(X_train_scaled, y_train)
best_model = ridge_cv.best_estimator_
# Results:
print(f"Overall Test R²: {best_model.score(X_test_scaled, y_test):.3f}") # 0.759
print(f"High-value Customer R²: {high_value_r2:.3f}") # 0.821
print(f"Can identify 73% of top 20% spenders")
9:00 AM: Presentation day. You've prepared two narratives.
The CEO is expecting 95% accuracy. You achieved 76% overall, but 82% on high-value customers. How do you frame this?
Opening: "I discovered our data had quality issues inflating apparent performance. After fixing these, I built a model that identifies 73% of our highest-value customers with 82% accuracy."
Business Impact: "This means for our $10M Q4 inventory investment, we can allocate $7.3M with high confidence, reducing waste by an estimated $1.8M compared to uniform distribution."
Honest Assessment: "The 95% accuracy our competitor claims likely includes data leakage. Our 76% is real-world performance you can bank on."
CEO Response: "Finally, someone who tells me the truth! This saves us from a costly mistake. Approved for production."
Outstanding First Week
"Demonstrated technical excellence, business acumen, and ethical integrity. Rare combination of skills that will define successful data scientists of the future."
- Your Manager (who got promoted too)