Framework 2: ML Pipeline Engineering

The Jupyter-to-Production Gap

Jupyter notebooks are fantastic for exploration. They're terrible for production. Here's an actual notebook from a real project — can you spot what's wrong?

🔍 Diagnose This Notebook

This notebook achieved 95% accuracy. Click any cell to inspect it. Find the 5 production problems hidden inside.

[1]:

import pandas as pd import numpy as np

[7]:

DATA_PATH = "/Users/john/Desktop/data_march.csv" PROBLEM 1 df = pd.read_csv(DATA_PATH)

[2]:

df.head() # Quick check

[15]:

df['age_group'] = pd.cut(df['age'], bins=[0,25,45,65,100]) PROBLEM 2

[16]:

df['clv'] = df['revenue'] * df['tenure_months']

[3]:

# Normalize features PROBLEM 3 scaler = StandardScaler() df_scaled = scaler.fit_transform(df[features]) # LEAKS test data!

[4]:

X_train, X_test, y_train, y_test = train_test_split(df_scaled, df['churn'])

[5]:

model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)

[6]:

import pickle PROBLEM 4 pickle.dump(model, open("model_2024_03_14.pkl","wb"))

[42]:

# TODO: add error handling later PROBLEM 5 print("Accuracy:", model.score(X_test, y_test))

... 36 more cells ...

💡 Click on highlighted cells (marked PROBLEM) to identify the issues.

🧠 Quiz 1: What's Wrong With This Notebook?

Select all 5 production problems you found. Check each one that applies:

✓

Hardcoded file paths (/Users/john/Desktop/...)

Non-sequential cell execution order (out-of-order run numbers)

Using Random Forest instead of XGBoost

Data leakage: scaler fit on full dataset before train/test split

Too many features (over 10 columns)

No model versioning or metadata tracking

No random seed, no error handling, no data validation

The Fix: Production ML requires pipelines — reproducible, versioned, testable workflows that work identically every time, on any machine.

Data Pipeline Orchestration

A pipeline is a directed acyclic graph (DAG) of tasks with defined dependencies. Tools like Apache Airflow Prefect Dagster manage these workflows at scale.

🔧 Interactive DAG Explorer

Click each pipeline stage to learn what it does. The arrows show data flow and dependencies.

📥 Extract

✅ Validate

🔄 Transform

🗄️ Load

🧠 Train

📊 Evaluate

🚀 Deploy

👆 Click a node above to learn about each pipeline stage

Extract

Validate

Transform

Load

Train

Evaluate

Deploy

⚡ Parallel Execution

Notice: Validate and Transform run in parallel after Extract. This is a key optimization — independent tasks run simultaneously, cutting pipeline time.

🔁 Airflow Concept

with DAG('ml_pipeline') as dag:
  extract = PythonOperator(...)
  validate = PythonOperator(...)
  train = PythonOperator(...)
  extract >> [validate, transform]
  [validate, transform] >> load
  load >> [train, evaluate]

🧠 Quiz 2: Pipeline Dependencies

In the DAG above, the Deploy node should only run when:

A) Extract completes successfully

B) Train completes successfully

C) Both Train AND Evaluate complete, and Evaluate passes quality gates

D) Any single upstream task completes

Feature Engineering Pipelines

Raw data is rarely model-ready. Feature engineering transforms it into representations that capture the signal your model needs. The key insight: feature engineering should live inside a reproducible pipeline, not scattered across notebook cells.

🧪 Interactive: Raw → Features

Watch how raw customer data is transformed step by step. Click each stage to apply the transformation.

📊 Raw Data

→

🧹 Impute Nulls

→

📐 Normalize

→

🏷️ Encode Cats

→

⚗️ Create Features

📈 Feature Quality vs. Model Performance

Drag the slider to see how feature engineering investment affects accuracy. Based on real ML project data.

Feature Engineering Effort: Low (raw features only)

Model Accuracy73%

73%

False Positive Rate31%

31%

Business Impact ($/mo)$12K

$12K

💻 Code Sandbox: Build a sklearn Pipeline

Complete the pipeline below. Replace # YOUR CODE HERE with a StandardScaler() step, then click Run.

pipeline_sandbox.py

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer

# Define which columns need which treatment
numeric_features = ['age', 'tenure_months', 'monthly_charges']
categorical_features = ['contract_type', 'payment_method']

# Numeric preprocessing sub-pipeline
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', # YOUR CODE HERE )
])

# Categorical preprocessing sub-pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine into a ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numeric_features),
    ('cat', categorical_pipeline, categorical_features)
])

# Full pipeline: preprocessing + model
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

print("Pipeline created successfully!")
print(full_pipeline.named_steps)

🧠 Quiz 3: Why Use a sklearn Pipeline?

What is the most important production benefit of wrapping preprocessing in a Pipeline object?

A) It makes the code run faster

B) Preprocessing parameters (e.g., scaler mean/std) learned on training data are automatically applied to new data at inference — preventing data leakage

C) It automatically selects the best features

D) It handles model versioning automatically

Continuous Training & Retraining Triggers

Models decay. The world changes — customer behavior shifts, economic conditions evolve, new product lines launch. A model trained on last year's data becomes less accurate over time. This is called concept drift.

📉 Concept Drift Visualizer

This chart shows model accuracy over 12 months. Use the controls to simulate a retraining event and explore different drift patterns.

Select a drift scenario above to see how concept drift affects model performance.

🤔 When Should You Retrain?

Compare the three main retraining strategies. Click each to see pros and cons.

⏰ Time-Based

Retrain every N days, regardless of performance

📊 Performance-Based

Retrain when accuracy drops below threshold

🌊 Data Drift Detection

Monitor input data distribution; retrain when it shifts

🧠 Quiz 4: Diagnosing Accuracy Drops

Your churn model's accuracy dropped from 92% to 87% this month. What should you check first?

A) Immediately retrain on all available data

B) Switch to a more complex model (e.g., XGBoost → Neural Network)

C) Check if the input data distribution has changed (new data sources, schema changes, seasonality)

D) Roll back to the previous model version

Testing ML Pipelines

Would you deploy software without testing it? ML pipelines need tests too — but they're different from regular software tests. You test data, transformations, model behavior, and end-to-end system behavior.

🔺 The ML Testing Pyramid

Click each layer to see what to test at that level.

🚀 E2E Tests (few, slow)

🔗 Integration Tests (some)

🔬 Unit Tests (many, fast)

💻 Code Editor: Data Validation Tests

This is a Great Expectations-style data validation suite. Edit and run to see validation results.

data_validation.py

import great_expectations as ge
import pandas as pd

# Load your dataset as a GE DataFrame
df = ge.from_pandas(pd.DataFrame({
    'age': [25, None, 45, -5, 150],
    'monthly_charges': [50.0, 80.0, None, 120.0, 95.0],
    'contract_type': ['Month-to-month', 'One year', 'INVALID', 'Two year', 'Month-to-month'],
    'churn': [0, 1, 0, 1, 0]
}))

# === Define Expectations ===

# 1. No nulls in critical columns
result1 = df.expect_column_values_to_not_be_null('age')
print(f"No nulls in age: {result1['success']}")

# 2. Age must be between 18 and 120
result2 = df.expect_column_values_to_be_between('age', min_value=18, max_value=120)
print(f"Age in valid range: {result2['success']}")

# 3. contract_type must be in allowed set
allowed = ['Month-to-month', 'One year', 'Two year']
result3 = df.expect_column_values_to_be_in_set('contract_type', allowed)
print(f"Contract type valid: {result3['success']}")

# 4. churn must be binary (0 or 1)
result4 = df.expect_column_values_to_be_in_set('churn', [0, 1])
print(f"Churn is binary: {result4['success']}")

# Summary
passed = sum([r['success'] for r in [result1, result2, result3, result4]])
print(f"\n✅ {passed}/4 expectations passed")
if passed < 4:
    print("❌ Pipeline should NOT proceed — fix data issues first!")

🧠 Quiz 5: ML Testing Strategy

Your data validation suite detects that 3% of incoming records have age = -1 (clearly invalid). Your pipeline is set to fail on any validation error. What should you do?

A) Change the pipeline to ignore validation errors and train anyway

B) Fix only the validation rule (allow negative age)

C) Quarantine invalid records, investigate the root cause in the data source, then retrain on clean data

D) Since only 3% are affected, the model can handle the noise

🎓 Framework 2 Summary: ML Pipeline Engineering

What you learned:

Why Jupyter notebooks fail in production (5 specific problems)
How to design a DAG-based pipeline with proper dependencies
Feature engineering as a reproducible, versioned pipeline step
When and how to retrain models (3 strategies)
The ML testing pyramid: unit → integration → E2E

Key tools to know:

Apache Airflow — DAG orchestration
Prefect / Dagster — Modern orchestration
sklearn Pipeline — Preprocessing + model
Great Expectations — Data validation
MLflow — Model versioning (next framework)

–

Final Quiz Score

Continue to Framework 3: MLOps →

⚙️ Framework 2: ML Pipeline Engineering

📍 Your Progress

🚨 The Scenario