โ† System Architecture Course Home Next: MLOps โ†’

โš™๏ธ Framework 2: ML Pipeline Engineering

From Jupyter notebooks to production-grade ML systems that actually run.

๐Ÿ“ Your Progress

1Jupyter Gap
2Orchestration
3Feature Eng.
4Retraining
5Testing
Quiz Score: 0 / 0 Complete each section to advance your progress

๐Ÿšจ The Scenario

"A data scientist at your company built a model in Jupyter that achieves 95% accuracy. 'Ship it!' says the VP of Product. You know this won't end well. Here's why โ€” and how to fix it."

83%
of ML projects never reach production
โ€” Gartner Research
17%
of models actually deployed
use engineered pipelines
The core problem: A Jupyter notebook is a great exploration tool, but it is not a production system. This framework teaches you how to build the infrastructure that bridges the gap.

1

The Jupyter-to-Production Gap

Jupyter notebooks are fantastic for exploration. They're terrible for production. Here's an actual notebook from a real project โ€” can you spot what's wrong?

๐Ÿ” Diagnose This Notebook

This notebook achieved 95% accuracy. Click any cell to inspect it. Find the 5 production problems hidden inside.

๐Ÿ““ churn_model_FINAL_v3_USE_THIS.ipynb 47 cells | Python 3.8
[1]:
import pandas as pd import numpy as np
[7]:
DATA_PATH = "/Users/john/Desktop/data_march.csv" PROBLEM 1 df = pd.read_csv(DATA_PATH)
[2]:
df.head() # Quick check
[15]:
df['age_group'] = pd.cut(df['age'], bins=[0,25,45,65,100]) PROBLEM 2
[16]:
df['clv'] = df['revenue'] * df['tenure_months']
[3]:
# Normalize features PROBLEM 3 scaler = StandardScaler() df_scaled = scaler.fit_transform(df[features]) # LEAKS test data!
[4]:
X_train, X_test, y_train, y_test = train_test_split(df_scaled, df['churn'])
[5]:
model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)
[6]:
import pickle PROBLEM 4 pickle.dump(model, open("model_2024_03_14.pkl","wb"))
[42]:
# TODO: add error handling later PROBLEM 5 print("Accuracy:", model.score(X_test, y_test))
... 36 more cells ...

๐Ÿ’ก Click on highlighted cells (marked PROBLEM) to identify the issues.

๐Ÿง  Quiz 1: What's Wrong With This Notebook?

Select all 5 production problems you found. Check each one that applies:

โœ“
Hardcoded file paths (/Users/john/Desktop/...)
Non-sequential cell execution order (out-of-order run numbers)
Using Random Forest instead of XGBoost
Data leakage: scaler fit on full dataset before train/test split
Too many features (over 10 columns)
No model versioning or metadata tracking
No random seed, no error handling, no data validation
The Fix: Production ML requires pipelines โ€” reproducible, versioned, testable workflows that work identically every time, on any machine.

2

Data Pipeline Orchestration

A pipeline is a directed acyclic graph (DAG) of tasks with defined dependencies. Tools like Apache Airflow Prefect Dagster manage these workflows at scale.

๐Ÿ”ง Interactive DAG Explorer

Click each pipeline stage to learn what it does. The arrows show data flow and dependencies.

๐Ÿ“ฅ Extract
โœ… Validate
๐Ÿ”„ Transform
๐Ÿ—„๏ธ Load
๐Ÿง  Train
๐Ÿ“Š Evaluate
๐Ÿš€ Deploy
๐Ÿ‘† Click a node above to learn about each pipeline stage
Extract
Validate
Transform
Load
Train
Evaluate
Deploy

โšก Parallel Execution

Notice: Validate and Transform run in parallel after Extract. This is a key optimization โ€” independent tasks run simultaneously, cutting pipeline time.

๐Ÿ” Airflow Concept

with DAG('ml_pipeline') as dag:
  extract = PythonOperator(...)
  validate = PythonOperator(...)
  train = PythonOperator(...)
  extract >> [validate, transform]
  [validate, transform] >> load
  load >> [train, evaluate]

๐Ÿง  Quiz 2: Pipeline Dependencies

In the DAG above, the Deploy node should only run when:

A) Extract completes successfully
B) Train completes successfully
C) Both Train AND Evaluate complete, and Evaluate passes quality gates
D) Any single upstream task completes

3

Feature Engineering Pipelines

Raw data is rarely model-ready. Feature engineering transforms it into representations that capture the signal your model needs. The key insight: feature engineering should live inside a reproducible pipeline, not scattered across notebook cells.

๐Ÿงช Interactive: Raw โ†’ Features

Watch how raw customer data is transformed step by step. Click each stage to apply the transformation.

๐Ÿ“Š Raw Data
โ†’
๐Ÿงน Impute Nulls
โ†’
๐Ÿ“ Normalize
โ†’
๐Ÿท๏ธ Encode Cats
โ†’
โš—๏ธ Create Features

๐Ÿ“ˆ Feature Quality vs. Model Performance

Drag the slider to see how feature engineering investment affects accuracy. Based on real ML project data.

Model Accuracy73%
73%
False Positive Rate31%
31%
Business Impact ($/mo)$12K
$12K

๐Ÿ’ป Code Sandbox: Build a sklearn Pipeline

Complete the pipeline below. Replace # YOUR CODE HERE with a StandardScaler() step, then click Run.

pipeline_sandbox.py

๐Ÿง  Quiz 3: Why Use a sklearn Pipeline?

What is the most important production benefit of wrapping preprocessing in a Pipeline object?

A) It makes the code run faster
B) Preprocessing parameters (e.g., scaler mean/std) learned on training data are automatically applied to new data at inference โ€” preventing data leakage
C) It automatically selects the best features
D) It handles model versioning automatically

4

Continuous Training & Retraining Triggers

Models decay. The world changes โ€” customer behavior shifts, economic conditions evolve, new product lines launch. A model trained on last year's data becomes less accurate over time. This is called concept drift.

๐Ÿ“‰ Concept Drift Visualizer

This chart shows model accuracy over 12 months. Use the controls to simulate a retraining event and explore different drift patterns.

Select a drift scenario above to see how concept drift affects model performance.

๐Ÿค” When Should You Retrain?

Compare the three main retraining strategies. Click each to see pros and cons.

โฐ Time-Based

Retrain every N days, regardless of performance

๐Ÿ“Š Performance-Based

Retrain when accuracy drops below threshold

๐ŸŒŠ Data Drift Detection

Monitor input data distribution; retrain when it shifts

๐Ÿง  Quiz 4: Diagnosing Accuracy Drops

Your churn model's accuracy dropped from 92% to 87% this month. What should you check first?

A) Immediately retrain on all available data
B) Switch to a more complex model (e.g., XGBoost โ†’ Neural Network)
C) Check if the input data distribution has changed (new data sources, schema changes, seasonality)
D) Roll back to the previous model version

5

Testing ML Pipelines

Would you deploy software without testing it? ML pipelines need tests too โ€” but they're different from regular software tests. You test data, transformations, model behavior, and end-to-end system behavior.

๐Ÿ”บ The ML Testing Pyramid

Click each layer to see what to test at that level.

๐Ÿš€ E2E Tests (few, slow)
๐Ÿ”— Integration Tests (some)
๐Ÿ”ฌ Unit Tests (many, fast)

๐Ÿ’ป Code Editor: Data Validation Tests

This is a Great Expectations-style data validation suite. Edit and run to see validation results.

data_validation.py

๐Ÿง  Quiz 5: ML Testing Strategy

Your data validation suite detects that 3% of incoming records have age = -1 (clearly invalid). Your pipeline is set to fail on any validation error. What should you do?

A) Change the pipeline to ignore validation errors and train anyway
B) Fix only the validation rule (allow negative age)
C) Quarantine invalid records, investigate the root cause in the data source, then retrain on clean data
D) Since only 3% are affected, the model can handle the noise

๐ŸŽ“ Framework 2 Summary: ML Pipeline Engineering

What you learned:

  • Why Jupyter notebooks fail in production (5 specific problems)
  • How to design a DAG-based pipeline with proper dependencies
  • Feature engineering as a reproducible, versioned pipeline step
  • When and how to retrain models (3 strategies)
  • The ML testing pyramid: unit โ†’ integration โ†’ E2E

Key tools to know:

  • Apache Airflow โ€” DAG orchestration
  • Prefect / Dagster โ€” Modern orchestration
  • sklearn Pipeline โ€” Preprocessing + model
  • Great Expectations โ€” Data validation
  • MLflow โ€” Model versioning (next framework)
โ€“
Final Quiz Score
Continue to Framework 3: MLOps โ†’