ML Frameworks & Applied Analytics · Framework 3 of 5

🚀 Model Deployment & MLOps

From Jupyter notebook to production system — keeping your model alive, monitored, and profitable

⏱ 90 min 🎓 Business Students 🛠 5 Hands-on Labs 💰 $50K/hr Stakes

🚨 3 AM. Your Phone Rings.

"It's 3 AM. Your phone rings. The recommendation model is returning the same product for every user. Revenue is dropping $50,000 per hour. What do you do?"

-$0

Revenue lost since incident began

Incident started 0 minutes ago

This isn't a hypothetical. It happened to a major e-commerce company in 2022. The root cause? A Docker container that worked perfectly on every engineer's laptop — but silently failed when the base image was updated in production.

Today you'll learn to prevent incidents, detect them faster, and recover in minutes — not hours.

🗺 Your Learning Journey — 5 MLOps Pillars

1

2

3

4

5

Containerization Model Serving A/B Testing Monitoring Rollback

5%

Quiz score: 0 / 5

📦 Pillar 1: Containerization for ML (Docker)

A container packages your model, code, and exact environment into one portable unit. If it runs in your container, it runs in production. That's the promise of Docker — and the cure for "works on my machine."

Without Docker 😱

Python 3.8 locally → Python 3.11 in prod
scikit-learn 1.0 → 1.3 in prod
No GPU driver on prod server
Missing system library libgomp
Result: 3 AM phone call

With Docker ✅

Exact Python version locked
Exact library versions pinned
All system dependencies included
Runs identically everywhere
Result: peaceful sleep

🔨 Lab 1: Build Your Dockerfile Layer by Layer

Add each layer by clicking the buttons below. Watch the Dockerfile assemble — and see how image size grows with each addition.

Dockerfile Image size: 0 MB

# Click buttons above to build your Dockerfile
# Each layer adds to the image size

Image Size0 MB / 800 MB budget

🧩 Quiz 1: The Production Gotcha

Your container works perfectly locally but crashes immediately in production. The error says: libgomp.so.1: cannot open shared object file. What's most likely the cause?

A) Your Python code has a bug that only appears at scale

B) The production server has too little RAM

C) Your Dockerfile uses a different base image than production (missing system library)

D) You forgot to push the latest model file

⚡ Pillar 2: Model Serving Patterns

Not all inference is equal. Choosing the wrong serving pattern can mean paying 10× more for the same results — or missing your latency SLA entirely.

📦 Batch

⚡ Real-time

🌊 Streaming

Batch Inference

Process large volumes of data at scheduled intervals (hourly, nightly). Results are pre-computed and stored.

Latency: Minutes to hours (acceptable)
Throughput: Very high (millions of rows)
Cost: Low (run only when needed)
Use when: Results don't need to be instant

🎬 Netflix

Batch Serving

Recommendations computed nightly for all 260M subscribers. By morning, your homepage is ready instantly. Realtime computation would cost 100× more.

Real-time Inference

Model responds to each request within milliseconds. Results are computed on demand.

Latency: < 100ms (tight SLA)
Throughput: Medium (requests/sec)
Cost: High (always-on infrastructure)
Use when: User is waiting for the answer

🔍 Google Search

Real-time Serving

Every search query triggers 200+ ML models in <200ms — spam detection, query understanding, result ranking. You can't pre-compute "what will people search?"

Streaming Inference

Continuously process data streams as events arrive in near-real-time (seconds, not hours).

Latency: 1–10 seconds
Throughput: Very high (event streams)
Cost: Medium (always-on + scale)
Use when: Data arrives continuously, decisions needed fast

🚗 Uber

Streaming Serving

Surge pricing model ingests real-time GPS data from millions of drivers and riders. Prices update every few seconds. Too slow → wrong price. Too fast → unstable UI.

📊 Lab 2: Latency vs Throughput Simulator

Adjust batch size and see how latency and throughput change. Find the sweet spot for your use case.

Batch Size: 1 requests

1 (real-time)128512 (batch)

8 ms

P99 Latency

🟢 Excellent

125 req/s

Throughput

📈 Good

$0.08

Cost per 1K req

💰 High

✅

SLA Met (<200ms)

Pattern: Real-time

🎯 Which Pattern Fits Your Use Case?

Use Case	Recommended Pattern	Key Reason
Credit card fraud detection	Real-time	Transaction must be approved/denied in milliseconds
Weekly sales forecast	Batch	Results consumed next morning; high volume
Ride-share surge pricing	Streaming	Driver/rider GPS updates every few seconds
Email spam filter	Real-time	User expects immediate delivery or block
Product recommendations (homepage)	Batch	Pre-compute for all users nightly
Social media content moderation	Streaming	Posts arrive continuously; hours is too slow

🧩 Quiz 2: Serving Pattern Choice

A hospital wants to flag patients at high risk of sepsis. The model analyzes vital signs (updated every 5 minutes). Which serving pattern is most appropriate?

A) Batch — run the model nightly for all patients

B) Streaming — process vital sign updates as they arrive in near-real-time

C) Real-time API — wait for a doctor to request a prediction

D) No ML needed — use rule-based thresholds only

🎲 Pillar 3: A/B Testing & Canary Deployments

Never deploy a new model to 100% of users at once. Split the risk. Measure the impact. Let data — not opinions — decide if the new model is better.

🧪 Lab 3: Run Your Own A/B Test

Simulate an A/B test comparing your current model (control) against a new model (treatment). Adjust parameters and watch statistical significance emerge.

Sample Size per Group: 1,000

1005,00010,000

True Lift (simulated): +2.0%

-3%0%+10%

—

Control CVR

Model A (current)

—

Treatment CVR

Model B (new)

—

p-value

Run test first

—

Statistical Power

Run test first

🐤 Lab 3b: Canary Deployment — Gradual Traffic Shift

A canary deployment gradually shifts traffic to the new model. If metrics degrade at any stage, you roll back — only a fraction of users are affected.

Traffic Split: 0% → New Model | 100% → Current Model

v1 (current)

0.1%

Error Rate

🟢 Normal

82ms

P99 Latency

🟢 Normal

3.2%

Conversion Rate

🟢 Baseline

0

Affected Users

Risk: None

Set canary percentage to begin gradual rollout.

🧩 Quiz 3: The Multiple Testing Trap

You ran an A/B test. Your new recommendation model shows a 2% conversion lift with p = 0.04. Your team is excited. Do you ship it?

A) Yes — p < 0.05, it's statistically significant. Ship immediately.

B) Yes — 2% lift is huge business value regardless of p-value.

C) Not yet — ask: Was this the only metric tested? Did we peek early? Multiple testing inflates false positives.

D) No — p < 0.05 is not significant enough for production deployment.

📡 Pillar 4: Model Monitoring & Drift Detection

A model that was 92% accurate at launch might be 71% accurate 6 months later — without a single line of code changing. Why? The world changed. Your model didn't.

Data Drift

Input feature distribution changes. Customers who used to be 25–34 years old are now predominantly 45–54.

PSIKS TestJensen-Shannon

Concept Drift

The relationship between features and outcome changes. A "good credit" score meant something different before vs. after a recession.

Performance metricsLabel shift

📊 Lab 4: Live Monitoring Dashboard

This simulates a real production monitoring dashboard. Click "Start Monitoring" and watch for anomalies — then diagnose and respond.

Simulated time: T+0h

92.1%

Model Accuracy

🟢 Healthy

0.04

PSI Score

🟢 No drift

0.06

KS Statistic

🟢 Stable

87ms

Avg Latency

🟢 Normal

🔔 Alert Thresholds

Accuracy Alert Below: %

PSI Alert Above:

Latency Alert Above: ms

[System] Alert monitor initialized. Thresholds loaded.

🚑 Incident Response: Step Through the Process

Click each step to advance through a real incident response. This is the process your on-call engineer follows at 3 AM.

🔍

1. Detect

PagerDuty alert fires. Accuracy dropped from 92% → 61%. Drift PSI = 0.31.

↓

🩺

2. Diagnose

Is it data drift, concept drift, or infrastructure failure? Check feature distributions vs. training data.

↓

🛠

3. Mitigate

Option A: Rollback to v1 (immediate). Option B: Hotfix input preprocessing. Option C: Emergency retrain.

↓

✅

4. Verify

Monitor for 30 min post-fix. Confirm accuracy recovered. Check no new alerts.

↓

📝

5. Postmortem

Write incident report: root cause, timeline, what broke, what was missing in monitoring, prevention plan.

🧩 Quiz 4: Drift Detection

Your fraud detection model's accuracy is still 91% (same as launch), but fraud losses have increased 40% over 6 months. What type of drift is most likely occurring?

A) Data drift — input features have shifted distribution

B) Concept drift — fraudsters adapted to your model; accuracy metric is misleading because fraud rate itself changed

C) Infrastructure drift — latency increased, causing more fraud

D) No drift — 91% accuracy proves the model is working fine

⏪ Pillar 5: Rollback Strategies

Every deployment needs an escape hatch. The fastest fix is almost always rolling back to the last known-good model version — not debugging at 3 AM.

8 min

Average time to rollback with a good MLOps pipeline. vs. 4+ hours of debugging without one.

🔄 Lab 5: Practice a Model Rollback

Walk through a simulated rollback. Your model is failing — you need to detect the issue, switch versions, and verify recovery.

      [MLOps Dashboard] System nominal. Model v2.3 serving 100% traffic.
    

92.1%

Accuracy

🟢 Normal

0.1%

Error Rate

🟢 Normal

v2.3

Active Version

Current deployment

$0

Revenue Loss

Saved vs. no-action

🌳 Decision Tree: Rollback vs Hotfix vs Retrain?

Use this framework when an incident occurs. Click to explore the decision path.

❓ Model performance is degrading in production. What now?

↓

      📋 Rollback vs Hotfix vs Retrain — Quick Reference
      
        OptionWhen to UseTime to ExecuteRisk
RollbackNew model broke something; previous version was goodMinutesLow
HotfixInfrastructure/data pipeline issue, not model logic30 min – 2 hrsMedium
RetrainData drift, world has changed, no good previous versionHours to daysHigh

    

Option	When to Use	Time to Execute	Risk
Rollback	New model broke something; previous version was good	Minutes	Low
Hotfix	Infrastructure/data pipeline issue, not model logic	30 min – 2 hrs	Medium
Retrain	Data drift, world has changed, no good previous version	Hours to days	High

🧩 Quiz 5: The Right Response

At 2 PM on Black Friday, your pricing model starts returning $0.00 for all products. Error rate spikes to 95%. Your last working deployment was 3 hours ago. What do you do FIRST?

A) Start investigating the root cause — understand the bug before acting

B) Retrain the model with today's data to fix the issue

C) Immediately rollback to the version from 3 hours ago, then investigate

D) Disable the ML model and use manual pricing rules

🎯 MLOps Mastery — Your Incident Survival Kit

📦

Containerize

Lock your environment. Docker prevents "works on my machine."

⚡

Right Serving

Match batch/real-time/streaming to your latency requirements.

🎲

Test & Canary

Never deploy to 100%. Measure first. Let data decide.

📡

Monitor Drift

Your model decays. Watch PSI, KS, accuracy continuously.

⏪

Rollback Fast

Every minute costs $800+. Rollback first, debug second.

—

Your final quiz score ·

Coming up next: Framework 4 — Enterprise ML Integration: connecting models to business systems, governance, and organizational change management.

Next: Enterprise Integration → ← Review Pipeline Engineering