Course Home Next: Pipeline Engineering β†’
ML Course 2 Framework 1 Interactive

πŸ—οΈ Framework 1: ML System Architecture

Building production-grade ML platforms β€” from monoliths to microservices, feature stores to model registries.

🚨 Your First Week as Head of ML

Your company just raised $50M Series B. The CEO walks into your office and asks:

"How do we build an ML platform that scales? We're competing with companies that have 10Γ— our team. What's our move?"

You have 6 months and a team of 5. Every architectural decision you make in the next 30 minutes will determine whether you ship on time β€” or burn through the Series B with nothing to show.

In this module, you'll learn the 4 architectural pillars every production ML system needs. We'll make the same decisions real companies made β€” and see which ones paid off.

0% Complete
1
2
3
4
βœ“
Monolith vs Microservices Feature Stores Model Registry API Design Mission Complete

🧱 Part 1: Monolithic vs Microservices Architecture

Day 1 decision: Your team needs to ship a fraud detection model in 8 weeks. Do you build everything in one system, or split it into independent services? This choice will shape your codebase for years.

⚑ Architecture Explorer

Click on any component to learn what it does and where it belongs. Use the tabs to compare the two architectures.

🏠 Monolithic (All-in-One)
πŸ”§ Microservices (Modular)
πŸ“Š Side-by-Side
ML Application (Single Codebase) Data Ingestion ETL Pipelines Feature Eng. Preprocessing Model Training sklearn / torch API Serving Flask / Django Model Registry File System Monitoring Logging / Alerts

βœ… Fast to build. One repo, one deploy. Great for small teams & early stage.

When Monolith Wins πŸ†

  • Early stage startup β€” move fast, ship features, iterate
  • Small team (1–5 ML engineers) β€” coordination overhead kills microservices
  • Simple, homogeneous models β€” one model type, predictable traffic
  • Tight deadline β€” 8 weeks to launch? Don't architect, ship
3Γ— faster
Time-to-first-model with monolith vs microservices for teams <5
Data Service Kafka / S3 :8001 Feature Store Redis / Feast :8002 Training Svc Kubernetes :8003 Model Registry MLflow :8004 Serving Svc FastAPI :8005 API Gateway Auth β€’ Rate Limiting β€’ Routing Monitoring & Observability Prometheus Β· Grafana Β· PagerDuty

βš™οΈ Each service independently deployable. Scale what needs scaling.

When Microservices Win πŸ†

  • Large, diverse team β€” 10+ engineers can work in parallel without conflicts
  • Multiple model types β€” recommendation, fraud, NLP β€” each scales differently
  • High availability needs β€” one service failure shouldn't kill everything
  • Rapid iteration at scale β€” deploy one service without touching others
$2.4M saved
Netflix's per-service scaling saves ~$2.4M/month vs monolithic over-provisioning

🏠 Monolithic

Deploy time⚑ 15 min
Team size fit1–8 engineers
ScalingScale everything
Fault isolationNone
Initial cost$$ Low
Tech debt riskHigh at scale

Best for: MVPs, small teams, homogeneous workloads

πŸ”§ Microservices

Deploy timeπŸ• 2–4 hours setup
Team size fit5–500+ engineers
ScalingPer-service
Fault isolationExcellent
Initial cost$$$$ High infra
Tech debt riskLow long-term

Best for: Scale-ups, diverse models, high availability

🧠 Decision Time: Architecture Choice

Scenario: You're a Series A startup with 3 ML engineers. You need to ship a recommendation engine in 6 weeks for your flagship product. What do you build?

A. Full microservices from day one β€” we'll need to scale eventually
B. Monolithic application β€” ship fast, refactor when we have traction
C. Serverless functions β€” cheapest option per request
D. Buy a managed ML platform β€” don't build anything

πŸ“± Case Study: Netflix's Architecture Journey

Netflix is the canonical example of a monolith-to-microservices migration done right. Here's how they did it:

1
2008: Monolithic "DVD rental" architecture. One codebase, one database. 3-hour deployments.
2
2009–2011: Database corruption incident takes down the entire platform for 3 days. Cost: ~$50M in subscriber credits. Decision made: break up the monolith.
3
2012–2015: Gradual migration to microservices. ML recommendation engine broken out first. Each team owns their service.
4
2016+: 700+ microservices. Recommendation ML deploys 4,000+ times/day. A/B test new models in hours, not months.

Lesson: They started monolithic deliberately β€” not by accident. The monolith let them build product intuition before investing in infrastructure.

πŸ—ƒοΈ Part 2: Feature Stores β€” The Heart of ML Infrastructure

Week 3: Your fraud detection model is in production. Now the recommendation team wants to use some of the same features β€” user transaction history, session behavior. Without a feature store, they'll spend 3 months rebuilding pipelines you already built. There's a better way.

πŸ“Š The Feature Reuse Multiplier

How much engineering time do you save as feature reuse increases? Drag the slider to see.

Reuse Factor 1Γ— (No Reuse)
πŸ• 3 months per team
12 wks
Engineering weeks per model
$0
Annual savings (@ $180K eng)
1 model/quarter
Model delivery velocity

πŸ”§ Build a Feature Pipeline

Click "Run Step" to walk through how raw data becomes model-ready features in a production feature store.

πŸ“Š Raw Data
S3 / DB
β†’
πŸ”„ Transform
Spark / dbt
β†’
πŸ—ƒοΈ Feature Store
Feast / Redis
β†’
πŸ€– Model
Training
β†’
⚑ Serving
Real-time

Click "Run Next Step" to start the pipeline simulation.

🧠 Feature Store Architecture

Your recommendation model needs features that must be computed in real-time (e.g., "items the user clicked in the last 5 minutes") AND historical features computed offline (e.g., "user's 30-day purchase history"). What architecture supports both?

A. Online-only store β€” real-time is more important, use Redis for everything
B. Offline-only store β€” batch process nightly, good enough for recommendations
C. Lambda architecture β€” online store (Redis) for real-time + offline store (S3/BigQuery) for batch
D. Recompute all features at serving time β€” freshest possible data

πŸš— Real World: Uber's Michelangelo Feature Store

Uber's Michelangelo is one of the most influential ML platforms ever built. Here's the problem it solved:

⚠
Problem (2015): 20 teams each building their own feature pipelines. ETA model recomputing the same "driver location history" features as the surge pricing model. 60% of ML engineering time was duplicated feature work.
βœ“
Solution: Centralized feature store. Teams register features once. Any model can consume them. Features backed by Cassandra (real-time) + Hive (historical).
πŸ“ˆ
Result: Feature development time dropped 70%. New models ship in days instead of months. 10,000+ features registered across the org β€” every team benefits from every other team's work.

Business impact: Uber's ETA model accuracy improvement (enabled by richer features) is estimated to have reduced driver idle time by 20%, generating ~$300M/year in efficiency gains.

πŸ“‹ Part 3: Model Registry & Versioning

Month 2: Your fraud model is live. It's working great. Then an engineer updates a preprocessing step, retrains the model, and pushes it directly to production. Performance tanks β€” fraud escapes undetected. You don't know which version is running or how to roll back. Sound familiar?

😱 The Versioning Horror Show

You deployed a model without versioning. The model starts performing poorly. Which nightmare scenario are you in?

A. "Which model is in production?" β€” nobody knows the exact version
B. "We can't roll back" β€” the old model weights were overwritten
C. "Training-serving skew" β€” production data doesn't match training features
D. All of the above β€” and your CEO is on the phone

πŸ”¬ MLflow-Style Model Registry

Simulate promoting a model through the development lifecycle. Click on each stage to move the model forward.

πŸ”¬ Development
fraud-v1.3.2 Accuracy: 91.2% F1: 0.87
πŸ§ͺ Staging
β€” awaiting β€” Shadow traffic
πŸš€ Production
β€” awaiting β€” Live traffic
Model registry initialized. fraud-v1.3.2 in development.

πŸ“œ What Goes in a Model Version?

πŸ”’ Version Metadata β–Ό
Model name, version tag (semantic: v1.3.2), training timestamp, git commit hash, author, description, tags.
πŸ“Š Performance Metrics β–Ό
All training and validation metrics: accuracy, F1, AUC-ROC, precision@K, RMSE β€” whatever your task requires. Stored immutably per version.
πŸ—‚οΈ Artifacts & Dependencies β–Ό
Model weights/pickle file, feature schema, preprocessing pipeline, conda environment / requirements.txt, Docker image tag. Everything needed to reproduce inference exactly.
πŸ“‹ Lineage & Provenance β–Ό
Training dataset version, feature store snapshot, hyperparameters, training code version. Answers: "If this model makes a bad prediction, why?" and "Can we reproduce this result in 2 years?"
87%
of ML teams report production incidents caused by model versioning failures (Algorithmia 2022 State of ML Report)

πŸ”Œ Part 4: API Design for ML Services

Month 4: Your model is solid. Now engineering wants to integrate it into the mobile app. You need to expose it as an API. How you design this API determines latency, reliability, and how easily other teams can use your model.

⏱️ Batch vs Real-Time Latency Simulator

Different serving patterns have very different latency profiles. Use the controls to simulate request types and see the impact.

1 request
Medium (XGBoost)
⚑ Real-Time (synchronous)~12ms
p50
πŸ“¦ Micro-Batch (100ms window)~110ms
p50
πŸ—ƒοΈ Batch (async queue)~5,000ms
p50
πŸ’Ύ Pre-computed (cache hit)~2ms
p50

πŸ’‘ For 1 request: Real-time is best. Use synchronous REST API with <50ms SLA.

🧠 API Design Decision

Your fraud detection model must return a decision within 200ms while the user is completing a checkout. The model uses 50 features. What serving pattern do you use?

A. Synchronous REST API with pre-fetched features from online feature store
B. Async batch job β€” queue the request and return a job ID
C. Pre-compute scores for all users nightly and cache them
D. GraphQL subscription β€” stream predictions as they're ready

πŸ’» Code Lab: Build Your ML Serving API

Write a FastAPI prediction endpoint. Your task: implement the /predict endpoint that (1) validates input, (2) loads features, (3) runs inference, and (4) returns a structured response. Click "Run" to test it.

fraud_api.py FastAPI Β· Python 3.11
Simulated execution
Waiting for execution...

πŸ’‘ Key API Design Principles for ML

Schema Versioning
Include model_version in every response. When predictions change, clients know why.
Latency SLAs
Set p99 latency budgets. Fraud: <200ms. Recommendations: <100ms. Log latency always.
Fallback Logic
If model fails, fall back to rule-based system. Never return HTTP 500 to a payment flow.
Input Validation
Validate all inputs before inference. Bad inputs cause silent model degradation worse than errors.
$18M / year
Stripe's estimated savings from <100ms fraud API latency β€” faster decisions catch more fraud without false positives that block legitimate purchases

🎯 Mission Complete: Your ML Architecture Playbook

πŸ“Š Your Progress

Complete the quizzes above to see your score!

0/4 quizzes

πŸ—ΊοΈ The $50M Series B Architecture Decision

Based on everything you've learned, here's the playbook for your company's ML platform:

Month Decision Why Cost
1–2 Monolithic MVP Ship fast, learn what matters $2K/mo infra
3–4 Feature Store (Feast) Second model proves reuse value 1 eng-month setup
4–5 Model Registry (MLflow) Multiple models β†’ governance needed Open source, free
5–6 REST APIs (FastAPI) Product integrations go live Minimal overhead
6+ Begin microservices migration Now you know what to split 3–6 months effort

πŸ”‘ Key Takeaways


ML Frameworks & Applied Analytics Β· Chenhao Zhou Β· Rutgers Business School
Framework 1 of 6 Β· Teaching Portfolio