Framework 1: ML System Architecture | ML Frameworks & Applied Analytics

Building production-grade ML platforms — from monoliths to microservices, feature stores to model registries.

🚨 Your First Week as Head of ML

Your company just raised $50M Series B. The CEO walks into your office and asks:

"How do we build an ML platform that scales? We're competing with companies that have 10× our team. What's our move?"

You have 6 months and a team of 5. Every architectural decision you make in the next 30 minutes will determine whether you ship on time — or burn through the Series B with nothing to show.

In this module, you'll learn the 4 architectural pillars every production ML system needs. We'll make the same decisions real companies made — and see which ones paid off.

🧱 Part 1: Monolithic vs Microservices Architecture

Day 1 decision: Your team needs to ship a fraud detection model in 8 weeks. Do you build everything in one system, or split it into independent services? This choice will shape your codebase for years.

⚡ Architecture Explorer

Click on any component to learn what it does and where it belongs. Use the tabs to compare the two architectures.

🏠 Monolithic (All-in-One)

🔧 Microservices (Modular)

📊 Side-by-Side

✅ Fast to build. One repo, one deploy. Great for small teams & early stage.

When Monolith Wins 🏆

Early stage startup — move fast, ship features, iterate
Small team (1–5 ML engineers) — coordination overhead kills microservices
Simple, homogeneous models — one model type, predictable traffic
Tight deadline — 8 weeks to launch? Don't architect, ship

3× faster

Time-to-first-model with monolith vs microservices for teams <5

⚙️ Each service independently deployable. Scale what needs scaling.

When Microservices Win 🏆

Large, diverse team — 10+ engineers can work in parallel without conflicts
Multiple model types — recommendation, fraud, NLP — each scales differently
High availability needs — one service failure shouldn't kill everything
Rapid iteration at scale — deploy one service without touching others

$2.4M saved

Netflix's per-service scaling saves ~$2.4M/month vs monolithic over-provisioning

🏠 Monolithic

Deploy time	⚡ 15 min
Team size fit	1–8 engineers
Scaling	Scale everything
Fault isolation	None
Initial cost	$$ Low
Tech debt risk	High at scale

Best for: MVPs, small teams, homogeneous workloads

🔧 Microservices

Deploy time	🕐 2–4 hours setup
Team size fit	5–500+ engineers
Scaling	Per-service
Fault isolation	Excellent
Initial cost	$$$$ High infra
Tech debt risk	Low long-term

Best for: Scale-ups, diverse models, high availability

🧠 Decision Time: Architecture Choice

Scenario: You're a Series A startup with 3 ML engineers. You need to ship a recommendation engine in 6 weeks for your flagship product. What do you build?

A. Full microservices from day one — we'll need to scale eventually

B. Monolithic application — ship fast, refactor when we have traction

C. Serverless functions — cheapest option per request

D. Buy a managed ML platform — don't build anything

📱 Case Study: Netflix's Architecture Journey

Netflix is the canonical example of a monolith-to-microservices migration done right. Here's how they did it:

2008: Monolithic "DVD rental" architecture. One codebase, one database. 3-hour deployments.

2009–2011: Database corruption incident takes down the entire platform for 3 days. Cost: ~$50M in subscriber credits. Decision made: break up the monolith.

2012–2015: Gradual migration to microservices. ML recommendation engine broken out first. Each team owns their service.

2016+: 700+ microservices. Recommendation ML deploys 4,000+ times/day. A/B test new models in hours, not months.

Lesson: They started monolithic deliberately — not by accident. The monolith let them build product intuition before investing in infrastructure.

🗃️ Part 2: Feature Stores — The Heart of ML Infrastructure

Week 3: Your fraud detection model is in production. Now the recommendation team wants to use some of the same features — user transaction history, session behavior. Without a feature store, they'll spend 3 months rebuilding pipelines you already built. There's a better way.

📊 The Feature Reuse Multiplier

How much engineering time do you save as feature reuse increases? Drag the slider to see.

Reuse Factor 1× (No Reuse)

🕐 3 months per team

12 wks

Engineering weeks per model

Annual savings (@ $180K eng)

1 model/quarter

Model delivery velocity

🔧 Build a Feature Pipeline

Click "Run Step" to walk through how raw data becomes model-ready features in a production feature store.

📊 Raw Data
S3 / DB

→

🔄 Transform
Spark / dbt

→

🗃️ Feature Store
Feast / Redis

→

🤖 Model
Training

→

⚡ Serving
Real-time

Click "Run Next Step" to start the pipeline simulation.

🧠 Feature Store Architecture

Your recommendation model needs features that must be computed in real-time (e.g., "items the user clicked in the last 5 minutes") AND historical features computed offline (e.g., "user's 30-day purchase history"). What architecture supports both?

A. Online-only store — real-time is more important, use Redis for everything

B. Offline-only store — batch process nightly, good enough for recommendations

C. Lambda architecture — online store (Redis) for real-time + offline store (S3/BigQuery) for batch

D. Recompute all features at serving time — freshest possible data

🚗 Real World: Uber's Michelangelo Feature Store

Uber's Michelangelo is one of the most influential ML platforms ever built. Here's the problem it solved:

⚠

Problem (2015): 20 teams each building their own feature pipelines. ETA model recomputing the same "driver location history" features as the surge pricing model. 60% of ML engineering time was duplicated feature work.

✓

Solution: Centralized feature store. Teams register features once. Any model can consume them. Features backed by Cassandra (real-time) + Hive (historical).

📈

Result: Feature development time dropped 70%. New models ship in days instead of months. 10,000+ features registered across the org — every team benefits from every other team's work.

Business impact: Uber's ETA model accuracy improvement (enabled by richer features) is estimated to have reduced driver idle time by 20%, generating ~$300M/year in efficiency gains.

📋 Part 3: Model Registry & Versioning

Month 2: Your fraud model is live. It's working great. Then an engineer updates a preprocessing step, retrains the model, and pushes it directly to production. Performance tanks — fraud escapes undetected. You don't know which version is running or how to roll back. Sound familiar?

😱 The Versioning Horror Show

You deployed a model without versioning. The model starts performing poorly. Which nightmare scenario are you in?

A. "Which model is in production?" — nobody knows the exact version

B. "We can't roll back" — the old model weights were overwritten

C. "Training-serving skew" — production data doesn't match training features

D. All of the above — and your CEO is on the phone

🔬 MLflow-Style Model Registry

Simulate promoting a model through the development lifecycle. Click on each stage to move the model forward.

🔬 Development

fraud-v1.3.2 Accuracy: 91.2% F1: 0.87

🧪 Staging

— awaiting — Shadow traffic

🚀 Production

— awaiting — Live traffic

    Model registry initialized. fraud-v1.3.2 in development.
  

📜 What Goes in a Model Version?

🔢 Version Metadata ▼

Model name, version tag (semantic: v1.3.2), training timestamp, git commit hash, author, description, tags.

📊 Performance Metrics ▼

All training and validation metrics: accuracy, F1, AUC-ROC, precision@K, RMSE — whatever your task requires. Stored immutably per version.

🗂️ Artifacts & Dependencies ▼

Model weights/pickle file, feature schema, preprocessing pipeline, conda environment / requirements.txt, Docker image tag. Everything needed to reproduce inference exactly.

📋 Lineage & Provenance ▼

Training dataset version, feature store snapshot, hyperparameters, training code version. Answers: "If this model makes a bad prediction, why?" and "Can we reproduce this result in 2 years?"

87%

of ML teams report production incidents caused by model versioning failures (Algorithmia 2022 State of ML Report)

🔌 Part 4: API Design for ML Services

Month 4: Your model is solid. Now engineering wants to integrate it into the mobile app. You need to expose it as an API. How you design this API determines latency, reliability, and how easily other teams can use your model.

⏱️ Batch vs Real-Time Latency Simulator

Different serving patterns have very different latency profiles. Use the controls to simulate request types and see the impact.

Batch Size

1 request

Model Complexity

Medium (XGBoost)

⚡ Real-Time (synchronous)~12ms

p50

📦 Micro-Batch (100ms window)~110ms

p50

🗃️ Batch (async queue)~5,000ms

p50

💾 Pre-computed (cache hit)~2ms

p50

💡 For 1 request: Real-time is best. Use synchronous REST API with <50ms SLA.

🧠 API Design Decision

Your fraud detection model must return a decision within 200ms while the user is completing a checkout. The model uses 50 features. What serving pattern do you use?

A. Synchronous REST API with pre-fetched features from online feature store

B. Async batch job — queue the request and return a job ID

C. Pre-compute scores for all users nightly and cache them

D. GraphQL subscription — stream predictions as they're ready

💻 Code Lab: Build Your ML Serving API

Write a FastAPI prediction endpoint. Your task: implement the /predict endpoint that (1) validates input, (2) loads features, (3) runs inference, and (4) returns a structured response. Click "Run" to test it.

fraud_api.py FastAPI · Python 3.11

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import numpy as np
import time

app = FastAPI(title="Fraud Detection API", version="1.3.2")

# Request schema
class PredictionRequest(BaseModel):
    transaction_id: str
    amount: float
    merchant_category: str
    user_id: str

# Response schema
class PredictionResponse(BaseModel):
    transaction_id: str
    fraud_probability: float
    decision: str          # "approve" | "review" | "decline"
    latency_ms: float
    model_version: str

# TODO: Implement the /predict endpoint
# Requirements:
# 1. Validate that amount > 0
# 2. Fetch features from online store (simulate with random)
# 3. Run model inference (simulate with threshold logic)
# 4. Return structured response with latency

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    start = time.time()
    
    # Your implementation here:
    
    pass

Waiting for execution...

💡 Key API Design Principles for ML

Schema Versioning
Include model_version in every response. When predictions change, clients know why.

Latency SLAs
Set p99 latency budgets. Fraud: <200ms. Recommendations: <100ms. Log latency always.

Fallback Logic
If model fails, fall back to rule-based system. Never return HTTP 500 to a payment flow.

Input Validation
Validate all inputs before inference. Bad inputs cause silent model degradation worse than errors.

$18M / year

Stripe's estimated savings from <100ms fraud API latency — faster decisions catch more fraud without false positives that block legitimate purchases

🎯 Mission Complete: Your ML Architecture Playbook

📊 Your Progress

Complete the quizzes above to see your score!

0/4 quizzes

🗺️ The $50M Series B Architecture Decision

Based on everything you've learned, here's the playbook for your company's ML platform:

Month	Decision	Why	Cost
1–2	Monolithic MVP	Ship fast, learn what matters	$2K/mo infra
3–4	Feature Store (Feast)	Second model proves reuse value	1 eng-month setup
4–5	Model Registry (MLflow)	Multiple models → governance needed	Open source, free
5–6	REST APIs (FastAPI)	Product integrations go live	Minimal overhead
6+	Begin microservices migration	Now you know what to split	3–6 months effort

🔑 Key Takeaways

Architecture follows team size — monolith for ≤8 engineers, microservices after you have traction and clear service boundaries
Feature stores are a multiplier — every feature you build once, every team uses forever. The ROI compounds with each new model.
Versioning is non-negotiable — you cannot safely run production ML without knowing exactly what's deployed and being able to roll back in minutes
API design = product design — your latency SLA determines user experience; your schema determines how long integrations last
Start boring, get fancy later — choose the simplest architecture that solves today's problem. Complexity is a debt you pay with interest.

ML Frameworks & Applied Analytics · Chenhao Zhou · Rutgers Business School
Framework 1 of 6 · Teaching Portfolio

🏗️ Framework 1: ML System Architecture