ML System Design
This skill provides frameworks for designing production machine learning systems, from data pipelines to model serving.
When to Use This Skill
Keywords: ML pipeline, machine learning system, feature store, model training, model serving, ML infrastructure, MLOps, A/B testing ML, feature engineering, model deployment
Use this skill when:
-
Designing end-to-end ML systems for production
-
Planning feature store architecture
-
Designing model training pipelines
-
Planning model serving infrastructure
-
Preparing for ML system design interviews
-
Evaluating ML platform tools and frameworks
ML System Architecture Overview
The ML System Lifecycle
┌─────────────────────────────────────────────────────────────────────────┐ │ ML SYSTEM LIFECYCLE │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │ │ │ Data │──▶│ Feature │──▶│ Model │──▶│ Model │──▶│ Monitor│ │ │ │ Ingestion│ │ Pipeline │ │ Training │ │ Serving │ │ & Eval │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────┘ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │ │ │ Data │ │ Feature │ │ Model │ │ Inference│ │ Metrics│ │ │ │ Lake │ │ Store │ │ Registry │ │ Cache │ │ Store │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────┘
Key Components
Component Purpose Examples
Data Ingestion Collect raw data from sources Kafka, Kinesis, Pub/Sub
Feature Pipeline Transform raw data to features Spark, Flink, dbt
Feature Store Store and serve features Feast, Tecton, Vertex AI
Model Training Train and validate models SageMaker, Vertex AI, Kubeflow
Model Registry Version and track models MLflow, Weights & Biases
Model Serving Serve predictions TensorFlow Serving, Triton, vLLM
Monitoring Track model performance Evidently, WhyLabs, Arize
Feature Store Architecture
Why Feature Stores?
Problems without a feature store:
-
Training-serving skew (features computed differently)
-
Duplicate feature computation across teams
-
No feature versioning or lineage
-
Slow feature experimentation
Feature Store Components
┌─────────────────────────────────────────────────────────────────┐ │ FEATURE STORE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │ │ OFFLINE STORE │ │ ONLINE STORE │ │ │ │ │ │ │ │ │ │ - Historical data │ │ - Low-latency │ │ │ │ - Training queries │ ────▶ │ - Point lookups │ │ │ │ - Batch features │ sync │ - Real-time serving│ │ │ │ │ │ │ │ │ │ (Data Warehouse) │ │ (Redis, DynamoDB) │ │ │ └─────────────────────┘ └─────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────┐│ │ │ FEATURE REGISTRY ││ │ │ - Feature definitions - Version control ││ │ │ - Data lineage - Access control ││ │ └─────────────────────────────────────────────────────────────┘│ └─────────────────────────────────────────────────────────────────┘
Feature Types
Type Computation Storage Example
Batch Scheduled (hourly/daily) Offline → Online User purchase count (30 days)
Streaming Real-time event processing Direct to online Items in cart (current)
On-demand Request-time computation Not stored Distance to nearest store
Training-Serving Consistency
TRAINING (Historical): ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Historical │───▶│ Point-in-Time│───▶│ Training │ │ Events │ │ Join │ │ Dataset │ └──────────────┘ └──────────────┘ └──────────────┘ │ Uses feature definitions │ SERVING (Real-time): ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Online │───▶│ Same Feature │───▶│ Prediction │ │ Store │ │ Definitions │ │ Request │ └──────────────┘ └──────────────┘ └──────────────┘
Model Training Infrastructure
Training Pipeline Components
┌───────────────────────────────────────────────────────────────────────┐ │ TRAINING PIPELINE │ ├───────────────────────────────────────────────────────────────────────┤ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Data │──▶│ Feature │──▶│ Model │──▶│ Model │ │ │ │ Loader │ │ Transform│ │ Train │ │ Validate │ │ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Experiment │ │ Hyperparameter│ │ Checkpoint │ │ Model │ │ │ │ Tracking │ │ Tuning │ │ Storage │ │ Registry │ │ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────────┘
Training Infrastructure Patterns
Pattern Use Case Tools
Single-node Small datasets, quick experiments Jupyter, local GPU
Distributed data-parallel Large datasets, same model Horovod, PyTorch DDP
Model-parallel Large models that don't fit in memory DeepSpeed, FSDP, Megatron
Hyperparameter tuning Automated model optimization Optuna, Ray Tune
Experiment Tracking
Track for reproducibility:
What to Track Why
Hyperparameters Reproduce training runs
Metrics Compare model performance
Artifacts Model files, datasets
Code version Git commit hash
Environment Docker image, dependencies
Data version Dataset hash or snapshot
Model Serving Architecture
Serving Patterns
Pattern Latency Throughput Use Case
Online (REST/gRPC) Low (<100ms) Medium Real-time predictions
Batch High (hours) Very high Bulk scoring
Streaming Medium High Event-driven predictions
Embedded Very low Varies Edge/mobile inference
Online Serving Architecture
┌─────────────────────────────────────────────────────────────────────┐ │ MODEL SERVING SYSTEM │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ Clients │ │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Load Balancer│ │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ API Gateway │ │ │ │ - Authentication - Rate limiting - Request validation │ │ │ └──────────────────────────────┬───────────────────────────────┘ │ │ │ │ │ ┌───────────────────────┼───────────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Model A │ │ Model B │ │ Model C │ │ │ │ (v1.2) │ │ (v2.0) │ │ (v1.0) │ │ │ └────────────┘ └────────────┘ └────────────┘ │ │ │ │ │ │ │ └───────────────────────┼───────────────────────┘ │ │ ▼ │ │ ┌────────────────┐ │ │ │ Feature Store │ │ │ │ (Online) │ │ │ └────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘
Latency Optimization
Technique Latency Impact Trade-off
Batching Reduces per-request Increases latency for first request
Caching 10-100x faster May serve stale predictions
Quantization 2-4x faster Slight accuracy loss
Distillation Variable Training overhead
GPU inference 10-100x faster Cost increase
A/B Testing ML Models
Experiment Design
┌─────────────────────────────────────────────────────────────────────┐ │ A/B TESTING ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ Traffic │ │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ Experiment Assignment │ ◀─────── Experiment Config │ │ │ - User bucketing │ - Allocation % │ │ │ - Feature flags │ - Target segments │ │ └──────────┬───────────┘ - Guardrails │ │ │ │ │ ┌────────┴────────┐ │ │ ▼ ▼ │ │ ┌────────┐ ┌────────┐ │ │ │Control │ │Treatment│ │ │ │Model A │ │Model B │ │ │ └────┬───┘ └────┬───┘ │ │ │ │ │ │ └────────┬───────┘ │ │ ▼ │ │ ┌────────────────┐ │ │ │ Metrics Logger │ │ │ └────────┬───────┘ │ │ ▼ │ │ ┌────────────────┐ │ │ │ Statistical │ ─────▶ Decision: Ship / Iterate / Kill │ │ │ Analysis │ │ │ └────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘
Metrics to Track
Metric Type Examples Purpose
Model metrics AUC, RMSE, precision/recall Model quality
Business metrics CTR, conversion, revenue Business impact
Guardrail metrics Latency, error rate, engagement Prevent regressions
Segment metrics Metrics by user segment Detect heterogeneous effects
Statistical Considerations
-
Sample size: Calculate power before experiment
-
Duration: Account for novelty effects and time patterns
-
Multiple testing: Adjust for multiple metrics (Bonferroni, FDR)
-
Early stopping: Use sequential testing methods
Model Monitoring
What to Monitor
Category Metrics Alert Threshold
Data quality Missing values, schema drift
1% change
Feature drift Distribution shift (PSI, KL) PSI >0.2
Prediction drift Output distribution shift Depends on use case
Model performance Accuracy, AUC (when labels available)
5% degradation
Operational Latency, throughput, errors SLO violations
Drift Detection
┌─────────────────────────────────────────────────────────────────────┐ │ DRIFT DETECTION PIPELINE │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Training Data Production Data │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Reference │ │ Current │ │ │ │ Distribution │ │ Distribution │ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ └──────────────┬──────────────┘ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Statistical Test │ │ │ │ - PSI (Population Stability Index) │ │ │ - KS Test │ │ │ - Chi-squared │ │ └────────┬─────────┘ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Drift Score │ │ │ └────────┬─────────┘ │ │ │ │ │ ┌───────────┼───────────┐ │ │ ▼ ▼ ▼ │ │ No Drift Warning Critical │ │ (< 0.1) (0.1-0.2) (> 0.2) │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ Continue Investigate Retrain │ │ │ └─────────────────────────────────────────────────────────────────────┘
Common ML System Design Patterns
Pattern 1: Recommendation System
Components needed:
- Candidate Generation (retrieve 100s-1000s)
- Ranking Model (score and sort)
- Feature Store (user features, item features)
- Real-time personalization (recent behavior)
- A/B testing infrastructure
Pattern 2: Fraud Detection
Components needed:
- Real-time feature computation
- Low-latency model serving (<50ms)
- High recall focus (can't miss fraud)
- Explainability for compliance
- Human-in-the-loop review
- Feedback loop for labels
Pattern 3: Search Ranking
Components needed:
- Two-stage ranking (retrieval + ranking)
- Feature store for query/document features
- Low latency (<200ms end-to-end)
- Learning to rank models
- Click-through rate prediction
- A/B testing with interleaving
Estimation for ML Systems
Training Infrastructure
Training time estimation:
- Dataset size: 100M examples
- Model: Transformer (100M params)
- GPU: A100 (80GB, 312 TFLOPS)
- Batch size: 32
- Training steps: Dataset / batch = 3.1M steps
- Time per step: ~100ms
- Total time: ~86 hours single GPU
- With 8 GPUs (data parallel): ~11 hours
Serving Infrastructure
Inference estimation:
- QPS: 10,000
- Model latency: 20ms
- Batch size: 1 (real-time)
- GPU utilization: 50% (latency constraint)
- Requests per GPU/sec: 25
- GPUs needed: 10,000 / 25 = 400 GPUs
- With batching (batch 8): 100 GPUs (4x reduction)
Related Skills
-
llm-serving-patterns
-
LLM-specific serving and optimization
-
rag-architecture
-
Retrieval-Augmented Generation patterns
-
vector-databases
-
Vector search and embeddings
-
ml-inference-optimization
-
Latency and cost optimization
-
estimation-techniques
-
Back-of-envelope calculations
-
quality-attributes-taxonomy
-
NFR definitions
Related Commands
-
/sd:ml-pipeline <problem>
-
Design ML system interactively
-
/sd:estimate <scenario>
-
Capacity calculations
Related Agents
-
ml-systems-designer
-
Design ML architectures
-
ml-interviewer
-
Mock ML system design interviews
Version History
- v1.0.0 (2025-12-26): Initial release
Last Updated
Date: 2025-12-26 Model: claude-opus-4-5-20251101