ml-system-design

ML System Design

This skill provides frameworks for designing production machine learning systems, from data pipelines to model serving.

When to Use This Skill

Keywords: ML pipeline, machine learning system, feature store, model training, model serving, ML infrastructure, MLOps, A/B testing ML, feature engineering, model deployment

Use this skill when:

Designing end-to-end ML systems for production
Planning feature store architecture
Designing model training pipelines
Planning model serving infrastructure
Preparing for ML system design interviews
Evaluating ML platform tools and frameworks

ML System Architecture Overview

The ML System Lifecycle

┌─────────────────────────────────────────────────────────────────────────┐ │ ML SYSTEM LIFECYCLE │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │ │ │ Data │──▶│ Feature │──▶│ Model │──▶│ Model │──▶│ Monitor│ │ │ │ Ingestion│ │ Pipeline │ │ Training │ │ Serving │ │ & Eval │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────┘ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │ │ │ Data │ │ Feature │ │ Model │ │ Inference│ │ Metrics│ │ │ │ Lake │ │ Store │ │ Registry │ │ Cache │ │ Store │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────┘

Key Components

Component Purpose Examples

Data Ingestion Collect raw data from sources Kafka, Kinesis, Pub/Sub

Feature Pipeline Transform raw data to features Spark, Flink, dbt

Feature Store Store and serve features Feast, Tecton, Vertex AI

Model Training Train and validate models SageMaker, Vertex AI, Kubeflow

Model Registry Version and track models MLflow, Weights & Biases

Model Serving Serve predictions TensorFlow Serving, Triton, vLLM

Monitoring Track model performance Evidently, WhyLabs, Arize

Feature Store Architecture

Why Feature Stores?

Problems without a feature store:

Training-serving skew (features computed differently)
Duplicate feature computation across teams
No feature versioning or lineage
Slow feature experimentation

Feature Store Components

┌─────────────────────────────────────────────────────────────────┐ │ FEATURE STORE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │ │ OFFLINE STORE │ │ ONLINE STORE │ │ │ │ │ │ │ │ │ │ - Historical data │ │ - Low-latency │ │ │ │ - Training queries │ ────▶ │ - Point lookups │ │ │ │ - Batch features │ sync │ - Real-time serving│ │ │ │ │ │ │ │ │ │ (Data Warehouse) │ │ (Redis, DynamoDB) │ │ │ └─────────────────────┘ └─────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────┐│ │ │ FEATURE REGISTRY ││ │ │ - Feature definitions - Version control ││ │ │ - Data lineage - Access control ││ │ └─────────────────────────────────────────────────────────────┘│ └─────────────────────────────────────────────────────────────────┘

Feature Types

Type Computation Storage Example

Batch Scheduled (hourly/daily) Offline → Online User purchase count (30 days)

Streaming Real-time event processing Direct to online Items in cart (current)

On-demand Request-time computation Not stored Distance to nearest store

Training-Serving Consistency

TRAINING (Historical): ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Historical │───▶│ Point-in-Time│───▶│ Training │ │ Events │ │ Join │ │ Dataset │ └──────────────┘ └──────────────┘ └──────────────┘ │ Uses feature definitions │ SERVING (Real-time): ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Online │───▶│ Same Feature │───▶│ Prediction │ │ Store │ │ Definitions │ │ Request │ └──────────────┘ └──────────────┘ └──────────────┘

Model Training Infrastructure

Training Pipeline Components

┌───────────────────────────────────────────────────────────────────────┐ │ TRAINING PIPELINE │ ├───────────────────────────────────────────────────────────────────────┤ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Data │──▶│ Feature │──▶│ Model │──▶│ Model │ │ │ │ Loader │ │ Transform│ │ Train │ │ Validate │ │ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Experiment │ │ Hyperparameter│ │ Checkpoint │ │ Model │ │ │ │ Tracking │ │ Tuning │ │ Storage │ │ Registry │ │ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────────┘

Training Infrastructure Patterns

Pattern Use Case Tools

Single-node Small datasets, quick experiments Jupyter, local GPU

Distributed data-parallel Large datasets, same model Horovod, PyTorch DDP

Model-parallel Large models that don't fit in memory DeepSpeed, FSDP, Megatron

Hyperparameter tuning Automated model optimization Optuna, Ray Tune

Experiment Tracking

Track for reproducibility:

What to Track Why

Hyperparameters Reproduce training runs

Metrics Compare model performance

Artifacts Model files, datasets

Code version Git commit hash

Environment Docker image, dependencies

Data version Dataset hash or snapshot

Model Serving Architecture

Serving Patterns

Pattern Latency Throughput Use Case

Online (REST/gRPC) Low (<100ms) Medium Real-time predictions

Batch High (hours) Very high Bulk scoring

Streaming Medium High Event-driven predictions

Embedded Very low Varies Edge/mobile inference

Online Serving Architecture

┌─────────────────────────────────────────────────────────────────────┐ │ MODEL SERVING SYSTEM │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ Clients │ │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ Load Balancer│ │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ API Gateway │ │ │ │ - Authentication - Rate limiting - Request validation │ │ │ └──────────────────────────────┬───────────────────────────────┘ │ │ │ │ │ ┌───────────────────────┼───────────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Model A │ │ Model B │ │ Model C │ │ │ │ (v1.2) │ │ (v2.0) │ │ (v1.0) │ │ │ └────────────┘ └────────────┘ └────────────┘ │ │ │ │ │ │ │ └───────────────────────┼───────────────────────┘ │ │ ▼ │ │ ┌────────────────┐ │ │ │ Feature Store │ │ │ │ (Online) │ │ │ └────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘

Latency Optimization

Technique Latency Impact Trade-off

Batching Reduces per-request Increases latency for first request

Caching 10-100x faster May serve stale predictions

Quantization 2-4x faster Slight accuracy loss

Distillation Variable Training overhead

GPU inference 10-100x faster Cost increase

A/B Testing ML Models

Experiment Design

┌─────────────────────────────────────────────────────────────────────┐ │ A/B TESTING ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ Traffic │ │ │ └──────┬───────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ Experiment Assignment │ ◀─────── Experiment Config │ │ │ - User bucketing │ - Allocation % │ │ │ - Feature flags │ - Target segments │ │ └──────────┬───────────┘ - Guardrails │ │ │ │ │ ┌────────┴────────┐ │ │ ▼ ▼ │ │ ┌────────┐ ┌────────┐ │ │ │Control │ │Treatment│ │ │ │Model A │ │Model B │ │ │ └────┬───┘ └────┬───┘ │ │ │ │ │ │ └────────┬───────┘ │ │ ▼ │ │ ┌────────────────┐ │ │ │ Metrics Logger │ │ │ └────────┬───────┘ │ │ ▼ │ │ ┌────────────────┐ │ │ │ Statistical │ ─────▶ Decision: Ship / Iterate / Kill │ │ │ Analysis │ │ │ └────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘

Metrics to Track

Metric Type Examples Purpose

Model metrics AUC, RMSE, precision/recall Model quality

Business metrics CTR, conversion, revenue Business impact

Guardrail metrics Latency, error rate, engagement Prevent regressions

Segment metrics Metrics by user segment Detect heterogeneous effects

Statistical Considerations

Sample size: Calculate power before experiment
Duration: Account for novelty effects and time patterns
Multiple testing: Adjust for multiple metrics (Bonferroni, FDR)
Early stopping: Use sequential testing methods

Model Monitoring

What to Monitor

Category Metrics Alert Threshold

Data quality Missing values, schema drift

1% change

Feature drift Distribution shift (PSI, KL) PSI >0.2

Prediction drift Output distribution shift Depends on use case

Model performance Accuracy, AUC (when labels available)

5% degradation

Operational Latency, throughput, errors SLO violations

Drift Detection

┌─────────────────────────────────────────────────────────────────────┐ │ DRIFT DETECTION PIPELINE │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Training Data Production Data │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Reference │ │ Current │ │ │ │ Distribution │ │ Distribution │ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ └──────────────┬──────────────┘ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Statistical Test │ │ │ │ - PSI (Population Stability Index) │ │ │ - KS Test │ │ │ - Chi-squared │ │ └────────┬─────────┘ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Drift Score │ │ │ └────────┬─────────┘ │ │ │ │ │ ┌───────────┼───────────┐ │ │ ▼ ▼ ▼ │ │ No Drift Warning Critical │ │ (< 0.1) (0.1-0.2) (> 0.2) │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ Continue Investigate Retrain │ │ │ └─────────────────────────────────────────────────────────────────────┘

Common ML System Design Patterns

Pattern 1: Recommendation System

Components needed:

Candidate Generation (retrieve 100s-1000s)
Ranking Model (score and sort)
Feature Store (user features, item features)
Real-time personalization (recent behavior)
A/B testing infrastructure

Pattern 2: Fraud Detection

Components needed:

Real-time feature computation
Low-latency model serving (<50ms)
High recall focus (can't miss fraud)
Explainability for compliance
Human-in-the-loop review
Feedback loop for labels

Pattern 3: Search Ranking

Components needed:

Two-stage ranking (retrieval + ranking)
Feature store for query/document features
Low latency (<200ms end-to-end)
Learning to rank models
Click-through rate prediction
A/B testing with interleaving

Estimation for ML Systems

Training Infrastructure

Training time estimation:

Dataset size: 100M examples
Model: Transformer (100M params)
GPU: A100 (80GB, 312 TFLOPS)
Batch size: 32
Training steps: Dataset / batch = 3.1M steps
Time per step: ~100ms
Total time: ~86 hours single GPU
With 8 GPUs (data parallel): ~11 hours

Serving Infrastructure

Inference estimation:

QPS: 10,000
Model latency: 20ms
Batch size: 1 (real-time)
GPU utilization: 50% (latency constraint)
Requests per GPU/sec: 25
GPUs needed: 10,000 / 25 = 400 GPUs
With batching (batch 8): 100 GPUs (4x reduction)

Related Skills

llm-serving-patterns
LLM-specific serving and optimization
rag-architecture
Retrieval-Augmented Generation patterns
vector-databases
Vector search and embeddings
ml-inference-optimization
Latency and cost optimization
estimation-techniques
Back-of-envelope calculations
quality-attributes-taxonomy
NFR definitions

Related Commands

/sd:ml-pipeline <problem>
Design ML system interactively
/sd:estimate <scenario>
Capacity calculations

Related Agents

ml-systems-designer
Design ML architectures
ml-interviewer
Mock ML system design interviews

Version History

v1.0.0 (2025-12-26): Initial release

Last Updated

Date: 2025-12-26 Model: claude-opus-4-5-20251101

ml-system-design

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering