ML & AI Engineering System

Complete methodology for building, deploying, and operating production ML/AI systems — from experiment to scale.

Phase 1: Problem Framing

Before writing any code, define the ML problem precisely.

ML Problem Brief

problem_brief:
  business_objective: ""          # What business metric improves?
  success_metric: ""              # Quantified target (e.g., "reduce churn 15%")
  baseline: ""                    # Current performance without ML
  ml_task_type: ""                # classification | regression | ranking | generation | clustering | anomaly_detection | recommendation
  prediction_target: ""           # What exactly are we predicting?
  prediction_consumer: ""         # Who/what uses the prediction? (API | dashboard | email | automated action)
  latency_requirement: ""         # real-time (<100ms) | near-real-time (<1s) | batch (minutes-hours)
  data_available: ""              # What data exists today?
  data_gaps: ""                   # What's missing?
  ethical_considerations: ""      # Bias risks, fairness requirements, privacy
  kill_criteria:                  # When to abandon the ML approach
    - "Baseline heuristic achieves >90% of ML performance"
    - "Data quality too poor after 2 weeks of cleaning"
    - "Model can't beat random by >10% on holdout set"

ML vs Rules Decision

Signal	Use Rules	Use ML
Logic is explainable in <10 rules	✅	❌
Pattern is too complex for humans	❌	✅
Training data >1,000 labeled examples	—	✅
Needs to adapt to new patterns	❌	✅
Must be 100% auditable/deterministic	✅	❌
Pattern changes faster than you can update rules	❌	✅

Rule of thumb: Start with rules/heuristics. Only add ML when rules fail to capture the pattern.

Phase 2: Data Engineering for ML

Data Quality Assessment

Score each data source (0-5 per dimension):

Dimension	0 (Terrible)	5 (Excellent)
Completeness	>50% missing	<1% missing
Accuracy	Known errors, no validation	Validated against source of truth
Consistency	Different formats, duplicates	Standardized, deduplicated
Timeliness	Months stale	Real-time or daily refresh
Relevance	Weak proxy for target	Direct signal for prediction
Volume	<100 samples	>10,000 samples per class

Minimum score to proceed: 18/30. Below 18 → fix data first, don't build models.

Feature Engineering Patterns

feature_types:
  numerical:
    - raw_value           # Use as-is if normally distributed
    - log_transform       # Right-skewed distributions (revenue, counts)
    - standardize         # z-score for algorithms sensitive to scale (SVM, KNN, neural nets)
    - bin_to_categorical  # When relationship is non-linear and data is limited
  categorical:
    - one_hot             # <20 categories, tree-based models handle natively
    - target_encoding     # High-cardinality (>20 categories), use with K-fold to prevent leakage
    - embedding           # Very high-cardinality (user IDs, product IDs) with deep learning
  temporal:
    - lag_features        # Value at t-1, t-7, t-30
    - rolling_statistics  # Mean, std, min, max over windows
    - time_since_event    # Days since last purchase, hours since login
    - cyclical_encoding   # sin/cos for hour-of-day, day-of-week, month
  text:
    - tfidf               # Simple, interpretable, good baseline
    - sentence_embeddings # semantic similarity, modern NLP
    - llm_extraction      # Use LLM to extract structured fields from unstructured text
  interaction:
    - ratios              # Feature A / Feature B (e.g., clicks/impressions = CTR)
    - differences         # Feature A - Feature B (e.g., price - competitor_price)
    - polynomial          # A * B, A^2 (use sparingly, high-cardinality features)

Feature Store Design

feature_store:
  offline_store:         # For training — batch computed, stored in data warehouse
    storage: "BigQuery | Snowflake | S3+Parquet"
    compute: "Spark | dbt | SQL"
    refresh: "daily | hourly"
  online_store:          # For serving — low-latency lookups
    storage: "Redis | DynamoDB | Feast online"
    latency_target: "<10ms p99"
    refresh: "streaming | near-real-time"
  registry:              # Feature metadata
    naming: "{entity}_{feature_name}_{window}_{aggregation}"  # e.g., user_purchase_count_30d_sum
    documentation:
      - description
      - data_type
      - source_table
      - owner
      - created_date
      - known_issues

Data Leakage Prevention Checklist

No future information in features (time-travel check)
Train/val/test split done BEFORE feature engineering
Target encoding uses only training fold statistics
No features derived from the target variable
Temporal splits for time-series (no random shuffle)
Holdout set created BEFORE any EDA
Duplicates removed BEFORE splitting (same entity not in train AND test)
Normalization/scaling fit on train, applied to val/test

Phase 3: Experiment Management

Experiment Tracking Template

experiment:
  id: "EXP-{YYYY-MM-DD}-{NNN}"
  hypothesis: ""                 # "Adding user tenure features will improve churn prediction AUC by >2%"
  dataset_version: ""            # Hash or version of training data
  features_used: []              # List of feature names
  model_type: ""                 # Algorithm name
  hyperparameters: {}            # All hyperparams logged
  training_time: ""              # Wall clock
  metrics:
    primary: {}                  # The one metric that matters
    secondary: {}                # Supporting metrics
  baseline_comparison: ""        # Delta vs baseline
  verdict: "promoted | archived | iterate"
  notes: ""
  artifacts:
    - model_path: ""
    - notebook_path: ""
    - confusion_matrix: ""

Model Selection Guide

Task	Start With	Scale To	Avoid
Tabular classification	XGBoost/LightGBM	Neural nets only if >100K samples	Deep learning on <10K samples
Tabular regression	XGBoost/LightGBM	CatBoost for high-cardinality cats	Linear regression without feature engineering
Image classification	Fine-tune ResNet/EfficientNet	Vision Transformer if >100K images	Training from scratch
Text classification	Fine-tune BERT/RoBERTa	LLM few-shot if labeled data scarce	Bag-of-words for nuanced tasks
Text generation	GPT-4/Claude API	Fine-tuned Llama/Mistral for cost	Training from scratch
Time series	Prophet/ARIMA baseline → LightGBM	Temporal Fusion Transformer	LSTM without strong reason
Recommendation	Collaborative filtering baseline	Two-tower neural	Complex models on <1K users
Anomaly detection	Isolation Forest	Autoencoder if high-dimensional	Supervised methods without labeled anomalies
Search/ranking	BM25 baseline → Learning to Rank	Cross-encoder reranking	Pure keyword without semantic

Hyperparameter Tuning Strategy

Manual first — understand 3-5 most impactful parameters
Bayesian optimization (Optuna) — 50-100 trials for production models
Grid search — only for final fine-tuning of 2-3 parameters
Random search — better than grid for >4 parameters

Key hyperparameters by model:

Model	Critical Params	Typical Range
XGBoost	learning_rate, max_depth, n_estimators, min_child_weight	0.01-0.3, 3-10, 100-1000, 1-10
LightGBM	learning_rate, num_leaves, feature_fraction, min_data_in_leaf	0.01-0.3, 15-255, 0.5-1.0, 5-100
Neural Net	learning_rate, batch_size, hidden_dims, dropout	1e-5 to 1e-2, 32-512, arch-dependent, 0.1-0.5
Random Forest	n_estimators, max_depth, min_samples_leaf	100-1000, 5-30, 1-20

Phase 4: Model Evaluation

Metric Selection by Task

Task	Primary Metric	When to Use	Watch Out For
Binary classification (balanced)	F1-score	Equal importance of precision/recall	—
Binary classification (imbalanced)	PR-AUC	Rare positive class (<5%)	ROC-AUC hides poor performance on minority
Multi-class	Macro F1	All classes equally important	Micro F1 if class frequency = importance
Regression	MAE	Outliers should not dominate	RMSE penalizes large errors more
Ranking	NDCG@K	Top-K results matter most	MAP if binary relevance
Generation	Human eval + automated	Quality is subjective	BLEU/ROUGE alone are insufficient
Anomaly detection	Precision@K	False positives are expensive	Recall if missing anomalies is dangerous

Evaluation Rigor Checklist

Offline-to-Online Gap Analysis

Before deploying, verify these don't cause train-serving skew:

Check	Offline	Online	Action
Feature computation	Batch SQL	Real-time API	Verify same logic, test with replay
Data freshness	Point-in-time snapshot	Latest value	Document acceptable staleness
Missing values	Imputed in pipeline	May be truly missing	Handle gracefully in serving
Feature distributions	Training period	Current period	Monitor drift post-deploy

Phase 5: Model Deployment

Deployment Pattern Decision Tree

Is latency < 100ms required?
├── Yes → Is model < 500MB?
│   ├── Yes → Embedded serving (FastAPI + model in memory)
│   └── No → Model server (Triton, TorchServe, vLLM)
└── No → Is it a batch prediction?
    ├── Yes → Batch pipeline (Spark, Airflow + offline inference)
    └── No → Async queue (Celery/SQS → worker → result store)

Production Serving Checklist

serving_config:
  model:
    format: ""                    # ONNX | TorchScript | SavedModel | safetensors
    version: ""                   # Semantic version
    size_mb: null
    load_time_seconds: null
  infrastructure:
    compute: ""                   # CPU | GPU (T4/A10/A100/H100)
    instances: null               # Min/max for autoscaling
    autoscale_metric: ""          # RPS | latency_p99 | GPU_utilization
    autoscale_target: null
  api:
    endpoint: ""
    input_schema: {}              # Pydantic model or JSON schema
    output_schema: {}
    timeout_ms: null
    rate_limit: null
  reliability:
    health_check: "/health"
    readiness_check: "/ready"     # Model loaded and warm
    graceful_shutdown: true
    circuit_breaker: true
    fallback: ""                  # Rules-based fallback when model is down

Containerization Template

# Multi-stage build for minimal image
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY model/ ./model/
COPY src/ ./src/

# Non-root user
RUN useradd -m appuser && chown -R appuser /app
USER appuser

# Health check
HEALTHCHECK --interval=30s --timeout=5s CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080
CMD ["uvicorn", "src.serve:app", "--host", "0.0.0.0", "--port", "8080"]

A/B Testing for Models

ab_test:
  name: ""
  hypothesis: ""
  primary_metric: ""              # Business metric (revenue, engagement, etc.)
  guardrail_metrics: []           # Metrics that must NOT degrade
  traffic_split:
    control: 50                   # Current model
    treatment: 50                 # New model
  minimum_sample_size: null       # Power analysis: use statsmodels or online calculator
  minimum_runtime_days: null      # At least 1 full business cycle (7 days min)
  decision_criteria:
    ship: "Treatment > control by >X% with p<0.05 AND no guardrail regression"
    iterate: "Promising signal but not significant — extend test or refine model"
    kill: "No improvement after 2x minimum runtime OR guardrail breach"

Phase 6: LLM Engineering

LLM Application Architecture

┌─────────────────────────────────────────────┐
│              Application Layer               │
│  (Prompt templates, chains, output parsers)  │
├─────────────────────────────────────────────┤
│              Orchestration Layer              │
│  (Routing, fallback, retry, caching)         │
├─────────────────────────────────────────────┤
│              Model Layer                     │
│  (API calls, fine-tuned models, embeddings)  │
├─────────────────────────────────────────────┤
│              Data Layer                      │
│  (Vector store, context retrieval, memory)   │
└─────────────────────────────────────────────┘

Model Selection for LLM Tasks

Task	Best Option	Cost-Effective Option	When to Fine-Tune
General reasoning	Claude Opus / GPT-4o	Claude Sonnet / GPT-4o-mini	Never for general reasoning
Classification	Fine-tuned small model	Few-shot with Sonnet	>1,000 labeled examples + high volume
Extraction	Structured output API	Regex + LLM fallback	Consistent format needed at scale
Summarization	Claude Sonnet	GPT-4o-mini	Domain-specific style needed
Code generation	Claude Sonnet	Codestral / DeepSeek	Internal codebase conventions
Embeddings	text-embedding-3-large	text-embedding-3-small	Domain-specific vocab (medical, legal)

RAG System Architecture

rag_pipeline:
  ingestion:
    chunking:
      strategy: "semantic"         # semantic | fixed_size | recursive
      chunk_size: 512              # tokens (512-1024 for most use cases)
      overlap: 50                  # tokens overlap between chunks
      metadata_to_preserve:
        - source_document
        - page_number
        - section_heading
        - date_created
    embedding:
      model: "text-embedding-3-large"
      dimensions: 1536             # Or 256/512 with Matryoshka for cost savings
    vector_store: "Pinecone | Weaviate | pgvector | Qdrant"
  retrieval:
    strategy: "hybrid"             # dense | sparse | hybrid (recommended)
    top_k: 10                      # Retrieve more, then rerank
    reranking:
      model: "Cohere rerank | cross-encoder"
      top_n: 3                     # Final context chunks
    filters: []                    # Metadata filters (date range, source, etc.)
  generation:
    model: ""
    system_prompt: |
      Answer based ONLY on the provided context.
      If the context doesn't contain the answer, say "I don't have enough information."
      Cite sources using [Source: document_name, page X].
    temperature: 0.1               # Low for factual, higher for creative
    max_tokens: null

RAG Quality Checklist

Chunking preserves semantic meaning (not cutting mid-sentence)
Metadata enables filtering (dates, sources, categories)
Retrieval returns relevant chunks (test with 50+ queries manually)
Reranking improves precision (compare with/without)
System prompt prevents hallucination (tested with adversarial queries)
Sources are cited and verifiable
Handles "I don't know" gracefully
Latency acceptable (<3s for interactive, <30s for complex)
Cost per query tracked and within budget

LLM Cost Optimization

Strategy	Savings	Trade-off
Prompt caching	50-90% on repeated prefixes	Requires cache-friendly prompt design
Model routing (small → large)	40-70%	Slightly higher latency, need router logic
Batch API	50%	Hours of delay, batch-only workloads
Shorter prompts	Linear with token reduction	May reduce quality
Fine-tuned small model	80-95% vs large model API	Training cost + maintenance
Semantic caching	50-80% for similar queries	May return stale/wrong cached result
Output token limits	Proportional	May truncate useful information

Phase 7: Model Monitoring

Monitoring Dashboard

monitoring:
  model_performance:
    metrics:
      - name: "primary_metric"         # Same as offline evaluation
        threshold: null                 # Alert if below
        window: "1h | 1d | 7d"
      - name: "prediction_distribution"
        alert: "KL divergence > 0.1 from training distribution"
    latency:
      p50_ms: null
      p95_ms: null
      p99_ms: null
      alert_threshold_ms: null
    throughput:
      requests_per_second: null
      error_rate_threshold: 0.01       # Alert if >1% errors
  data_drift:
    method: "PSI | KS-test | JS-divergence"
    features_to_monitor: []            # Top 10 most important features
    check_frequency: "hourly | daily"
    alert_threshold: null              # PSI > 0.2 = significant drift
  concept_drift:
    method: "performance_degradation"
    ground_truth_delay: ""             # How long until we get labels?
    proxy_metrics: []                  # Metrics available before ground truth
    retraining_trigger: ""             # When to retrain

Drift Response Playbook

Drift Type	Detection	Severity	Response
Feature drift (input distribution shifts)	PSI > 0.1	Warning	Investigate cause, monitor performance
Feature drift (PSI > 0.25)	PSI > 0.25	Critical	Retrain on recent data within 24h
Concept drift (relationship changes)	Performance drop >5%	Critical	Retrain with new labels, review features
Label drift (target distribution changes)	Chi-square test	Warning	Verify label quality, check for data issues
Prediction drift (output distribution shifts)	KL divergence	Warning	May indicate upstream data issue

Automated Retraining Pipeline

retraining:
  triggers:
    - type: "scheduled"
      frequency: "weekly | monthly"
    - type: "performance"
      condition: "primary_metric < threshold for 24h"
    - type: "drift"
      condition: "PSI > 0.2 on any top-10 feature"
  pipeline:
    1_data_validation:
      - check_completeness
      - check_distribution_shift
      - check_label_quality
    2_training:
      - use_latest_N_months_data
      - same_hyperparameters_as_production   # Unless scheduled tuning
      - log_all_metrics
    3_evaluation:
      - compare_vs_production_model
      - must_beat_production_on_primary_metric
      - must_not_regress_on_guardrail_metrics
      - evaluate_on_golden_test_set
    4_deployment:
      - canary_deployment: 5%
      - monitor_for: "4h minimum"
      - auto_rollback_if: "error_rate > 2x baseline"
      - gradual_rollout: "5% → 25% → 50% → 100%"
    5_notification:
      - log_retraining_event
      - notify_team_on_failure
      - update_model_registry

Phase 8: MLOps Infrastructure

ML Platform Components

Component	Purpose	Tools
Experiment tracking	Log runs, compare results	MLflow, W&B, Neptune
Feature store	Centralized feature management	Feast, Tecton, Hopsworks
Model registry	Version, stage, approve models	MLflow Registry, SageMaker
Pipeline orchestration	DAG-based ML workflows	Airflow, Prefect, Dagster, Kubeflow
Model serving	Low-latency inference	Triton, TorchServe, vLLM, BentoML
Monitoring	Drift, performance, data quality	Evidently, Whylogs, Great Expectations
Vector store	Embedding storage for RAG	Pinecone, Weaviate, pgvector, Qdrant
GPU management	Training and inference compute	K8s + GPU operator, RunPod, Modal

CI/CD for ML

ml_cicd:
  on_code_change:
    - lint_and_type_check
    - unit_tests (data transforms, feature logic)
    - integration_tests (pipeline end-to-end on sample data)
  on_data_change:
    - data_validation (Great Expectations / custom)
    - feature_pipeline_run
    - smoke_test_predictions
  on_model_change:
    - full_evaluation_suite
    - bias_and_fairness_check
    - performance_regression_test
    - model_size_and_latency_check
    - security_scan (model file, dependencies)
    - staging_deployment
    - integration_test_in_staging
    - approval_gate (manual for major versions)
    - canary_deployment

Model Registry Workflow

┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│  Development │ ───→ │   Staging    │ ───→ │  Production  │
│              │      │              │      │              │
│ - Experiment │      │ - Eval suite │      │ - Canary     │
│ - Log metrics│      │ - Load test  │      │ - Monitor    │
│ - Compare    │      │ - Approval   │      │ - Rollback   │
└──────────────┘      └──────────────┘      └──────────────┘

Promotion criteria:

Dev → Staging: Beats current production on offline metrics
Staging → Production: Passes load test + integration test + human approval
Auto-rollback: Error rate >2x OR latency >2x OR primary metric drops >5%

Phase 9: Responsible AI

Bias Detection Checklist

Training data represents all demographic groups proportionally
Performance metrics broken down by protected attributes
Equal opportunity: similar true positive rates across groups
Calibration: predicted probabilities match actual rates per group
No proxy features for protected attributes (ZIP code → race)
Fairness metric selected and threshold defined BEFORE training
Disparate impact ratio >0.8 (80% rule)
Edge cases tested: what happens with unusual inputs?

Model Card Template

model_card:
  model_name: ""
  version: ""
  date: ""
  owner: ""
  description: ""
  intended_use: ""
  out_of_scope_uses: ""
  training_data:
    source: ""
    size: ""
    date_range: ""
    known_biases: ""
  evaluation:
    metrics: {}
    datasets: []
    sliced_metrics: {}             # Performance by subgroup
  limitations: []
  ethical_considerations: []
  maintenance:
    retraining_schedule: ""
    monitoring: ""
    contact: ""

Phase 10: Cost & Performance Optimization

GPU Selection Guide

Use Case	GPU	VRAM	Cost/hr (cloud)	Best For
Fine-tune 7B model	A10G	24GB	~$1	LoRA/QLoRA fine-tuning
Fine-tune 70B model	A100 80GB	80GB	~$4	Full fine-tuning medium models
Serve 7B model	T4	16GB	~$0.50	Inference at scale
Serve 70B model	A100 40GB	40GB	~$2	Large model inference
Train from scratch	H100	80GB	~$8	Pre-training, large-scale training

Inference Optimization Techniques

Technique	Speedup	Quality Impact	Complexity
Quantization (INT8)	2-3x	<1% degradation	Low
Quantization (INT4/GPTQ)	3-4x	1-3% degradation	Medium
Batching	2-10x throughput	None	Low
KV-cache optimization	20-40% memory savings	None	Medium
Speculative decoding	2-3x for LLMs	None (mathematically exact)	High
Model distillation	5-10x smaller model	2-5% degradation	High
ONNX Runtime	1.5-3x	None	Low
TensorRT	2-5x	<1%	Medium
vLLM (PagedAttention)	2-4x throughput for LLMs	None	Low

Cost Tracking Template

ml_costs:
  training:
    compute_cost_per_run: null
    runs_per_month: null
    data_storage_monthly: null
    experiment_tracking: null
  inference:
    cost_per_1k_predictions: null
    daily_volume: null
    monthly_cost: null
    cost_per_query_breakdown:
      compute: null
      model_api_calls: null
      vector_db: null
      data_transfer: null
  optimization_targets:
    cost_per_prediction: null      # Target
    monthly_budget: null
    cost_reduction_goal: ""

Phase 11: ML System Quality Rubric

Score your ML system (0-100):

Dimension	Weight	0-2 (Poor)	3-4 (Good)	5 (Excellent)
Problem framing	15%	No clear business metric	Defined success metric	Kill criteria + baseline + ROI estimate
Data quality	15%	Ad-hoc data, no validation	Automated quality checks	Feature store + lineage + versioning
Experiment rigor	15%	No tracking, one-off notebooks	MLflow/W&B tracking	Reproducible pipelines + proper evaluation
Model performance	15%	Barely beats baseline	Significant improvement	Calibrated, fair, robust to edge cases
Deployment	10%	Manual deployment	CI/CD for models	Canary + auto-rollback + A/B testing
Monitoring	15%	No monitoring	Basic metrics dashboard	Drift detection + auto-retraining + alerts
Documentation	5%	Nothing documented	Model card exists	Full model card + runbooks + decision log
Cost efficiency	10%	No cost tracking	Budget exists	Optimized inference + cost-per-prediction tracking

Scoring:

80-100: Production-grade ML system
60-79: Good foundations, missing operational maturity
40-59: Prototype quality, not ready for production
<40: Science project, needs fundamental rework

Common Mistakes

Mistake	Fix
Optimizing model before fixing data	Data quality > model complexity. Always.
Using accuracy on imbalanced data	Use PR-AUC, F1, or domain-specific metric
No baseline comparison	Always start with simple heuristic baseline
Training on future data	Temporal splits for time-series, strict leakage checks
Deploying without monitoring	No model in production without drift detection
Fine-tuning when prompting works	Try few-shot prompting first — fine-tune only for scale/cost
GPU for everything	CPU inference is often sufficient and 10x cheaper
Ignoring calibration	If probabilities matter (risk scoring), calibrate
One-time model deployment	ML is a continuous system — plan for retraining from day 1
Premature scaling	Prove value with batch predictions before building real-time serving

Quick Commands

"Frame ML problem" → Phase 1 brief
"Assess data quality" → Phase 2 scoring
"Select model" → Phase 3 guide
"Evaluate model" → Phase 4 checklist
"Deploy model" → Phase 5 serving config
"Build RAG" → Phase 6 RAG architecture
"Set up monitoring" → Phase 7 dashboard
"Optimize costs" → Phase 10 tracking
"Score ML system" → Phase 11 rubric
"Detect drift" → Phase 7 playbook
"A/B test model" → Phase 5 template
"Create model card" → Phase 9 template