ML Engineering Guide

Production-grade ML/AI systems, MLOps, and model deployment.

When to Use

Deploying ML models to production
Building ML platforms and infrastructure
Implementing MLOps pipelines
Integrating LLMs into production systems
Setting up model monitoring and drift detection

Tech Stack

Category Tools

ML Frameworks PyTorch, TensorFlow, Scikit-learn, XGBoost

LLM Frameworks LangChain, LlamaIndex, DSPy

Data Tools Spark, Airflow, dbt, Kafka, Databricks

Deployment Docker, Kubernetes, AWS/GCP/Azure

Monitoring MLflow, Weights & Biases, Prometheus

Databases PostgreSQL, BigQuery, Snowflake, Pinecone

Production Patterns

Model Deployment Pipeline

Model serving with FastAPI

from fastapi import FastAPI import torch

app = FastAPI() model = torch.load("model.pth")

@app.post("/predict") async def predict(data: dict): tensor = preprocess(data) with torch.no_grad(): prediction = model(tensor) return {"prediction": prediction.tolist()}

Feature Store Integration

Feast feature store

from feast import FeatureStore

store = FeatureStore(repo_path=".") features = store.get_online_features( features=["user_features:age", "user_features:location"], entity_rows=[{"user_id": 123}] ).to_dict()

Model Monitoring

Drift detection

from evidently import ColumnMapping from evidently.report import Report from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()]) report.run(reference_data=ref_df, current_data=curr_df)

MLOps Best Practices

Development

Test-driven development for ML pipelines
Version control models and data
Reproducible experiments with MLflow

Production

A/B testing infrastructure
Canary deployments for models
Automated retraining pipelines
Model monitoring and drift detection

Performance Targets

Metric Target

P50 Latency < 50ms

P95 Latency < 100ms

P99 Latency < 200ms

Throughput

1000 RPS

Availability 99.9%

LLM Integration Patterns

RAG System

Basic RAG with LangChain

from langchain.vectorstores import Pinecone from langchain.embeddings import OpenAIEmbeddings from langchain.chains import RetrievalQA

vectorstore = Pinecone.from_existing_index( index_name="docs", embedding=OpenAIEmbeddings() ) qa = RetrievalQA.from_chain_type( llm=llm, retriever=vectorstore.as_retriever() )

Prompt Management

Structured prompts with DSPy

import dspy

class QA(dspy.Signature): """Answer questions based on context.""" context = dspy.InputField() question = dspy.InputField() answer = dspy.OutputField()

qa = dspy.Predict(QA)

Common Commands

Development

python -m pytest tests/ -v --cov python -m black src/ python -m pylint src/

Training

python scripts/train.py --config prod.yaml mlflow run . -P epochs=10

Deployment

docker build -t model:v1 . kubectl apply -f k8s/model-serving.yaml

Monitoring

mlflow ui --port 5000

Security & Compliance

Authentication for model endpoints
Data encryption (at rest & in transit)
PII handling and anonymization
GDPR/CCPA compliance
Model access audit logging

ml-engineering

Safety Notice

Copy this and send it to your AI assistant to learn

Model serving with FastAPI

Feast feature store

Drift detection

Basic RAG with LangChain

Structured prompts with DSPy

Development

Training

Deployment

Monitoring

Source Transparency

Related Skills

agent-browser

agent-evaluation

multi-agent-patterns