AI/ML Engineering
Build production AI systems with modern patterns and tools.
Quick Reference
The 2026 AI Stack
Layer Tool Purpose
Prompting DSPy Programmatic prompt optimization
Orchestration LangGraph Stateful multi-agent workflows
RAG LlamaIndex Document ingestion and retrieval
Vectors Qdrant / Pinecone Embedding storage and search
Evaluation RAGAS RAG quality metrics
Experiment Tracking MLflow / W&B Logging, versioning, comparison
Serving BentoML / vLLM Model deployment
Protocol MCP Tool and context integration
DSPy: Programmatic Prompting
Manual prompts are dead. DSPy treats prompts as optimizable code:
import dspy
class QA(dspy.Signature): """Answer questions with short factoid answers.""" question = dspy.InputField() answer = dspy.OutputField(desc="1-5 words")
Create module
qa = dspy.Predict(QA)
Use it
result = qa(question="What is the capital of France?") print(result.answer) # "Paris"
Optimize with real data:
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(metric=exact_match) optimized_qa = optimizer.compile(qa, trainset=train_data)
RAG Architecture (Production)
Query → Rewrite → Hybrid Retrieval → Rerank → Generate → Cite │ │ │ v v v Query expansion Dense + BM25 Cross-encoder
LlamaIndex + LangGraph Pattern:
from llama_index.core import VectorStoreIndex from langgraph.graph import StateGraph
Data layer (LlamaIndex)
index = VectorStoreIndex.from_documents(docs) query_engine = index.as_query_engine()
Control layer (LangGraph)
def retrieve(state): response = query_engine.query(state["question"]) return {"context": response.response, "sources": response.source_nodes}
graph = StateGraph(State) graph.add_node("retrieve", retrieve) graph.add_node("generate", generate_answer) graph.add_edge("retrieve", "generate")
MCP Integration
Model Context Protocol is the standard for tool integration:
from mcp import Server, Tool
server = Server("my-tools")
@server.tool() async def search_docs(query: str) -> str: """Search the knowledge base.""" results = await vector_store.search(query) return format_results(results)
Embeddings (2026)
Model Dimensions Best For
text-embedding-3-large 3072 General purpose
BGE-M3 1024 Multilingual RAG
Qwen3-Embedding Flexible Custom domains
Fine-Tuning with LoRA/QLoRA
from peft import LoraConfig, get_peft_model
config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, )
model = get_peft_model(base_model, config)
Train on ~24GB VRAM (QLoRA on RTX 4090)
MLOps Pipeline
MLflow tracking
mlflow.set_experiment("rag-v2")
with mlflow.start_run(): mlflow.log_params({"chunk_size": 512, "model": "gpt-4"}) mlflow.log_metrics({"faithfulness": 0.92, "relevance": 0.88}) mlflow.log_artifact("prompts/qa.txt")
Evaluation with RAGAS
from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision], ) print(results) # {'faithfulness': 0.92, 'answer_relevancy': 0.88, ...}
Vector Database Selection
DB Best For Pricing
Qdrant Self-hosted, filtering 1GB free forever
Pinecone Managed, zero-ops Free tier available
Weaviate Knowledge graphs 14-day trial
Milvus Billion-scale Self-hosted
Agents
-
ai-engineer - LLM integration, RAG, MCP, production AI
-
mlops-engineer - Model deployment, monitoring, pipelines
-
data-scientist - Analysis, modeling, experimentation
-
ml-researcher - Cutting-edge architectures, paper implementation
-
cv-engineer - Computer vision, VLMs, image processing
Deep Dives
-
references/dspy-guide.md
-
references/rag-patterns.md
-
references/mcp-integration.md
-
references/fine-tuning.md
-
references/evaluation.md
Examples
-
examples/rag-pipeline/
-
examples/mcp-server/
-
examples/dspy-optimization/