LLM Development & Engineering — Complete Reference
Build, evaluate, and deploy LLM systems with modern production standards.
This skill covers the full LLM lifecycle:
-
Development: Strategy selection, dataset design, instruction tuning, PEFT/LoRA fine-tuning
-
Evaluation: Automated testing, LLM-as-judge, metrics, rollout gates
-
Deployment: Serving handoff, latency/cost budgeting, reliability patterns (see ai-llm-inference )
-
Operations: Quality monitoring, change management, incident response (see ai-mlops )
-
Safety: Threat modeling, data governance, layered mitigations (NIST AI RMF: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf)
Modern Best Practices (2026):
-
Treat the model as a component with contracts, budgets, and rollback plans (not "magic").
-
Separate core concepts (tokenization, context, training vs adaptation) from implementation choices (providers, SDKs).
-
Gate upgrades with repeatable evals and staged rollout; avoid blind model swaps.
-
Cost-aware engineering: Measure cost per successful outcome, not just cost per token; design tiering/caching early.
-
Security-by-design: Threat model prompt injection, data leakage, and tool abuse; treat guardrails as production code.
For detailed patterns: See Resources and Templates sections below.
Quick Reference
Task Tool/Framework Command/Pattern When to Use
Choose architecture Prompt vs RAG vs fine-tune Start simple; add retrieval/adaptation only if needed New products and migrations
Model selection Scoring matrix Quality/latency/cost/privacy/license weighting Provider changes and procurement
Cost optimization Tiered models + caching Cascade routing, prompt caching, budget guardrails Cost-sensitive production
Fine-tuning ROI ROI calculator Break-even analysis, TCO comparison Investment decisions
Prompt contracts Structured output + constraints JSON schema, max tokens, refusal rules Reliability and integration
RAG integration Hybrid retrieval + grounding Retrieve → rerank → pack → cite → verify Fresh/large corpora, traceability
Fine-tuning PEFT/LoRA (when justified) Small targeted datasets + regression suite Stable domains, repeated tasks
Evaluation Offline + online Golden sets + A/B + canary + monitoring Prevent regressions and drift
Decision Tree: LLM System Architecture
Building LLM application: [Architecture Selection] ├─ Need current knowledge? │ ├─ Simple Q&A? → Basic RAG (page-level chunking + hybrid retrieval) │ └─ Complex retrieval? → Advanced RAG (reranking + contextual retrieval) │ ├─ Need tool use / actions? │ ├─ Single task? → Simple agent (ReAct pattern) │ └─ Multi-step workflow? → Multi-agent (LangGraph, CrewAI) │ ├─ Static behavior sufficient? │ ├─ Quick MVP? → Prompt engineering (CI/CD integrated) │ └─ Production quality? → Fine-tuning (PEFT/LoRA) │ └─ Best results? └─ Hybrid (RAG + Fine-tuning + Agents) → Comprehensive solution
See Decision Matrices for detailed selection criteria.
Cost-Quality Decision Framework
LLM spend is driven by usage-based inference (tokens/requests) plus supporting infra and engineering. Model selection is a cost-quality-latency-risk tradeoff.
Model Tier Strategy
| Tier | Typical profile | Use For | |------|--------|------|---------| | Value | Small/fast models | High-volume, simple tasks | | Balanced | General-purpose models | Most production workloads | | Premium | Frontier/large models | Hardest tasks, low volume |
Cost Optimization Levers
-
Model tiering: Route simple requests to cheaper models (often large savings at scale)
-
Prompt caching: Reuse stable prefixes/context (provider-specific discounts and constraints)
-
Prompt optimization: Compress examples and instructions (typically meaningful token reduction)
-
Output limits: Set appropriate max_tokens (prevents runaway costs)
When to Fine-Tune (ROI-Based)
Fine-tuning pays off when:
-
Volume justifies it: >10k requests/month provides meaningful cost savings
-
Domain is stable: Requirements unchanged for >6 months
-
Data exists: >1,000 quality training examples available
-
Break-even achievable: <12 months to recover investment
See Cost Economics for TCO modeling and Fine-Tuning ROI Calculator for investment analysis.
Core Concepts (Vendor-Agnostic)
-
Model classes: encoder-only, decoder-only, encoder-decoder, multimodal; choose based on task and latency.
-
Tokenization & limits: context window, max output, and prompt/template overhead drive both cost and tail latency.
-
Adaptation options: prompting → retrieval → adapters (LoRA) → full fine-tune; choose by stability and ROI (LoRA: https://arxiv.org/abs/2106.09685).
-
Evaluation: metrics must map to user value; report uncertainty and slice performance, not only global averages.
-
Governance: data retention, residency, licensing, and auditability are product requirements (EU AI Act: https://eur-lex.europa.eu/eli/reg/2024/1689/oj; NIST GenAI Profile: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf).
Implementation Practices (Tooling Examples)
-
Use a provider abstraction (gateway/router) to enable fallbacks and staged upgrades.
-
Instrument requests with tokens, latency, and error classes (OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/).
-
Maintain prompt/model registries with versioning, changelogs, and rollback criteria.
Do / Avoid
Do
-
Do pin model + prompt versions in production, and re-run evals before any change.
-
Do enforce budgets at the boundary: max tokens, max tools, max retries, max cost.
-
Do plan for degraded modes (smaller model, cached answers, “unable to answer”).
Avoid
-
Avoid model sprawl (unowned variants with no eval coverage).
-
Avoid blind upgrades based on anecdotal quality; require measured impact.
-
Avoid training on production logs without consent, governance, and leakage controls.
When to Use This Skill
Claude should invoke this skill when the user asks about:
-
LLM preflight/project checklists, production best practices, or data pipelines
-
Building or deploying RAG, agentic, or prompt-based LLM apps
-
Prompt design, chain-of-thought (CoT), ReAct, or template patterns
-
Troubleshooting LLM hallucination, bias, retrieval issues, or production failures
-
Evaluating LLMs: benchmarks, multi-metric eval, or rollout/monitoring
-
LLMOps: deployment, rollback, scaling, resource optimization
-
Technology stack selection (models, vector DBs, frameworks)
-
Production deployment strategies and operational patterns
Scope Boundaries (Use These Skills for Depth)
-
Prompt design & CI/CD → ai-prompt-engineering
-
RAG pipelines & chunking → ai-rag
-
Search tuning (BM25, HNSW, hybrid) → ai-rag
-
Agent architectures & tools → ai-agents
-
Serving optimization/quantization → ai-llm-inference
-
Production deployment/monitoring → ai-mlops
-
Security/guardrails → ai-mlops
Resources (Best Practices & Operational Patterns)
Comprehensive operational guides with checklists, patterns, and decision frameworks:
Core Operational Patterns
Cost Economics & Decision Frameworks - Cost modeling, unit economics, TCO analysis
-
Pricing/discount assumptions (verify against current provider docs)
-
Cost-quality tradeoff framework and decision matrix
-
Total Cost of Ownership (TCO) calculation
-
Fine-tuning ROI framework and break-even analysis
-
Prompt caching economics
-
Cost monitoring and budget guardrails
Project Planning Patterns - Stack selection, FTI pipeline, performance budgeting
-
AI engineering stack selection matrix
-
Feature/Training/Inference (FTI) pipeline blueprint
-
Performance budgeting and goodput gates
-
Progressive complexity (prompt → RAG → fine-tune → hybrid)
Production Checklists - Pre-deployment validation and operational checklists
-
LLM lifecycle checklist (modern production standards)
-
Data & training, RAG pipeline, deployment & serving
-
Safety/guardrails, evaluation, agentic systems
-
Reliability & data infrastructure (DDIA-grade)
-
Weekly production tasks
Common Design Patterns - Copy-paste ready implementation examples
-
Chain-of-Thought (CoT) prompting
-
ReAct (Reason + Act) pattern
-
RAG pipeline (minimal to advanced)
-
Agentic planning loop
-
Self-reflection and multi-agent collaboration
Decision Matrices - Quick reference tables for selection
-
RAG type decision matrix (naive → advanced → modular)
-
Production evaluation table with targets and actions
-
Model selection matrix (tier-based, vendor-agnostic)
-
Vector database, embedding model, framework selection
-
Deployment strategy matrix
Anti-Patterns - Common mistakes and prevention strategies
-
Data leakage, prompt dilution, RAG context overload
-
Agentic runaway, over-engineering, ignoring evaluation
-
Hard-coded prompts, missing observability
-
Detection methods and prevention code examples
Domain-Specific Patterns
-
LLMOps Best Practices - Operational lifecycle and deployment patterns
-
Evaluation Patterns - Testing, metrics, and quality validation
-
Prompt Engineering Patterns - Quick reference (canonical skill: ai-prompt-engineering)
-
Agentic Patterns - Quick reference (canonical skill: ai-agents)
-
RAG Best Practices - Quick reference (canonical skill: ai-rag)
Emerging Patterns
-
Structured Output Patterns - JSON mode, constrained decoding, schema enforcement, validation pipelines
-
Multimodal Patterns - Vision-language models, audio/image inputs, cross-modal pipelines, cost management
-
Model Migration Guide - Provider migration playbook, eval-gated rollout, prompt adaptation, fallback strategies
Note: Each resource file includes preflight/validation checklists, copy-paste reference tables, inline templates, anti-patterns, and decision matrices.
Templates (Copy-Paste Ready)
Production templates by use case and technology:
Selection & Governance
-
Model Selection Matrix - Documented selection, scoring, licensing, and governance
-
Fine-Tuning ROI Calculator - Investment analysis, break-even, go/no-go decisions
RAG Pipelines
-
Basic RAG - Simple retrieval-augmented generation
-
Advanced RAG - Hybrid retrieval, reranking, contextual embeddings
Prompt Engineering
-
Chain-of-Thought - Step-by-step reasoning pattern
-
ReAct - Reason + Act for tool use
Agentic Workflows
-
Reflection Agent - Self-critique and improvement
-
Multi-Agent - Manager-worker orchestration
Data Pipelines
- Data Quality - Validation, deduplication, PII detection
Deployment
- LLM Deployment - Production deployment with monitoring
Evaluation
- Multi-Metric Evaluation - Comprehensive testing suite
Shared Utilities (Centralized patterns — extract, don't duplicate)
-
../software-clean-code-standard/utilities/llm-utilities.md — Token counting, streaming, cost estimation
-
../software-clean-code-standard/utilities/error-handling.md — Effect Result types, correlation IDs
-
../software-clean-code-standard/utilities/resilience-utilities.md — p-retry v6, circuit breaker for LLM API calls
-
../software-clean-code-standard/utilities/logging-utilities.md — pino v9 + OpenTelemetry integration
-
../software-clean-code-standard/utilities/observability-utilities.md — OpenTelemetry SDK, tracing, metrics
-
../software-clean-code-standard/utilities/config-validation.md — Zod 3.24+, secrets management for API keys
-
../software-clean-code-standard/utilities/testing-utilities.md — Test factories, fixtures, mocks
-
../software-clean-code-standard/references/clean-code-standard.md — Canonical clean code rules (CC-* ) for citation
Trend Awareness Protocol
IMPORTANT: For “best/latest” recommendations, verify recency using current sources (official docs/release notes/benchmarks). If you can’t browse, state assumptions and ask for timeframe + constraints.
Trigger Conditions
-
"What's the best LLM model for [use case]?"
-
"What should I use for [RAG/fine-tuning/agents]?"
-
"What's the latest in LLM development?"
-
"Current best practices for [prompting/evaluation/deployment]?"
-
"Is [model/framework] still relevant in 2026?"
-
"[Model A] vs [Model B]?" or "[Framework A] vs [Framework B]?"
-
"Best vector database for [use case]?"
-
"What agent framework should I use?"
Minimal Verification Checklist
-
Confirm user constraints: latency, cost, privacy/compliance, deployment target, and toolchain.
-
Check at least 2 authoritative sources from data/sources.json (provider docs, release notes, pricing/quotas, deprecations).
-
Prefer stable guidance (tradeoffs + decision criteria) over “one best model/framework”.
What to Report
After searching, provide:
-
Current landscape: What models/frameworks are popular NOW (not 6 months ago)
-
Emerging trends: New models, frameworks, or techniques gaining traction
-
Deprecated/declining: Models/frameworks losing relevance or support
-
Recommendation: Based on fresh data, not just static knowledge
Example Topics (verify with fresh sources)
-
Latest frontier models (GPT-4.5, Claude 4, Gemini 2.x, Llama 4)
-
Agent frameworks (LangGraph, CrewAI, AutoGen, Semantic Kernel)
-
Vector databases (Pinecone, Qdrant, Weaviate, pgvector)
-
RAG techniques (contextual retrieval, agentic RAG, graph RAG)
-
Inference engines (vLLM, TensorRT-LLM, SGLang)
-
Evaluation frameworks (RAGAS, DeepEval, Braintrust)
Related Skills
This skill integrates with complementary Claude Code skills:
Core Dependencies
-
ai-rag - Retrieval pipelines: chunking, hybrid search, reranking, evaluation
-
ai-prompt-engineering - Systematic prompt design, evaluation, testing, and optimization
-
ai-agents - Agent architectures, tool use, multi-agent systems, autonomous workflows
Production & Operations
-
ai-llm-inference - Production serving, quantization, batching, GPU optimization
-
ai-mlops - Deployment, monitoring, incident response, security, and governance
External Resources
See data/sources.json for 50+ curated authoritative sources:
-
Official LLM platform docs - OpenAI, Anthropic, Gemini, Mistral, Azure OpenAI, AWS Bedrock
-
Open-source models and frameworks - HuggingFace Transformers, open-weight models, PEFT/LoRA, distributed training/inference stacks
-
RAG frameworks and vector DBs - LlamaIndex, LangChain 1.2+, LangGraph, LangGraph Studio v2, Haystack, Pinecone, Qdrant, Chroma
-
Agent frameworks (examples) - LangGraph, Semantic Kernel, AutoGen, CrewAI
-
RAG innovations (examples) - Graph-based retrieval, hybrid retrieval, online evaluation loops
-
Prompt engineering - Anthropic Prompt Library, Prompt Engineering Guide, CoT/ReAct patterns
-
Evaluation and monitoring - OpenAI Evals, HELM, Anthropic Evals, LangSmith, W&B, Arize Phoenix
-
Production deployment - Model gateways/routers, self-hosted serving, managed endpoints
Usage
For New Projects
-
Start with Production Checklists - Validate all pre-deployment requirements
-
Use Decision Matrices - Select technology stack
-
Reference Project Planning Patterns - Design FTI pipeline
-
Implement with Common Design Patterns - Copy-paste code examples
-
Avoid Anti-Patterns - Learn from common mistakes
For Troubleshooting
-
Check Anti-Patterns - Identify failure modes and mitigations
-
Use Decision Matrices - Evaluate if architecture fits use case
-
Reference Common Design Patterns - Verify implementation correctness
For Ongoing Operations
-
Follow Production Checklists - Weekly operational tasks
-
Integrate Evaluation Patterns - Continuous quality monitoring
-
Apply LLMOps Best Practices - Deployment and rollback procedures
Navigation Summary
Quick Decisions: Decision Matrices Pre-Deployment: Production Checklists Planning: Project Planning Patterns Implementation: Common Design Patterns Troubleshooting: Anti-Patterns
Domain Depth: LLMOps | Evaluation | Prompts | Agents | RAG
Templates: assets/ - Copy-paste ready production code
Sources: data/sources.json - Authoritative documentation links
Fact-Checking
-
Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
-
Prefer primary sources; report source links and dates for volatile information.
-
If web access is unavailable, state the limitation and mark guidance as unverified.