LLM Integration

Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in rules/ loaded on-demand.

Quick Reference

Category Rules Impact When to Use

Function Calling 3 CRITICAL Tool definitions, parallel execution, input validation

Streaming 3 HIGH SSE endpoints, structured streaming, backpressure handling

Local Inference 3 HIGH Ollama setup, model selection, GPU optimization

Fine-Tuning 3 HIGH LoRA/QLoRA training, dataset preparation, evaluation

Context Optimization 2 HIGH Window management, compression, caching, budget scaling

Evaluation 2 HIGH LLM-as-judge, RAGAS metrics, quality gates, benchmarks

Prompt Engineering 4 HIGH CoT, few-shot, versioning, DSPy optimization, ReAct, cost optimization

Total: 20 rules across 7 categories

Quick Start

Function calling: strict mode tool definition

tools = [{ "type": "function", "function": { "name": "search_documents", "description": "Search knowledge base", "strict": True, "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"}, "limit": {"type": "integer", "description": "Max results"} }, "required": ["query", "limit"], "additionalProperties": False } } }]

Streaming: SSE endpoint with FastAPI

@app.get("/chat/stream") async def stream_chat(prompt: str): async def generate(): async for token in async_stream(prompt): yield {"event": "token", "data": token} yield {"event": "done", "data": ""} return EventSourceResponse(generate())

Local inference: Ollama with LangChain

llm = ChatOllama( model="deepseek-r1:70b", base_url="http://localhost:11434", temperature=0.0, num_ctx=32768, )

Fine-tuning: QLoRA with Unsloth

model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Meta-Llama-3.1-8B", max_seq_length=2048, load_in_4bit=True, ) model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)

Function Calling

Enable LLMs to use external tools and return structured data. Use strict mode schemas (2026 best practice) for reliability. Limit to 5-15 tools per request, validate all inputs with Pydantic/Zod, and return errors as tool results.

calling-tool-definition.md -- Strict mode schemas, OpenAI/Anthropic formats, LangChain binding
calling-parallel.md -- Parallel tool execution, asyncio.gather, strict mode constraints
calling-validation.md -- Input validation, error handling, tool execution loops

Streaming

Deliver LLM responses in real-time for better UX. Use SSE for web, WebSocket for bidirectional. Handle backpressure with bounded queues.

streaming-sse.md -- FastAPI SSE endpoints, frontend consumers, async iterators
streaming-structured.md -- Streaming with tool calls, partial JSON parsing, chunk accumulation
streaming-backpressure.md -- Backpressure handling, bounded buffers, cancellation

Local Inference

Run LLMs locally with Ollama for cost savings (93% vs cloud), privacy, and offline development. Pre-warm models, use provider factory for cloud/local switching.

local-ollama-setup.md -- Installation, model pulling, environment configuration
local-model-selection.md -- Model comparison by task, hardware profiles, quantization
local-gpu-optimization.md -- Apple Silicon tuning, keep-alive, CI integration

Fine-Tuning

Customize LLMs with parameter-efficient techniques. Fine-tune ONLY after exhausting prompt engineering and RAG. Requires 1000+ quality examples.

tuning-lora.md -- LoRA/QLoRA configuration, Unsloth training, adapter merging
tuning-dataset-prep.md -- Synthetic data generation, quality validation, deduplication
tuning-evaluation.md -- DPO alignment, evaluation metrics, anti-patterns

Context Optimization

Manage context windows, compression, and attention-aware positioning. Optimize for tokens-per-task.

context-window-management.md -- Five-layer architecture, anchored summarization, compression triggers
context-caching.md -- Just-in-time loading, budget scaling, probe evaluation, CC 2.1.32+

Evaluation

Evaluate LLM outputs with multi-dimension scoring, quality gates, and benchmarks.

evaluation-metrics.md -- LLM-as-judge, RAGAS metrics, hallucination detection
evaluation-benchmarks.md -- Quality gates, batch evaluation, pairwise comparison

Prompt Engineering

Design, version, and optimize prompts for production LLM applications.

prompt-design.md -- Chain-of-Thought, few-shot learning, pattern selection guide
prompt-testing.md -- Langfuse versioning, DSPy optimization, A/B testing, self-consistency
prompt-react-pattern.md -- ReAct loop for tool-using agents, thought-action-observation format
prompt-optimization.md -- Token reduction, cost optimization, model tiering, prompt spec format

Key Decisions

Decision Recommendation

Tool schema mode strict: true (2026 best practice)

Tool count 5-15 max per request

Streaming protocol SSE for web, WebSocket for bidirectional

Buffer size 50-200 tokens

Local model (reasoning) deepseek-r1:70b

Local model (coding) qwen2.5-coder:32b

Fine-tuning approach LoRA/QLoRA (try prompting first)

LoRA rank 16-64 typical

Training epochs 1-3 (more risks overfitting)

Context compression Anchored iterative (60-80%)

Compress trigger 70% utilization, target 50%

Judge model GPT-5.2-mini or Haiku 4.5

Quality threshold 0.7 production, 0.6 drafts

Few-shot examples 3-5 diverse, representative

Prompt versioning Langfuse with labels

Auto-optimization DSPy MIPROv2

Related Skills

ork:rag-retrieval -- Embedding patterns, when RAG is better than fine-tuning
agent-loops -- Multi-step tool use with reasoning
llm-evaluation -- Evaluate fine-tuned and local models
langfuse-observability -- Track training experiments

Capability Details

function-calling

Keywords: tool, function, define tool, tool schema, function schema, strict mode, parallel tools Solves:

Define tools with clear descriptions and strict schemas
Execute tool calls in parallel with asyncio.gather
Validate inputs and handle errors in tool execution loops

streaming

Keywords: streaming, SSE, Server-Sent Events, real-time, backpressure, token stream Solves:

Stream LLM tokens via SSE endpoints
Handle tool calls within streams
Manage backpressure with bounded queues

local-inference

Keywords: Ollama, local, self-hosted, model selection, GPU, Apple Silicon Solves:

Set up Ollama for local LLM inference
Select models based on task and hardware
Optimize GPU usage and CI integration

fine-tuning

Keywords: LoRA, QLoRA, fine-tune, DPO, synthetic data, PEFT, alignment Solves:

Configure LoRA/QLoRA for parameter-efficient training
Generate and validate synthetic training data
Align models with DPO and evaluate results

llm-integration

Safety Notice

Copy this and send it to your AI assistant to learn

Function calling: strict mode tool definition

Streaming: SSE endpoint with FastAPI

Local inference: Ollama with LangChain

Fine-tuning: QLoRA with Unsloth

Source Transparency

Related Skills

ui-components

responsive-patterns

domain-driven-design

dashboard-patterns