llm-integration

Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in rules/ loaded on-demand.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llm-integration" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-llm-integration

LLM Integration

Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in rules/ loaded on-demand.

Quick Reference

Category Rules Impact When to Use

Function Calling 3 CRITICAL Tool definitions, parallel execution, input validation

Streaming 3 HIGH SSE endpoints, structured streaming, backpressure handling

Local Inference 3 HIGH Ollama setup, model selection, GPU optimization

Fine-Tuning 3 HIGH LoRA/QLoRA training, dataset preparation, evaluation

Context Optimization 2 HIGH Window management, compression, caching, budget scaling

Evaluation 2 HIGH LLM-as-judge, RAGAS metrics, quality gates, benchmarks

Prompt Engineering 4 HIGH CoT, few-shot, versioning, DSPy optimization, ReAct, cost optimization

Total: 20 rules across 7 categories

Quick Start

Function calling: strict mode tool definition

tools = [{ "type": "function", "function": { "name": "search_documents", "description": "Search knowledge base", "strict": True, "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"}, "limit": {"type": "integer", "description": "Max results"} }, "required": ["query", "limit"], "additionalProperties": False } } }]

Streaming: SSE endpoint with FastAPI

@app.get("/chat/stream") async def stream_chat(prompt: str): async def generate(): async for token in async_stream(prompt): yield {"event": "token", "data": token} yield {"event": "done", "data": ""} return EventSourceResponse(generate())

Local inference: Ollama with LangChain

llm = ChatOllama( model="deepseek-r1:70b", base_url="http://localhost:11434", temperature=0.0, num_ctx=32768, )

Fine-tuning: QLoRA with Unsloth

model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Meta-Llama-3.1-8B", max_seq_length=2048, load_in_4bit=True, ) model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)

Function Calling

Enable LLMs to use external tools and return structured data. Use strict mode schemas (2026 best practice) for reliability. Limit to 5-15 tools per request, validate all inputs with Pydantic/Zod, and return errors as tool results.

  • calling-tool-definition.md -- Strict mode schemas, OpenAI/Anthropic formats, LangChain binding

  • calling-parallel.md -- Parallel tool execution, asyncio.gather, strict mode constraints

  • calling-validation.md -- Input validation, error handling, tool execution loops

Streaming

Deliver LLM responses in real-time for better UX. Use SSE for web, WebSocket for bidirectional. Handle backpressure with bounded queues.

  • streaming-sse.md -- FastAPI SSE endpoints, frontend consumers, async iterators

  • streaming-structured.md -- Streaming with tool calls, partial JSON parsing, chunk accumulation

  • streaming-backpressure.md -- Backpressure handling, bounded buffers, cancellation

Local Inference

Run LLMs locally with Ollama for cost savings (93% vs cloud), privacy, and offline development. Pre-warm models, use provider factory for cloud/local switching.

  • local-ollama-setup.md -- Installation, model pulling, environment configuration

  • local-model-selection.md -- Model comparison by task, hardware profiles, quantization

  • local-gpu-optimization.md -- Apple Silicon tuning, keep-alive, CI integration

Fine-Tuning

Customize LLMs with parameter-efficient techniques. Fine-tune ONLY after exhausting prompt engineering and RAG. Requires 1000+ quality examples.

  • tuning-lora.md -- LoRA/QLoRA configuration, Unsloth training, adapter merging

  • tuning-dataset-prep.md -- Synthetic data generation, quality validation, deduplication

  • tuning-evaluation.md -- DPO alignment, evaluation metrics, anti-patterns

Context Optimization

Manage context windows, compression, and attention-aware positioning. Optimize for tokens-per-task.

  • context-window-management.md -- Five-layer architecture, anchored summarization, compression triggers

  • context-caching.md -- Just-in-time loading, budget scaling, probe evaluation, CC 2.1.32+

Evaluation

Evaluate LLM outputs with multi-dimension scoring, quality gates, and benchmarks.

  • evaluation-metrics.md -- LLM-as-judge, RAGAS metrics, hallucination detection

  • evaluation-benchmarks.md -- Quality gates, batch evaluation, pairwise comparison

Prompt Engineering

Design, version, and optimize prompts for production LLM applications.

  • prompt-design.md -- Chain-of-Thought, few-shot learning, pattern selection guide

  • prompt-testing.md -- Langfuse versioning, DSPy optimization, A/B testing, self-consistency

  • prompt-react-pattern.md -- ReAct loop for tool-using agents, thought-action-observation format

  • prompt-optimization.md -- Token reduction, cost optimization, model tiering, prompt spec format

Key Decisions

Decision Recommendation

Tool schema mode strict: true (2026 best practice)

Tool count 5-15 max per request

Streaming protocol SSE for web, WebSocket for bidirectional

Buffer size 50-200 tokens

Local model (reasoning) deepseek-r1:70b

Local model (coding) qwen2.5-coder:32b

Fine-tuning approach LoRA/QLoRA (try prompting first)

LoRA rank 16-64 typical

Training epochs 1-3 (more risks overfitting)

Context compression Anchored iterative (60-80%)

Compress trigger 70% utilization, target 50%

Judge model GPT-5.2-mini or Haiku 4.5

Quality threshold 0.7 production, 0.6 drafts

Few-shot examples 3-5 diverse, representative

Prompt versioning Langfuse with labels

Auto-optimization DSPy MIPROv2

Related Skills

  • ork:rag-retrieval -- Embedding patterns, when RAG is better than fine-tuning

  • agent-loops -- Multi-step tool use with reasoning

  • llm-evaluation -- Evaluate fine-tuned and local models

  • langfuse-observability -- Track training experiments

Capability Details

function-calling

Keywords: tool, function, define tool, tool schema, function schema, strict mode, parallel tools Solves:

  • Define tools with clear descriptions and strict schemas

  • Execute tool calls in parallel with asyncio.gather

  • Validate inputs and handle errors in tool execution loops

streaming

Keywords: streaming, SSE, Server-Sent Events, real-time, backpressure, token stream Solves:

  • Stream LLM tokens via SSE endpoints

  • Handle tool calls within streams

  • Manage backpressure with bounded queues

local-inference

Keywords: Ollama, local, self-hosted, model selection, GPU, Apple Silicon Solves:

  • Set up Ollama for local LLM inference

  • Select models based on task and hardware

  • Optimize GPU usage and CI integration

fine-tuning

Keywords: LoRA, QLoRA, fine-tune, DPO, synthetic data, PEFT, alignment Solves:

  • Configure LoRA/QLoRA for parameter-efficient training

  • Generate and validate synthetic training data

  • Align models with DPO and evaluate results

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ui-components

No summary provided by upstream source.

Repository SourceNeeds Review
General

responsive-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

domain-driven-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

dashboard-patterns

No summary provided by upstream source.

Repository SourceNeeds Review