llm-caching

Optimize LLM costs and latency through KV caching and prompt caching. Use when (1) structuring prompts for cache hits, (2) configuring API cache_control for Anthropic/Cohere/OpenAI/Gemini, (3) setting up self-hosted inference with vLLM/SGLang/Ollama, (4) building agentic workflows with prefix reuse, (5) designing batch processing pipelines, or (6) understanding cache pricing and tradeoffs.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llm-caching" with this command: npx skills add rshvr/llm-caching/rshvr-llm-caching-llm-caching

LLM Caching

Maximize KV cache reuse to reduce costs and latency.

Core Concept

LLMs compute Key (K) and Value (V) vectors for each token during inference. These encode the model's "understanding" of context. Caching avoids recomputation.

Level 1: KV Cache (inference)     - Within one generation, reuse previous tokens' K,V
Level 2: Prompt Cache (API)       - Across requests, persist KV state server-side
Level 3: Prefix Sharing (batch)   - Across users/requests, share common prefixes

The Golden Rule

Static content first, variable content last.

[System prompt]         <- cacheable, same every request
[Tool definitions]      <- cacheable
[Few-shot examples]     <- cacheable (same order!)
[Reference documents]   <- cacheable if stable
[User message]          <- variable, at the end

Cache hits require the prefix (beginning) to match exactly. Any difference breaks caching for everything after.

Prompt Structure Template

┌─────────────────────────────────────┐
│  1. System instructions (static)    │  <- cache_control
├─────────────────────────────────────┤
│  2. Tool definitions (static)       │  <- cache_control
├─────────────────────────────────────┤
│  3. Few-shot examples (static)      │  <- cache_control
├─────────────────────────────────────┤
│  4. Documents/context (semi-static) │  <- cache_control if reused
├─────────────────────────────────────┤
│  5. Conversation history (growing)  │  <- cache after N turns
├─────────────────────────────────────┤
│  6. Current user message (variable) │  <- no caching
└─────────────────────────────────────┘

Anti-Patterns

Anti-PatternWhy It Breaks Caching
Variable content earlyPrefix changes every request
Randomizing few-shot orderDifferent order = different prefix
Timestamps in system promptChanges every request
User ID in prefixPer-user cache = no sharing
Prompts < minimum thresholdToo small to cache (1024 tokens for Claude)
Shuffling tool definitionsTool order is part of prefix

Cost Impact

OperationTypical PricingNotes
Cache write~1.25x inputOne-time, stores KV state
Cache read~0.1x input90% savings on cache hit
No caching1x inputFull recomputation every time

Example: 50k token system prompt, 100 requests

  • Without cache: 50k × 100 × $3/1M = $15.00
  • With cache: 50k × $3.75/1M + 50k × 99 × $0.30/1M = $1.67 (89% savings)

Provider References

Cookbooks

Practical examples: references/cookbooks.md

PatternKey Insight
Web scraping agentSame tools + system prompt, different URLs
RAG pipelineCache document chunks, vary queries
Multi-turn chatGrowing prefix, cache conversation history
Batch processingSame prompt template, different inputs
Agentic tool useCache tool definitions + examples
Multi-tenant SaaSShared base prompt, tenant-specific suffix

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

elite-inspiration

No summary provided by upstream source.

Repository SourceNeeds Review
General

elite-css-animations

No summary provided by upstream source.

Repository SourceNeeds Review
General

elite-gsap

No summary provided by upstream source.

Repository SourceNeeds Review
General

elite-layouts

No summary provided by upstream source.

Repository SourceNeeds Review