Prompt Caching
Cache LLM prompt prefixes for 90% token savings.
Supported Models (2026)
Provider Models
Claude Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Haiku 4.5, Haiku 3.5, Haiku 3
OpenAI gpt-5.2, gpt-5.2-mini, o3, o3-mini (automatic caching)
Claude Prompt Caching
def build_cached_messages( system_prompt: str, few_shot_examples: str | None, user_content: str, use_extended_cache: bool = False ) -> list[dict]: """Build messages with cache breakpoints.
Cache structure (processing order: tools → system → messages):
1. System prompt (cached)
2. Few-shot examples (cached)
─────── CACHE BREAKPOINT ───────
3. User content (NOT cached)
"""
# TTL: "5m" (default, 1.25x write cost) or "1h" (extended, 2x write cost)
ttl = "1h" if use_extended_cache else "5m"
content_parts = []
# Breakpoint 1: System prompt
content_parts.append({
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral", "ttl": ttl}
})
# Breakpoint 2: Few-shot examples (up to 4 breakpoints allowed)
if few_shot_examples:
content_parts.append({
"type": "text",
"text": few_shot_examples,
"cache_control": {"type": "ephemeral", "ttl": ttl}
})
# Dynamic content (NOT cached)
content_parts.append({
"type": "text",
"text": user_content
})
return [{"role": "user", "content": content_parts}]
Cache Pricing (2026)
┌─────────────────────────────────────────────────────────────┐ │ Cache Cost Multipliers (relative to base input price) │ ├─────────────────────────────────────────────────────────────┤ │ 5-minute cache write: 1.25x base input price │ │ 1-hour cache write: 2.00x base input price │ │ Cache read: 0.10x base input price (90% off!) │ └─────────────────────────────────────────────────────────────┘
Example: Claude Sonnet 4 @ $3/MTok input
Without Prompt Caching: System prompt: 2,000 tokens @ $3/MTok = $0.006 Few-shot examples: 5,000 tokens @ $3/MTok = $0.015 User content: 10,000 tokens @ $3/MTok = $0.030 ─────────────────────────────────────────────────── Total: 17,000 tokens = $0.051
With 5m Caching (first request = cache write): Cached prefix: 7,000 tokens @ $3.75/MTok = $0.02625 (1.25x) User content: 10,000 tokens @ $3/MTok = $0.03000 Total first req: = $0.05625
With 5m Caching (subsequent = cache read): Cached prefix: 7,000 tokens @ $0.30/MTok = $0.0021 (0.1x) User content: 10,000 tokens @ $3/MTok = $0.0300 Total cached req: = $0.0321
Savings: 37% per cached request, break-even after 2 requests
Extended Cache (1-hour TTL)
Use 1-hour cache when:
-
Prompt reused > 10 times per hour
-
System prompts are highly stable
-
Token count > 10k (maximize savings)
Extended cache: 2x write cost but persists 12x longer
"cache_control": {"type": "ephemeral", "ttl": "1h"}
Break-even: 1h cache pays off after ~8 reads
(2x write cost ÷ 0.9 savings per read ≈ 8 reads)
OpenAI Automatic Caching
OpenAI caches prefixes automatically
No cache_control markers needed
response = await openai.chat.completions.create( model="gpt-5.2", messages=[ {"role": "system", "content": system_prompt}, # Cached {"role": "user", "content": user_content} # Not cached ] )
Check cache usage in response
cache_tokens = response.usage.prompt_tokens_cached
Cache Processing Order
Cache references entire prompt in order:
- Tools (cached first)
- System messages (cached second)
- User messages (cached last)
⚠️ Extended thinking changes invalidate message caches (but NOT system/tools caches)
Monitoring Cache Effectiveness
Track these fields in API response
response = await client.messages.create(...)
cache_created = response.usage.cache_creation_input_tokens # New cache cache_read = response.usage.cache_read_input_tokens # Cache hit regular = response.usage.input_tokens # Not cached
Calculate cache hit rate
if cache_created + cache_read > 0: hit_rate = cache_read / (cache_created + cache_read) print(f"Cache hit rate: {hit_rate:.1%}")
Best Practices
✅ Good: Long, stable prefix first
messages = [ {"role": "system", "content": LONG_SYSTEM_PROMPT}, {"role": "user", "content": FEW_SHOT_EXAMPLES}, {"role": "user", "content": user_input} # Variable ]
❌ Bad: Variable content early (breaks cache)
messages = [ {"role": "user", "content": user_input}, # Breaks cache! {"role": "system", "content": LONG_SYSTEM_PROMPT} ]
Key Decisions
Decision Recommendation
Min prefix size 1,024 tokens (Claude)
Breakpoint count 2-4 per request
Content order Stable prefix first
Default TTL 5m for most cases
Extended TTL 1h if >10 reads/hour
Common Mistakes
-
Variable content before cached prefix
-
Too many breakpoints (overhead)
-
Prefix too short (min 1024 tokens)
-
Not checking cache_read_input_tokens
-
Using 1h TTL for infrequent calls (wastes 2x write)
Related Skills
-
semantic-caching
-
Redis similarity caching
-
cache-cost-tracking
-
Cost monitoring
-
llm-streaming
-
Streaming with caching
Capability Details
anthropic-caching
Keywords: anthropic, claude, cache_control, ephemeral Solves:
-
Use Anthropic prompt caching
-
Set cache breakpoints
-
Reduce API costs
openai-caching
Keywords: openai, gpt, cached_tokens, automatic Solves:
-
Use OpenAI prompt caching
-
Structure prompts for cache hits
-
Monitor cache effectiveness
wrapper-template
Keywords: wrapper, template, implementation, python Solves:
-
Prompt cache wrapper template
-
Python implementation
-
Drop-in caching layer