Prompt Caching

Cache LLM prompt prefixes for 90% token savings.

Supported Models (2026)

Provider Models

Claude Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Haiku 4.5, Haiku 3.5, Haiku 3

OpenAI gpt-5.2, gpt-5.2-mini, o3, o3-mini (automatic caching)

Claude Prompt Caching

def build_cached_messages( system_prompt: str, few_shot_examples: str | None, user_content: str, use_extended_cache: bool = False ) -> list[dict]: """Build messages with cache breakpoints.

Cache structure (processing order: tools → system → messages):
1. System prompt (cached)
2. Few-shot examples (cached)
─────── CACHE BREAKPOINT ───────
3. User content (NOT cached)
"""
# TTL: "5m" (default, 1.25x write cost) or "1h" (extended, 2x write cost)
ttl = "1h" if use_extended_cache else "5m"

content_parts = []

# Breakpoint 1: System prompt
content_parts.append({
    "type": "text",
    "text": system_prompt,
    "cache_control": {"type": "ephemeral", "ttl": ttl}
})

# Breakpoint 2: Few-shot examples (up to 4 breakpoints allowed)
if few_shot_examples:
    content_parts.append({
        "type": "text",
        "text": few_shot_examples,
        "cache_control": {"type": "ephemeral", "ttl": ttl}
    })

# Dynamic content (NOT cached)
content_parts.append({
    "type": "text",
    "text": user_content
})

return [{"role": "user", "content": content_parts}]

Cache Pricing (2026)

┌─────────────────────────────────────────────────────────────┐ │ Cache Cost Multipliers (relative to base input price) │ ├─────────────────────────────────────────────────────────────┤ │ 5-minute cache write: 1.25x base input price │ │ 1-hour cache write: 2.00x base input price │ │ Cache read: 0.10x base input price (90% off!) │ └─────────────────────────────────────────────────────────────┘

Example: Claude Sonnet 4 @ $3/MTok input

Without Prompt Caching: System prompt: 2,000 tokens @ $3/MTok = $0.006 Few-shot examples: 5,000 tokens @ $3/MTok = $0.015 User content: 10,000 tokens @ $3/MTok = $0.030 ─────────────────────────────────────────────────── Total: 17,000 tokens = $0.051

With 5m Caching (first request = cache write): Cached prefix: 7,000 tokens @ $3.75/MTok = $0.02625 (1.25x) User content: 10,000 tokens @ $3/MTok = $0.03000 Total first req: = $0.05625

With 5m Caching (subsequent = cache read): Cached prefix: 7,000 tokens @ $0.30/MTok = $0.0021 (0.1x) User content: 10,000 tokens @ $3/MTok = $0.0300 Total cached req: = $0.0321

Savings: 37% per cached request, break-even after 2 requests

Extended Cache (1-hour TTL)

Use 1-hour cache when:

Prompt reused > 10 times per hour
System prompts are highly stable
Token count > 10k (maximize savings)

Extended cache: 2x write cost but persists 12x longer

"cache_control": {"type": "ephemeral", "ttl": "1h"}

Break-even: 1h cache pays off after ~8 reads

(2x write cost ÷ 0.9 savings per read ≈ 8 reads)

OpenAI Automatic Caching

OpenAI caches prefixes automatically

No cache_control markers needed

response = await openai.chat.completions.create( model="gpt-5.2", messages=[ {"role": "system", "content": system_prompt}, # Cached {"role": "user", "content": user_content} # Not cached ] )

Check cache usage in response

cache_tokens = response.usage.prompt_tokens_cached

Cache Processing Order

Cache references entire prompt in order:

Tools (cached first)
System messages (cached second)
User messages (cached last)

⚠️ Extended thinking changes invalidate message caches (but NOT system/tools caches)

Monitoring Cache Effectiveness

Track these fields in API response

response = await client.messages.create(...)

cache_created = response.usage.cache_creation_input_tokens # New cache cache_read = response.usage.cache_read_input_tokens # Cache hit regular = response.usage.input_tokens # Not cached

Calculate cache hit rate

if cache_created + cache_read > 0: hit_rate = cache_read / (cache_created + cache_read) print(f"Cache hit rate: {hit_rate:.1%}")

Best Practices

✅ Good: Long, stable prefix first

messages = [ {"role": "system", "content": LONG_SYSTEM_PROMPT}, {"role": "user", "content": FEW_SHOT_EXAMPLES}, {"role": "user", "content": user_input} # Variable ]

❌ Bad: Variable content early (breaks cache)

messages = [ {"role": "user", "content": user_input}, # Breaks cache! {"role": "system", "content": LONG_SYSTEM_PROMPT} ]

Key Decisions

Decision Recommendation

Min prefix size 1,024 tokens (Claude)

Breakpoint count 2-4 per request

Content order Stable prefix first

Default TTL 5m for most cases

Extended TTL 1h if >10 reads/hour

Common Mistakes

Variable content before cached prefix
Too many breakpoints (overhead)
Prefix too short (min 1024 tokens)
Not checking cache_read_input_tokens
Using 1h TTL for infrequent calls (wastes 2x write)

Related Skills

semantic-caching
Redis similarity caching
cache-cost-tracking
Cost monitoring
llm-streaming
Streaming with caching

Capability Details

anthropic-caching

Keywords: anthropic, claude, cache_control, ephemeral Solves:

Use Anthropic prompt caching
Set cache breakpoints
Reduce API costs

openai-caching

Keywords: openai, gpt, cached_tokens, automatic Solves:

Use OpenAI prompt caching
Structure prompts for cache hits
Monitor cache effectiveness

wrapper-template

Keywords: wrapper, template, implementation, python Solves:

Prompt cache wrapper template
Python implementation
Drop-in caching layer

prompt-caching

Safety Notice

Copy this and send it to your AI assistant to learn