prompt-caching

Cache LLM prompt prefixes for 90% token savings.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "prompt-caching" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-prompt-caching

Prompt Caching

Cache LLM prompt prefixes for 90% token savings.

Supported Models (2026)

Provider Models

Claude Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Sonnet 3.7, Haiku 4.5, Haiku 3.5, Haiku 3

OpenAI gpt-5.2, gpt-5.2-mini, o3, o3-mini (automatic caching)

Claude Prompt Caching

def build_cached_messages( system_prompt: str, few_shot_examples: str | None, user_content: str, use_extended_cache: bool = False ) -> list[dict]: """Build messages with cache breakpoints.

Cache structure (processing order: tools → system → messages):
1. System prompt (cached)
2. Few-shot examples (cached)
─────── CACHE BREAKPOINT ───────
3. User content (NOT cached)
"""
# TTL: "5m" (default, 1.25x write cost) or "1h" (extended, 2x write cost)
ttl = "1h" if use_extended_cache else "5m"

content_parts = []

# Breakpoint 1: System prompt
content_parts.append({
    "type": "text",
    "text": system_prompt,
    "cache_control": {"type": "ephemeral", "ttl": ttl}
})

# Breakpoint 2: Few-shot examples (up to 4 breakpoints allowed)
if few_shot_examples:
    content_parts.append({
        "type": "text",
        "text": few_shot_examples,
        "cache_control": {"type": "ephemeral", "ttl": ttl}
    })

# Dynamic content (NOT cached)
content_parts.append({
    "type": "text",
    "text": user_content
})

return [{"role": "user", "content": content_parts}]

Cache Pricing (2026)

┌─────────────────────────────────────────────────────────────┐ │ Cache Cost Multipliers (relative to base input price) │ ├─────────────────────────────────────────────────────────────┤ │ 5-minute cache write: 1.25x base input price │ │ 1-hour cache write: 2.00x base input price │ │ Cache read: 0.10x base input price (90% off!) │ └─────────────────────────────────────────────────────────────┘

Example: Claude Sonnet 4 @ $3/MTok input

Without Prompt Caching: System prompt: 2,000 tokens @ $3/MTok = $0.006 Few-shot examples: 5,000 tokens @ $3/MTok = $0.015 User content: 10,000 tokens @ $3/MTok = $0.030 ─────────────────────────────────────────────────── Total: 17,000 tokens = $0.051

With 5m Caching (first request = cache write): Cached prefix: 7,000 tokens @ $3.75/MTok = $0.02625 (1.25x) User content: 10,000 tokens @ $3/MTok = $0.03000 Total first req: = $0.05625

With 5m Caching (subsequent = cache read): Cached prefix: 7,000 tokens @ $0.30/MTok = $0.0021 (0.1x) User content: 10,000 tokens @ $3/MTok = $0.0300 Total cached req: = $0.0321

Savings: 37% per cached request, break-even after 2 requests

Extended Cache (1-hour TTL)

Use 1-hour cache when:

  • Prompt reused > 10 times per hour

  • System prompts are highly stable

  • Token count > 10k (maximize savings)

Extended cache: 2x write cost but persists 12x longer

"cache_control": {"type": "ephemeral", "ttl": "1h"}

Break-even: 1h cache pays off after ~8 reads

(2x write cost ÷ 0.9 savings per read ≈ 8 reads)

OpenAI Automatic Caching

OpenAI caches prefixes automatically

No cache_control markers needed

response = await openai.chat.completions.create( model="gpt-5.2", messages=[ {"role": "system", "content": system_prompt}, # Cached {"role": "user", "content": user_content} # Not cached ] )

Check cache usage in response

cache_tokens = response.usage.prompt_tokens_cached

Cache Processing Order

Cache references entire prompt in order:

  1. Tools (cached first)
  2. System messages (cached second)
  3. User messages (cached last)

⚠️ Extended thinking changes invalidate message caches (but NOT system/tools caches)

Monitoring Cache Effectiveness

Track these fields in API response

response = await client.messages.create(...)

cache_created = response.usage.cache_creation_input_tokens # New cache cache_read = response.usage.cache_read_input_tokens # Cache hit regular = response.usage.input_tokens # Not cached

Calculate cache hit rate

if cache_created + cache_read > 0: hit_rate = cache_read / (cache_created + cache_read) print(f"Cache hit rate: {hit_rate:.1%}")

Best Practices

✅ Good: Long, stable prefix first

messages = [ {"role": "system", "content": LONG_SYSTEM_PROMPT}, {"role": "user", "content": FEW_SHOT_EXAMPLES}, {"role": "user", "content": user_input} # Variable ]

❌ Bad: Variable content early (breaks cache)

messages = [ {"role": "user", "content": user_input}, # Breaks cache! {"role": "system", "content": LONG_SYSTEM_PROMPT} ]

Key Decisions

Decision Recommendation

Min prefix size 1,024 tokens (Claude)

Breakpoint count 2-4 per request

Content order Stable prefix first

Default TTL 5m for most cases

Extended TTL 1h if >10 reads/hour

Common Mistakes

  • Variable content before cached prefix

  • Too many breakpoints (overhead)

  • Prefix too short (min 1024 tokens)

  • Not checking cache_read_input_tokens

  • Using 1h TTL for infrequent calls (wastes 2x write)

Related Skills

  • semantic-caching

  • Redis similarity caching

  • cache-cost-tracking

  • Cost monitoring

  • llm-streaming

  • Streaming with caching

Capability Details

anthropic-caching

Keywords: anthropic, claude, cache_control, ephemeral Solves:

  • Use Anthropic prompt caching

  • Set cache breakpoints

  • Reduce API costs

openai-caching

Keywords: openai, gpt, cached_tokens, automatic Solves:

  • Use OpenAI prompt caching

  • Structure prompts for cache hits

  • Monitor cache effectiveness

wrapper-template

Keywords: wrapper, template, implementation, python Solves:

  • Prompt cache wrapper template

  • Python implementation

  • Drop-in caching layer

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

responsive-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

domain-driven-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

dashboard-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

rag-retrieval

No summary provided by upstream source.

Repository SourceNeeds Review