prompt-caching-patterns

Prompt Caching Patterns

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "prompt-caching-patterns" with this command: npx skills add latestaiagents/agent-skills/latestaiagents-agent-skills-prompt-caching-patterns

Prompt Caching Patterns

Implement effective caching strategies to reduce LLM costs by up to 90%.

When to Use

  • Same or similar prompts are sent repeatedly

  • Large system prompts are reused across requests

  • Responses can be reused for identical queries

  • Need to reduce latency for common requests

  • Optimizing costs for high-volume applications

Caching Strategies

  1. Provider-Level Caching (Anthropic)

Anthropic offers built-in prompt caching with 90% cost reduction.

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

// Large system context that will be reused const systemContext = [Your long system prompt, documentation, examples, etc.] This can be many thousands of tokens that you want to cache.;

async function queryWithCache(userQuestion: string) { const response = await client.messages.create({ model: 'claude-3-sonnet-20240229', max_tokens: 1024, system: [ { type: 'text', text: systemContext, cache_control: { type: 'ephemeral' } // Cache for 5 minutes } ], messages: [ { role: 'user', content: userQuestion } ] });

// Check cache usage console.log('Cache read tokens:', response.usage.cache_read_input_tokens); console.log('Cache creation tokens:', response.usage.cache_creation_input_tokens);

return response; }

Pricing with cache:

  • Cache write: 25% more than base input price

  • Cache read: 90% less than base input price

  • Break-even: ~2 requests with same cached content

  1. Response Caching

Cache LLM responses for identical or similar queries.

interface CacheEntry { response: string; createdAt: number; ttlMs: number; metadata: { model: string; inputTokens: number; outputTokens: number; }; }

class ResponseCache { private cache = new Map<string, CacheEntry>();

private hashPrompt(prompt: string): string { // Simple hash for exact matching return crypto.createHash('sha256').update(prompt).digest('hex'); }

get(prompt: string): string | null { const key = this.hashPrompt(prompt); const entry = this.cache.get(key);

if (!entry) return null;

// Check TTL
if (Date.now() - entry.createdAt > entry.ttlMs) {
  this.cache.delete(key);
  return null;
}

return entry.response;

}

set(prompt: string, response: string, options: { ttlMs?: number; metadata?: any } = {}): void { const key = this.hashPrompt(prompt); this.cache.set(key, { response, createdAt: Date.now(), ttlMs: options.ttlMs || 3600000, // 1 hour default metadata: options.metadata }); } }

// Usage const cache = new ResponseCache();

async function cachedQuery(prompt: string): Promise<string> { // Check cache first const cached = cache.get(prompt); if (cached) { console.log('Cache hit!'); return cached; }

// Make API call const response = await llm.complete(prompt);

// Cache the response cache.set(prompt, response, { ttlMs: 3600000 });

return response; }

  1. Semantic Caching

Cache based on meaning, not exact match.

import { OpenAIEmbeddings } from 'langchain/embeddings/openai';

class SemanticCache { private entries: { embedding: number[]; response: string; prompt: string }[] = []; private embeddings: OpenAIEmbeddings; private similarityThreshold = 0.95;

constructor() { this.embeddings = new OpenAIEmbeddings(); }

async get(prompt: string): Promise<string | null> { const queryEmbedding = await this.embeddings.embedQuery(prompt);

// Find most similar cached prompt
let bestMatch: { similarity: number; response: string } | null = null;

for (const entry of this.entries) {
  const similarity = this.cosineSimilarity(queryEmbedding, entry.embedding);

  if (similarity > this.similarityThreshold) {
    if (!bestMatch || similarity > bestMatch.similarity) {
      bestMatch = { similarity, response: entry.response };
    }
  }
}

return bestMatch?.response || null;

}

async set(prompt: string, response: string): Promise<void> { const embedding = await this.embeddings.embedQuery(prompt); this.entries.push({ embedding, response, prompt }); }

private cosineSimilarity(a: number[], b: number[]): number { let dotProduct = 0; let normA = 0; let normB = 0;

for (let i = 0; i &#x3C; a.length; i++) {
  dotProduct += a[i] * b[i];
  normA += a[i] * a[i];
  normB += b[i] * b[i];
}

return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));

} }

// Usage const semanticCache = new SemanticCache();

// These would hit the cache: // "What is the capital of France?" -> cached // "What's France's capital city?" -> semantic match!

  1. Template Caching

Cache static parts, vary dynamic parts.

interface PromptTemplate { staticPart: string; dynamicParts: string[]; }

class TemplateCache { private templates = new Map<string, { staticPartHash: string; responses: Map<string, string>; // dynamicHash -> response }>();

generateKey(template: PromptTemplate, values: Record<string, string>): { templateKey: string; valuesKey: string; } { const templateKey = this.hash(template.staticPart); const valuesKey = this.hash(JSON.stringify(values)); return { templateKey, valuesKey }; }

get(template: PromptTemplate, values: Record<string, string>): string | null { const { templateKey, valuesKey } = this.generateKey(template, values); return this.templates.get(templateKey)?.responses.get(valuesKey) || null; }

set(template: PromptTemplate, values: Record<string, string>, response: string): void { const { templateKey, valuesKey } = this.generateKey(template, values);

if (!this.templates.has(templateKey)) {
  this.templates.set(templateKey, {
    staticPartHash: templateKey,
    responses: new Map()
  });
}

this.templates.get(templateKey)!.responses.set(valuesKey, response);

} }

// Usage const template: PromptTemplate = { staticPart: You are a helpful assistant that translates text. Translate the following to the target language. Be accurate and natural., dynamicParts: ['text', 'targetLanguage'] };

// Cache hit for same text + language combo const cached = templateCache.get(template, { text: 'Hello world', targetLanguage: 'Spanish' });

Redis-Based Distributed Cache

import Redis from 'ioredis';

class DistributedPromptCache { private redis: Redis; private prefix = 'llm:cache:';

constructor(redisUrl: string) { this.redis = new Redis(redisUrl); }

private key(prompt: string): string { const hash = crypto.createHash('sha256').update(prompt).digest('hex'); return ${this.prefix}${hash}; }

async get(prompt: string): Promise<string | null> { const cached = await this.redis.get(this.key(prompt)); if (cached) { await this.redis.hincrby(${this.prefix}stats, 'hits', 1); } else { await this.redis.hincrby(${this.prefix}stats, 'misses', 1); } return cached; }

async set(prompt: string, response: string, ttlSeconds: number = 3600): Promise<void> { await this.redis.setex(this.key(prompt), ttlSeconds, response); }

async getStats(): Promise<{ hits: number; misses: number; hitRate: number }> { const stats = await this.redis.hgetall(${this.prefix}stats); const hits = parseInt(stats.hits || '0'); const misses = parseInt(stats.misses || '0'); const total = hits + misses;

return {
  hits,
  misses,
  hitRate: total > 0 ? hits / total : 0
};

} }

Cache Invalidation

interface CachePolicy { ttlMs: number; invalidateOn: string[]; // Events that invalidate cache tags: string[]; // For tag-based invalidation }

class SmartCache { private cache = new Map<string, { value: string; policy: CachePolicy; createdAt: number }>(); private tagIndex = new Map<string, Set<string>>(); // tag -> keys

set(key: string, value: string, policy: CachePolicy): void { this.cache.set(key, { value, policy, createdAt: Date.now() });

// Index by tags
for (const tag of policy.tags) {
  if (!this.tagIndex.has(tag)) {
    this.tagIndex.set(tag, new Set());
  }
  this.tagIndex.get(tag)!.add(key);
}

}

invalidateByTag(tag: string): number { const keys = this.tagIndex.get(tag) || new Set(); let count = 0;

for (const key of keys) {
  if (this.cache.delete(key)) count++;
}

this.tagIndex.delete(tag);
return count;

}

invalidateByEvent(event: string): number { let count = 0;

for (const [key, entry] of this.cache) {
  if (entry.policy.invalidateOn.includes(event)) {
    this.cache.delete(key);
    count++;
  }
}

return count;

} }

// Usage cache.set('user:123:summary', response, { ttlMs: 3600000, invalidateOn: ['user:123:updated', 'user:123:deleted'], tags: ['user:123', 'summaries'] });

// When user updates their profile cache.invalidateByEvent('user:123:updated');

// Or invalidate all summaries cache.invalidateByTag('summaries');

Cost Savings Calculator

function calculateCacheSavings( stats: { hits: number; misses: number }, avgInputTokens: number, avgOutputTokens: number, pricing: { inputPer1M: number; outputPer1M: number } ): { withoutCache: number; withCache: number; savings: number; savingsPercent: number; } { const totalRequests = stats.hits + stats.misses;

// Without cache: all requests hit API const withoutCache = totalRequests * ( (avgInputTokens / 1_000_000) * pricing.inputPer1M + (avgOutputTokens / 1_000_000) * pricing.outputPer1M );

// With cache: only misses hit API const withCache = stats.misses * ( (avgInputTokens / 1_000_000) * pricing.inputPer1M + (avgOutputTokens / 1_000_000) * pricing.outputPer1M );

return { withoutCache, withCache, savings: withoutCache - withCache, savingsPercent: ((withoutCache - withCache) / withoutCache) * 100 }; }

Best Practices

  • Cache at the right level - Response, prompt part, or embedding

  • Set appropriate TTLs - Balance freshness vs. savings

  • Monitor hit rates - Low hit rate means cache isn't helping

  • Invalidate intelligently - Don't serve stale data

  • Use semantic caching carefully - Embedding costs add up

  • Warm the cache - Pre-populate for known queries

  • Consider cache size - Memory isn't free either

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

graphrag-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

agentic-rag

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

production-rag-checklist

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

rag-evaluation

No summary provided by upstream source.

Repository SourceNeeds Review