prompt-caching-patterns

Prompt Caching Patterns

Implement effective caching strategies to reduce LLM costs by up to 90%.

When to Use

Same or similar prompts are sent repeatedly
Large system prompts are reused across requests
Responses can be reused for identical queries
Need to reduce latency for common requests
Optimizing costs for high-volume applications

Caching Strategies

Provider-Level Caching (Anthropic)

Anthropic offers built-in prompt caching with 90% cost reduction.

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

// Large system context that will be reused const systemContext = [Your long system prompt, documentation, examples, etc.] This can be many thousands of tokens that you want to cache.;

async function queryWithCache(userQuestion: string) { const response = await client.messages.create({ model: 'claude-3-sonnet-20240229', max_tokens: 1024, system: [ { type: 'text', text: systemContext, cache_control: { type: 'ephemeral' } // Cache for 5 minutes } ], messages: [ { role: 'user', content: userQuestion } ] });

// Check cache usage console.log('Cache read tokens:', response.usage.cache_read_input_tokens); console.log('Cache creation tokens:', response.usage.cache_creation_input_tokens);

return response; }

Pricing with cache:

Cache write: 25% more than base input price
Cache read: 90% less than base input price
Break-even: ~2 requests with same cached content

Response Caching

Cache LLM responses for identical or similar queries.

interface CacheEntry { response: string; createdAt: number; ttlMs: number; metadata: { model: string; inputTokens: number; outputTokens: number; }; }

class ResponseCache { private cache = new Map<string, CacheEntry>();

private hashPrompt(prompt: string): string { // Simple hash for exact matching return crypto.createHash('sha256').update(prompt).digest('hex'); }

get(prompt: string): string | null { const key = this.hashPrompt(prompt); const entry = this.cache.get(key);

if (!entry) return null;

// Check TTL
if (Date.now() - entry.createdAt > entry.ttlMs) {
  this.cache.delete(key);
  return null;
}

return entry.response;

}

set(prompt: string, response: string, options: { ttlMs?: number; metadata?: any } = {}): void { const key = this.hashPrompt(prompt); this.cache.set(key, { response, createdAt: Date.now(), ttlMs: options.ttlMs || 3600000, // 1 hour default metadata: options.metadata }); } }

// Usage const cache = new ResponseCache();

async function cachedQuery(prompt: string): Promise<string> { // Check cache first const cached = cache.get(prompt); if (cached) { console.log('Cache hit!'); return cached; }

// Make API call const response = await llm.complete(prompt);

// Cache the response cache.set(prompt, response, { ttlMs: 3600000 });

return response; }

Semantic Caching

Cache based on meaning, not exact match.

import { OpenAIEmbeddings } from 'langchain/embeddings/openai';

class SemanticCache { private entries: { embedding: number[]; response: string; prompt: string }[] = []; private embeddings: OpenAIEmbeddings; private similarityThreshold = 0.95;

constructor() { this.embeddings = new OpenAIEmbeddings(); }

async get(prompt: string): Promise<string | null> { const queryEmbedding = await this.embeddings.embedQuery(prompt);

// Find most similar cached prompt
let bestMatch: { similarity: number; response: string } | null = null;

for (const entry of this.entries) {
  const similarity = this.cosineSimilarity(queryEmbedding, entry.embedding);

  if (similarity > this.similarityThreshold) {
    if (!bestMatch || similarity > bestMatch.similarity) {
      bestMatch = { similarity, response: entry.response };
    }
  }
}

return bestMatch?.response || null;

}

async set(prompt: string, response: string): Promise<void> { const embedding = await this.embeddings.embedQuery(prompt); this.entries.push({ embedding, response, prompt }); }

private cosineSimilarity(a: number[], b: number[]): number { let dotProduct = 0; let normA = 0; let normB = 0;

for (let i = 0; i &#x3C; a.length; i++) {
  dotProduct += a[i] * b[i];
  normA += a[i] * a[i];
  normB += b[i] * b[i];
}

return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));

} }

// Usage const semanticCache = new SemanticCache();

// These would hit the cache: // "What is the capital of France?" -> cached // "What's France's capital city?" -> semantic match!

Template Caching

Cache static parts, vary dynamic parts.

interface PromptTemplate { staticPart: string; dynamicParts: string[]; }

class TemplateCache { private templates = new Map<string, { staticPartHash: string; responses: Map<string, string>; // dynamicHash -> response }>();

generateKey(template: PromptTemplate, values: Record<string, string>): { templateKey: string; valuesKey: string; } { const templateKey = this.hash(template.staticPart); const valuesKey = this.hash(JSON.stringify(values)); return { templateKey, valuesKey }; }

get(template: PromptTemplate, values: Record<string, string>): string | null { const { templateKey, valuesKey } = this.generateKey(template, values); return this.templates.get(templateKey)?.responses.get(valuesKey) || null; }

set(template: PromptTemplate, values: Record<string, string>, response: string): void { const { templateKey, valuesKey } = this.generateKey(template, values);

if (!this.templates.has(templateKey)) {
  this.templates.set(templateKey, {
    staticPartHash: templateKey,
    responses: new Map()
  });
}

this.templates.get(templateKey)!.responses.set(valuesKey, response);

} }

// Usage const template: PromptTemplate = { staticPart: You are a helpful assistant that translates text. Translate the following to the target language. Be accurate and natural., dynamicParts: ['text', 'targetLanguage'] };

// Cache hit for same text + language combo const cached = templateCache.get(template, { text: 'Hello world', targetLanguage: 'Spanish' });

Redis-Based Distributed Cache

import Redis from 'ioredis';

class DistributedPromptCache { private redis: Redis; private prefix = 'llm:cache:';

constructor(redisUrl: string) { this.redis = new Redis(redisUrl); }

private key(prompt: string): string { const hash = crypto.createHash('sha256').update(prompt).digest('hex'); return ${this.prefix}${hash}; }

async get(prompt: string): Promise<string | null> { const cached = await this.redis.get(this.key(prompt)); if (cached) { await this.redis.hincrby(${this.prefix}stats, 'hits', 1); } else { await this.redis.hincrby(${this.prefix}stats, 'misses', 1); } return cached; }

async set(prompt: string, response: string, ttlSeconds: number = 3600): Promise<void> { await this.redis.setex(this.key(prompt), ttlSeconds, response); }

async getStats(): Promise<{ hits: number; misses: number; hitRate: number }> { const stats = await this.redis.hgetall(${this.prefix}stats); const hits = parseInt(stats.hits || '0'); const misses = parseInt(stats.misses || '0'); const total = hits + misses;

return {
  hits,
  misses,
  hitRate: total > 0 ? hits / total : 0
};

} }

Cache Invalidation

interface CachePolicy { ttlMs: number; invalidateOn: string[]; // Events that invalidate cache tags: string[]; // For tag-based invalidation }

class SmartCache { private cache = new Map<string, { value: string; policy: CachePolicy; createdAt: number }>(); private tagIndex = new Map<string, Set<string>>(); // tag -> keys

set(key: string, value: string, policy: CachePolicy): void { this.cache.set(key, { value, policy, createdAt: Date.now() });

// Index by tags
for (const tag of policy.tags) {
  if (!this.tagIndex.has(tag)) {
    this.tagIndex.set(tag, new Set());
  }
  this.tagIndex.get(tag)!.add(key);
}

}

invalidateByTag(tag: string): number { const keys = this.tagIndex.get(tag) || new Set(); let count = 0;

for (const key of keys) {
  if (this.cache.delete(key)) count++;
}

this.tagIndex.delete(tag);
return count;

}

invalidateByEvent(event: string): number { let count = 0;

for (const [key, entry] of this.cache) {
  if (entry.policy.invalidateOn.includes(event)) {
    this.cache.delete(key);
    count++;
  }
}

return count;

} }

// Usage cache.set('user:123:summary', response, { ttlMs: 3600000, invalidateOn: ['user:123:updated', 'user:123:deleted'], tags: ['user:123', 'summaries'] });

// When user updates their profile cache.invalidateByEvent('user:123:updated');

// Or invalidate all summaries cache.invalidateByTag('summaries');

Cost Savings Calculator

// Without cache: all requests hit API const withoutCache = totalRequests * ( (avgInputTokens / 1_000_000) * pricing.inputPer1M + (avgOutputTokens / 1_000_000) * pricing.outputPer1M );

// With cache: only misses hit API const withCache = stats.misses * ( (avgInputTokens / 1_000_000) * pricing.inputPer1M + (avgOutputTokens / 1_000_000) * pricing.outputPer1M );

return { withoutCache, withCache, savings: withoutCache - withCache, savingsPercent: ((withoutCache - withCache) / withoutCache) * 100 }; }

Best Practices

Cache at the right level - Response, prompt part, or embedding
Set appropriate TTLs - Balance freshness vs. savings
Monitor hit rates - Low hit rate means cache isn't helping
Invalidate intelligently - Don't serve stale data
Use semantic caching carefully - Embedding costs add up
Warm the cache - Pre-populate for known queries
Consider cache size - Memory isn't free either

prompt-caching-patterns

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

graphrag-patterns

agentic-rag

production-rag-checklist

rag-evaluation