How to Cache AI Responses Without Breaking Your App
Friday 27/02/2026
·12 min readYour AI feature works great — until you get the bill. Half those API calls are near-duplicates: users asking the same question slightly differently, the same document getting summarized over and over, identical support tickets hitting your chatbot daily. You're paying full price every time. Caching AI API responses is the single fastest way to slash costs and improve latency, but get it wrong and you'll serve stale garbage to your users.
The tricky part isn't the caching itself — it's deciding what counts as "the same request" when every prompt is a slightly different string. I've shipped three different caching strategies in production, each with different trade-offs. Here's exactly how to implement them, when to use each, and the gotchas that'll bite you.
Why standard HTTP caching doesn't work for AI
You can't just slap a Cache-Control header on an LLM response. AI API calls have a few properties that break traditional caching:
- Prompts are long and variable — a system prompt + user message + conversation history might be 4,000 tokens. Even a one-character difference in the input produces a completely different cache key if you're doing naive string matching.
- Responses aren't deterministic — the same prompt can produce different outputs (unless you set
temperature: 0, and even then it's not guaranteed). - Context windows change — in a chatbot, every message changes the full prompt, so every turn is a "new" request.
You need caching strategies designed specifically for LLM workloads. Let's build three of them, from simplest to most sophisticated.
Strategy 1: Exact match caching with prompt fingerprinting
The simplest approach: hash the full prompt and use it as a cache key. If the exact same prompt comes in again, return the cached response.
// src/lib/ai-cache.ts
import { createHash } from 'crypto'
import { Redis } from 'ioredis'
interface CachedResponse {
content: string
model: string
cachedAt: number
tokensSaved: { input: number; output: number }
}
interface CacheOptions {
ttlSeconds?: number
namespace?: string
}
export class ExactMatchCache {
private redis: Redis
private defaultTTL: number
constructor(redisUrl: string, defaultTTL = 3600) {
this.redis = new Redis(redisUrl)
this.defaultTTL = defaultTTL
}
private buildKey(prompt: string, model: string, namespace?: string): string {
const hash = createHash('sha256')
.update(`${model}:${prompt}`)
.digest('hex')
.slice(0, 16)
const prefix = namespace ? `ai:${namespace}` : 'ai'
return `${prefix}:${hash}`
}
async get(
prompt: string,
model: string,
namespace?: string
): Promise<CachedResponse | null> {
const key = this.buildKey(prompt, model, namespace)
const cached = await this.redis.get(key)
if (!cached) return null
return JSON.parse(cached) as CachedResponse
}
async set(
prompt: string,
model: string,
response: string,
tokens: { input: number; output: number },
options: CacheOptions = {}
): Promise<void> {
const key = this.buildKey(prompt, model, options.namespace)
const entry: CachedResponse = {
content: response,
model,
cachedAt: Date.now(),
tokensSaved: tokens,
}
const ttl = options.ttlSeconds ?? this.defaultTTL
await this.redis.setex(key, ttl, JSON.stringify(entry))
}
}
Now wrap your AI calls with the cache:
// src/lib/cached-ai-client.ts
import Anthropic from '@anthropic-ai/sdk'
import { ExactMatchCache } from './ai-cache'
const anthropic = new Anthropic()
const cache = new ExactMatchCache(process.env.REDIS_URL!)
export async function cachedComplete(
systemPrompt: string,
userMessage: string,
model: string = 'claude-sonnet-4-20250514'
): Promise<{ content: string; fromCache: boolean }> {
const fullPrompt = `${systemPrompt}\n---\n${userMessage}`
// Check cache first
const cached = await cache.get(fullPrompt, model)
if (cached) {
return { content: cached.content, fromCache: true }
}
// Cache miss — call the API
const response = await anthropic.messages.create({
model,
max_tokens: 1024,
system: systemPrompt,
messages: [{ role: 'user', content: userMessage }],
})
const content =
response.content[0].type === 'text' ? response.content[0].text : ''
await cache.set(fullPrompt, model, content, {
input: response.usage.input_tokens,
output: response.usage.output_tokens,
})
return { content, fromCache: false }
}
When to use this: Summarization, data extraction, classification — anything where the same input should always produce the same output. I use this for a document summarizer that processes the same PDFs repeatedly during testing and staging.
Gotcha: This has a 0% hit rate for chatbots, because the conversation history changes with every message. You need something smarter for that.
Strategy 2: Normalized prompt caching
Most cache misses aren't because the question is different — it's because the prompt looks different. Extra whitespace, different casing, trailing punctuation. Normalizing prompts before hashing dramatically improves hit rates.
// src/lib/prompt-normalizer.ts
export function normalizePrompt(prompt: string): string {
return prompt
.toLowerCase()
.replace(/\s+/g, ' ')
.replace(/[^\w\s]/g, '')
.trim()
}
export function normalizeForChatbot(
systemPrompt: string,
userMessage: string
): string {
// For chatbots, only cache based on the current user message
// + system prompt, NOT the full conversation history.
// This is a trade-off: you lose conversational context
// but gain cache hits across different users.
const normalizedSystem = normalizePrompt(systemPrompt)
const normalizedUser = normalizePrompt(userMessage)
return `${normalizedSystem}|${normalizedUser}`
}
This simple normalization got me from a 2% hit rate to 18% on a customer support chatbot. But there's a trade-off with the chatbot approach: by caching only the system prompt + latest user message, you lose conversational context. If a user says "tell me more about that," you'll get a cached response from a different conversation where "that" meant something else.
Rule of thumb: Only use message-level caching for the first message in a conversation, or for stateless Q&A interfaces where there's no conversation history.
Strategy 3: Semantic similarity caching
This is the powerful one. Instead of requiring an exact match, you check if a similar enough question has been asked before. "How do I reset my password?" and "Where can I change my password?" should return the same cached answer.
You'll need embeddings and a vector store. Here's the implementation with Supabase pgvector:
// src/lib/semantic-cache.ts
import OpenAI from 'openai'
import { createClient, SupabaseClient } from '@supabase/supabase-js'
interface SemanticCacheEntry {
id: string
embedding: number[]
prompt: string
response: string
model: string
created_at: string
}
export class SemanticCache {
private openai: OpenAI
private supabase: SupabaseClient
private similarityThreshold: number
constructor(similarityThreshold = 0.92) {
this.openai = new OpenAI()
this.supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_KEY!
)
this.similarityThreshold = similarityThreshold
}
private async getEmbedding(text: string): Promise<number[]> {
const response = await this.openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
})
return response.data[0].embedding
}
async get(prompt: string): Promise<{ content: string; similarity: number } | null> {
const embedding = await this.getEmbedding(prompt)
const { data, error } = await this.supabase.rpc('match_ai_cache', {
query_embedding: embedding,
match_threshold: this.similarityThreshold,
match_count: 1,
})
if (error || !data || data.length === 0) return null
return {
content: data[0].response,
similarity: data[0].similarity,
}
}
async set(prompt: string, response: string, model: string): Promise<void> {
const embedding = await this.getEmbedding(prompt)
await this.supabase.from('ai_cache').insert({
prompt,
response,
model,
embedding,
})
}
}
You'll need this SQL function in your Supabase database:
-- Run this in the Supabase SQL editor
create extension if not exists vector;
create table ai_cache (
id uuid primary key default gen_random_uuid(),
prompt text not null,
response text not null,
model text not null,
embedding vector(1536),
created_at timestamptz default now()
);
create index on ai_cache using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
create or replace function match_ai_cache(
query_embedding vector(1536),
match_threshold float,
match_count int
)
returns table (
id uuid,
prompt text,
response text,
similarity float
)
language sql stable
as $$
select
id,
prompt,
response,
1 - (embedding <=> query_embedding) as similarity
from ai_cache
where 1 - (embedding <=> query_embedding) > match_threshold
order by embedding <=> query_embedding
limit match_count;
$$;
The critical number: the similarity threshold. Set it too low (0.85) and you'll serve wrong answers. Set it too high (0.98) and you'll barely get any cache hits. I've found these ranges work well in practice:
- 0.95+ for factual Q&A where accuracy matters most
- 0.92 for customer support where similar questions have similar answers
- 0.88 for creative/casual use cases where close-enough is fine
Cost consideration: You're making an embedding API call on every request to check the cache. At $0.02 per million tokens for text-embedding-3-small, this is negligible compared to the Claude/GPT-4 calls you're saving. A typical user question is ~50 tokens, so 1 million cache lookups costs about $0.001. That's basically free.
Putting it all together: a tiered cache
In production, I use all three strategies in a waterfall:
// src/lib/tiered-ai-cache.ts
import Anthropic from '@anthropic-ai/sdk'
import { ExactMatchCache } from './ai-cache'
import { normalizePrompt } from './prompt-normalizer'
import { SemanticCache } from './semantic-cache'
interface TieredCacheResult {
content: string
cacheLevel: 'exact' | 'normalized' | 'semantic' | 'none'
similarity?: number
}
export class TieredAICache {
private exactCache: ExactMatchCache
private semanticCache: SemanticCache
private anthropic: Anthropic
constructor(redisUrl: string) {
this.exactCache = new ExactMatchCache(redisUrl)
this.semanticCache = new SemanticCache(0.92)
this.anthropic = new Anthropic()
}
async query(
systemPrompt: string,
userMessage: string,
model: string = 'claude-sonnet-4-20250514'
): Promise<TieredCacheResult> {
const fullPrompt = `${systemPrompt}\n---\n${userMessage}`
// Level 1: Exact match (fastest, cheapest)
const exact = await this.exactCache.get(fullPrompt, model)
if (exact) {
return { content: exact.content, cacheLevel: 'exact' }
}
// Level 2: Normalized match
const normalized = normalizePrompt(fullPrompt)
const normalizedHit = await this.exactCache.get(normalized, model, 'norm')
if (normalizedHit) {
return { content: normalizedHit.content, cacheLevel: 'normalized' }
}
// Level 3: Semantic similarity (slower, needs embedding call)
const semantic = await this.semanticCache.get(userMessage)
if (semantic) {
return {
content: semantic.content,
cacheLevel: 'semantic',
similarity: semantic.similarity,
}
}
// Cache miss — call the API
const response = await this.anthropic.messages.create({
model,
max_tokens: 1024,
system: systemPrompt,
messages: [{ role: 'user', content: userMessage }],
})
const content =
response.content[0].type === 'text'
? response.content[0].text
: ''
// Store in all cache layers
await Promise.all([
this.exactCache.set(fullPrompt, model, content, {
input: response.usage.input_tokens,
output: response.usage.output_tokens,
}),
this.exactCache.set(
normalized,
model,
content,
{
input: response.usage.input_tokens,
output: response.usage.output_tokens,
},
{ namespace: 'norm' }
),
this.semanticCache.set(userMessage, content, model),
])
return { content, cacheLevel: 'none' }
}
}
Cache invalidation: the hard part
Caching LLM responses has a unique invalidation problem: the "correct" answer can change even though the question hasn't. Your product docs get updated, your pricing changes, a new feature launches. Suddenly your cached responses are confidently wrong.
Here's what works:
TTL-based expiration is your first line of defense. Set TTLs based on how often the underlying data changes:
// src/lib/cache-ttl-config.ts
export const cacheTTLs: Record<string, number> = {
// Product FAQ: docs change weekly
'support-faq': 24 * 60 * 60, // 1 day
// Code explanation: stable unless codebase changes
'code-explain': 7 * 24 * 60 * 60, // 1 week
// News/current events: stale quickly
'news-summary': 60 * 60, // 1 hour
// Data extraction from static docs: very stable
'pdf-extract': 30 * 24 * 60 * 60, // 30 days
}
Version-tagged cache keys for when your system prompt or RAG context changes:
// src/lib/versioned-cache.ts
import { createHash } from 'crypto'
export function versionedCacheKey(
prompt: string,
model: string,
contextVersion: string
): string {
const hash = createHash('sha256')
.update(`${model}:${contextVersion}:${prompt}`)
.digest('hex')
.slice(0, 16)
return `ai:v:${hash}`
}
// When your docs update, bump the version
const CONTEXT_VERSION = createHash('md5')
.update(fs.readFileSync('docs/faq.md', 'utf-8'))
.digest('hex')
.slice(0, 8)
This way, when your FAQ document changes, the hash changes, and old cache entries naturally stop being hit. No manual invalidation needed.
Measuring your cache performance
You can't optimize what you don't measure. Track these metrics:
// src/lib/cache-metrics.ts
interface CacheMetrics {
hits: { exact: number; normalized: number; semantic: number }
misses: number
totalTokensSaved: { input: number; output: number }
estimatedCostSaved: number
}
export class CacheTracker {
private metrics: CacheMetrics = {
hits: { exact: 0, normalized: 0, semantic: 0 },
misses: 0,
totalTokensSaved: { input: 0, output: 0 },
estimatedCostSaved: 0,
}
recordHit(
level: 'exact' | 'normalized' | 'semantic',
tokensSaved: { input: number; output: number }
): void {
this.metrics.hits[level]++
this.metrics.totalTokensSaved.input += tokensSaved.input
this.metrics.totalTokensSaved.output += tokensSaved.output
// Estimate based on Claude Sonnet pricing
const inputCost = (tokensSaved.input / 1_000_000) * 3.0
const outputCost = (tokensSaved.output / 1_000_000) * 15.0
this.metrics.estimatedCostSaved += inputCost + outputCost
}
recordMiss(): void {
this.metrics.misses++
}
getHitRate(): number {
const totalHits = Object.values(this.metrics.hits).reduce(
(sum, count) => sum + count,
0
)
const total = totalHits + this.metrics.misses
return total === 0 ? 0 : totalHits / total
}
getSummary(): string {
const hitRate = (this.getHitRate() * 100).toFixed(1)
const saved = this.metrics.estimatedCostSaved.toFixed(2)
return `Cache hit rate: ${hitRate}% | Est. saved: $${saved}`
}
}
In my experience, a well-tuned tiered cache on a customer support chatbot hits 35-50% cache rate, which translates to roughly 40% cost reduction. For document summarization pipelines, exact match alone gets you 60-80% because the same documents keep getting processed.
Common mistakes to avoid
Don't cache streaming responses mid-stream. Cache the final complete response. If you try to cache partial chunks, you'll end up with incomplete entries when requests get interrupted.
Don't forget to cache errors. If a prompt consistently fails (content filter, too long, etc.), you don't want to keep burning API calls. Cache the failure for a short TTL:
// Cache failures for 5 minutes to avoid hammering the API
if (error instanceof Anthropic.APIError && error.status === 400) {
await cache.set(prompt, model, `__ERROR__:${error.message}`, {
input: 0,
output: 0,
}, { ttlSeconds: 300 })
}
Don't use semantic caching for tool-use or agentic workflows. The same question can require completely different tool calls depending on the current state. Semantic caching works for stateless Q&A, not for multi-step agents.
Don't skip the in-memory layer. Before hitting Redis, check a local LRU cache. For a Next.js API route handling 100 req/s, the Redis round-trip matters. An in-memory cache with a 60-second TTL catches the most repeated requests:
pnpm add lru-cache
// src/lib/memory-cache.ts
import { LRUCache } from 'lru-cache'
const memoryCache = new LRUCache<string, string>({
max: 500,
ttl: 60 * 1000, // 60 seconds
})
export function getFromMemory(key: string): string | undefined {
return memoryCache.get(key)
}
export function setInMemory(key: string, value: string): void {
memoryCache.set(key, value)
}
What's next
Once your caching layer is in place, you'll want to keep your AI features reliable even as you scale. Next up: How to Test AI Features: Unit Testing LLM-Powered Code — because testing non-deterministic code is its own kind of puzzle, and doing it right will save you from shipping broken caches and wrong responses to production.