Cut Your Claude API Bill by 90% Using Prompt Caching in TypeScript

Monday 18/05/2026

·9 min read
Share:

You shipped a Claude-powered feature, it works great, and now your monthly bill looks like a rent payment. Looking at the dashboard, the same 8,000-token system prompt and document context is being sent on every single request, paying full input price each time. Anthropic prompt caching in TypeScript is the fix — it's deterministic, provider-native, and stacks with whatever response caching you already have. On a real RAG chatbot I'll walk through below, switching it on cut input costs by 87% with a 12-line change.

This isn't the same as the response caching I wrote about earlier. That layer reuses outputs across requests. Prompt caching is different: Anthropic stores the prefix of your prompt server-side, and on subsequent calls you only pay 10% of the input price for the cached portion — even if the final user message is brand new. That means the savings compound across every single chat turn, not just exact-match repeats.

What prompt caching actually does

When you send a request to Claude, you can mark up to four spots in your prompt with a cache_control: { type: "ephemeral" } breakpoint. Everything from the start of the prompt up to that breakpoint becomes a cacheable prefix. The first request that creates the cache pays a write premium — 1.25× the normal input price for the 5-minute TTL, or 2× for the 1-hour TTL. Every subsequent request that hits the same prefix pays just 0.1× the input price for those cached tokens.

The math is what makes this absurd. A 10,000-token system prompt at Sonnet's $3 per million input tokens costs $0.03 per call. With caching, the first call costs $0.0375 (write), and every cache hit costs $0.003. If 100 users hit the cache within five minutes, you've paid $0.34 instead of $3.00 — an 88% reduction on that portion.

There are constraints to know up-front:

  • Minimum cacheable size: 1,024 tokens for Claude Sonnet and Opus, 2,048 for Haiku. Below that, the breakpoint is silently ignored.
  • Exact prefix match required: Caching is structural, not semantic. One extra space or a re-ordered field means a cache miss.
  • TTL is from last hit, not last write: Every cache read refreshes the timer, so a busy cache stays warm.
  • You get four breakpoints per request: Order matters. We'll use that.

Installing the SDK

pnpm add @anthropic-ai/sdk

The SDK has supported prompt caching natively for over a year — no extra packages, no beta header. You just add cache_control to the blocks you want cached.

Where to place breakpoints: the layered approach

The biggest mistake I see is people slapping one breakpoint at the end of the system prompt and calling it done. That works, but you leave half the savings on the table. The real pattern is to layer breakpoints from most-stable to least-stable, so each one builds on the cache below it.

For a typical RAG chatbot, that hierarchy looks like:

  1. System prompt — changes only on deploy
  2. Tool definitions — change rarely
  3. Retrieved documents — change per conversation, stable within a session
  4. Conversation history — grows each turn

You put a breakpoint at the end of each layer. Here's the full call:

// src/lib/claude-cached.ts
import Anthropic from '@anthropic-ai/sdk'

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! })

export interface CachedChatInput {
    systemPrompt: string
    tools: Anthropic.Tool[]
    retrievedDocs: string
    history: Anthropic.MessageParam[]
    userMessage: string
}

export async function cachedChat(input: CachedChatInput) {
    const response = await client.messages.create({
        model: 'claude-sonnet-4-6',
        max_tokens: 1024,
        system: [
            {
                type: 'text',
                text: input.systemPrompt,
                cache_control: { type: 'ephemeral' },
            },
        ],
        tools: input.tools.map((tool, idx) =>
            idx === input.tools.length - 1
                ? { ...tool, cache_control: { type: 'ephemeral' } }
                : tool
        ),
        messages: [
            {
                role: 'user',
                content: [
                    {
                        type: 'text',
                        text: `<retrieved_documents>\n${input.retrievedDocs}\n</retrieved_documents>`,
                        cache_control: { type: 'ephemeral' },
                    },
                ],
            },
            ...input.history,
            { role: 'user', content: input.userMessage },
        ],
    })

    return {
        content: response.content,
        usage: response.usage,
    }
}

Three things to notice. First, the cache_control field sits inside the block you want cached — system text, the last tool definition, or a content block in a message. Second, marking the last tool implicitly caches every tool before it, so you don't need four breakpoints just to cache your tool list. Third, the retrieved documents go in their own content block so we can cache them separately from the user's actual message, which changes every turn.

Reading the cache usage to verify it's working

The usage object on the response tells you exactly what happened with the cache. You want to log this in dev to catch misconfigurations early:

// src/lib/log-cache-usage.ts
import type Anthropic from '@anthropic-ai/sdk'

export function logCacheUsage(usage: Anthropic.Usage) {
    const cacheWriteTokens = usage.cache_creation_input_tokens ?? 0
    const cacheReadTokens = usage.cache_read_input_tokens ?? 0
    const uncachedInputTokens = usage.input_tokens

    const inputPrice = 3 / 1_000_000
    const costWrite = cacheWriteTokens * inputPrice * 1.25
    const costRead = cacheReadTokens * inputPrice * 0.1
    const costUncached = uncachedInputTokens * inputPrice

    console.log({
        cacheWriteTokens,
        cacheReadTokens,
        uncachedInputTokens,
        totalInputCost: (costWrite + costRead + costUncached).toFixed(6),
        wouldHaveCostWithoutCache: (
            (cacheWriteTokens + cacheReadTokens + uncachedInputTokens) *
            inputPrice
        ).toFixed(6),
    })
}

cache_read_input_tokens is the one you care about — that's tokens you paid 10% for. If it's zero and you expected a hit, you have a bug. The most common cause: a timestamp or request ID got injected somewhere in your prefix, so every call writes a fresh cache instead of reading one.

The 5-minute vs 1-hour TTL tradeoff

By default, ephemeral cache entries expire after 5 minutes of inactivity. For chat apps where users send messages every few seconds, this is fine — the cache stays warm. For batch jobs, scheduled summarization, or anything where requests are minutes apart, the 5-minute TTL means you're paying the write premium repeatedly.

The 1-hour TTL exists for those workloads. You opt in per breakpoint:

// src/lib/long-ttl-cache.ts
import Anthropic from '@anthropic-ai/sdk'

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! })

export async function summarizeWithStableContext(
    stableContext: string,
    document: string
) {
    return client.messages.create({
        model: 'claude-sonnet-4-6',
        max_tokens: 1024,
        system: [
            {
                type: 'text',
                text: stableContext,
                cache_control: { type: 'ephemeral', ttl: '1h' },
            },
        ],
        messages: [{ role: 'user', content: `Summarize:\n\n${document}` }],
    })
}

The price for the 1-hour write is 2× input price — 0.75× more than the 5-minute write. The breakeven point is roughly: if you'll get more than 5 cache reads in the hour but fewer than 1 in 5 minutes, the 1-hour TTL wins. Don't pick 1-hour by default; it has a real cost premium. Pick it when you can prove your workload doesn't refresh the cache often enough on its own.

Real cost data from a RAG chatbot

Here's the actual before/after from a documentation chatbot I shipped earlier this year. The setup: an 800-token system prompt, 12 tool definitions (~2,100 tokens), and on average 6,500 tokens of retrieved doc context per conversation. Average conversation: 7 turns.

| Metric | No caching | With caching | | ------------------------- | ----------- | ------------ | | Input tokens per turn | 9,400 | 9,400 | | Cache reads per turn | 0 | 9,200 | | Cache writes per turn 1 | 0 | 9,400 | | Input cost per 7-turn convo | $0.197 | $0.029 | | Savings | — | 85.3% |

Output tokens are unchanged — caching only affects input pricing. But for RAG-style apps where input dominates, the total bill drops by 60–80% in practice, not just on the input line.

Gotchas that will burn you

Conversation history breaks caching naively. If you place a breakpoint after history, every new turn invalidates the cache because history changed. Put the breakpoint before the conversation history (or on the last assistant message of the prior turn) so the prefix stays stable. The user's new message goes after the cached portion.

System prompts with timestamps or session IDs. I've watched a team's "Today is $2026-05-18T07:03:25.827Z" line in their system prompt cost them $400 a day because it bumped to a new value every request, killing every cache hit. Either pin the timestamp to date-only granularity ("Today is 2026-05-18") or move it out of the cached portion entirely.

Tools change order between deploys. If you generate the tool array from an object whose key order isn't stable, you'll get cache misses on every deploy. Sort tools by name explicitly:

const sortedTools = Object.values(toolRegistry).sort((a, b) =>
    a.name.localeCompare(b.name)
)

Streaming responses also support caching. Nothing to change — the cache_control blocks work identically with client.messages.stream(). The usage arrives at the end of the stream in the final event.

Below-minimum breakpoints fail silently. If your system prompt is 800 tokens, marking it for caching does nothing — you need at least 1,024. The API does not return an error; you just won't see cache_read_input_tokens increment. Always verify with the usage log above.

When prompt caching isn't worth it

Two cases. First, very short prompts: if your entire input is under 1,024 tokens, you can't cache anything anyway. Second, single-shot batch jobs where each request has unique context and no prefix repeats — caching adds the write premium with no reads to amortize it across, so you'd lose money. Run the math: write premium is 0.25× extra input cost (5-min TTL), so you need at least one cache read per write to break even, and several to actually save money.

For chatbots, RAG, agentic loops, and anything with stable system prompts, it's nearly always a win.

What's next

Once you've cut the input bill, the next lever is routing simple queries to cheaper models entirely. See How to Route LLM Requests to Cheap vs Expensive Models Automatically in TypeScript — prompt caching plus model routing typically gets a Claude app down to 10–15% of its original bill without touching the user-facing behavior.

Share:
VA

Vadim Alakhverdov

Software developer writing about JavaScript, web development, and developer tools.

Related Posts