How to Handle AI API Rate Limits and Errors in Production (TypeScript)

Wednesday 18/02/2026

·12 min read
Share:

Your AI feature works great in development. Then it hits production, 50 users send requests at the same time, and everything falls apart. Anthropic returns 429s, OpenAI times out, your server logs fill up with unhandled promise rejections, and your users see a blank screen. The happy-path tutorial code you copied from a blog post wasn't built for this.

AI API rate limiting in TypeScript production apps is a different game than making a single API call work. You need exponential backoff, request queuing, token budgets, and a plan for when the API is just... down. Here are the patterns I use in production, all as copy-pasteable TypeScript utilities.

Exponential backoff that actually works

Every developer knows "just add exponential backoff." But most implementations get the details wrong — they don't add jitter (so all retries hit at the same time), they retry on errors that will never succeed, or they retry forever.

Here's a retry wrapper that handles the real-world edge cases:

// src/lib/retry.ts
interface RetryOptions {
    maxRetries: number
    baseDelayMs: number
    maxDelayMs: number
    retryableStatuses: Set<number>
}

const DEFAULT_OPTIONS: RetryOptions = {
    maxRetries: 3,
    baseDelayMs: 1000,
    maxDelayMs: 30000,
    retryableStatuses: new Set([429, 500, 502, 503, 529]),
}

interface ApiError {
    status?: number
    headers?: Record<string, string>
    message: string
}

function isApiError(error: unknown): error is ApiError {
    return (
        typeof error === 'object' &&
        error !== null &&
        'message' in error
    )
}

function getRetryAfterMs(error: ApiError): number | null {
    const retryAfter = error.headers?.['retry-after']
    if (!retryAfter) return null

    const seconds = parseFloat(retryAfter)
    if (!isNaN(seconds)) return seconds * 1000

    // Some APIs return a date instead of seconds
    const date = new Date(retryAfter).getTime()
    if (!isNaN(date)) return Math.max(0, date - Date.now())

    return null
}

export async function withRetry<T>(
    fn: () => Promise<T>,
    options: Partial<RetryOptions> = {}
): Promise<T> {
    const opts = { ...DEFAULT_OPTIONS, ...options }

    for (let attempt = 0; attempt <= opts.maxRetries; attempt++) {
        try {
            return await fn()
        } catch (error) {
            const isLastAttempt = attempt === opts.maxRetries

            if (!isApiError(error) || isLastAttempt) {
                throw error
            }

            // Don't retry client errors (except 429)
            if (error.status && error.status < 500 && error.status !== 429) {
                throw error
            }

            // Don't retry non-retryable status codes
            if (error.status && !opts.retryableStatuses.has(error.status)) {
                throw error
            }

            // Use Retry-After header if provided, otherwise exponential backoff
            const retryAfterMs = getRetryAfterMs(error)
            const exponentialDelay = Math.min(
                opts.baseDelayMs * Math.pow(2, attempt),
                opts.maxDelayMs
            )

            // Add jitter: random value between 0 and the calculated delay
            const jitter = Math.random() * exponentialDelay
            const delay = retryAfterMs ?? exponentialDelay + jitter

            console.warn(
                `Retry ${attempt + 1}/${opts.maxRetries} after ${Math.round(delay)}ms` +
                    (error.status ? ` (status: ${error.status})` : '')
            )

            await new Promise((resolve) => setTimeout(resolve, delay))
        }
    }

    // TypeScript needs this — the loop always returns or throws
    throw new Error('Retry loop exited unexpectedly')
}

The key details that matter:

  • Jitter is not optional. Without it, if 10 requests all hit a rate limit at the same time, they'll all retry at the exact same moment, causing another rate limit. Jitter spreads them out.
  • Respect Retry-After headers. Both Anthropic and OpenAI send this header on 429 responses. It tells you exactly how long to wait. Ignoring it means you're guessing when you don't need to.
  • Status 529 is Anthropic's "overloaded" code. It's not in the HTTP spec, but Anthropic uses it when their servers are under heavy load. Treat it like a 503.
  • Don't retry 400/401/403 errors. A bad API key won't become valid after waiting 2 seconds. Only retry errors that might succeed on the next attempt.

Usage is simple:

// src/lib/claude.ts
import Anthropic from '@anthropic-ai/sdk'
import { withRetry } from './retry'

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })

export async function askClaude(prompt: string): Promise<string> {
    const response = await withRetry(() =>
        client.messages.create({
            model: 'claude-sonnet-4-5-20250929',
            max_tokens: 1024,
            messages: [{ role: 'user', content: prompt }],
        })
    )

    return response.content[0].type === 'text' ? response.content[0].text : ''
}

Request queue: don't flood the API

Retry logic handles failures, but it doesn't prevent them. If your app fires 100 Claude requests simultaneously, most of them will get rate-limited no matter how good your retry logic is. You need a queue that limits concurrency.

// src/lib/queue.ts
type QueuedTask<T> = {
    fn: () => Promise<T>
    resolve: (value: T) => void
    reject: (error: unknown) => void
}

export class RequestQueue {
    private queue: QueuedTask<unknown>[] = []
    private activeCount = 0
    private readonly maxConcurrent: number
    private readonly delayBetweenMs: number

    constructor(maxConcurrent = 5, delayBetweenMs = 100) {
        this.maxConcurrent = maxConcurrent
        this.delayBetweenMs = delayBetweenMs
    }

    async add<T>(fn: () => Promise<T>): Promise<T> {
        return new Promise<T>((resolve, reject) => {
            this.queue.push({
                fn: fn as () => Promise<unknown>,
                resolve: resolve as (value: unknown) => void,
                reject,
            })
            this.processNext()
        })
    }

    private async processNext(): Promise<void> {
        if (this.activeCount >= this.maxConcurrent || this.queue.length === 0) {
            return
        }

        const task = this.queue.shift()
        if (!task) return

        this.activeCount++

        try {
            const result = await task.fn()
            task.resolve(result)
        } catch (error) {
            task.reject(error)
        } finally {
            this.activeCount--

            // Small delay between requests to avoid bursts
            if (this.queue.length > 0) {
                await new Promise((resolve) =>
                    setTimeout(resolve, this.delayBetweenMs)
                )
            }
            this.processNext()
        }
    }

    get pending(): number {
        return this.queue.length
    }

    get active(): number {
        return this.activeCount
    }
}

Now combine the queue with the retry wrapper:

// src/lib/claude.ts
import Anthropic from '@anthropic-ai/sdk'
import { withRetry } from './retry'
import { RequestQueue } from './queue'

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })
const queue = new RequestQueue(5, 200) // 5 concurrent, 200ms gap

export async function askClaude(prompt: string): Promise<string> {
    const response = await queue.add(() =>
        withRetry(() =>
            client.messages.create({
                model: 'claude-sonnet-4-5-20250929',
                max_tokens: 1024,
                messages: [{ role: 'user', content: prompt }],
            })
        )
    )

    return response.content[0].type === 'text' ? response.content[0].text : ''
}

The queue wraps the retry, not the other way around. Each queued item gets its own retry attempts, and the queue ensures you never have more than 5 in-flight requests at once. The 200ms delay between requests gives the API a moment to breathe.

Gotcha: Don't make maxConcurrent too low. Setting it to 1 turns your app into a serial queue where everyone waits for everyone else. Setting it to 50 defeats the purpose. I've found 5-10 works well for Anthropic's rate limits on most plans. Check your specific rate limits in the Anthropic console.

Token budgeting: stop the money leak

Rate limits aren't just about requests per minute. They're also about tokens per minute, and more importantly, tokens per dollar. Without a budget, a single malicious or buggy prompt can burn through your monthly allocation in hours.

// src/lib/token-budget.ts
interface BudgetConfig {
    maxTokensPerMinute: number
    maxTokensPerHour: number
    maxTokensPerRequest: number
}

const DEFAULT_BUDGET: BudgetConfig = {
    maxTokensPerMinute: 80000,
    maxTokensPerHour: 1000000,
    maxTokensPerRequest: 4096,
}

interface TokenRecord {
    tokens: number
    timestamp: number
}

export class TokenBudget {
    private usage: TokenRecord[] = []
    private readonly config: BudgetConfig

    constructor(config: Partial<BudgetConfig> = {}) {
        this.config = { ...DEFAULT_BUDGET, ...config }
    }

    canSpend(estimatedTokens: number): { allowed: boolean; reason?: string } {
        if (estimatedTokens > this.config.maxTokensPerRequest) {
            return {
                allowed: false,
                reason: `Request exceeds max tokens per request (${estimatedTokens} > ${this.config.maxTokensPerRequest})`,
            }
        }

        const now = Date.now()
        this.cleanup(now)

        const minuteUsage = this.getUsageSince(now - 60_000)
        if (minuteUsage + estimatedTokens > this.config.maxTokensPerMinute) {
            return {
                allowed: false,
                reason: `Would exceed per-minute token budget (${minuteUsage + estimatedTokens} > ${this.config.maxTokensPerMinute})`,
            }
        }

        const hourUsage = this.getUsageSince(now - 3600_000)
        if (hourUsage + estimatedTokens > this.config.maxTokensPerHour) {
            return {
                allowed: false,
                reason: `Would exceed per-hour token budget (${hourUsage + estimatedTokens} > ${this.config.maxTokensPerHour})`,
            }
        }

        return { allowed: true }
    }

    record(tokens: number): void {
        this.usage.push({ tokens, timestamp: Date.now() })
    }

    private getUsageSince(since: number): number {
        return this.usage
            .filter((r) => r.timestamp >= since)
            .reduce((sum, r) => sum + r.tokens, 0)
    }

    private cleanup(now: number): void {
        // Remove records older than 1 hour
        this.usage = this.usage.filter((r) => r.timestamp >= now - 3600_000)
    }
}

Wire it into your Claude wrapper:

// src/lib/claude.ts — updated with token budget
import Anthropic from '@anthropic-ai/sdk'
import { withRetry } from './retry'
import { RequestQueue } from './queue'
import { TokenBudget } from './token-budget'

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })
const queue = new RequestQueue(5, 200)
const budget = new TokenBudget({
    maxTokensPerMinute: 80000,
    maxTokensPerHour: 1000000,
    maxTokensPerRequest: 4096,
})

export async function askClaude(
    prompt: string,
    maxTokens = 1024
): Promise<string> {
    // Rough estimate: 1 token ≈ 4 chars for English text
    const estimatedInputTokens = Math.ceil(prompt.length / 4)
    const estimatedTotal = estimatedInputTokens + maxTokens

    const check = budget.canSpend(estimatedTotal)
    if (!check.allowed) {
        throw new Error(`Token budget exceeded: ${check.reason}`)
    }

    const response = await queue.add(() =>
        withRetry(() =>
            client.messages.create({
                model: 'claude-sonnet-4-5-20250929',
                max_tokens: maxTokens,
                messages: [{ role: 'user', content: prompt }],
            })
        )
    )

    // Record actual usage from the response
    const actualTokens =
        response.usage.input_tokens + response.usage.output_tokens
    budget.record(actualTokens)

    return response.content[0].type === 'text' ? response.content[0].text : ''
}

The token estimate before the call is rough (1 token ≈ 4 chars), but it's good enough for budget gating. After the call, you record the actual usage from the API response. Over time, this gives you accurate tracking.

Why not just rely on the API's built-in rate limits? Because by the time the API rejects you, you've already made the request. Budget checking on your side prevents unnecessary requests and gives your users an immediate error message instead of a timeout.

Graceful degradation: what to show when Claude is down

The worst thing you can do when the AI API fails is show a generic "Something went wrong" message. The second worst thing is showing nothing at all. Here's a pattern that gives users a useful experience even when the API is unavailable:

// src/lib/fallback.ts
interface FallbackConfig<T> {
    primary: () => Promise<T>
    fallback: () => T | Promise<T>
    isHealthy: () => boolean
    onError?: (error: unknown) => void
}

export class CircuitBreaker<T> {
    private failures = 0
    private lastFailure = 0
    private readonly threshold: number
    private readonly resetTimeMs: number

    constructor(threshold = 5, resetTimeMs = 60000) {
        this.threshold = threshold
        this.resetTimeMs = resetTimeMs
    }

    isHealthy(): boolean {
        if (this.failures < this.threshold) return true

        // Check if enough time has passed to try again
        if (Date.now() - this.lastFailure > this.resetTimeMs) {
            this.failures = 0
            return true
        }

        return false
    }

    recordFailure(): void {
        this.failures++
        this.lastFailure = Date.now()
    }

    recordSuccess(): void {
        this.failures = 0
    }

    async execute(config: FallbackConfig<T>): Promise<T> {
        if (!this.isHealthy()) {
            console.warn('Circuit breaker open — using fallback')
            return config.fallback()
        }

        try {
            const result = await config.primary()
            this.recordSuccess()
            return result
        } catch (error) {
            this.recordFailure()
            config.onError?.(error)

            if (!this.isHealthy()) {
                return config.fallback()
            }

            throw error
        }
    }
}

Here's how you'd use it for a feature like AI-powered search suggestions:

// src/lib/search.ts
import { CircuitBreaker } from './fallback'
import { askClaude } from './claude'

const breaker = new CircuitBreaker<string[]>(3, 30000) // 3 failures, 30s reset

// Pre-computed suggestions that don't need AI
const STATIC_SUGGESTIONS: Record<string, string[]> = {
    auth: ['How to implement JWT auth', 'OAuth2 with Next.js', 'Session management'],
    api: ['REST API best practices', 'GraphQL vs REST', 'API rate limiting'],
    // ... more categories
}

function getStaticSuggestions(query: string): string[] {
    const key = Object.keys(STATIC_SUGGESTIONS).find((k) =>
        query.toLowerCase().includes(k)
    )
    return key ? STATIC_SUGGESTIONS[key] : ['Try searching for something specific']
}

export async function getSearchSuggestions(query: string): Promise<string[]> {
    return breaker.execute({
        primary: async () => {
            const response = await askClaude(
                `Given the search query "${query}", suggest 5 related search terms. Return a JSON array of strings only.`
            )
            return JSON.parse(response) as string[]
        },
        fallback: () => getStaticSuggestions(query),
        isHealthy: () => breaker.isHealthy(),
        onError: (error) =>
            console.error('AI suggestions failed, using static:', error),
    })
}

The circuit breaker pattern prevents cascading failures. After 3 consecutive failures, it stops trying the AI endpoint for 30 seconds and immediately returns the fallback. This means your users get instant (if less clever) results instead of waiting for timeouts.

Putting it all together

Here's the full stack in one diagram:

  1. User request comes in
  2. Token budget checks if we can afford this request → reject early if not
  3. Request queue ensures we don't exceed concurrency limits
  4. Retry with backoff handles transient failures (429s, 500s)
  5. Circuit breaker switches to fallback if the API is consistently failing

Each layer solves a different problem:

| Layer | Problem it solves | Without it | |-------|------------------|------------| | Token budget | Runaway costs | A single bug burns your $500 monthly budget overnight | | Request queue | Concurrency limits | 100 simultaneous requests all get rate-limited | | Retry + backoff | Transient failures | One 429 = one failed user request | | Circuit breaker | Sustained outages | Every request waits for timeout during an outage |

The total added code is about 200 lines of TypeScript. No external dependencies beyond the Anthropic SDK itself. You can drop these four files into any project and immediately have production-grade AI API handling.

A note on monitoring

These utilities log warnings and errors to console, which is fine for getting started. In production, you'll want to send these to your monitoring stack (Datadog, Sentry, whatever you use). The key metrics to track:

  • Retry rate — if it's consistently above 10%, you're hitting limits too often
  • Circuit breaker trips — alerts you to API outages faster than status pages
  • Token budget utilization — tells you when to upgrade your plan or optimize prompts

What's next

These patterns keep your AI features running when the API misbehaves. But what about building features that are more than a single API call? In the next post, I'll cover how to build a multi-step AI agent with tool use in TypeScript — where you chain multiple Claude calls together and the error handling patterns from this post become even more critical.

If you're looking for more ways to save money on AI API calls, check out the upcoming post on caching AI responses — combining caching with the token budget system here can cut your costs by 50-80%.

Share:
VA

Vadim Alakhverdov

Software developer writing about JavaScript, web development, and developer tools.

Related Posts