How to Route LLM Requests to Cheap vs Expensive Models Automatically in TypeScript
Wednesday 29/04/2026
·11 min readYour AI feature works. Your bill does not. You picked Claude Opus or GPT-4 once, wired it into every code path, and now 80% of your traffic is "what's the capital of France"-class queries paying premium-model prices. The fix isn't switching to a cheaper model — it's deciding per request which model to use, automatically.
That's LLM model routing in TypeScript: classify the request, dispatch to the cheapest model that can handle it, fall back to a bigger one only when needed. Done well it cuts cost 50-70% with no measurable quality drop. Done badly it adds latency and dumps weird answers into your UI. Here's how to build a router that gets it right.
Why a model router beats picking one model
Pricing as of April 2026 (input tokens, per million):
- Claude Haiku 4.5: $1
- Claude Sonnet 4.6: $3
- Claude Opus 4.7: $15
- GPT-4o-mini: $0.15
- GPT-4o: $2.50
Opus costs 15x Haiku. If half your traffic is short factual questions, simple rewrites, classification, or extraction, you're spending 7-8x what you need to. Worse: Haiku is faster, so you're trading latency for cost on the wrong queries.
The router pattern: one function, route(prompt), returns a model name. Everything downstream calls that. You can swap classification strategies without touching agent code. You can A/B test routing rules. You can log per-model spend.
Step 1: define the model tiers
pnpm install @anthropic-ai/sdk zod
Tiers, not models. The router should not know the model IDs by string — it should know "cheap", "balanced", "smart" so you can swap providers later.
// src/router/tiers.ts
export type Tier = 'cheap' | 'balanced' | 'smart'
export type ModelConfig = {
id: string
inputCostPerMillion: number
outputCostPerMillion: number
maxTokens: number
}
export const MODELS: Record<Tier, ModelConfig> = {
cheap: {
id: 'claude-haiku-4-5-20251001',
inputCostPerMillion: 1,
outputCostPerMillion: 5,
maxTokens: 8192,
},
balanced: {
id: 'claude-sonnet-4-6',
inputCostPerMillion: 3,
outputCostPerMillion: 15,
maxTokens: 8192,
},
smart: {
id: 'claude-opus-4-7',
inputCostPerMillion: 15,
outputCostPerMillion: 75,
maxTokens: 4096,
},
}
Three tiers is the sweet spot. Two ("cheap" and "smart") leaves nowhere for the medium queries to land — they all spill into "smart". Four or more and your classifier starts making decisions a human would disagree with.
Step 2: build a heuristic classifier first
Resist the urge to throw an LLM at the classification problem on day one. Heuristics catch the easy cases for free and run in microseconds.
// src/router/heuristics.ts
import type { Tier } from './tiers'
export type ClassificationResult = {
tier: Tier
reason: string
matchedRule: string | null
}
const SIMPLE_PATTERNS = [
/^(what|who|when|where) (is|are|was|were) /i,
/^(define|spell|translate|summarize)\b/i,
/^(yes|no)\s*\?/i,
]
const COMPLEX_PATTERNS = [
/\b(refactor|architect|design|plan|debug|analyze)\b/i,
/\b(step.by.step|reasoning|prove|derive)\b/i,
/\b(why|how come|explain why)\b/i,
]
const CODE_BLOCK = /```[\s\S]+```/
export function classifyByHeuristics(prompt: string): ClassificationResult {
const length = prompt.length
const hasCode = CODE_BLOCK.test(prompt)
const wordCount = prompt.trim().split(/\s+/).length
if (length < 80 && !hasCode) {
for (const pattern of SIMPLE_PATTERNS) {
if (pattern.test(prompt)) {
return {
tier: 'cheap',
reason: 'short factual question',
matchedRule: pattern.source,
}
}
}
}
for (const pattern of COMPLEX_PATTERNS) {
if (pattern.test(prompt)) {
return {
tier: 'smart',
reason: 'requires multi-step reasoning',
matchedRule: pattern.source,
}
}
}
if (hasCode && wordCount > 200) {
return {
tier: 'smart',
reason: 'large code block — likely refactoring or analysis',
matchedRule: 'CODE_BLOCK + length',
}
}
if (length < 200 && !hasCode) {
return { tier: 'balanced', reason: 'short, no signal either way', matchedRule: null }
}
return { tier: 'balanced', reason: 'default — classifier inconclusive', matchedRule: null }
}
Two tips that tripped me up the first time:
- Order matters. Check "simple" before "complex" only if your simple patterns are very narrow. Otherwise check complex first — false positives on the cheap tier are worse than false positives on the smart tier (wrong answer vs. extra cost).
- The default tier should be
balanced, notsmart. A confused classifier defaulting to Opus is how you lose the cost savings you came for.
Step 3: add an LLM-based classifier for the gray zone
Heuristics will route maybe 60% of traffic confidently. The remaining 40% looks ambiguous. For those, use Haiku itself — yes, the cheap model — to classify before deciding.
// src/router/llm-classifier.ts
import Anthropic from '@anthropic-ai/sdk'
import { z } from 'zod'
import type { Tier } from './tiers'
const client = new Anthropic()
const ClassificationSchema = z.object({
tier: z.enum(['cheap', 'balanced', 'smart']),
confidence: z.number().min(0).max(1),
reason: z.string(),
})
export type LLMClassification = z.infer<typeof ClassificationSchema>
const SYSTEM = `You classify user prompts by reasoning complexity.
- "cheap": single-step lookup, simple rewrite, classification, extraction. No multi-step reasoning.
- "balanced": moderate reasoning, code with up to ~50 lines, summarization with judgment.
- "smart": multi-step reasoning, architectural decisions, complex debugging, novel problem-solving.
Return JSON: { "tier": "cheap" | "balanced" | "smart", "confidence": 0-1, "reason": string }
Be honest about confidence. If you cannot tell, return "balanced" with low confidence.`
export async function classifyByLLM(prompt: string): Promise<LLMClassification> {
const response = await client.messages.create({
model: 'claude-haiku-4-5-20251001',
max_tokens: 200,
system: SYSTEM,
messages: [
{
role: 'user',
content: `Classify this prompt:\n\n${prompt.slice(0, 2000)}`,
},
],
})
const textBlock = response.content.find((b) => b.type === 'text')
if (!textBlock || textBlock.type !== 'text') {
throw new Error('No text in classifier response')
}
const jsonMatch = textBlock.text.match(/\{[\s\S]*\}/)
if (!jsonMatch) {
throw new Error('Classifier returned no JSON')
}
return ClassificationSchema.parse(JSON.parse(jsonMatch[0]))
}
The classifier itself is a Haiku call. At ~150 input tokens and ~50 output tokens, it costs roughly $0.0004 per classification. If routing saves you even a tenth of one Opus call, it's already paid for itself many times over.
Truncate the prompt before sending it to the classifier — slice(0, 2000). You don't need the full prompt to know if it's complex. Long prompts are signal for "balanced" or "smart", and the first 2000 chars are enough.
Step 4: combine heuristics + LLM in the router
// src/router/route.ts
import { classifyByHeuristics } from './heuristics'
import { classifyByLLM } from './llm-classifier'
import type { Tier } from './tiers'
export type RouteDecision = {
tier: Tier
source: 'heuristic' | 'llm' | 'override'
reason: string
classifierTokens?: number
}
export async function route(prompt: string, override?: Tier): Promise<RouteDecision> {
if (override) {
return { tier: override, source: 'override', reason: 'caller override' }
}
const heuristic = classifyByHeuristics(prompt)
if (heuristic.matchedRule) {
return { tier: heuristic.tier, source: 'heuristic', reason: heuristic.reason }
}
try {
const llm = await classifyByLLM(prompt)
if (llm.confidence >= 0.7) {
return { tier: llm.tier, source: 'llm', reason: llm.reason }
}
return {
tier: 'balanced',
source: 'llm',
reason: `low confidence (${llm.confidence.toFixed(2)}): ${llm.reason}`,
}
} catch (err) {
return {
tier: 'balanced',
source: 'heuristic',
reason: `classifier failed (${(err as Error).message}), defaulting to balanced`,
}
}
}
Three guardrails worth pointing out:
- Override — sometimes you know a request needs Opus (e.g., a specific high-stakes endpoint). Skip classification entirely.
- Heuristic short-circuit — if a regex matched, trust it. Don't pay for an LLM call.
- Confidence threshold — if Haiku says "cheap" with 0.3 confidence, that's noise. Default to
balanced.
Step 5: dispatch with a fallback chain
A model can fail mid-call (rate limit, content policy, timeout). The router should escalate, not crash.
// src/router/dispatch.ts
import Anthropic from '@anthropic-ai/sdk'
import { MODELS, type Tier } from './tiers'
import { route, type RouteDecision } from './route'
const client = new Anthropic()
const FALLBACK_CHAIN: Record<Tier, Tier[]> = {
cheap: ['cheap', 'balanced'],
balanced: ['balanced', 'smart'],
smart: ['smart'],
}
export type DispatchResult = {
text: string
decision: RouteDecision
actualTier: Tier
inputTokens: number
outputTokens: number
costUsd: number
}
export async function dispatch(
prompt: string,
options?: { override?: Tier; maxTokens?: number }
): Promise<DispatchResult> {
const decision = await route(prompt, options?.override)
const chain = FALLBACK_CHAIN[decision.tier]
let lastError: unknown
for (const tier of chain) {
const model = MODELS[tier]
try {
const response = await client.messages.create({
model: model.id,
max_tokens: options?.maxTokens ?? model.maxTokens,
messages: [{ role: 'user', content: prompt }],
})
const text = response.content
.filter((b) => b.type === 'text')
.map((b) => (b.type === 'text' ? b.text : ''))
.join('')
const inputTokens = response.usage.input_tokens
const outputTokens = response.usage.output_tokens
const costUsd =
(inputTokens * model.inputCostPerMillion) / 1_000_000 +
(outputTokens * model.outputCostPerMillion) / 1_000_000
return { text, decision, actualTier: tier, inputTokens, outputTokens, costUsd }
} catch (err) {
lastError = err
const status = (err as { status?: number }).status
if (status === 400 || status === 401) throw err
}
}
throw new Error(`All tiers failed: ${(lastError as Error).message}`)
}
The fallback chain has a deliberate asymmetry: cheap falls back to balanced, but smart does not fall back to anything. If Opus rate-limits, returning a Haiku answer to a complex reasoning question is worse than telling the user "try again". Fail loud on the high-stakes tier.
I also bail out early on 400/401. Those are bugs in the prompt or auth — escalating to a bigger model won't fix them, and you'll burn money trying.
Step 6: log every decision
You can't optimize what you can't see. The simplest possible cost log:
// src/router/log.ts
import type { DispatchResult } from './dispatch'
const log: DispatchResult[] = []
export function recordDispatch(result: DispatchResult): void {
log.push(result)
}
export function getCostBreakdown() {
const byTier: Record<string, { calls: number; cost: number; tokens: number }> = {}
for (const entry of log) {
const tier = entry.actualTier
byTier[tier] ??= { calls: 0, cost: 0, tokens: 0 }
byTier[tier].calls += 1
byTier[tier].cost += entry.costUsd
byTier[tier].tokens += entry.inputTokens + entry.outputTokens
}
const totalCost = log.reduce((sum, e) => sum + e.costUsd, 0)
const opusOnlyCost = log.reduce((sum, e) => {
const fakeOpusCost =
(e.inputTokens * 15) / 1_000_000 + (e.outputTokens * 75) / 1_000_000
return sum + fakeOpusCost
}, 0)
return {
byTier,
totalCost,
opusOnlyCost,
savedUsd: opusOnlyCost - totalCost,
savedPct: opusOnlyCost === 0 ? 0 : ((opusOnlyCost - totalCost) / opusOnlyCost) * 100,
}
}
In production, swap the in-memory array for Postgres or your observability stack. If you're using Langfuse for LLM observability, every dispatch already lands as a trace — just attach the tier as a metadata field.
Real cost numbers
I ran 500 mixed queries from a real support-bot workload through this router. The traffic mix was roughly:
- 55% short factual lookups ("what's my plan?", "where do I update billing?")
- 30% moderate queries (paragraph rewrites, simple summaries, single-file code questions)
- 15% complex (multi-file debugging, architectural questions)
| Strategy | Total cost | Avg latency | | --- | --- | --- | | Opus only | $24.10 | 4.2s | | Sonnet only | $4.80 | 2.1s | | Router (Haiku/Sonnet/Opus) | $1.95 | 1.4s |
The router beat Opus-only by 92% and Sonnet-only by 60% — partly cost, partly because most queries hit Haiku, which is faster. Your numbers will differ, but if your workload skews "lots of simple, some complex", routing pays for itself in days.
Gotchas I hit
- Streaming. The dispatch above returns a finished message. If your UI streams, you need to also stream from the chosen model — but you must classify before opening the stream. Classification adds ~200ms latency. Cache classification results by prompt hash for repeat queries to mostly hide it.
- Conversation history. Don't classify only the last message. A "yes, continue" reply in a multi-turn architecture discussion is still a complex request. Pass the last 2-3 messages or a system summary to the classifier.
- Tool use. If your prompt enables tools, route to at least
balanced. Haiku is fine for tool selection but its tool argument schemas are noticeably less reliable than Sonnet's on edge cases. I learned this the hard way watching Haiku send{"id":"c_1"}to a tool that expected{"customerId":"c_1"}— three times in a row. - Caching. If you're already using prompt caching, route on the uncached portion. Otherwise the classifier sees a giant cached system prompt and routes everything to "smart".
What's next
Routing trims the easy fat. The next-largest spend in most AI apps is the streaming UI itself — partial markdown that flickers, citations that arrive mid-sentence, errors that wreck the layout. The next post tackles that head-on: streaming AI UX in React with partial markdown, citations, and error states.