The Real Cost of Running an AI Feature in Production (With Math)
Wednesday 25/02/2026
·9 min readYou're about to ship an AI feature. Your prototype works, the demo impressed your team, and now someone asks: "What will this cost us at 10,000 users?" You open the pricing page, see "per million tokens," and realize you have no idea how to turn that into a monthly bill. The cost of an AI API in a production app is one of those things that's easy to hand-wave and painful to get wrong.
I've shipped AI features that cost $3/month and ones that burned through $200/day before anyone noticed. The difference was understanding the token math before deploying, not after. Here's exactly how to estimate costs for real use cases, with a TypeScript calculator you can drop into your project.
How AI API pricing actually works
Both Anthropic and OpenAI charge per token — roughly 1 token per 4 characters of English text, or about 0.75 tokens per word. But there's a critical detail: input tokens and output tokens have different prices.
Here's the current pricing for the models you're most likely using (as of early 2026):
| Model | Input (per 1M tokens) | Output (per 1M tokens) | |-------|----------------------|----------------------| | Claude 3.5 Sonnet | $3.00 | $15.00 | | Claude 3.5 Haiku | $0.80 | $4.00 | | GPT-4o | $2.50 | $10.00 | | GPT-4o mini | $0.15 | $0.60 |
Output tokens cost 3-5x more than input tokens. This matters a lot because different features have very different input/output ratios. A summarizer sends lots of input and gets little output. A chatbot conversation accumulates input over time as the context window fills up.
Token math for four common AI features
Let's do the math for four real use cases, each at 10,000 monthly active users.
1. Customer support chatbot
Assumptions: average 5 messages per conversation, 3 conversations per user per month, system prompt of ~500 tokens, each user message ~50 tokens, each assistant response ~200 tokens.
// src/lib/cost-calculator.ts
interface UsageScenario {
name: string
inputTokensPerRequest: number
outputTokensPerRequest: number
requestsPerUserPerMonth: number
monthlyActiveUsers: number
}
interface CostEstimate {
name: string
totalInputTokens: number
totalOutputTokens: number
inputCost: number
outputCost: number
totalCost: number
costPerUser: number
}
function estimateMonthlyCost(
scenario: UsageScenario,
inputPricePerMillion: number,
outputPricePerMillion: number
): CostEstimate {
const totalRequests = scenario.requestsPerUserPerMonth * scenario.monthlyActiveUsers
const totalInputTokens = scenario.inputTokensPerRequest * totalRequests
const totalOutputTokens = scenario.outputTokensPerRequest * totalRequests
const inputCost = (totalInputTokens / 1_000_000) * inputPricePerMillion
const outputCost = (totalOutputTokens / 1_000_000) * outputPricePerMillion
return {
name: scenario.name,
totalInputTokens,
totalOutputTokens,
inputCost: Math.round(inputCost * 100) / 100,
outputCost: Math.round(outputCost * 100) / 100,
totalCost: Math.round((inputCost + outputCost) * 100) / 100,
costPerUser:
Math.round(((inputCost + outputCost) / scenario.monthlyActiveUsers) * 10000) / 10000,
}
}
Now let's run the chatbot scenario. The tricky part: in a multi-turn conversation, you re-send the full history on every turn. By message 5, you're sending the system prompt + all previous messages as input.
// src/lib/scenarios.ts
function calculateChatbotTokens(
systemPromptTokens: number,
avgUserMsgTokens: number,
avgAssistantMsgTokens: number,
turnsPerConversation: number
): { avgInputPerRequest: number; avgOutputPerRequest: number } {
let totalInput = 0
for (let turn = 1; turn <= turnsPerConversation; turn++) {
const historyTokens =
(turn - 1) * (avgUserMsgTokens + avgAssistantMsgTokens)
totalInput += systemPromptTokens + historyTokens + avgUserMsgTokens
}
return {
avgInputPerRequest: Math.round(totalInput / turnsPerConversation),
avgOutputPerRequest: avgAssistantMsgTokens,
}
}
const chatbot = calculateChatbotTokens(500, 50, 200, 5)
// avgInputPerRequest: 750, avgOutputPerRequest: 200
const chatbotScenario: UsageScenario = {
name: 'Customer Support Chatbot',
inputTokensPerRequest: chatbot.avgInputPerRequest, // 750
outputTokensPerRequest: chatbot.avgOutputPerRequest, // 200
requestsPerUserPerMonth: 15, // 5 turns × 3 conversations
monthlyActiveUsers: 10_000,
}
Result with Claude 3.5 Sonnet: ~$787/month ($0.079/user). With Haiku: ~$258/month ($0.026/user).
That context window accumulation is the gotcha. A 5-turn conversation doesn't cost 5x a single turn — it costs more, because each turn re-sends the growing history.
2. Document summarization
Assumptions: users upload documents averaging 3,000 tokens (~2 pages), the summary output is ~300 tokens, each user summarizes 10 documents per month.
const summarization: UsageScenario = {
name: 'Document Summarization',
inputTokensPerRequest: 3200, // doc + system prompt
outputTokensPerRequest: 300,
requestsPerUserPerMonth: 10,
monthlyActiveUsers: 10_000,
}
Result with Claude 3.5 Sonnet: $1,410/month. Heavy on input, light on output. This is where a cheaper model like Haiku ($352/month) makes a huge difference — summarization quality is often good enough with smaller models.
3. RAG search (retrieval-augmented generation)
Assumptions: 5 retrieved chunks of ~500 tokens each in the prompt, user query ~30 tokens, response ~400 tokens, 20 queries per user per month.
const ragSearch: UsageScenario = {
name: 'RAG Search',
inputTokensPerRequest: 2730, // system prompt (200) + 5 chunks (2500) + query (30)
outputTokensPerRequest: 400,
requestsPerUserPerMonth: 20,
monthlyActiveUsers: 10_000,
}
Result with Claude 3.5 Sonnet: $2,837/month. RAG is expensive because you're stuffing the context window with retrieved chunks on every request. If you can reduce from 5 chunks to 3 without hurting quality (often possible with better embeddings), you drop to ~$1,900/month.
4. AI code review
Assumptions: a PR diff averages 2,000 tokens, system prompt with review instructions is 800 tokens, review output is 600 tokens, 50 PRs per month across the team.
const codeReview: UsageScenario = {
name: 'AI Code Review',
inputTokensPerRequest: 2800,
outputTokensPerRequest: 600,
requestsPerUserPerMonth: 50, // per-team, not per-user
monthlyActiveUsers: 1, // treating the team as one "user"
}
Result with Claude 3.5 Sonnet: $0.87/month. Code review is dirt cheap because the volume is low. Even at 500 PRs/month, you're under $10. This is the kind of feature where AI costs are a rounding error.
A cost comparison runner
Here's how to plug all the scenarios together and print a comparison table:
// src/lib/run-cost-estimate.ts
const CLAUDE_SONNET = { input: 3.0, output: 15.0 }
const CLAUDE_HAIKU = { input: 0.8, output: 4.0 }
const scenarios: UsageScenario[] = [
chatbotScenario,
summarization,
ragSearch,
codeReview,
]
function printCostTable(
scenarios: UsageScenario[],
modelName: string,
pricing: { input: number; output: number }
): void {
console.log(`\n=== ${modelName} ===`)
console.log(
'Feature'.padEnd(30),
'Monthly Cost'.padStart(14),
'Per User'.padStart(10)
)
console.log('-'.repeat(56))
for (const scenario of scenarios) {
const estimate = estimateMonthlyCost(
scenario,
pricing.input,
pricing.output
)
console.log(
estimate.name.padEnd(30),
`$${estimate.totalCost.toFixed(2)}`.padStart(14),
`$${estimate.costPerUser.toFixed(4)}`.padStart(10)
)
}
}
printCostTable(scenarios, 'Claude 3.5 Sonnet', CLAUDE_SONNET)
printCostTable(scenarios, 'Claude 3.5 Haiku', CLAUDE_HAIKU)
Five ways to cut your AI costs in half
The math above is your baseline. Here's how to bring it down.
1. Use the smallest model that works
This is the single biggest lever. Haiku costs ~70% less than Sonnet for most tasks. Run an eval on 50-100 real inputs — if the cheaper model scores within 5% of the expensive one, use it. You can always route complex queries to the bigger model and use the small one for everything else.
2. Cache repeated prompts
If your system prompt is 500 tokens and you're sending 150,000 requests/month, that's 75 million tokens just for the same system prompt over and over. Anthropic's prompt caching stores repeated prompt prefixes server-side and charges 90% less for cached tokens. For a chatbot with a static system prompt, this alone can cut input costs by 30-40%.
3. Trim your context window
For RAG, fewer chunks means fewer input tokens. Improve your retrieval quality (better embeddings, re-ranking) so you can send 3 high-quality chunks instead of 5 mediocre ones. For chatbots, summarize old conversation turns instead of sending the full history every time.
4. Set max output tokens
If your summarizer only needs 300 tokens, set max_tokens: 400 — not 4,096. This won't save money directly (you only pay for tokens generated), but it prevents runaway responses that burn through output tokens. More importantly, it keeps your latency predictable.
// src/lib/claude-client.ts
const response = await client.messages.create({
model: 'claude-3-5-haiku-latest',
max_tokens: 400,
system: systemPrompt,
messages: [{ role: 'user', content: userInput }],
})
5. Add a cost tracking middleware
You can't optimize what you don't measure. Wrap your API calls with a tracker that logs token usage per feature, per user, and per day. The Anthropic SDK returns usage.input_tokens and usage.output_tokens on every response — use them.
// src/lib/cost-tracker.ts
interface UsageRecord {
feature: string
inputTokens: number
outputTokens: number
model: string
timestamp: Date
estimatedCost: number
}
const usageLog: UsageRecord[] = []
function trackUsage(
feature: string,
model: string,
inputTokens: number,
outputTokens: number,
pricing: { input: number; output: number }
): void {
const cost =
(inputTokens / 1_000_000) * pricing.input +
(outputTokens / 1_000_000) * pricing.output
usageLog.push({
feature,
inputTokens,
outputTokens,
model,
timestamp: new Date(),
estimatedCost: Math.round(cost * 1_000_000) / 1_000_000,
})
}
function getDailyCost(): number {
const today = new Date().toDateString()
return usageLog
.filter((r) => r.timestamp.toDateString() === today)
.reduce((sum, r) => sum + r.estimatedCost, 0)
}
function getCostByFeature(): Record<string, number> {
const costs: Record<string, number> = {}
for (const record of usageLog) {
costs[record.feature] = (costs[record.feature] || 0) + record.estimatedCost
}
return costs
}
Wire this into your API wrapper and check the dashboard weekly. You'll be surprised which features are actually expensive — it's rarely the one you'd guess.
The bottom line
For most apps, the cost of AI API calls in production is somewhere between "barely noticeable" and "significant but manageable." A chatbot at 10K users on Sonnet runs ~$800/month. Switch to Haiku and add prompt caching, and you're under $200. A low-volume feature like code review costs less than your morning coffee.
The dangerous zone is features with high per-request token counts and high volume — RAG search, long-document analysis, anything that stuffs the context window. That's where the math matters most, and where choosing the right model and trimming your prompts pays for itself immediately.
Run the calculator above with your actual numbers before you ship. Future you will be grateful.
What's next
Once you know what your AI feature costs, the next step is making sure you're not paying for the same answer twice. Up next: How to Cache AI Responses Without Breaking Your App — strategies for caching LLM responses with TypeScript implementations using Redis and in-memory options.