How to Test AI Features: Unit Testing LLM-Powered Code

Wednesday 04/03/2026

·14 min read

You shipped an AI feature. It works great - until it doesn't. A user reports that your summarizer started returning one-word answers. Your code review bot is approving everything. Your RAG chatbot confidently cites documents that don't exist. You have no tests, so you have no idea when this broke or why.

Testing AI features feels impossible because the output is non-deterministic. Ask Claude the same question twice and you'll get two different answers. Traditional expect(result).toBe("exact string") assertions are useless. But "it's non-deterministic" isn't an excuse to skip testing - it just means you need different patterns. Here's how to test AI features in TypeScript using Vitest, with real strategies that catch real bugs.

The testing pyramid for AI features

Before writing any tests, you need a mental model. AI code has three testable layers:

Deterministic logic - prompt construction, response parsing, token counting, caching keys. These are regular functions. Test them like regular functions.
API integration - does your code send the right request and handle the response correctly? Mock the API and test the glue code.
Output quality - does the AI actually return useful answers? This is the hard part. Use eval frameworks and assertion heuristics.

Most teams skip layer 1 and 2 because they're fixated on layer 3. That's backwards. Layers 1 and 2 catch 80% of bugs with normal unit tests.

Setting up the project

pnpm add -D vitest @vitest/coverage-v8
pnpm add @anthropic-ai/sdk zod

// vitest.config.ts
import { defineConfig } from 'vitest/config'

export default defineConfig({
    test: {
        globals: true,
        environment: 'node',
        coverage: {
            provider: 'v8',
            reporter: ['text', 'json-summary'],
        },
    },
})

Add the test script to your package.json:

// package.json (scripts section)
{
    "scripts": {
        "test": "vitest run",
        "test:watch": "vitest"
    }
}

Layer 1: Testing deterministic logic

This is the biggest win with the least effort. Your AI feature probably has functions that build prompts, parse responses, and manage context. These are pure functions. Test them.

// src/lib/prompt-builder.ts
export interface SummaryOptions {
    maxLength: number
    style: 'bullet-points' | 'paragraph' | 'tldr'
    language: string
}

export function buildSummaryPrompt(document: string, options: SummaryOptions): string {
    const styleInstructions: Record<SummaryOptions['style'], string> = {
        'bullet-points': 'Use concise bullet points. Each bullet should be one sentence.',
        paragraph: 'Write 2-3 short paragraphs.',
        tldr: 'Write a single sentence summary.',
    }

    return [
        `Summarize the following document in ${options.language}.`,
        styleInstructions[options.style],
        `Keep the summary under ${options.maxLength} words.`,
        '',
        '<document>',
        document,
        '</document>',
    ].join('\n')
}

export function extractJsonFromResponse(response: string): unknown {
    const jsonMatch = response.match(/```json\s*([\s\S]*?)```/)
    if (jsonMatch) {
        return JSON.parse(jsonMatch[1].trim())
    }
    // Try parsing the whole response as JSON
    return JSON.parse(response)
}

export function countTokensEstimate(text: string): number {
    // Rough estimate: 1 token ≈ 4 characters for English text
    return Math.ceil(text.length / 4)
}

The tests are straightforward:

// src/lib/__tests__/prompt-builder.test.ts
import { describe, it, expect } from 'vitest'
import {
    buildSummaryPrompt,
    extractJsonFromResponse,
    countTokensEstimate,
} from '../prompt-builder'

describe('buildSummaryPrompt', () => {
    it('includes the document in XML tags', () => {
        const prompt = buildSummaryPrompt('Hello world', {
            maxLength: 100,
            style: 'paragraph',
            language: 'English',
        })
        expect(prompt).toContain('<document>')
        expect(prompt).toContain('Hello world')
        expect(prompt).toContain('</document>')
    })

    it('sets the correct style instructions', () => {
        const prompt = buildSummaryPrompt('doc', {
            maxLength: 50,
            style: 'bullet-points',
            language: 'English',
        })
        expect(prompt).toContain('bullet points')
    })

    it('includes the word limit', () => {
        const prompt = buildSummaryPrompt('doc', {
            maxLength: 200,
            style: 'tldr',
            language: 'Spanish',
        })
        expect(prompt).toContain('200 words')
        expect(prompt).toContain('Spanish')
    })
})

describe('extractJsonFromResponse', () => {
    it('extracts JSON from markdown code blocks', () => {
        const response = 'Here is the data:\n```json\n{"name": "test"}\n```'
        expect(extractJsonFromResponse(response)).toEqual({ name: 'test' })
    })

    it('parses raw JSON responses', () => {
        expect(extractJsonFromResponse('{"count": 42}')).toEqual({ count: 42 })
    })

    it('throws on invalid JSON', () => {
        expect(() => extractJsonFromResponse('not json at all')).toThrow()
    })
})

describe('countTokensEstimate', () => {
    it('estimates tokens roughly as length / 4', () => {
        const text = 'a'.repeat(100)
        expect(countTokensEstimate(text)).toBe(25)
    })

    it('rounds up for partial tokens', () => {
        expect(countTokensEstimate('hello')).toBe(2) // 5 / 4 = 1.25 → 2
    })
})

Nothing fancy. These tests run in milliseconds, cost nothing, and catch the most common bugs: prompt format changes that silently break your AI feature.

Layer 2: Mocking the AI API

Here's where most developers get stuck. You don't want to call Claude in your unit tests - it's slow, expensive, and non-deterministic. Mock it.

// src/lib/summarizer.ts
import Anthropic from '@anthropic-ai/sdk'
import { buildSummaryPrompt, SummaryOptions } from './prompt-builder'

export interface SummaryResult {
    summary: string
    model: string
    inputTokens: number
    outputTokens: number
}

export async function summarizeDocument(
    client: Anthropic,
    document: string,
    options: SummaryOptions
): Promise<SummaryResult> {
    const prompt = buildSummaryPrompt(document, options)

    const response = await client.messages.create({
        model: 'claude-sonnet-4-5-20250929',
        max_tokens: 1024,
        messages: [{ role: 'user', content: prompt }],
    })

    const textBlock = response.content.find((block) => block.type === 'text')
    if (!textBlock || textBlock.type !== 'text') {
        throw new Error('No text content in response')
    }

    return {
        summary: textBlock.text,
        model: response.model,
        inputTokens: response.usage.input_tokens,
        outputTokens: response.usage.output_tokens,
    }
}

The key is dependency injection - pass the Anthropic client as a parameter instead of importing it directly. This makes mocking trivial:

// src/lib/__tests__/summarizer.test.ts
import { describe, it, expect, vi } from 'vitest'
import { summarizeDocument } from '../summarizer'
import Anthropic from '@anthropic-ai/sdk'

function createMockClient(responseText: string) {
    return {
        messages: {
            create: vi.fn().mockResolvedValue({
                content: [{ type: 'text', text: responseText }],
                model: 'claude-sonnet-4-5-20250929',
                usage: { input_tokens: 150, output_tokens: 50 },
                role: 'assistant',
                stop_reason: 'end_turn',
            }),
        },
    } as unknown as Anthropic
}

describe('summarizeDocument', () => {
    it('returns the summary text from Claude response', async () => {
        const mockClient = createMockClient('This is a summary.')
        const result = await summarizeDocument(mockClient, 'Long document...', {
            maxLength: 100,
            style: 'paragraph',
            language: 'English',
        })
        expect(result.summary).toBe('This is a summary.')
    })

    it('passes the correct model and max_tokens', async () => {
        const mockClient = createMockClient('Summary')
        await summarizeDocument(mockClient, 'doc', {
            maxLength: 50,
            style: 'tldr',
            language: 'English',
        })
        expect(mockClient.messages.create).toHaveBeenCalledWith(
            expect.objectContaining({
                model: 'claude-sonnet-4-5-20250929',
                max_tokens: 1024,
            })
        )
    })

    it('includes token usage in the result', async () => {
        const mockClient = createMockClient('Summary')
        const result = await summarizeDocument(mockClient, 'doc', {
            maxLength: 50,
            style: 'tldr',
            language: 'English',
        })
        expect(result.inputTokens).toBe(150)
        expect(result.outputTokens).toBe(50)
    })

    it('throws when response has no text content', async () => {
        const mockClient = {
            messages: {
                create: vi.fn().mockResolvedValue({
                    content: [],
                    model: 'claude-sonnet-4-5-20250929',
                    usage: { input_tokens: 10, output_tokens: 0 },
                    role: 'assistant',
                    stop_reason: 'end_turn',
                }),
            },
        } as unknown as Anthropic

        await expect(
            summarizeDocument(mockClient, 'doc', {
                maxLength: 50,
                style: 'tldr',
                language: 'English',
            })
        ).rejects.toThrow('No text content in response')
    })
})

Gotcha: mocking tool use responses. If your AI feature uses Claude's tool calling, the mock gets more complex. The response includes tool_use blocks, and your code needs to handle the tool call loop:

// src/lib/__tests__/tool-use-mock.test.ts
import { describe, it, expect, vi } from 'vitest'

function createToolUseMockClient() {
    const create = vi
        .fn()
        // First call: Claude decides to use a tool
        .mockResolvedValueOnce({
            content: [
                {
                    type: 'tool_use',
                    id: 'toolu_01ABC',
                    name: 'search_database',
                    input: { query: 'revenue Q4 2025' },
                },
            ],
            model: 'claude-sonnet-4-5-20250929',
            usage: { input_tokens: 200, output_tokens: 80 },
            role: 'assistant',
            stop_reason: 'tool_use',
        })
        // Second call: Claude uses the tool result to answer
        .mockResolvedValueOnce({
            content: [{ type: 'text', text: 'Q4 2025 revenue was $2.1M.' }],
            model: 'claude-sonnet-4-5-20250929',
            usage: { input_tokens: 350, output_tokens: 30 },
            role: 'assistant',
            stop_reason: 'end_turn',
        })

    return { messages: { create } }
}

describe('tool use flow', () => {
    it('handles the tool call loop correctly', async () => {
        const mockClient = createToolUseMockClient()

        // First call - gets tool_use response
        const firstResponse = await mockClient.messages.create({})
        expect(firstResponse.stop_reason).toBe('tool_use')

        const toolBlock = firstResponse.content[0]
        expect(toolBlock.type).toBe('tool_use')

        // Second call - sends tool result, gets final answer
        const secondResponse = await mockClient.messages.create({})
        expect(secondResponse.content[0].type).toBe('text')
        expect(secondResponse.content[0].text).toContain('$2.1M')
    })
})

Layer 3: Testing output quality with assertions

This is the hard part. You can't expect(summary).toBe("exact text") because the model won't produce the same output every time. Instead, use structural and semantic assertions.

Pattern 1: Schema validation with Zod

When your AI returns structured data, validate the shape - not the exact content:

// src/lib/__tests__/structured-output.test.ts
import { describe, it, expect } from 'vitest'
import { z } from 'zod'

const InvoiceSchema = z.object({
    vendor: z.string().min(1),
    amount: z.number().positive(),
    currency: z.enum(['USD', 'EUR', 'GBP']),
    date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
    lineItems: z.array(
        z.object({
            description: z.string().min(1),
            quantity: z.number().int().positive(),
            unitPrice: z.number().positive(),
        })
    ),
})

describe('invoice extraction output', () => {
    it('validates against the expected schema', () => {
        // This would come from your AI extraction function
        const aiOutput = {
            vendor: 'Acme Corp',
            amount: 1500.0,
            currency: 'USD',
            date: '2025-11-15',
            lineItems: [
                { description: 'Widget A', quantity: 10, unitPrice: 100.0 },
                { description: 'Widget B', quantity: 5, unitPrice: 100.0 },
            ],
        }

        const result = InvoiceSchema.safeParse(aiOutput)
        expect(result.success).toBe(true)
    })

    it('rejects invalid amounts', () => {
        const badOutput = {
            vendor: 'Acme Corp',
            amount: -500,
            currency: 'USD',
            date: '2025-11-15',
            lineItems: [],
        }
        const result = InvoiceSchema.safeParse(badOutput)
        expect(result.success).toBe(false)
    })
})

Pattern 2: Assertion helpers for fuzzy matching

Build a small set of assertion helpers for common checks on AI output:

// src/lib/test-utils/ai-assertions.ts
import { expect } from 'vitest'

export function expectWithinWordCount(text: string, min: number, max: number) {
    const wordCount = text.split(/\s+/).filter(Boolean).length
    expect(wordCount).toBeGreaterThanOrEqual(min)
    expect(wordCount).toBeLessThanOrEqual(max)
}

export function expectContainsAllKeywords(text: string, keywords: string[]) {
    const lowerText = text.toLowerCase()
    const missing = keywords.filter((kw) => !lowerText.includes(kw.toLowerCase()))
    expect(missing, `Missing keywords: ${missing.join(', ')}`).toHaveLength(0)
}

export function expectNoBannedPhrases(text: string, banned: string[]) {
    const lowerText = text.toLowerCase()
    const found = banned.filter((phrase) => lowerText.includes(phrase.toLowerCase()))
    expect(found, `Found banned phrases: ${found.join(', ')}`).toHaveLength(0)
}

export function expectValidMarkdown(text: string) {
    // Check that code blocks are closed
    const codeBlockCount = (text.match(/```/g) || []).length
    expect(codeBlockCount % 2, 'Unclosed code block detected').toBe(0)

    // Check that links have valid format
    const brokenLinks = text.match(/\[([^\]]*)\]\(\s*\)/g)
    expect(brokenLinks, 'Found empty links').toBeNull()
}

Use them in your tests:

// src/lib/__tests__/summary-quality.test.ts
import { describe, it } from 'vitest'
import {
    expectWithinWordCount,
    expectContainsAllKeywords,
    expectNoBannedPhrases,
} from '../test-utils/ai-assertions'

describe('summary quality checks', () => {
    // In a real test, this comes from your mocked or recorded AI response
    const summary =
        'The quarterly report shows revenue grew 15% to $2.1M. ' +
        'Cloud services drove most of the growth, while hardware sales declined. ' +
        'The company plans to expand into the European market next quarter.'

    it('stays within the requested word count', () => {
        expectWithinWordCount(summary, 20, 100)
    })

    it('mentions key topics from the source document', () => {
        expectContainsAllKeywords(summary, ['revenue', 'cloud', 'European'])
    })

    it('does not include banned filler phrases', () => {
        expectNoBannedPhrases(summary, [
            'as an AI',
            'I cannot',
            "I don't have access",
            'in conclusion',
        ])
    })
})

Pattern 3: Snapshot testing for prompt regression

Snapshot tests won't help with AI output, but they're perfect for catching unintended prompt changes:

// src/lib/__tests__/prompt-snapshots.test.ts
import { describe, it, expect } from 'vitest'
import { buildSummaryPrompt } from '../prompt-builder'

describe('prompt snapshots', () => {
    it('summary prompt matches snapshot', () => {
        const prompt = buildSummaryPrompt('Test document content.', {
            maxLength: 100,
            style: 'bullet-points',
            language: 'English',
        })
        expect(prompt).toMatchInlineSnapshot(`
          "Summarize the following document in English.
          Use concise bullet points. Each bullet should be one sentence.
          Keep the summary under 100 words.

          <document>
          Test document content.
          </document>"
        `)
    })
})

When someone changes the prompt template, the snapshot test fails and forces a review. This is intentional - prompt changes should be deliberate, not accidental side effects of a refactor.

Building a lightweight eval runner

For serious AI features, you need something beyond unit tests. An eval runner sends real prompts to the model and scores the responses against a rubric. This is expensive and slow, so run it separately from your unit tests - think of it as your AI-specific integration test suite.

// src/lib/eval/eval-runner.ts
import Anthropic from '@anthropic-ai/sdk'

export interface EvalCase {
    name: string
    input: string
    rubric: EvalCheck[]
}

export interface EvalCheck {
    description: string
    check: (output: string) => boolean
}

export interface EvalResult {
    caseName: string
    passed: boolean
    details: { description: string; passed: boolean }[]
    output: string
    durationMs: number
}

export async function runEval(
    client: Anthropic,
    systemPrompt: string,
    evalCase: EvalCase
): Promise<EvalResult> {
    const start = Date.now()

    const response = await client.messages.create({
        model: 'claude-sonnet-4-5-20250929',
        max_tokens: 1024,
        system: systemPrompt,
        messages: [{ role: 'user', content: evalCase.input }],
    })

    const textBlock = response.content.find((b) => b.type === 'text')
    const output = textBlock && textBlock.type === 'text' ? textBlock.text : ''
    const durationMs = Date.now() - start

    const details = evalCase.rubric.map((check) => ({
        description: check.description,
        passed: check.check(output),
    }))

    return {
        caseName: evalCase.name,
        passed: details.every((d) => d.passed),
        details,
        output,
        durationMs,
    }
}

export function printEvalResults(results: EvalResult[]) {
    const passed = results.filter((r) => r.passed).length
    console.log(`\nEval Results: ${passed}/${results.length} passed\n`)

    for (const result of results) {
        const icon = result.passed ? 'PASS' : 'FAIL'
        console.log(`[${icon}] ${result.caseName} (${result.durationMs}ms)`)
        for (const detail of result.details) {
            const checkIcon = detail.passed ? '  +' : '  -'
            console.log(`${checkIcon} ${detail.description}`)
        }
        if (!result.passed) {
            console.log(`  Output: ${result.output.slice(0, 200)}...`)
        }
    }
}

Define your eval cases:

// src/lib/eval/summarizer.eval.ts
import Anthropic from '@anthropic-ai/sdk'
import { runEval, printEvalResults, EvalCase } from './eval-runner'

const SYSTEM_PROMPT = `You are a document summarizer. Summarize the given text in 2-3 sentences. Be factual and concise.`

const evalCases: EvalCase[] = [
    {
        name: 'basic article summary',
        input:
            'TypeScript 5.4 introduces the NoInfer utility type, which prevents unwanted type inference in generic functions. Previously, developers had to use workarounds like intermediate type parameters. The new feature is backward compatible and requires no changes to existing code.',
        rubric: [
            {
                description: 'mentions TypeScript 5.4',
                check: (out) => out.includes('TypeScript 5.4') || out.includes('TS 5.4'),
            },
            {
                description: 'mentions NoInfer',
                check: (out) => out.toLowerCase().includes('noinfer'),
            },
            {
                description: 'is concise (under 100 words)',
                check: (out) => out.split(/\s+/).length <= 100,
            },
            {
                description: 'does not hallucinate features',
                check: (out) => !out.toLowerCase().includes('decorators'),
            },
        ],
    },
    {
        name: 'handles empty-ish input gracefully',
        input: 'Meeting notes: nothing discussed.',
        rubric: [
            {
                description: 'produces a short response',
                check: (out) => out.split(/\s+/).length <= 30,
            },
            {
                description: 'does not invent content',
                check: (out) => !out.toLowerCase().includes('action items'),
            },
        ],
    },
]

async function main() {
    const client = new Anthropic()
    const results = await Promise.all(
        evalCases.map((evalCase) => runEval(client, SYSTEM_PROMPT, evalCase))
    )
    printEvalResults(results)
    const allPassed = results.every((r) => r.passed)
    process.exit(allPassed ? 0 : 1)
}

main()

Run it separately:

pnpm tsx src/lib/eval/summarizer.eval.ts

Gotcha: flaky evals. AI evals will sometimes fail non-deterministically. Run them 3 times and require 2/3 passes, or set temperature: 0 to reduce variance. Don't add them to your CI pipeline's required checks - run them on a schedule (nightly) and alert on trends, not individual failures.

Putting it all together: a practical test strategy

Here's the test setup I use for AI features in production:

Unit tests (run on every commit):

Prompt construction functions
Response parsing and validation
Token counting and budget logic
Cache key generation
Error handling branches (mocked API failures)

Integration tests (run on PRs):

Mocked API calls testing the full request/response flow
Snapshot tests on prompt templates
Schema validation on structured outputs

Evals (run nightly):

Real API calls with a test budget
Rubric-based scoring against known inputs
Track pass rates over time in a dashboard

Keep your unit tests fast and free by mocking the API. Keep your evals honest by hitting the real model. Don't mix the two - that way lies confusion and wasted money.

What's next

Testing gives you the confidence to iterate fast on your AI features without breaking production. If you're building the kind of AI-powered apps we've covered in this series - from streaming Claude responses in Next.js to RAG chatbots to Slack bots that know your codebase - these testing patterns will save you from the 3 AM "the AI is saying what?" incident.

Start with layer 1 - test your prompts and parsers. Add mocks for layer 2. Graduate to evals when your feature is mature enough to justify the cost. You don't need all three on day one, but you'll want all three before you scale.