How to Add LLM Observability and Tracing to Your TypeScript AI App with Langfuse

Friday 17/04/2026

·12 min read

Your AI feature works in development. Users are hitting it in production. But you have no idea what's actually happening inside those LLM calls. A user reports the chatbot gave a wrong answer yesterday — which prompt fired? What context was retrieved? How many tokens did it burn? You check your logs and find... nothing useful.

This is the observability gap that kills AI features in production. Traditional logging tells you that something happened. LLM tracing tells you why — every prompt, every completion, every tool call, every retrieval step, with timing and cost data attached. Langfuse is an open-source LLM observability platform (recently acquired by ClickHouse) that gives you this visibility without vendor lock-in.

Why Langfuse over other options

There are several LLM observability tools out there — LangSmith, Helicone, Braintrust. Here's why Langfuse is my pick for TypeScript apps:

Open source — self-host or use their cloud. No vendor lock-in
First-class Vercel AI SDK integration — one-line setup if you're already using ai
Provider-agnostic — works with Claude, OpenAI, Gemini, local models, anything
ClickHouse-backed analytics — fast queries over millions of traces
Cost tracking — automatic token counting and cost calculation per model

Setting up the project

We'll add Langfuse tracing to a Vercel AI SDK app that uses Claude with tool calling. If you have an existing AI SDK app, you can skip the scaffolding and jump to the integration.

mkdir ai-app-with-tracing && cd ai-app-with-tracing
pnpm init
pnpm add ai @ai-sdk/anthropic langfuse zod
pnpm add -D typescript @types/node tsx

// tsconfig.json
{
    "compilerOptions": {
        "target": "ES2022",
        "module": "Node16",
        "moduleResolution": "Node16",
        "outDir": "./dist",
        "strict": true,
        "esModuleInterop": true,
        "skipLibCheck": true
    },
    "include": ["src/**/*"]
}

Set up your environment variables:

# .env
ANTHROPIC_API_KEY=sk-ant-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_BASEURL=https://cloud.langfuse.com

You get the Langfuse keys from cloud.langfuse.com after creating a project, or from your self-hosted instance.

The Vercel AI SDK integration (the easy path)

If you're using Vercel AI SDK, Langfuse has a direct integration that wraps the SDK's generateText and streamText with tracing — zero manual instrumentation. This is the approach I'd recommend for most apps.

// src/traced-chat.ts
import { generateText, tool } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'
import { Langfuse } from 'langfuse'
import { z } from 'zod'

const langfuse = new Langfuse({
    secretKey: process.env.LANGFUSE_SECRET_KEY!,
    publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
    baseUrl: process.env.LANGFUSE_BASEURL,
})

const weatherTool = tool({
    description: 'Get current weather for a city',
    parameters: z.object({
        city: z.string().describe('City name'),
    }),
    execute: async ({ city }) => {
        // Simulate weather API call
        return {
            city,
            temperature: Math.round(Math.random() * 30 + 5),
            condition: ['sunny', 'cloudy', 'rainy'][Math.floor(Math.random() * 3)],
        }
    },
})

async function chat(userMessage: string) {
    // Create a Langfuse trace for this conversation turn
    const trace = langfuse.trace({
        name: 'chat',
        userId: 'user-123',
        metadata: { source: 'cli' },
        tags: ['production', 'weather-bot'],
    })

    const result = await generateText({
        model: anthropic('claude-sonnet-4-20250514'),
        tools: { weather: weatherTool },
        maxSteps: 5,
        messages: [{ role: 'user', content: userMessage }],
        experimental_telemetry: {
            isEnabled: true,
            metadata: {
                langfuseTraceId: trace.id,
            },
        },
    })

    // Log the final output
    trace.update({
        output: result.text,
    })

    // Flush to ensure traces are sent before process exits
    await langfuse.flushAsync()

    return result.text
}

// Run it
chat("What's the weather in Tel Aviv and Berlin?").then(console.log)

That experimental_telemetry config with the langfuseTraceId is all you need. Langfuse hooks into the AI SDK's OpenTelemetry instrumentation and captures every generation, tool call, and step automatically.

Gotcha: The experimental_telemetry approach requires you to set up an OpenTelemetry exporter. For a simpler path without OTel, use the manual approach below — it gives you more control anyway.

Manual tracing (full control)

The direct SDK integration is convenient, but manual tracing gives you finer control over what gets traced and how it's structured. This is what I use in production because I want to trace my own logic — retrieval steps, caching decisions, input validation — not just LLM calls.

// src/manual-tracing.ts
import { generateText, tool } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'
import { Langfuse } from 'langfuse'
import { z } from 'zod'

const langfuse = new Langfuse({
    secretKey: process.env.LANGFUSE_SECRET_KEY!,
    publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
    baseUrl: process.env.LANGFUSE_BASEURL,
})

interface Document {
    id: string
    content: string
    score: number
}

// Simulate a retrieval step
async function retrieveDocuments(query: string): Promise<Document[]> {
    // In reality, this would be a vector search
    return [
        { id: 'doc-1', content: 'Langfuse supports TypeScript natively...', score: 0.92 },
        { id: 'doc-2', content: 'Tracing helps debug prompt regressions...', score: 0.87 },
    ]
}

async function ragChat(userMessage: string) {
    const trace = langfuse.trace({
        name: 'rag-chat',
        input: { message: userMessage },
        userId: 'user-456',
        sessionId: 'session-abc', // Groups multiple turns in the same conversation
        tags: ['rag', 'production'],
    })

    // Trace the retrieval step as a span
    const retrievalSpan = trace.span({
        name: 'document-retrieval',
        input: { query: userMessage },
    })

    const documents = await retrieveDocuments(userMessage)

    retrievalSpan.end({
        output: { documentCount: documents.length, topScore: documents[0]?.score },
        metadata: { source: 'pinecone', index: 'docs-v2' },
    })

    // Build the prompt with retrieved context
    const systemPrompt = `You are a helpful assistant. Use the following context to answer questions:

${documents.map((d) => d.content).join('\n\n')}

If the context doesn't contain relevant information, say so.`

    // Trace the LLM call as a generation
    const generation = trace.generation({
        name: 'llm-call',
        model: 'claude-sonnet-4-20250514',
        input: {
            system: systemPrompt,
            user: userMessage,
        },
        modelParameters: {
            maxTokens: 1024,
            temperature: 0.7,
        },
    })

    const result = await generateText({
        model: anthropic('claude-sonnet-4-20250514'),
        system: systemPrompt,
        messages: [{ role: 'user', content: userMessage }],
        maxTokens: 1024,
        temperature: 0.7,
    })

    generation.end({
        output: result.text,
        usage: {
            input: result.usage.promptTokens,
            output: result.usage.completionTokens,
        },
    })

    // Update the top-level trace with the final output
    trace.update({
        output: result.text,
        metadata: {
            documentsUsed: documents.length,
            totalTokens: result.usage.promptTokens + result.usage.completionTokens,
        },
    })

    await langfuse.flushAsync()
    return result.text
}

This creates a trace tree that looks like:

rag-chat (trace)
├── document-retrieval (span) — 45ms, 2 docs found
└── llm-call (generation) — 1.2s, 847 tokens, $0.003

Every node has timing, input/output, and metadata. When something goes wrong, you click into the trace and see exactly what happened at each step.

Tracing tool calls in a multi-step agent

Things get more interesting with agents that make multiple tool calls. Here's how to trace a multi-step agent loop:

// src/traced-agent.ts
import { generateText, tool } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'
import { Langfuse } from 'langfuse'
import { z } from 'zod'

const langfuse = new Langfuse({
    secretKey: process.env.LANGFUSE_SECRET_KEY!,
    publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
    baseUrl: process.env.LANGFUSE_BASEURL,
})

// Define tools
const searchTool = tool({
    description: 'Search the knowledge base',
    parameters: z.object({ query: z.string() }),
    execute: async ({ query }) => {
        return { results: [`Result for: ${query}`], count: 1 }
    },
})

const calculatorTool = tool({
    description: 'Perform a calculation',
    parameters: z.object({
        expression: z.string().describe('Math expression to evaluate'),
    }),
    execute: async ({ expression }) => {
        // Using Function constructor for math eval (safe for controlled input)
        const result = new Function(`return ${expression}`)()
        return { expression, result: Number(result) }
    },
})

async function tracedAgent(userMessage: string) {
    const trace = langfuse.trace({
        name: 'agent-run',
        input: { message: userMessage },
    })

    let stepCount = 0

    const result = await generateText({
        model: anthropic('claude-sonnet-4-20250514'),
        tools: { search: searchTool, calculator: calculatorTool },
        maxSteps: 10,
        messages: [{ role: 'user', content: userMessage }],
        onStepFinish: async (step) => {
            stepCount++

            if (step.toolCalls.length > 0) {
                // Trace each tool call
                for (const toolCall of step.toolCalls) {
                    trace.span({
                        name: `tool:${toolCall.toolName}`,
                        input: toolCall.args,
                        output: step.toolResults.find(
                            (r) => r.toolCallId === toolCall.toolCallId
                        )?.result,
                        metadata: { step: stepCount },
                    })
                }
            }

            // Trace the generation step
            trace.generation({
                name: `step-${stepCount}`,
                model: 'claude-sonnet-4-20250514',
                input: step.request?.messages,
                output: step.text || step.toolCalls,
                usage: {
                    input: step.usage.promptTokens,
                    output: step.usage.completionTokens,
                },
            })
        },
    })

    trace.update({
        output: result.text,
        metadata: { totalSteps: stepCount },
    })

    await langfuse.flushAsync()
    return result.text
}

Now in the Langfuse dashboard, you can see the full agent loop — every step, every tool call, every token spent — and figure out why your agent decided to call search three times before giving an answer.

Adding scores and evaluations

Tracing shows you what happened. Scores tell you how well it went. Langfuse lets you attach scores to any trace — from user feedback, automated evaluations, or both.

// src/scoring.ts
import { Langfuse } from 'langfuse'

const langfuse = new Langfuse({
    secretKey: process.env.LANGFUSE_SECRET_KEY!,
    publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
    baseUrl: process.env.LANGFUSE_BASEURL,
})

// After getting user feedback (e.g., thumbs up/down in your UI)
async function recordUserFeedback(traceId: string, isPositive: boolean) {
    langfuse.score({
        traceId,
        name: 'user-feedback',
        value: isPositive ? 1 : 0,
        comment: isPositive ? 'User found the response helpful' : 'User marked as unhelpful',
    })
    await langfuse.flushAsync()
}

// Automated evaluation — check if the response actually used the retrieved context
async function evaluateGroundedness(traceId: string, response: string, context: string) {
    const contextWords = new Set(
        context
            .toLowerCase()
            .split(/\s+/)
            .filter((w) => w.length > 4)
    )
    const responseWords = response.toLowerCase().split(/\s+/)
    const overlapCount = responseWords.filter((w) => contextWords.has(w)).length
    const groundednessScore = Math.min(overlapCount / 10, 1) // Normalize to 0-1

    langfuse.score({
        traceId,
        name: 'groundedness',
        value: groundednessScore,
        comment: `${overlapCount} context words found in response`,
    })
    await langfuse.flushAsync()
}

A simple word-overlap check isn't going to catch everything, but it's a decent heuristic to start with. For production, you'd use an LLM-as-a-judge pattern — send the context and response to a smaller model and ask "does this response accurately reflect the provided context?" Then score the trace with the result.

Prompt management (underrated feature)

Langfuse doubles as a prompt management system. Instead of hardcoding system prompts in your code, you store them in Langfuse and fetch them at runtime. This means your PM can tweak prompts in the Langfuse UI without deploying code.

// src/prompt-management.ts
import { Langfuse } from 'langfuse'
import { generateText } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'

const langfuse = new Langfuse({
    secretKey: process.env.LANGFUSE_SECRET_KEY!,
    publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
    baseUrl: process.env.LANGFUSE_BASEURL,
})

async function chatWithManagedPrompt(userMessage: string) {
    // Fetch the prompt from Langfuse (cached locally, refreshes every 60s by default)
    const prompt = await langfuse.getPrompt('rag-assistant')

    const compiledPrompt = prompt.compile({
        context: 'Some retrieved context here',
        maxLength: '500 words',
    })

    const trace = langfuse.trace({
        name: 'managed-prompt-chat',
        input: { message: userMessage },
    })

    const generation = trace.generation({
        name: 'llm-call',
        model: 'claude-sonnet-4-20250514',
        prompt, // Link the generation to the prompt version — crucial for tracking
    })

    const result = await generateText({
        model: anthropic('claude-sonnet-4-20250514'),
        system: compiledPrompt,
        messages: [{ role: 'user', content: userMessage }],
    })

    generation.end({
        output: result.text,
        usage: {
            input: result.usage.promptTokens,
            output: result.usage.completionTokens,
        },
    })

    await langfuse.flushAsync()
    return result.text
}

The killer feature here is version tracking. When you update a prompt in Langfuse, it creates a new version. Every generation is linked to the prompt version that produced it. So when your AI starts giving worse answers on Tuesday, you can see that someone changed the system prompt on Monday and compare metrics between versions.

Building a cost and latency dashboard

Once traces are flowing in, Langfuse gives you a built-in dashboard with latency percentiles, token usage, cost by model, and error rates. But the real power is in custom analytics.

Here's a pattern for tracking cost per feature — useful when you have multiple AI features and want to know which one is burning your budget:

// src/feature-tracking.ts
import { Langfuse } from 'langfuse'

const langfuse = new Langfuse({
    secretKey: process.env.LANGFUSE_SECRET_KEY!,
    publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
    baseUrl: process.env.LANGFUSE_BASEURL,
})

type FeatureName = 'chat' | 'search' | 'summarize' | 'code-review'

function createFeatureTrace(feature: FeatureName, userId: string) {
    return langfuse.trace({
        name: feature,
        userId,
        tags: [feature, 'production'],
        metadata: {
            feature,
            version: process.env.APP_VERSION || 'dev',
        },
    })
}

// Usage in your API routes
async function handleChatRequest(userId: string, message: string) {
    const trace = createFeatureTrace('chat', userId)
    // ... your chat logic with trace.generation() calls
    await langfuse.flushAsync()
}

async function handleSearchRequest(userId: string, query: string) {
    const trace = createFeatureTrace('search', userId)
    // ... your search logic
    await langfuse.flushAsync()
}

In the Langfuse dashboard, filter by the feature tag and you get per-feature breakdowns. Last month I discovered our "summarize" feature was costing 3x what "chat" cost because a prompt regression was sending redundant context — something I never would have caught without per-feature cost tracking.

Integration with Next.js API routes

If you're running a Next.js app (like most of my projects), here's how to wire Langfuse into your API routes:

// src/app/api/chat/route.ts (Next.js App Router)
import { streamText } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'
import { Langfuse } from 'langfuse'

const langfuse = new Langfuse({
    secretKey: process.env.LANGFUSE_SECRET_KEY!,
    publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
    baseUrl: process.env.LANGFUSE_BASEURL,
})

export async function POST(req: Request) {
    const { messages, userId } = await req.json()

    const trace = langfuse.trace({
        name: 'chat-api',
        userId,
        input: messages[messages.length - 1],
    })

    const generation = trace.generation({
        name: 'stream-response',
        model: 'claude-sonnet-4-20250514',
        input: messages,
    })

    const result = streamText({
        model: anthropic('claude-sonnet-4-20250514'),
        messages,
        onFinish: async ({ text, usage }) => {
            generation.end({
                output: text,
                usage: {
                    input: usage.promptTokens,
                    output: usage.completionTokens,
                },
            })
            trace.update({ output: text })
            await langfuse.flushAsync()
        },
    })

    return result.toDataStreamResponse()
}

Gotcha: Don't forget flushAsync() in serverless environments. Langfuse batches traces and sends them asynchronously by default. In a serverless function that dies after returning the response, your traces might never get sent. The flushAsync() call forces the SDK to send everything before the function exits.

Debugging a prompt regression (real scenario)

Here's how tracing actually saved me hours. Our RAG chatbot started giving vague, unhelpful answers. Users were complaining. Without tracing, I'd be staring at logs trying to reproduce the issue.

With Langfuse, I filtered traces by the user-feedback score (looking for low scores), clicked into a bad trace, and immediately saw the problem: the retrieval step was returning 0 documents. The vector store index had been rebuilt with a new embedding model, but the query-time embedding was still using the old model. The cosine similarity scores were garbage, so everything fell below the threshold.

Total debugging time: 12 minutes. Without tracing, this would have been hours of "it works on my machine" followed by "let me add some console.logs."

What's next

Once you have observability in place, the natural next step is building evaluation pipelines that automatically score your AI outputs. If you're building a RAG system that needs to cite its sources, check out my post on adding AI-powered citations and source attribution to your RAG chatbot — tracing those citation chains is where Langfuse really shines.

If you're still choosing between AI frameworks, read Vercel AI SDK vs Mastra vs LangChain.js — all three have different observability stories, and it's worth knowing before you commit.