Build a RAG Chatbot in 100 Lines of TypeScript

Monday 09/02/2026

·9 min read

You have a pile of documents — product docs, internal wikis, markdown files — and you want a chatbot that actually answers questions about your content instead of hallucinating. You search for a RAG tutorial and find 50 Python notebooks using LangChain with 400 lines of boilerplate. You're a TypeScript developer. You just want something that works.

Here's how to build a RAG chatbot in about 100 lines of TypeScript. No LangChain, no heavyweight frameworks. Just the Anthropic SDK, a vector database, and your documents.

What RAG actually is (in 30 seconds)

Retrieval-Augmented Generation means: before asking the LLM a question, you search your documents for relevant chunks, then stuff those chunks into the prompt as context. The LLM answers based on your actual content instead of its training data. That's it. The concept is simple — the implementation details are where people get stuck.

The stack

We'll use:

@anthropic-ai/sdk — for calling Claude
@pinecone-database/pinecone — vector database for storing and searching document embeddings
OpenAI's embedding API — for generating embeddings (Anthropic doesn't offer an embedding model yet, so we use OpenAI's text-embedding-3-small for this part)

Install everything:

pnpm add @anthropic-ai/sdk @pinecone-database/pinecone openai

You'll need three API keys in your .env:

# .env
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
PINECONE_API_KEY=pcsk_...

Step 1: Chunk your documents

Before you can search your docs, you need to split them into chunks and generate embeddings. Here's a script that reads markdown files from a directory, splits them into overlapping chunks, and uploads them to Pinecone.

// scripts/ingest.ts
import { Pinecone } from '@pinecone-database/pinecone'
import OpenAI from 'openai'
import fs from 'fs'
import path from 'path'

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! })

interface DocumentChunk {
    id: string
    text: string
    source: string
}

function chunkText(text: string, filename: string, chunkSize = 500, overlap = 50): DocumentChunk[] {
    const chunks: DocumentChunk[] = []
    let start = 0

    while (start < text.length) {
        const end = Math.min(start + chunkSize, text.length)
        chunks.push({
            id: `${filename}-chunk-${chunks.length}`,
            text: text.slice(start, end),
            source: filename,
        })
        start += chunkSize - overlap
    }

    return chunks
}

async function ingest(docsDir: string) {
    const index = pinecone.index('rag-chatbot')
    const files = fs.readdirSync(docsDir).filter((f) => f.endsWith('.md'))

    const allChunks: DocumentChunk[] = []
    for (const file of files) {
        const content = fs.readFileSync(path.join(docsDir, file), 'utf-8')
        allChunks.push(...chunkText(content, file))
    }

    console.log(`Processing ${allChunks.length} chunks from ${files.length} files`)

    // Embed in batches of 100 (OpenAI's limit)
    for (let i = 0; i < allChunks.length; i += 100) {
        const batch = allChunks.slice(i, i + 100)

        const embedResponse = await openai.embeddings.create({
            model: 'text-embedding-3-small',
            input: batch.map((c) => c.text),
        })

        const vectors = batch.map((chunk, idx) => ({
            id: chunk.id,
            values: embedResponse.data[idx].embedding,
            metadata: { text: chunk.text, source: chunk.source },
        }))

        await index.upsert(vectors)
        console.log(`Upserted batch ${Math.floor(i / 100) + 1}`)
    }

    console.log('Done!')
}

ingest('./docs').catch(console.error)

Run it with:

npx tsx scripts/ingest.ts

Gotcha: chunk size matters more than you think

If your chunks are too small (< 200 chars), you lose context and Claude gets confused fragments. Too large (> 1000 chars) and your search results get noisy — you'll retrieve paragraphs where only one sentence is relevant. I've found 400-600 characters with a 50-character overlap works well for most documentation. The overlap ensures you don't cut sentences in half at chunk boundaries.

Step 2: The RAG chatbot (the actual 100 lines)

Here's the complete chatbot. It takes a user question, searches Pinecone for relevant chunks, then asks Claude to answer based on those chunks.

// src/rag-chat.ts
import Anthropic from '@anthropic-ai/sdk'
import { Pinecone } from '@pinecone-database/pinecone'
import OpenAI from 'openai'
import * as readline from 'readline'

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! })
const index = pinecone.index('rag-chatbot')

interface SearchResult {
    text: string
    source: string
    score: number
}

async function searchDocs(query: string, topK = 5): Promise<SearchResult[]> {
    const embedResponse = await openai.embeddings.create({
        model: 'text-embedding-3-small',
        input: query,
    })

    const results = await index.query({
        vector: embedResponse.data[0].embedding,
        topK,
        includeMetadata: true,
    })

    return (results.matches || []).map((match) => ({
        text: (match.metadata?.text as string) || '',
        source: (match.metadata?.source as string) || '',
        score: match.score || 0,
    }))
}

function buildPrompt(question: string, context: SearchResult[]): string {
    const contextBlock = context
        .map((c) => `[Source: ${c.source} | Relevance: ${c.score.toFixed(2)}]\n${c.text}`)
        .join('\n\n---\n\n')

    return `Answer the user's question based ONLY on the following context. If the context doesn't contain enough information to answer, say so — don't make things up.

Context:
${contextBlock}

Question: ${question}`
}

async function chat(question: string): Promise<string> {
    const context = await searchDocs(question)

    if (context.length === 0 || context[0].score < 0.3) {
        return "I couldn't find any relevant information in the documents to answer that question."
    }

    const response = await anthropic.messages.create({
        model: 'claude-sonnet-4-5-20250929',
        max_tokens: 1024,
        system:
            'You are a helpful assistant that answers questions based on provided documentation. ' +
            'Cite your sources by mentioning the filename when possible. ' +
            'Be concise and direct.',
        messages: [{ role: 'user', content: buildPrompt(question, context) }],
    })

    const textBlock = response.content.find((block) => block.type === 'text')
    return textBlock ? textBlock.text : 'No response generated.'
}

async function main() {
    const rl = readline.createInterface({ input: process.stdin, output: process.stdout })

    console.log('RAG Chatbot ready. Type your questions (Ctrl+C to exit).\n')

    const askQuestion = () => {
        rl.question('You: ', async (question) => {
            if (!question.trim()) return askQuestion()

            try {
                const answer = await chat(question)
                console.log(`\nAssistant: ${answer}\n`)
            } catch (error) {
                const message = error instanceof Error ? error.message : 'Unknown error'
                console.error(`\nError: ${message}\n`)
            }

            askQuestion()
        })
    }

    askQuestion()
}

main()

Run it:

npx tsx src/rag-chat.ts

That's around 90 lines of actual code (excluding blank lines). Let's break down the key decisions.

Why this works (and where it breaks)

The relevance threshold

Notice this line:

if (context.length === 0 || context[0].score < 0.3) {
    return "I couldn't find any relevant information..."
}

This is crucial. Without a threshold, Pinecone will always return results — even if they're completely irrelevant. A question about "the weather" will still match something in your docs. The 0.3 threshold is a starting point; you'll need to tune it based on your data. Lower it if the bot is too conservative, raise it if it's answering questions it shouldn't.

The system prompt

Telling Claude to answer "ONLY based on the following context" and to admit when it doesn't know is what separates a useful RAG chatbot from a hallucination machine. Without this instruction, Claude will happily fill in gaps with its training data, which defeats the entire purpose.

Embedding model choice

We're using OpenAI's text-embedding-3-small because it's cheap ($0.02 per million tokens), fast, and good enough for most use cases. The text-embedding-3-large variant is marginally better but costs 6.5x more. For a docs chatbot, small is the right call.

Plugging this into a Next.js API route

If you want this as a web endpoint instead of a CLI tool, wrap the chat() function in an API route:

// src/pages/api/rag-chat.ts
import type { NextApiRequest, NextApiResponse } from 'next'
import Anthropic from '@anthropic-ai/sdk'
import { Pinecone } from '@pinecone-database/pinecone'
import OpenAI from 'openai'

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! })
const index = pinecone.index('rag-chatbot')

interface SearchResult {
    text: string
    source: string
    score: number
}

async function searchDocs(query: string, topK = 5): Promise<SearchResult[]> {
    const embedResponse = await openai.embeddings.create({
        model: 'text-embedding-3-small',
        input: query,
    })

    const results = await index.query({
        vector: embedResponse.data[0].embedding,
        topK,
        includeMetadata: true,
    })

    return (results.matches || []).map((match) => ({
        text: (match.metadata?.text as string) || '',
        source: (match.metadata?.source as string) || '',
        score: match.score || 0,
    }))
}

export default async function handler(req: NextApiRequest, res: NextApiResponse) {
    if (req.method !== 'POST') {
        return res.status(405).json({ error: 'Method not allowed' })
    }

    const { question } = req.body

    if (!question || typeof question !== 'string') {
        return res.status(400).json({ error: 'Missing or invalid question' })
    }

    try {
        const context = await searchDocs(question)

        if (context.length === 0 || context[0].score < 0.3) {
            return res.json({
                answer: "I couldn't find relevant information in the docs for that question.",
                sources: [],
            })
        }

        const contextBlock = context
            .map((c) => `[Source: ${c.source}]\n${c.text}`)
            .join('\n\n---\n\n')

        const response = await anthropic.messages.create({
            model: 'claude-sonnet-4-5-20250929',
            max_tokens: 1024,
            system:
                'You are a helpful assistant that answers questions based on provided documentation. ' +
                'Cite sources by filename. Be concise.',
            messages: [
                {
                    role: 'user',
                    content: `Answer based ONLY on this context. If unsure, say so.\n\nContext:\n${contextBlock}\n\nQuestion: ${question}`,
                },
            ],
        })

        const textBlock = response.content.find((block) => block.type === 'text')

        return res.json({
            answer: textBlock?.text || 'No response generated.',
            sources: context.map((c) => c.source),
        })
    } catch (error) {
        const message =
            error instanceof Anthropic.APIError
                ? `API error: ${error.status} - ${error.message}`
                : 'Internal server error'
        return res.status(500).json({ error: message })
    }
}

You can combine this with the streaming pattern from my previous post to stream the RAG responses token-by-token instead of waiting for the full answer.

Setting up Pinecone

If you haven't used Pinecone before, here's the quick setup:

Create a free account at pinecone.io
Create an index called rag-chatbot with 1536 dimensions (that's the dimension for text-embedding-3-small) and cosine metric
Copy your API key to .env

That's it. The free tier gives you 1 index with 2GB of storage, which is plenty for a docs chatbot.

Gotcha: Pinecone's serverless vs pod-based indexes

Pinecone now defaults to serverless indexes, which are cheaper and auto-scale. Use serverless unless you need guaranteed low-latency queries (< 50ms). For a chatbot where the LLM call takes 1-3 seconds anyway, the extra 100ms from serverless cold starts is irrelevant.

Things you'll want to add next

This is a minimal RAG chatbot. For production, consider:

Conversation history — right now each question is independent. Pass previous Q&A pairs to Claude so it can handle follow-ups like "tell me more about that."
Better chunking — split on paragraph or section boundaries instead of character counts. Markdown headers make natural chunk boundaries.
Reranking — retrieve 20 results from Pinecone, then use a reranker (like Cohere Rerank) to pick the top 5. This dramatically improves answer quality.
Metadata filtering — tag chunks with categories and let users filter ("search only in the API docs").

What's next

The RAG chatbot retrieves and answers, but what if you need an AI that takes actions — calling APIs, querying databases, running code? In an upcoming post, I'll cover how to build a multi-step AI agent with tool use in TypeScript, where Claude doesn't just read your docs but actually does things based on what it finds.