Build an Agentic RAG Pipeline That Retries and Reformulates Queries

Friday 03/04/2026

·12 min read

Your RAG pipeline retrieves three chunks, stuffs them into a prompt, and hopes for the best. Half the time the retrieved context is irrelevant, and the LLM confidently hallucinates an answer based on garbage. You know the retrieval step is the bottleneck, but you're not sure how to make it smarter without bolting on an entire framework.

The fix isn't more chunks or a bigger context window. It's giving the LLM the ability to judge whether the retrieval was good enough - and if not, reformulate the query and try again. This is agentic RAG: a loop where the model drives its own retrieval strategy instead of passively consuming whatever the vector search returns.

Here's how to build one in TypeScript, step by step.

The problem with naive RAG

A typical RAG pipeline looks like this:

// src/naive-rag.ts
import Anthropic from "@anthropic-ai/sdk";

async function naiveRag(question: string, retrieve: (q: string) => Promise<string[]>) {
    const chunks = await retrieve(question);
    const context = chunks.join("\n\n");

    const client = new Anthropic();
    const response = await client.messages.create({
        model: "claude-sonnet-4-20250514",
        max_tokens: 1024,
        messages: [
            {
                role: "user",
                content: `Answer this question based on the context below.\n\nContext:\n${context}\n\nQuestion: ${question}`,
            },
        ],
    });

    return response.content[0].type === "text" ? response.content[0].text : "";
}

This breaks in predictable ways. If the user asks "what's the refund policy for enterprise customers?" and your embeddings match chunks about general refund policies and enterprise pricing - but not the specific intersection - you get a confident-sounding answer stitched together from unrelated paragraphs.

The model can't tell you the retrieval was bad. It just does its best with what it got.

How agentic RAG fixes this

An agentic RAG pipeline adds a feedback loop:

Retrieve chunks using the original query
Evaluate - ask the LLM if the retrieved context actually answers the question
Reformulate - if the context is insufficient, the LLM generates a better search query
Retry with the new query (up to a max number of attempts)
Answer once the context is good enough, or admit uncertainty if retrieval keeps failing

The LLM becomes the judge of its own retrieval quality. Let's build it.

Project setup

mkdir agentic-rag && cd agentic-rag
pnpm init
pnpm add @anthropic-ai/sdk @supabase/supabase-js zod
pnpm add -D typescript @types/node tsx

// tsconfig.json
{
    "compilerOptions": {
        "target": "ES2022",
        "module": "ESNext",
        "moduleResolution": "bundler",
        "strict": true,
        "outDir": "dist",
        "esModuleInterop": true
    },
    "include": ["src"]
}

We'll use Supabase with pgvector for the vector store and Claude for both evaluation and answering. You can swap in any vector database - the agentic loop itself is provider-agnostic.

The retrieval layer

First, a simple retrieval function that wraps Supabase's vector similarity search:

// src/retriever.ts
import { createClient } from "@supabase/supabase-js";
import Anthropic from "@anthropic-ai/sdk";

const supabase = createClient(
    process.env.SUPABASE_URL!,
    process.env.SUPABASE_ANON_KEY!,
);

const anthropic = new Anthropic();

interface RetrievedChunk {
    content: string;
    metadata: Record<string, string>;
    similarity: number;
}

async function getEmbedding(text: string): Promise<number[]> {
    // Using Anthropic's voyage embeddings via a separate call,
    // or swap in OpenAI embeddings - whatever matches your index
    const response = await fetch("https://api.voyageai.com/v1/embeddings", {
        method: "POST",
        headers: {
            Authorization: `Bearer ${process.env.VOYAGE_API_KEY!}`,
            "Content-Type": "application/json",
        },
        body: JSON.stringify({
            input: [text],
            model: "voyage-3",
        }),
    });

    if (!response.ok) {
        throw new Error(`Embedding request failed: ${response.status} ${response.statusText}`);
    }

    const data = (await response.json()) as { data: Array<{ embedding: number[] }> };
    return data.data[0].embedding;
}

export async function retrieve(
    query: string,
    topK: number = 5,
    similarityThreshold: number = 0.3,
): Promise<RetrievedChunk[]> {
    const embedding = await getEmbedding(query);

    const { data, error } = await supabase.rpc("match_documents", {
        query_embedding: embedding,
        match_threshold: similarityThreshold,
        match_count: topK,
    });

    if (error) {
        throw new Error(`Retrieval failed: ${error.message}`);
    }

    return (data as Array<{ content: string; metadata: Record<string, string>; similarity: number }>).map(
        (row) => ({
            content: row.content,
            metadata: row.metadata,
            similarity: row.similarity,
        }),
    );
}

Nothing unusual here. The interesting part is what happens after retrieval.

The evaluation step

This is the core of agentic RAG. After retrieving chunks, we ask the LLM to judge whether the context is sufficient to answer the question:

// src/evaluator.ts
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";

const anthropic = new Anthropic();

const EvaluationSchema = z.object({
    sufficient: z.boolean(),
    reasoning: z.string(),
    reformulated_query: z.string().optional(),
    search_strategy: z.enum(["semantic", "keyword", "broader", "narrower"]).optional(),
});

type Evaluation = z.infer<typeof EvaluationSchema>;

export async function evaluateRetrieval(
    originalQuestion: string,
    retrievedChunks: Array<{ content: string; similarity: number }>,
    previousQueries: string[],
): Promise<Evaluation> {
    const chunksText = retrievedChunks
        .map((c, i) => `[Chunk ${i + 1}] (similarity: ${c.similarity.toFixed(3)})\n${c.content}`)
        .join("\n\n---\n\n");

    const previousQueriesText =
        previousQueries.length > 0
            ? `\nPrevious search queries that didn't work well:\n${previousQueries.map((q) => `- "${q}"`).join("\n")}`
            : "";

    const response = await anthropic.messages.create({
        model: "claude-sonnet-4-20250514",
        max_tokens: 512,
        messages: [
            {
                role: "user",
                content: `You are a retrieval quality evaluator. Given a user's question and retrieved context chunks, determine if the context is SUFFICIENT to answer the question accurately.

User's question: "${originalQuestion}"
${previousQueriesText}

Retrieved context:
${chunksText}

Respond with JSON matching this schema:
{
  "sufficient": boolean,        // true if chunks contain enough info to answer
  "reasoning": string,          // why you think the context is or isn't sufficient
  "reformulated_query": string, // if insufficient, a better search query to try
  "search_strategy": string     // one of: "semantic", "keyword", "broader", "narrower"
}

Rules:
- If similarity scores are all below 0.5, the retrieval likely missed entirely
- If chunks are tangentially related but don't address the specific question, mark insufficient
- When reformulating, try a DIFFERENT angle - don't just rephrase the same query
- If previous queries are listed, avoid repeating similar formulations
- "broader" strategy: widen the search scope (useful when the question is too specific)
- "narrower" strategy: focus on a specific aspect (useful when results are too general)`,
            },
        ],
    });

    const text = response.content[0].type === "text" ? response.content[0].text : "";

    const jsonMatch = text.match(/\{[\s\S]*\}/);
    if (!jsonMatch) {
        throw new Error("Evaluator did not return valid JSON");
    }

    return EvaluationSchema.parse(JSON.parse(jsonMatch[0]));
}

A few things to note:

We pass previous queries so the LLM doesn't suggest the same reformulation twice. This is critical - without it, you get stuck in loops.
The search strategy hint tells the retriever how to adjust. "Broader" means we're being too specific; "narrower" means the results are too generic.
We use claude-sonnet-4-20250514 for evaluation instead of Opus. The evaluation task is well-defined enough that Sonnet handles it accurately, and it's significantly cheaper when you're running it on every retrieval attempt.

The agentic loop

Now we wire everything together into the retry loop:

// src/agentic-rag.ts
import { retrieve } from "./retriever";
import { evaluateRetrieval } from "./evaluator";
import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

interface AgenticRagResult {
    answer: string;
    attempts: number;
    queries: string[];
    confident: boolean;
}

export async function agenticRag(
    question: string,
    maxAttempts: number = 3,
): Promise<AgenticRagResult> {
    const queries: string[] = [question];
    let bestChunks: Array<{ content: string; similarity: number }> = [];
    let attempts = 0;

    for (let i = 0; i < maxAttempts; i++) {
        attempts++;
        const currentQuery = queries[queries.length - 1];

        console.log(`\n[Attempt ${attempts}] Searching: "${currentQuery}"`);

        const chunks = await retrieve(currentQuery);

        console.log(`  Retrieved ${chunks.length} chunks (best similarity: ${chunks[0]?.similarity.toFixed(3) ?? "N/A"})`);

        // Keep the best chunks across all attempts
        bestChunks = mergeBestChunks(bestChunks, chunks, 8);

        const evaluation = await evaluateRetrieval(question, bestChunks, queries);

        console.log(`  Evaluation: ${evaluation.sufficient ? "SUFFICIENT" : "INSUFFICIENT"}`);
        console.log(`  Reasoning: ${evaluation.reasoning}`);

        if (evaluation.sufficient) {
            break;
        }

        if (evaluation.reformulated_query && i < maxAttempts - 1) {
            console.log(`  Reformulated query: "${evaluation.reformulated_query}"`);
            console.log(`  Strategy: ${evaluation.search_strategy}`);
            queries.push(evaluation.reformulated_query);
        }
    }

    // Generate final answer using accumulated context
    const answer = await generateAnswer(question, bestChunks, attempts >= maxAttempts);

    return {
        answer,
        attempts,
        queries,
        confident: attempts < maxAttempts,
    };
}

function mergeBestChunks(
    existing: Array<{ content: string; similarity: number }>,
    incoming: Array<{ content: string; similarity: number }>,
    maxChunks: number,
): Array<{ content: string; similarity: number }> {
    const seen = new Set<string>();
    const all = [...existing, ...incoming].filter((chunk) => {
        // Deduplicate by content
        if (seen.has(chunk.content)) return false;
        seen.add(chunk.content);
        return true;
    });

    return all.sort((a, b) => b.similarity - a.similarity).slice(0, maxChunks);
}

async function generateAnswer(
    question: string,
    chunks: Array<{ content: string; similarity: number }>,
    exhaustedRetries: boolean,
): Promise<string> {
    const context = chunks.map((c) => c.content).join("\n\n---\n\n");

    const confidenceNote = exhaustedRetries
        ? "\n\nIMPORTANT: The retrieval system could not find highly relevant context after multiple attempts. If the provided context doesn't contain a clear answer, say so honestly rather than guessing."
        : "";

    const response = await anthropic.messages.create({
        model: "claude-sonnet-4-20250514",
        max_tokens: 2048,
        messages: [
            {
                role: "user",
                content: `Answer the following question based on the provided context. If the context doesn't contain enough information to fully answer the question, acknowledge what's missing.${confidenceNote}

Context:
${context}

Question: ${question}`,
            },
        ],
    });

    return response.content[0].type === "text" ? response.content[0].text : "";
}

The key design decisions here:

Accumulate chunks across attempts. Each retry adds to the pool of context, and we keep the best ones by similarity. This means later attempts build on earlier ones rather than starting from scratch.

Honest uncertainty. When we've exhausted all retries, we tell the model explicitly. This produces answers like "Based on the available documentation, I found information about X but couldn't find specifics about Y" instead of fabricated answers.

Three attempts maximum. In my testing, if three reformulations can't find good context, a fourth won't either. You're better off returning an honest "I'm not sure" than burning tokens on a loop that won't converge. Adjust this based on the breadth of your document corpus.

Adding keyword search as a fallback

Vector similarity search isn't always the best strategy. When the evaluator suggests a "keyword" strategy, we should switch to full-text search:

// src/retriever.ts (add this function)
export async function keywordRetrieve(
    query: string,
    topK: number = 5,
): Promise<RetrievedChunk[]> {
    const { data, error } = await supabase
        .from("documents")
        .select("content, metadata")
        .textSearch("content", query, { type: "websearch" })
        .limit(topK);

    if (error) {
        throw new Error(`Keyword retrieval failed: ${error.message}`);
    }

    return (data ?? []).map((row) => ({
        content: row.content as string,
        metadata: (row.metadata ?? {}) as Record<string, string>,
        similarity: 1.0, // Keyword matches don't have a similarity score
    }));
}

Then update the agentic loop to pick the right retrieval method:

// src/agentic-rag.ts (update the retrieval call inside the loop)
const chunks =
    evaluation?.search_strategy === "keyword"
        ? await keywordRetrieve(currentQuery)
        : await retrieve(currentQuery);

This is surprisingly effective. Queries with specific product names, error codes, or technical terms often fail in embedding space but work perfectly with full-text search. Letting the LLM choose the strategy based on what it sees in the results is where the "agentic" part really pays off.

Running it

// src/index.ts
import { agenticRag } from "./agentic-rag";

async function main() {
    const question = process.argv[2];
    if (!question) {
        console.error("Usage: tsx src/index.ts 'your question here'");
        process.exit(1);
    }

    console.log(`Question: ${question}`);

    const result = await agenticRag(question);

    console.log("\n=== Result ===");
    console.log(`Answer: ${result.answer}`);
    console.log(`\nAttempts: ${result.attempts}`);
    console.log(`Queries tried: ${result.queries.join(" → ")}`);
    console.log(`Confident: ${result.confident}`);
}

main().catch(console.error);

tsx src/index.ts "what's the refund policy for enterprise customers?"

A typical run looks like this:

[Attempt 1] Searching: "what's the refund policy for enterprise customers?"
  Retrieved 5 chunks (best similarity: 0.412)
  Evaluation: INSUFFICIENT
  Reasoning: Chunks cover general refund terms but nothing enterprise-specific
  Reformulated query: "enterprise plan cancellation terms and refund eligibility"
  Strategy: narrower

[Attempt 2] Searching: "enterprise plan cancellation terms and refund eligibility"
  Retrieved 5 chunks (best similarity: 0.687)
  Evaluation: SUFFICIENT
  Reasoning: Found enterprise SLA document with refund conditions

=== Result ===
Answer: Enterprise customers on annual contracts are eligible for...
Attempts: 2
Queries tried: what's the refund policy for enterprise customers? → enterprise plan cancellation terms and refund eligibility
Confident: true

The first query was too broad. The reformulation zeroed in on the right terminology - "cancellation terms" and "refund eligibility" matched the way the enterprise SLA document was actually written, not how a user would ask about it. That's the whole point.

Gotchas and things to watch out for

Cost adds up. Each evaluation is an extra LLM call. With three attempts, you could make up to four Claude calls per question (three evaluations + one answer generation). Monitor your costs and consider whether the improved accuracy is worth it for your use case. For internal tools with low query volume, it's a no-brainer. For a public-facing chatbot handling thousands of queries per day, you might want to run evaluation only when the top similarity score is below a threshold.

The evaluator can be wrong. Sometimes it marks sufficient context as insufficient, triggering unnecessary retries. A tight evaluation prompt and a similarity threshold check (skip evaluation if top result is above 0.85) help reduce false negatives.

Don't let the loop run forever. Three attempts is a good default. I've seen teams set it to five or even ten - that's almost always a sign the underlying embeddings or chunking strategy needs work, not that the agent needs more retries.

Log everything. The queries, similarity scores, and evaluation reasoning are gold for debugging. When the pipeline gives a bad answer, the logs tell you exactly where the retrieval went wrong.

Before and after

On a test set of 50 questions against an internal documentation corpus:

| Metric | Naive RAG | Agentic RAG | |---|---|---| | Correct answers | 62% | 84% | | "I don't know" (appropriate) | 8% | 22% | | Hallucinated answers | 30% | 6% | | Avg. retrieval attempts | 1.0 | 1.7 | | Avg. latency | 1.2s | 2.8s |

The big win isn't just the accuracy improvement - it's the drop in hallucinated answers. Naive RAG hallucinates 30% of the time because it always tries to answer, even with bad context. Agentic RAG catches bad retrievals and either fixes them or admits uncertainty.

What's next

If you're already building with tool-calling agents, the evaluation step can become another tool in your agent's toolkit. Check out Build a Multi-Step AI Agent with Tool Use in TypeScript for how to wire up tool-use loops, then add retrieval evaluation as one of the tools.

For measuring whether this pipeline is actually helping your users (not just hitting benchmarks), take a look at topic #22 in the upcoming posts - we'll cover product metrics for AI features, including how to A/B test retrieval quality in production.