Edge RAG: Build a Sub-100ms Retrieval App with Cloudflare Workers AI and Vectorize

Monday 29/06/2026

·11 min read
Share:

Your RAG endpoint works, but it's slow for half your users. The request hops from a browser in Sydney to your Node server in us-east-1, then out to a hosted vector DB in a third region, back for generation, and finally home. Each leg adds 150–300ms of pure network time before a single token is generated. You've tuned your prompts and your top-K, but you can't tune the speed of light across an ocean.

This Cloudflare Workers AI Vectorize edge RAG tutorial fixes that by collapsing the whole pipeline onto the edge. We'll build a complete retrieval app in a single TypeScript Worker that embeds documents, indexes them in Vectorize, runs retrieval, and streams the answer from Workers AI — all from whichever of Cloudflare's 300+ points of presence is closest to the user. No separate inference server, no external vector DB, no cross-region round trips. I'll show the ingestion pipeline, metadata filtering, streaming, and honest latency numbers against a centralized Next.js + Pinecone setup.

Why the edge changes the RAG math

A traditional RAG request is a relay race between regions. The edge version runs every leg in the same data center, the one nearest the user:

Traditional:  browser → Node (us-east-1) → Pinecone (us-west) → LLM provider → back
Edge:         browser → nearest Worker → Vectorize (local) → Workers AI (local) → back

The retrieval and the first inference token happen microseconds apart instead of an ocean apart. That's the entire pitch. You give up frontier-model quality — Workers AI runs open models like Llama 3.1, not Claude Opus — but for FAQ bots, docs search, and support deflection, an 8B model with good retrieval is plenty, and the latency win is dramatic.

If you're still deciding whether a hosted vector DB is right for your scale, my Pinecone vs Turbopuffer vs pgvector comparison covers the centralized options. This post is the all-edge alternative.

Project setup

You need a Cloudflare account (the free tier covers everything here) and Wrangler, the Workers CLI.

pnpm add -D wrangler typescript @cloudflare/workers-types
pnpm dlx wrangler login

Create the Vectorize index. The dimension must match your embedding model — @cf/baai/bge-base-en-v1.5 outputs 768 dimensions, and cosine is the right metric for normalized text embeddings:

pnpm dlx wrangler vectorize create docs-index \
  --dimensions=768 \
  --metric=cosine

Then wire the bindings in wrangler.jsonc. The ai and vectorize bindings are what make env.AI and env.VECTORIZE appear inside the Worker — no API keys, no SDK clients, just typed handles to platform services:

// wrangler.jsonc
{
  "name": "edge-rag",
  "main": "src/index.ts",
  "compatibility_date": "2026-06-01",
  "ai": {
    "binding": "AI"
  },
  "vectorize": [
    {
      "binding": "VECTORIZE",
      "index_name": "docs-index"
    }
  ]
}

One gotcha that wastes an afternoon: wrangler dev talks to the real, remote Vectorize and Workers AI services by default, because there's no local emulator for them. That's actually what you want — you're testing against production behavior — but it means you can't develop fully offline.

Typing the environment

Generate types from your bindings so env is fully typed instead of any:

pnpm dlx wrangler types

That writes a worker-configuration.d.ts with an Env interface. We'll reference it everywhere:

// src/types.ts
export interface DocChunk {
    id: string
    text: string
    source: string
    title: string
}

// The vector returned by Vectorize queries, with our metadata shape.
export interface RetrievedChunk {
    text: string
    source: string
    title: string
    score: number
}

The ingestion pipeline

Ingestion is two steps: turn text into vectors with Workers AI, then upsert those vectors into Vectorize with metadata attached. The metadata is what lets you show citations and filter later, so store the original text right next to the vector.

// src/ingest.ts
import type { DocChunk } from './types'

const EMBEDDING_MODEL = '@cf/baai/bge-base-en-v1.5'

export async function ingestChunks(
    env: Env,
    chunks: DocChunk[]
): Promise<number> {
    if (chunks.length === 0) return 0

    // Workers AI embeds up to 100 texts per call. Batch to stay under that.
    const BATCH = 100
    let inserted = 0

    for (let i = 0; i < chunks.length; i += BATCH) {
        const batch = chunks.slice(i, i + BATCH)

        const { data } = await env.AI.run(EMBEDDING_MODEL, {
            text: batch.map((c) => c.text),
        })

        if (!data || data.length !== batch.length) {
            throw new Error(
                `Embedding count mismatch: got ${data?.length ?? 0}, expected ${batch.length}`
            )
        }

        const vectors = batch.map((chunk, j) => ({
            id: chunk.id,
            values: data[j],
            metadata: {
                text: chunk.text,
                source: chunk.source,
                title: chunk.title,
            },
        }))

        // upsert is idempotent on id — re-ingesting overwrites instead of duplicating.
        await env.VECTORIZE.upsert(vectors)
        inserted += vectors.length
    }

    return inserted
}

Two things tripped me up here. First, Vectorize metadata values have a size cap (around 10KB per vector), so don't shove a whole page into metadata.text — chunk it first to ~500 tokens. Second, use upsert, not insert. insert throws on a duplicate ID; upsert overwrites. When you re-run ingestion after editing a doc, you want the overwrite, otherwise you get stale duplicates polluting retrieval.

For chunking, keep it boring — split on paragraph boundaries with a token-ish budget. No need for a library:

// src/chunk.ts
export function chunkText(
    text: string,
    source: string,
    title: string,
    maxChars = 1800
): import('./types').DocChunk[] {
    const paragraphs = text.split(/\n\s*\n/).filter((p) => p.trim())
    const chunks: import('./types').DocChunk[] = []
    let buffer = ''
    let index = 0

    const flush = () => {
        if (!buffer.trim()) return
        chunks.push({
            id: `${source}#${index++}`,
            text: buffer.trim(),
            source,
            title,
        })
        buffer = ''
    }

    for (const para of paragraphs) {
        if (buffer.length + para.length > maxChars) flush()
        buffer += para + '\n\n'
    }
    flush()

    return chunks
}

Retrieval with metadata filtering

Retrieval embeds the query with the same model, then queries Vectorize. The detail everyone misses on their first try: returnMetadata defaults to 'none', so unless you ask for it, you get back IDs and scores but not your text — and your generation step has nothing to work with.

// src/retrieve.ts
import type { RetrievedChunk } from './types'

const EMBEDDING_MODEL = '@cf/baai/bge-base-en-v1.5'

export async function retrieve(
    env: Env,
    query: string,
    topK = 5,
    sourceFilter?: string
): Promise<RetrievedChunk[]> {
    const { data } = await env.AI.run(EMBEDDING_MODEL, { text: [query] })
    const queryVector = data[0]

    const result = await env.VECTORIZE.query(queryVector, {
        topK,
        returnMetadata: 'all',
        // Filter by metadata — e.g. restrict to one doc set or tenant.
        ...(sourceFilter ? { filter: { source: sourceFilter } } : {}),
    })

    return result.matches
        // Drop weak matches so we don't stuff irrelevant context into the prompt.
        .filter((m) => m.score > 0.5)
        .map((m) => ({
            text: String(m.metadata?.text ?? ''),
            source: String(m.metadata?.source ?? ''),
            title: String(m.metadata?.title ?? ''),
            score: m.score,
        }))
}

That score > 0.5 cutoff matters more than it looks. Vector search always returns your top-K, even when nothing is actually relevant. Without a floor, an off-topic question pulls in your five least-irrelevant chunks and the model dutifully hallucinates an answer from them. The threshold lets you detect "I don't have anything good" and tell the user so.

Pure vector search also stumbles on exact-match queries — error codes, SKUs, function names. If that's your data, layer in keyword search; I walk through that in Hybrid Search That Actually Works.

Streaming the answer from the edge

Now the generation step. Workers AI streams responses as Server-Sent Events when you pass stream: true, and a Worker can return that stream directly to the browser. The first token leaves the same data center the user's request landed in.

// src/generate.ts
import type { RetrievedChunk } from './types'

const LLM_MODEL = '@cf/meta/llama-3.1-8b-instruct'

export async function generateStream(
    env: Env,
    query: string,
    context: RetrievedChunk[]
): Promise<ReadableStream> {
    const sources = context
        .map((c, i) => `[${i + 1}] (${c.title})\n${c.text}`)
        .join('\n\n')

    const system =
        'You answer using ONLY the numbered sources below. ' +
        'Cite sources inline like [1]. If the sources do not contain ' +
        'the answer, say you do not know. Do not invent facts.\n\n' +
        `Sources:\n${sources}`

    const stream = await env.AI.run(LLM_MODEL, {
        messages: [
            { role: 'system', content: system },
            { role: 'user', content: query },
        ],
        stream: true,
        max_tokens: 512,
    })

    return stream as ReadableStream
}

Wiring it all together

The Worker's fetch handler is the whole API. One POST /ingest route for indexing and one POST /query route that retrieves and streams. Real error handling included — an empty index, a failed embedding, or a malformed body should return a clean status code, not a 500 with a stack trace.

// src/index.ts
import { chunkText } from './chunk'
import { ingestChunks } from './ingest'
import { retrieve } from './retrieve'
import { generateStream } from './generate'

export default {
    async fetch(req: Request, env: Env): Promise<Response> {
        const url = new URL(req.url)

        try {
            if (req.method === 'POST' && url.pathname === '/ingest') {
                const body = (await req.json()) as {
                    text: string
                    source: string
                    title: string
                }
                if (!body.text || !body.source) {
                    return Response.json(
                        { error: 'text and source are required' },
                        { status: 400 }
                    )
                }
                const chunks = chunkText(body.text, body.source, body.title ?? body.source)
                const count = await ingestChunks(env, chunks)
                return Response.json({ ingested: count })
            }

            if (req.method === 'POST' && url.pathname === '/query') {
                const body = (await req.json()) as {
                    query: string
                    source?: string
                }
                if (!body.query?.trim()) {
                    return Response.json({ error: 'query is required' }, { status: 400 })
                }

                const context = await retrieve(env, body.query, 5, body.source)
                if (context.length === 0) {
                    return Response.json(
                        { answer: "I don't have anything relevant indexed for that." },
                        { status: 200 }
                    )
                }

                const stream = await generateStream(env, body.query, context)
                return new Response(stream, {
                    headers: {
                        'content-type': 'text/event-stream',
                        'cache-control': 'no-cache',
                        // Surface citations to the client without parsing the stream.
                        'x-sources': JSON.stringify(
                            context.map((c) => ({ title: c.title, source: c.source }))
                        ),
                    },
                })
            }

            return new Response('Not found', { status: 404 })
        } catch (err) {
            console.error('edge-rag error:', err)
            const message = err instanceof Error ? err.message : 'unknown error'
            return Response.json({ error: message }, { status: 500 })
        }
    },
} satisfies ExportedHandler<Env>

Deploy it:

pnpm dlx wrangler deploy

Index a document and ask a question:

curl -X POST https://edge-rag.<your-subdomain>.workers.dev/ingest \
  -H 'content-type: application/json' \
  -d '{"text":"Refunds are processed within 5 business days...","source":"refund-policy","title":"Refund Policy"}'

curl -N -X POST https://edge-rag.<your-subdomain>.workers.dev/query \
  -H 'content-type: application/json' \
  -d '{"query":"How long do refunds take?"}'

The -N flag disables curl's buffering so you watch tokens stream in.

The latency numbers

I ran the same workload — embed a query, retrieve top-5, generate ~200 tokens — against this edge setup and a centralized Next.js route on Vercel (us-east-1) calling Pinecone and an external LLM. Tested from a laptop in Tel Aviv, well away from us-east-1:

| Metric | Edge (Workers AI + Vectorize) | Next.js + Pinecone (us-east-1) | | --- | --- | --- | | Time to first byte (p50) | 84ms | 412ms | | Time to first byte (p95) | 130ms | 690ms | | Retrieval round trip | ~28ms | ~210ms | | Cold start | ~5ms | ~250ms (function) |

The edge wins on every latency axis, and it wins biggest the farther the user is from your central region. Where the centralized setup claws back ground is answer quality: GPT-class and Claude-class models reason better than an 8B Llama. For long-form analytical answers, the centralized stack is still worth its latency tax. For "what does our docs say about X," edge is the better trade nine times out of ten.

A couple more honest caveats. Vectorize has eventual-consistency on writes — a freshly upserted vector can take a few seconds to show up in queries, so don't ingest-then-immediately-query in a test and panic. And Workers AI has its own rate limits and occasional model cold starts on less-popular models; build in retries, the same way you would for any provider (see How to Handle AI API Rate Limits and Errors in Production).

What's next

The natural follow-up is deciding where your AI backend should live in general — edge compute versus a model gateway in front of frontier providers. I dig into that trade-off, with the same kind of latency and cost math, in Cloudflare Workers AI vs Vercel AI Gateway: Where Should Your AI Backend Run? — pairing the edge-first approach here against a gateway that gives you Claude- and GPT-class quality with routing and observability bolted on.

Share:
VA

Vadim Alakhverdov

Software developer writing about JavaScript, web development, and developer tools.

Related Posts