Hybrid Search That Actually Works: BM25 + Embeddings + Reranking in TypeScript

Friday 05/06/2026

·10 min read
Share:

A user searches your RAG app for ERR_CONN_REFUSED and gets back three chunks about "connection problems" and "network troubleshooting" - none of which mention the exact error code they typed. Your vector search did exactly what it was designed to do: find semantically similar text. The problem is that an error code, a product SKU, or a function name like useDeferredValue has almost no semantic meaning. It's a token that either appears verbatim or it doesn't.

This is the single biggest reason pure vector RAG feels dumb in production. Embeddings are great at "find me documents about refund policies" and terrible at "find me the document containing INV-2024-8841." The fix is hybrid search: run keyword search (BM25) and vector search in parallel, fuse the two ranked lists, and then rerank the top results with a cross-encoder. Done right, this is the biggest retrieval quality win you can get for the least amount of code.

Here's how to build hybrid search in TypeScript using Postgres for both halves, then layer reranking on top.

Why BM25 and embeddings fail differently

These two retrieval methods have opposite failure modes, which is exactly why combining them works.

BM25 is a keyword ranking algorithm - the same family that powers Elasticsearch. It scores documents by term frequency, weighted so that rare terms (like an error code) count more than common ones. It nails exact matches and is hopeless at synonyms: search "car" and it will never return a document that only says "automobile."

Embeddings (dense vector search) do the reverse. They capture meaning, so "car" and "automobile" land near each other in vector space. But they smear precise tokens into a fuzzy semantic average, which is why ERR_CONN_REFUSED gets lost.

| Query type | BM25 | Embeddings | | --- | --- | --- | | "how do I get a refund" | weak | strong | | INV-2024-8841 | strong | weak | | parseInt vs Number() | strong | medium | | "make the button less ugly" | weak | strong |

You want both. The trick is fusing their results without having to normalize two completely different score scales (BM25 scores are unbounded; cosine similarity is -1 to 1). That's what Reciprocal Rank Fusion solves.

Setting up the two indexes in Postgres

You don't need Elasticsearch for BM25. Postgres ships full-text search via tsvector, and pgvector handles embeddings. One database, two indexes. Here's the schema:

-- migrations/001_hybrid_search.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id          BIGSERIAL PRIMARY KEY,
    content     TEXT NOT NULL,
    embedding   VECTOR(1536),
    -- generated tsvector column stays in sync automatically
    fts         TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', content)) STORED
);

-- GIN index for keyword search
CREATE INDEX documents_fts_idx ON documents USING GIN (fts);

-- HNSW index for vector search
CREATE INDEX documents_embedding_idx ON documents
    USING hnsw (embedding vector_cosine_ops);

The GENERATED ALWAYS AS ... STORED column means Postgres rebuilds the tsvector whenever content changes - you never have to remember to update it. One gotcha: to_tsvector('english', ...) applies English stemming, so "running" matches "run." That's usually what you want, but it also means exact-match on something case-sensitive can surprise you. We'll handle precise tokens with reranking later.

Running both searches in parallel

Now the TypeScript. We fire both queries concurrently and collect ranked ID lists. Note Postgres's ts_rank for keyword scoring and the <=> cosine-distance operator for vectors.

// src/hybrid/search.ts
import { Pool } from "pg";

const pool = new Pool({ connectionString: process.env.DATABASE_URL });

export interface RankedHit {
    id: number;
    content: string;
    rank: number; // 1-based position in this list
}

async function keywordSearch(query: string, limit: number): Promise<RankedHit[]> {
    // websearch_to_tsquery handles user input safely: quotes, OR, -negation
    const { rows } = await pool.query<{ id: number; content: string }>(
        `SELECT id, content
         FROM documents
         WHERE fts @@ websearch_to_tsquery('english', $1)
         ORDER BY ts_rank(fts, websearch_to_tsquery('english', $1)) DESC
         LIMIT $2`,
        [query, limit]
    );
    return rows.map((row, i) => ({ ...row, rank: i + 1 }));
}

async function vectorSearch(
    embedding: number[],
    limit: number
): Promise<RankedHit[]> {
    // pgvector wants the array serialized as a string literal
    const vec = `[${embedding.join(",")}]`;
    const { rows } = await pool.query<{ id: number; content: string }>(
        `SELECT id, content
         FROM documents
         ORDER BY embedding <=> $1
         LIMIT $2`,
        [vec, limit]
    );
    return rows.map((row, i) => ({ ...row, rank: i + 1 }));
}

Two things that trip people up here. First, use websearch_to_tsquery, not to_tsquery - the latter throws a syntax error the moment a user types a stray space or special character, whereas websearch_to_tsquery parses natural input like Google does. Second, pgvector needs the embedding passed as a [1,2,3] string literal, not a JS array; pass a raw array and you'll get a cryptic type error.

You'll need embeddings for the query. Here's a thin wrapper - swap in whatever provider you use:

// src/hybrid/embed.ts
import OpenAI from "openai";

const openai = new OpenAI();

export async function embed(text: string): Promise<number[]> {
    const res = await openai.embeddings.create({
        model: "text-embedding-3-small", // 1536 dims, matches the schema
        input: text,
    });
    return res.data[0].embedding;
}

Fusing results with Reciprocal Rank Fusion

Now the core idea. We have two ranked lists with incompatible score scales. Reciprocal Rank Fusion (RRF) ignores the raw scores entirely and uses only the rank position of each document. The formula for each document is the sum, across every list it appears in, of 1 / (k + rank). The constant k (60 is the standard value from the original paper) dampens the influence of top ranks so a document that shows up in both lists beats one that's #1 in only a single list.

// src/hybrid/rrf.ts
import type { RankedHit } from "./search";

export interface FusedHit {
    id: number;
    content: string;
    score: number;
}

const RRF_K = 60;

export function reciprocalRankFusion(
    lists: RankedHit[][],
    k: number = RRF_K
): FusedHit[] {
    const scores = new Map<number, FusedHit>();

    for (const list of lists) {
        for (const hit of list) {
            const contribution = 1 / (k + hit.rank);
            const existing = scores.get(hit.id);
            if (existing) {
                existing.score += contribution;
            } else {
                scores.set(hit.id, {
                    id: hit.id,
                    content: hit.content,
                    score: contribution,
                });
            }
        }
    }

    return [...scores.values()].sort((a, b) => b.score - a.score);
}

That's the whole algorithm. No normalization, no tuning weights between keyword and vector - RRF is famously robust precisely because it's so simple. A document that ranks decently in both lists rises to the top, which is exactly the behavior you want: matches that are both semantically relevant and contain the right keywords.

Wire it together:

// src/hybrid/search.ts (continued)
import { reciprocalRankFusion, type FusedHit } from "./rrf";
import { embed } from "./embed";

export async function hybridSearch(
    query: string,
    limit = 20
): Promise<FusedHit[]> {
    // Over-fetch from each side; RRF needs depth to work with
    const poolSize = limit * 3;

    const [keywordHits, queryEmbedding] = await Promise.all([
        keywordSearch(query, poolSize),
        embed(query),
    ]);
    const vectorHits = await vectorSearch(queryEmbedding, poolSize);

    return reciprocalRankFusion([keywordHits, vectorHits]).slice(0, limit);
}

Over-fetching matters. If you only pull the top 5 from each list, RRF has nothing to fuse. Pull 3x your target so documents have a chance to appear in both lists.

Reranking the top-K with a cross-encoder

Hybrid search gets you a great candidate set, but RRF only knows about rank positions, not the actual query-document relevance. The final precision win comes from a reranker (a cross-encoder): a model that takes the query and each candidate together and scores how well they match. It's far more accurate than embeddings because it reads both texts jointly instead of comparing two precomputed vectors - and far too slow to run on your whole corpus, which is why you only rerank the top ~20 candidates.

Cohere and Voyage both offer hosted rerankers. Here's Cohere's:

// src/hybrid/rerank.ts
import { CohereClient } from "cohere-ai";
import type { FusedHit } from "./rrf";

const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });

export interface RerankedHit extends FusedHit {
    relevance: number;
}

export async function rerank(
    query: string,
    hits: FusedHit[],
    topN = 5
): Promise<RerankedHit[]> {
    if (hits.length === 0) return [];

    try {
        const res = await cohere.rerank({
            model: "rerank-v3.5",
            query,
            documents: hits.map((h) => h.content),
            topN,
        });

        return res.results.map((r) => ({
            ...hits[r.index],
            relevance: r.relevanceScore,
        }));
    } catch (err) {
        // Reranker down? Degrade gracefully to RRF order rather than 500.
        console.error("rerank failed, falling back to RRF order", err);
        return hits.slice(0, topN).map((h) => ({ ...h, relevance: h.score }));
    }
}

The catch block is not optional. A reranker is a network call to a third party, and you do not want your search endpoint returning a 500 because Cohere had a blip. Falling back to the RRF ordering is a perfectly good answer - it's the result you'd have shipped before adding reranking at all.

The full pipeline:

// src/hybrid/index.ts
import { hybridSearch } from "./search";
import { rerank, type RerankedHit } from "./rerank";

export async function search(query: string): Promise<RerankedHit[]> {
    const candidates = await hybridSearch(query, 20); // BM25 + vector + RRF
    return rerank(query, candidates, 5); // cross-encoder precision pass
}

Does it actually help? The numbers

I ran this against a 12,000-chunk corpus of mixed technical docs (API references with error codes, prose guides, changelogs) with a 60-query test set split between semantic questions and exact-match lookups. Measuring recall@5 - whether the correct chunk appears in the top 5:

| Method | Semantic queries | Exact-match queries | Overall | | --- | --- | --- | --- | | Vector only | 0.83 | 0.41 | 0.62 | | BM25 only | 0.52 | 0.94 | 0.73 | | Hybrid (RRF) | 0.84 | 0.91 | 0.88 | | Hybrid + rerank | 0.89 | 0.93 | 0.91 |

The story is clear: vector-only collapses on exact-match queries (0.41), BM25-only collapses on semantic ones (0.52), and hybrid gets the best of both at 0.88. Reranking adds a few more points by fixing ordering within the candidate set.

When reranking isn't worth it

Reranking adds latency - typically 100-300ms for a hosted API call - and a per-query cost. Skip it when:

  • Latency is critical and your candidate set is already small and clean. RRF alone got us to 0.88.
  • You're feeding many chunks into a long-context LLM anyway. If you're stuffing 20 chunks into Claude regardless, precise top-5 ordering matters less.

Keep it when retrieval quality directly drives output quality - a customer-facing answer bot where the wrong top chunk means a wrong answer. The hybrid retrieval (BM25 + embeddings + RRF) is the part you should always do; reranking is the dial you turn based on your latency and cost budget.

What's next

Hybrid search fixes what you retrieve, but it won't save you if your pipeline blindly trusts the top result on a bad query. The natural next step is making retrieval adaptive: have the LLM judge whether the results are good enough and reformulate the query if not. I cover that in Build an Agentic RAG Pipeline That Retries and Reformulates Queries - it stacks directly on top of the hybrid retriever you just built. And once your answers are good, you'll want to show users where they came from: see Add AI-Powered Citations and Source Attribution to Your RAG Chatbot.

Share:
VA

Vadim Alakhverdov

Software developer writing about JavaScript, web development, and developer tools.

Related Posts