Pinecone vs Turbopuffer vs pgvector: Which Vector Database for Production RAG in 2026

Friday 22/05/2026

·11 min read

You're picking a vector database for a real RAG app and the internet is useless. Half the posts say "just use pgvector" based on a 10k-row demo; the other half are Pinecone marketing wrapped in a blog template. Nobody benchmarks at the size you actually care about, and Turbopuffer - which has quietly become the most interesting entrant in the last year - barely shows up in comparisons at all.

So I indexed the same 1M-chunk dataset into Pinecone Serverless, Turbopuffer, and Postgres 16 with pgvector, ran identical TypeScript retrieval workloads against all three, and pulled out latency, recall, ingestion throughput, and monthly cost at 1M, 10M, and 100M scale. This is the head-to-head pinecone vs turbopuffer vs pgvector comparison I wish existed when I started.

The test setup

Every benchmark below uses the same dataset and the same TypeScript client patterns:

Corpus: 1,012,440 chunks from a public docs corpus, ~340 tokens each, embedded with text-embedding-3-small (1536 dimensions).
Query set: 500 real user queries collected from an in-prod search log, with human-labeled relevant chunk IDs for recall@10 scoring.
Filter test: Each chunk has a tenant_id, doc_type, and updated_at field - the filtered runs constrain to a single tenant (~10% of the corpus) and a doc_type enum.
Client: Single Node 22 process running on us-east-1 (t3.large), measuring end-to-end latency from await client.query(...).
Concurrency: 32 concurrent queries, 10-minute steady-state windows.

If you want to reproduce, the embedding/retrieval scaffold is the same one from Build a RAG Chatbot in 100 Lines of TypeScript - just swap the store layer.

The TypeScript retrieval layer

All three stores live behind the same interface so I could swap them without changing the rest of the pipeline:

// src/lib/vector-store.ts
export interface SearchResult {
    id: string
    score: number
    text: string
    metadata: Record<string, unknown>
}

export interface VectorStore {
    upsert(chunks: Chunk[]): Promise<void>
    search(
        embedding: number[],
        opts: { topK: number; filter?: Record<string, unknown> }
    ): Promise<SearchResult[]>
}

export interface Chunk {
    id: string
    embedding: number[]
    text: string
    metadata: { tenant_id: string; doc_type: string; updated_at: string }
}

For Pinecone Serverless:

// src/lib/stores/pinecone.ts
import { Pinecone } from '@pinecone-database/pinecone'
import type { VectorStore, Chunk, SearchResult } from '../vector-store'

const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! })
const index = pc.index('rag-bench')

export const pineconeStore: VectorStore = {
    async upsert(chunks: Chunk[]) {
        const batches = chunkArray(chunks, 100)
        for (const batch of batches) {
            await index.namespace('default').upsert(
                batch.map((c) => ({
                    id: c.id,
                    values: c.embedding,
                    metadata: { text: c.text, ...c.metadata },
                }))
            )
        }
    },
    async search(embedding, { topK, filter }) {
        const res = await index.namespace('default').query({
            vector: embedding,
            topK,
            includeMetadata: true,
            filter,
        })
        return (res.matches ?? []).map((m) => ({
            id: m.id,
            score: m.score ?? 0,
            text: String(m.metadata?.text ?? ''),
            metadata: m.metadata ?? {},
        }))
    },
}

function chunkArray<T>(arr: T[], size: number): T[][] {
    const out: T[][] = []
    for (let i = 0; i < arr.length; i += size) out.push(arr.slice(i, i + size))
    return out
}

For Turbopuffer:

// src/lib/stores/turbopuffer.ts
import { Turbopuffer } from '@turbopuffer/turbopuffer'
import type { VectorStore, Chunk, SearchResult } from '../vector-store'

const tpuf = new Turbopuffer({ apiKey: process.env.TURBOPUFFER_API_KEY! })
const ns = tpuf.namespace('rag-bench')

export const turbopufferStore: VectorStore = {
    async upsert(chunks: Chunk[]) {
        const batches = chunkArray(chunks, 1000)
        for (const batch of batches) {
            await ns.write({
                upsert_rows: batch.map((c) => ({
                    id: c.id,
                    vector: c.embedding,
                    text: c.text,
                    tenant_id: c.metadata.tenant_id,
                    doc_type: c.metadata.doc_type,
                    updated_at: c.metadata.updated_at,
                })),
                distance_metric: 'cosine_distance',
            })
        }
    },
    async search(embedding, { topK, filter }) {
        const res = await ns.query({
            rank_by: ['vector', 'ANN', embedding],
            top_k: topK,
            include_attributes: ['text', 'tenant_id', 'doc_type'],
            filters: filterToTpuf(filter),
        })
        return res.rows.map((r) => ({
            id: String(r.id),
            score: r.$dist ?? 0,
            text: String(r.text ?? ''),
            metadata: { tenant_id: r.tenant_id, doc_type: r.doc_type },
        }))
    },
}

For pgvector (Postgres 16 with pgvector 0.8, HNSW index, m=16, ef_construction=64):

// src/lib/stores/pgvector.ts
import postgres from 'postgres'
import type { VectorStore, Chunk, SearchResult } from '../vector-store'

const sql = postgres(process.env.DATABASE_URL!)

export const pgvectorStore: VectorStore = {
    async upsert(chunks: Chunk[]) {
        for (const batch of chunkArray(chunks, 500)) {
            await sql`
                INSERT INTO chunks (id, embedding, text, tenant_id, doc_type, updated_at)
                SELECT * FROM ${sql(
                    batch.map((c) => [
                        c.id,
                        toVectorLiteral(c.embedding),
                        c.text,
                        c.metadata.tenant_id,
                        c.metadata.doc_type,
                        c.metadata.updated_at,
                    ])
                )}
                ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding
            `
        }
    },
    async search(embedding, { topK, filter }) {
        const tenant = filter?.tenant_id as string | undefined
        const docType = filter?.doc_type as string | undefined
        const rows = await sql<SearchResult[]>`
            SELECT id, text, tenant_id, doc_type,
                   1 - (embedding <=> ${toVectorLiteral(embedding)}::vector) AS score
            FROM chunks
            WHERE (${tenant}::text IS NULL OR tenant_id = ${tenant})
              AND (${docType}::text IS NULL OR doc_type = ${docType})
            ORDER BY embedding <=> ${toVectorLiteral(embedding)}::vector
            LIMIT ${topK}
        `
        return rows.map((r) => ({ id: r.id, score: r.score, text: r.text, metadata: {} }))
    },
}

function toVectorLiteral(v: number[]) {
    return `[${v.join(',')}]`
}

One gotcha I lost an afternoon to: pgvector's <=> operator returns distance, not similarity. The HNSW index only kicks in when you ORDER BY embedding <=> $vec LIMIT k - wrap the math wrong and Postgres silently does a sequential scan on a million rows. I learned that watching EXPLAIN ANALYZE after our p95 latency hit 4 seconds.

Latency at p50/p95/p99 (1M chunks)

All numbers in milliseconds, end-to-end from the client. topK=10, 32 concurrent queries, 10-minute window, warm caches.

| Store | Unfiltered p50 | p95 | p99 | Filtered p50 | p95 | p99 | |---|---|---|---|---|---|---| | Pinecone Serverless | 41 | 78 | 124 | 48 | 92 | 156 | | Turbopuffer (warm) | 38 | 71 | 118 | 44 | 83 | 142 | | Turbopuffer (cold) | 612 | 980 | 1340 | 690 | 1100 | 1480 | | pgvector (HNSW, db.m6g.xlarge) | 12 | 28 | 47 | 18 | 39 | 64 | | pgvector (HNSW, db.t3.large) | 31 | 88 | 211 | 47 | 134 | 298 |

A few things jump out. pgvector on a properly sized instance is genuinely fast - the data is sitting in shared memory next to your transactions, and there's no network hop. Pinecone and Turbopuffer are within shouting distance of each other on warm queries, but Turbopuffer's cold-start penalty is real. The first query to an idle namespace can take a full second while the index loads from object storage. For a 24/7 chatbot, that doesn't matter. For a feature that goes hours without traffic and then needs to respond fast, it does.

The filtered numbers are where things get interesting. Pinecone and Turbopuffer both apply filters efficiently at the index level. pgvector filters by combining the HNSW index with the WHERE clause, which works well as long as your filter isn't too selective - if you're slicing down to <1% of the corpus, the planner can flip to a sequential scan on the filtered subset, and your p99 spikes.

Recall@10 across stores

Filtered runs only (single-tenant, single doc_type) - the more realistic case.

| Store | Recall@10 vs exact KNN | |---|---| | Pinecone Serverless | 0.97 | | Turbopuffer | 0.96 | | pgvector (HNSW, ef_search=80) | 0.93 | | pgvector (HNSW, ef_search=200) | 0.97 |

pgvector's default ef_search=40 gets you 0.88 recall, which is bad enough that your RAG pipeline will start hallucinating on edge cases. Bump it to 80 or 200 and you're competitive - but it costs you latency proportionally. This is the dial nobody tells you about. If you're seeing irrelevant chunks come back from your basic pipeline, recall is probably the culprit before you reach for the agentic RAG retry loop.

Ingestion throughput

Time to index the full 1M chunks from a single client process:

| Store | Time | Chunks/sec | |---|---|---| | Pinecone Serverless | 38 min | ~440 | | Turbopuffer | 14 min | ~1,190 | | pgvector (m6g.xlarge, async COPY) | 22 min | ~760 | | pgvector (with HNSW build time included) | 31 min | ~540 |

Turbopuffer is noticeably faster on writes because it batches into its object-storage layer asynchronously. pgvector's catch is the HNSW build - for the first 1M rows it's manageable, but rebuilding the index after a schema migration on 100M rows can take hours.

Monthly cost at 1M / 10M / 100M chunks

Using 1536-dim float32 embeddings, ~6KB per vector + payload, with 1M queries/month and standard filter usage. Prices reflect mid-2026 pricing tiers.

| Store | 1M | 10M | 100M | |---|---|---|---| | Pinecone Serverless | ~$45 | ~$280 | ~$2,400 | | Turbopuffer | ~$18 | ~$130 | ~$1,050 | | pgvector (managed Postgres, sized for the workload) | ~$110 (m6g.xlarge) | ~$340 (m6g.2xlarge) | ~$1,800+ (r6g.4xlarge, plus operational pain) |

At 1M chunks, pgvector "looks expensive" only if you don't already have a Postgres instance. If you do - and most apps do - your marginal cost is closer to zero. At 100M, the comparison flips: pgvector requires a beefy instance with enough RAM to hold the HNSW graph (rule of thumb: ~1.5x raw vector size in RAM), plus you eat the operational burden of vacuum, backups, replication lag, and slow index rebuilds.

Hybrid search, metadata filtering, and multi-tenancy

This is where the marketing pages all sound identical and the operational reality diverges:

Hybrid search (BM25 + vector): Turbopuffer has it natively - you can express rank_by: [['text', 'BM25', query], ['vector', 'ANN', embedding]] and it fuses for you. Pinecone supports sparse-dense vectors but you produce the sparse vectors yourself. pgvector pairs naturally with Postgres tsvector, but you have to write the RRF fusion in SQL.
Metadata filtering: All three support it. Pinecone and Turbopuffer apply filters during the ANN search. pgvector relies on the planner; for high-cardinality filters add a partial index per tenant.
Multi-tenancy: Pinecone gives you free namespaces (use one per tenant - this is the right pattern). Turbopuffer same: namespaces are first-class and cheap. pgvector's "namespace" is a tenant_id column with proper indexing, which is fine until one tenant has 50M chunks and dominates the planner's stats.

Operational reality

The thing benchmarks don't capture is the day you have to migrate a schema or recover from a bad deploy. pgvector lets you ALTER TABLE with the rest of your data, run point-in-time recovery, and reuse the team's existing Postgres knowledge - backups, monitoring, connection pooling, the works. The flip side: rebuilding an HNSW index on 100M rows is hours of degraded performance, and you cannot do it online without setup work.

Pinecone and Turbopuffer abstract all of this away. There's no index rebuild, no vacuum, no replication setup. You upsert and it works. The cost is that when something is slow, your debugging surface is "open a support ticket." For Pinecone that's usually fine. For Turbopuffer - still young, smaller team - your tolerance for that matters.

My honest verdict per scale tier

Under 1M chunks: Use pgvector. The latency is great, you almost certainly have Postgres already, and the operational overhead is what you're already paying. The "use pgvector" Twitter take is correct at this scale.

1M to 10M chunks: Turbopuffer is my default. The cost per query is the lowest, the warm-state latency is competitive with Pinecone, and the hybrid search is built in. The cold-start caveat matters only if your traffic pattern is bursty with long idle gaps - and in that case you can ping it once a minute from a cron.

10M to 100M+ chunks: Pinecone Serverless if you want to stop thinking about infrastructure entirely. Turbopuffer if you're cost-sensitive and your team is comfortable with a smaller vendor. pgvector at this scale only if you have a Postgres team that wants the job, and the rebuild story doesn't scare you.

The framework I actually use: pick the cheapest store that meets your p99 latency budget at your traffic pattern, then verify recall is above 0.95 on a labeled query set before shipping. Everything else is noise.

What's next

Once you've picked a store, the next thing that bites is retrieval quality. If your pipeline is still returning irrelevant chunks, the fix isn't a different database - it's a smarter retrieval loop. Walk through that next in Build an Agentic RAG Pipeline That Retries and Reformulates Queries.

Pinecone vs Turbopuffer vs pgvector: Which Vector Database for Production RAG in 2026

The test setup

The TypeScript retrieval layer

Latency at p50/p95/p99 (1M chunks)

Recall@10 across stores

Ingestion throughput

Monthly cost at 1M / 10M / 100M chunks

Hybrid search, metadata filtering, and multi-tenancy

Operational reality

My honest verdict per scale tier

What's next

Vadim Alakhverdov

Related Posts

Run Real AI Features in the Browser with Transformers.js v4 and WebGPU

Edge RAG: Build a Sub-100ms Retrieval App with Cloudflare Workers AI and Vectorize

Give Your AI Agent Persistent Memory with Anthropic Managed Agents