Run Real AI Features in the Browser with Transformers.js v4 and WebGPU

Wednesday 01/07/2026

·11 min read

Every semantic-search feature you ship starts the same way: sign up for an embeddings API, provision a vector store, wire up a backend route, and then watch the bill climb every time a user types in the search box. For a feature that runs thousands of times a day over a user's own private notes, that's a lot of infrastructure and recurring cost for something the browser can now do by itself. And if the user is offline — on a plane, on a train, behind a corporate proxy — your feature is just dead.

This is a Transformers.js v4 WebGPU browser AI TypeScript tutorial. We'll build a client-side semantic search over a user's notes using local embeddings, add a small generative model for on-device summaries, and — importantly — measure it. You'll get real tokens/sec numbers, a WebGPU-with-WASM-fallback strategy, and an honest map of where on-device beats a server call and where it absolutely doesn't. If you read my earlier post on running models in the browser with WebLLM, this is the other half of the story: different library, very different tradeoffs.

Why v4 changes the calculus

Transformers.js v3 already ran ONNX models in the browser, but WebGPU was opt-in and rough. v4 makes WebGPU the default backend where the browser supports it, which matters because WebGPU now ships on by default in current Chrome, Edge, and Safari. The practical upshot: embedding and small-model inference that used to run on the CPU via WASM (slow, single-threaded-ish) now hits the GPU with a one-line device flag. Everything else — the pipeline() API, the model hub, quantization — stays familiar.

The package name is @huggingface/transformers (the old @xenova/transformers is dead — don't install it).

pnpm add @huggingface/transformers

What we're building

A notes app with two AI features, both 100% client-side:

Semantic search — embed every note once with a small embedding model, then rank notes against a query by cosine similarity. No vector DB, no API.
Summaries — a small instruction-tuned generative model produces a one-line summary of a selected note.

The stack:

feature-extraction pipeline with Xenova/all-MiniLM-L6-v2 (384-dim embeddings, ~23MB quantized) for search
text-generation pipeline with onnx-community/Qwen2.5-0.5B-Instruct for summaries
A Web Worker so model loading and inference never freeze the UI
WebGPU with an automatic WASM fallback

Gotcha #1: run inference in a Web Worker, always

The single biggest mistake with Transformers.js is calling pipeline() on the main thread. Model loading pulls tens of megabytes and compiles shaders; inference is a tight compute loop. Do that on the UI thread and your app locks up — scroll jank, frozen buttons, the works. Everything below lives in a worker.

// src/ai/worker.ts
import {
    pipeline,
    type FeatureExtractionPipeline,
    type TextGenerationPipeline,
} from '@huggingface/transformers'

type Device = 'webgpu' | 'wasm'

// Lazily-created singletons — load each model exactly once.
let embedder: FeatureExtractionPipeline | null = null
let generator: TextGenerationPipeline | null = null

function pickDevice(): Device {
    // navigator.gpu is present in the worker's global scope when WebGPU is available.
    return 'gpu' in navigator ? 'webgpu' : 'wasm'
}

async function getEmbedder(device: Device): Promise<FeatureExtractionPipeline> {
    if (embedder) return embedder
    embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
        device,
        // q8 keeps quality high on WebGPU; MiniLM is small enough not to need q4.
        dtype: 'q8',
        progress_callback: (p) => self.postMessage({ type: 'progress', model: 'embedder', payload: p }),
    })
    return embedder
}

async function getGenerator(device: Device): Promise<TextGenerationPipeline> {
    if (generator) return generator
    generator = await pipeline('text-generation', 'onnx-community/Qwen2.5-0.5B-Instruct', {
        device,
        // q4f16 = 4-bit weights, fp16 compute. Best size/speed on WebGPU.
        dtype: device === 'webgpu' ? 'q4f16' : 'q4',
        progress_callback: (p) => self.postMessage({ type: 'progress', model: 'generator', payload: p }),
    })
    return generator
}

dtype is where most of your quality-vs-speed decisions live. For embeddings, q8 is the sweet spot — the vectors stay close enough to the fp32 originals that ranking quality is indistinguishable, and the model is tiny anyway. For the generative model, q4f16 on WebGPU roughly halves memory versus q8 with barely perceptible quality loss on short summaries; on WASM, use plain q4 since fp16 compute isn't accelerated there.

Now the worker's message handler. Two operations: embed a batch of texts, and generate a summary.

// src/ai/worker.ts (continued)
const device = pickDevice()

// Cosine similarity on normalized vectors is just a dot product.
function dot(a: Float32Array, b: Float32Array): number {
    let sum = 0
    for (let i = 0; i < a.length; i++) sum += a[i] * b[i]
    return sum
}

type Request =
    | { type: 'embed'; id: number; texts: string[] }
    | { type: 'search'; id: number; query: string; vectors: Float32Array[] }
    | { type: 'summarize'; id: number; text: string }

self.onmessage = async (e: MessageEvent<Request>) => {
    const msg = e.data
    try {
        if (msg.type === 'embed') {
            const embedder = await getEmbedder(device)
            // pooling+normalize gives one L2-normalized vector per input.
            const out = await embedder(msg.texts, { pooling: 'mean', normalize: true })
            // out.tolist() -> number[][]; ship back as transferable Float32Arrays.
            const vectors = (out.tolist() as number[][]).map((v) => new Float32Array(v))
            self.postMessage(
                { type: 'result', id: msg.id, vectors },
                vectors.map((v) => v.buffer)
            )
        } else if (msg.type === 'search') {
            const embedder = await getEmbedder(device)
            const q = await embedder(msg.query, { pooling: 'mean', normalize: true })
            const qv = new Float32Array(q.tolist()[0] as number[])
            const scores = msg.vectors.map((v, index) => ({ index, score: dot(qv, v) }))
            scores.sort((a, b) => b.score - a.score)
            self.postMessage({ type: 'result', id: msg.id, scores })
        } else if (msg.type === 'summarize') {
            const generator = await getGenerator(device)
            const messages = [
                { role: 'system', content: 'Summarize the note in one short sentence. No preamble.' },
                { role: 'user', content: msg.text },
            ]
            const output = await generator(messages, { max_new_tokens: 48, do_sample: false })
            // Chat pipelines return the full message list; grab the assistant's reply.
            const last = (output[0].generated_text as { role: string; content: string }[]).at(-1)
            self.postMessage({ type: 'result', id: msg.id, summary: last?.content ?? '' })
        }
    } catch (err) {
        self.postMessage({ type: 'error', id: msg.id, message: (err as Error).message })
    }
}

A few things worth calling out. Embeddings come back from .tolist() as nested arrays; I convert them to Float32Array and transfer the buffers back to the main thread instead of copying — for a few thousand notes that's the difference between a snappy UI and a stutter. And notice the generative call passes a chat-style message array; the Qwen pipeline applies the chat template for you, so you don't hand-write <|im_start|> tokens.

The main-thread client

Wrap the worker in a small promise-based client so the rest of your app never touches postMessage. Each request carries an incrementing id we match against the response.

// src/ai/client.ts
type Pending = { resolve: (v: unknown) => void; reject: (e: Error) => void }

export type LoadProgress = {
    model: 'embedder' | 'generator'
    file: string
    progress: number // 0-100
    status: string
}

export class BrowserAI {
    private worker: Worker
    private pending = new Map<number, Pending>()
    private seq = 0

    constructor(private onProgress?: (p: LoadProgress) => void) {
        this.worker = new Worker(new URL('./worker.ts', import.meta.url), { type: 'module' })
        this.worker.onmessage = (e) => this.handle(e.data)
        this.worker.onerror = (e) => {
            // A worker-level error rejects everything in flight so callers aren't left hanging.
            for (const p of this.pending.values()) p.reject(new Error(e.message))
            this.pending.clear()
        }
    }

    private handle(msg: any) {
        if (msg.type === 'progress') {
            const p = msg.payload
            this.onProgress?.({
                model: msg.model,
                file: p.file ?? '',
                progress: Math.round(p.progress ?? 0),
                status: p.status ?? '',
            })
            return
        }
        const entry = this.pending.get(msg.id)
        if (!entry) return
        this.pending.delete(msg.id)
        if (msg.type === 'error') entry.reject(new Error(msg.message))
        else entry.resolve(msg)
    }

    private call<T>(payload: object): Promise<T> {
        const id = this.seq++
        return new Promise<T>((resolve, reject) => {
            this.pending.set(id, { resolve: resolve as (v: unknown) => void, reject })
            this.worker.postMessage({ id, ...payload })
        })
    }

    embed(texts: string[]): Promise<{ vectors: Float32Array[] }> {
        return this.call({ type: 'embed', texts })
    }

    search(query: string, vectors: Float32Array[]): Promise<{ scores: { index: number; score: number }[] }> {
        return this.call({ type: 'search', query, vectors })
    }

    summarize(text: string): Promise<{ summary: string }> {
        return this.call({ type: 'summarize', text })
    }
}

Using it from a React component looks like any other async data flow. The one UX rule that matters: show the load progress. The first embed call triggers a ~23MB download and a warm-up; if you don't surface that, users think your app is broken.

// src/components/NoteSearch.tsx
import { useEffect, useRef, useState } from 'react'
import { BrowserAI, type LoadProgress } from '@/src/ai/client'

type Note = { id: string; text: string }

export function NoteSearch({ notes }: { notes: Note[] }) {
    const ai = useRef<BrowserAI | null>(null)
    const vectors = useRef<Float32Array[]>([])
    const [progress, setProgress] = useState<LoadProgress | null>(null)
    const [ready, setReady] = useState(false)
    const [results, setResults] = useState<Note[]>(notes)

    useEffect(() => {
        const client = new BrowserAI(setProgress)
        ai.current = client
        // Warm up + embed the corpus once on mount.
        client
            .embed(notes.map((n) => n.text))
            .then(({ vectors: v }) => {
                vectors.current = v
                setReady(true)
            })
            .catch((e) => console.error('embedding failed', e))
        return () => client // workers are cheap; GC on unmount is fine for a demo
    }, [notes])

    async function onSearch(query: string) {
        if (!ai.current || !ready || !query.trim()) return setResults(notes)
        const { scores } = await ai.current.search(query, vectors.current)
        setResults(scores.slice(0, 10).map((s) => notes[s.index]))
    }

    if (!ready) {
        return (
            <div>
                Loading model… {progress?.file} {progress?.progress ?? 0}%
            </div>
        )
    }

    return (
        <div>
            <input placeholder="Search your notes…" onChange={(e) => onSearch(e.target.value)} />
            <ul>
                {results.map((n) => (
                    <li key={n.id}>{n.text}</li>
                ))}
            </ul>
        </div>
    )
}

Embed the corpus once, keep the vectors in memory (or persist them to IndexedDB so you skip re-embedding on the next visit), and every keystroke is a single embed + a few thousand dot products — sub-millisecond ranking.

Gotcha #2: WebGPU detection is necessary but not sufficient

navigator.gpu existing means the API is present, not that a usable adapter exists. On a locked-down VM or an ancient GPU, requestAdapter() returns null and pipeline creation throws. Do a real capability check before committing to WebGPU, and fall back cleanly:

// src/ai/detect.ts
export async function hasUsableWebGPU(): Promise<boolean> {
    if (!('gpu' in navigator)) return false
    try {
        const adapter = await (navigator as any).gpu.requestAdapter()
        return adapter !== null
    } catch {
        return false
    }
}

In the worker, upgrade pickDevice() to await this, and if a WebGPU pipeline throws during creation, retry once with device: 'wasm'. WASM is 5–10× slower but it works everywhere, which for a fallback is the whole point.

The benchmarks (and where on-device wins)

Numbers from my own machines, Qwen2.5-0.5B-Instruct at q4f16 for generation and MiniLM at q8 for embeddings. Treat them as ballpark — your GPU and thermal state will shift them — but the ratios hold:

| Device | Embedding (256 notes) | Generation | Model load (cold) | | --- | --- | --- | --- | | M2 MacBook Air, WebGPU | ~180ms | ~42 tok/sec | ~2.5s | | Desktop RTX 3060, WebGPU | ~70ms | ~95 tok/sec | ~1.8s | | Same desktop, WASM fallback | ~1.4s | ~6 tok/sec | ~1.2s |

Two takeaways. First, WebGPU vs WASM is roughly an order of magnitude on generation — the fallback is a safety net, not a plan A. Second, embeddings are cheap enough on WebGPU that client-side semantic search is genuinely production-viable; a 42 tok/sec summary, by contrast, is fine for a "summarize this note" button but would feel sluggish as a streaming chat.

So: on-device wins when the work is embeddings-heavy (search, classification, dedup, clustering), when privacy is a hard requirement, when offline matters, or when per-call server cost would dominate at scale. It loses when you need a large model's reasoning quality, long outputs, or consistent latency on unknown low-end hardware. That last one is the real dividing line versus a server call — you control your server's GPU; you don't control your user's. For anything customer-facing where quality can't wobble, keep it server-side (see the real cost breakdown for when the math tips one way or the other).

Transformers.js vs WebLLM: which do you reach for?

They overlap but aren't interchangeable. WebLLM is built for chat — it runs larger quantized LLMs (Llama, Qwen 7B+) with an OpenAI-compatible streaming API, and it's what you want for a full in-browser assistant. Transformers.js is a general model runtime: embeddings, classification, ASR, object detection, and small generative models, all through one pipeline() API. If your feature is "semantic search over the user's data" or "auto-tag this," Transformers.js is the lighter, more flexible pick. If it's "chat with a capable model entirely offline," reach for WebLLM. I use Transformers.js for embeddings and utility models even in apps where WebLLM handles the chat.

What's next

We ran embeddings entirely on-device here, but the moment your corpus outgrows the browser's memory you'll want a server-backed vector store instead. My next post — Interactive MCP Tools: Elicitation and Task-Based Execution with the MCP TypeScript SDK v2 — goes the other direction, into server-side tools that can pause mid-call to ask the user for input. If you'd rather stay on the retrieval track, compare this client-side approach against the server version in How to Add AI Search with Embeddings and Supabase.

Run Real AI Features in the Browser with Transformers.js v4 and WebGPU

Why v4 changes the calculus

What we're building

Gotcha #1: run inference in a Web Worker, always

The main-thread client

Gotcha #2: WebGPU detection is necessary but not sufficient

The benchmarks (and where on-device wins)

Transformers.js vs WebLLM: which do you reach for?

What's next

Vadim Alakhverdov

Related Posts

Edge RAG: Build a Sub-100ms Retrieval App with Cloudflare Workers AI and Vectorize

Give Your AI Agent Persistent Memory with Anthropic Managed Agents

Anthropic Agent Skills in TypeScript: Package Reusable Instructions and Code as Tools