Run AI Models Directly in the Browser with WebLLM and WebGPU
Wednesday 25/03/2026
·11 min readYour users don't want their private notes sent to a server. Your company doesn't want to pay per-token API costs for features that run thousands of times a day. And sometimes your users are on a plane with no internet. Client-side AI inference solves all three problems — and with WebLLM and WebGPU, it's finally practical enough to ship in a real app.
Running AI models directly in the browser using WebLLM and WebGPU means zero API costs, zero latency from network round-trips, and complete data privacy. The tradeoff is model size and hardware requirements, but for focused tasks like summarization and tagging, small models running locally are surprisingly capable.
What we're building
A note-taking app where every note gets an AI-generated summary and auto-tags — all running in the browser. No API keys. No backend. The LLM loads once into GPU memory via WebGPU and handles all inference locally.
Here's the stack:
- WebLLM — loads and runs GGUF-quantized models in the browser via WebGPU
- React — for the UI
- IndexedDB (via
idb-keyval) — for persisting notes locally
pnpm install @mlc-ai/web-llm idb-keyval
How WebLLM and WebGPU work together
WebGPU is the successor to WebGL — a low-level GPU API that browsers are shipping right now. Chrome, Edge, and Firefox Nightly support it. Safari is behind but catching up.
WebLLM sits on top of WebGPU. It compiles quantized LLMs into WebGPU shaders using Apache TVM, then runs the entire transformer inference loop on your GPU. No WASM fallback, no CPU inference — it's real GPU acceleration in the browser.
The key insight: you download the model once (cached in the browser), and then every inference call is a local GPU operation. For a 1.5B parameter model like Qwen2.5-1.5B-Instruct, the download is about 1.5GB and inference runs at 30-60 tokens/second on a decent GPU.
Setting up the WebLLM engine
The engine is the core abstraction. You create it once, load a model, and then call it like you'd call any chat API.
// src/lib/llm-engine.ts
import { CreateMLCEngine, MLCEngine } from '@mlc-ai/web-llm'
let engine: MLCEngine | null = null
export type LoadProgress = {
progress: number
text: string
}
export async function getEngine(
onProgress?: (progress: LoadProgress) => void
): Promise<MLCEngine> {
if (engine) return engine
engine = await CreateMLCEngine('Qwen2.5-1.5B-Instruct-q4f16_1-MLC', {
initProgressCallback: (report) => {
onProgress?.({
progress: report.progress,
text: report.text,
})
},
})
return engine
}
A few things to note:
CreateMLCEnginedownloads the model weights on first call, then caches them in the browser's Cache API. Subsequent page loads skip the download entirely.- The model ID
Qwen2.5-1.5B-Instruct-q4f16_1-MLCis a 4-bit quantized version of Qwen 2.5 1.5B. Small enough for browser delivery, capable enough for summarization and classification. - The progress callback is crucial for UX — you need to show users what's happening during the initial 1.5GB download.
Why Qwen 2.5 1.5B? I tested several models. Llama 3.2 1B is smaller but struggles with structured output. Phi-3.5-mini is good but 3.6GB is too heavy for a first load. Qwen 2.5 1.5B hits the sweet spot: reliable instruction following at a reasonable size. You can swap models later — the API is the same.
Building the summarize and tag functions
Now let's build the two AI functions our app needs. Both use the same engine instance.
// src/lib/note-ai.ts
import { getEngine } from './llm-engine'
export async function summarizeNote(content: string): Promise<string> {
const engine = await getEngine()
const response = await engine.chat.completions.create({
messages: [
{
role: 'system',
content:
'You are a note summarizer. Given a note, write a 1-2 sentence summary. Be concise and capture the key point. Return only the summary, nothing else.',
},
{
role: 'user',
content: content,
},
],
max_tokens: 100,
temperature: 0.3,
})
return response.choices[0]?.message?.content?.trim() ?? ''
}
export async function tagNote(content: string): Promise<string[]> {
const engine = await getEngine()
const response = await engine.chat.completions.create({
messages: [
{
role: 'system',
content:
'You are a note tagger. Given a note, return 1-3 relevant tags as a JSON array of lowercase strings. Example: ["meeting", "project-alpha", "deadline"]. Return only the JSON array, nothing else.',
},
{
role: 'user',
content: content,
},
],
max_tokens: 50,
temperature: 0.1,
})
const raw = response.choices[0]?.message?.content?.trim() ?? '[]'
try {
const parsed: unknown = JSON.parse(raw)
if (Array.isArray(parsed) && parsed.every((t) => typeof t === 'string')) {
return parsed as string[]
}
return []
} catch {
// Small models occasionally output malformed JSON
// Fall back to extracting quoted strings
const matches = raw.match(/"([^"]+)"/g)
return matches ? matches.map((m) => m.replace(/"/g, '')) : []
}
}
The temperature: 0.1 for tagging keeps output deterministic. For summarization, 0.3 gives slightly more natural language without going off the rails.
Notice the JSON parsing fallback in tagNote. Small quantized models occasionally produce slightly malformed JSON — an extra comma, a missing bracket. The regex fallback catches most of these cases. In production you'd want a more robust parser, but for tags this is sufficient.
The model loading UX
This is where most browser AI demos fall flat. They show a blank page while 1.5GB downloads. Let's build a proper loading screen.
// src/components/ModelLoader.tsx
import { useState, useEffect, useCallback } from 'react'
import { getEngine, LoadProgress } from '../lib/llm-engine'
type ModelLoaderProps = {
onReady: () => void
}
export function ModelLoader({ onReady }: ModelLoaderProps) {
const [progress, setProgress] = useState<LoadProgress>({
progress: 0,
text: 'Preparing to load model...',
})
const [error, setError] = useState<string | null>(null)
const loadModel = useCallback(async () => {
try {
await getEngine(setProgress)
onReady()
} catch (err) {
if (err instanceof Error && err.message.includes('WebGPU')) {
setError(
'Your browser does not support WebGPU. Please use Chrome 113+ or Edge 113+.'
)
} else {
setError(
`Failed to load AI model: ${err instanceof Error ? err.message : 'Unknown error'}`
)
}
}
}, [onReady])
useEffect(() => {
loadModel()
}, [loadModel])
if (error) {
return (
<div className="flex flex-col items-center justify-center min-h-screen p-8">
<div className="bg-red-50 border border-red-200 rounded-lg p-6 max-w-md">
<h2 className="text-red-800 font-semibold mb-2">
AI Not Available
</h2>
<p className="text-red-600 text-sm">{error}</p>
<p className="text-gray-500 text-sm mt-3">
The app will still work — you just won't get AI
summaries and tags.
</p>
</div>
</div>
)
}
const percent = Math.round(progress.progress * 100)
return (
<div className="flex flex-col items-center justify-center min-h-screen p-8">
<div className="max-w-md w-full">
<h2 className="text-lg font-semibold mb-4">
Loading AI Model
</h2>
<div className="w-full bg-gray-200 rounded-full h-3 mb-3">
<div
className="bg-blue-600 h-3 rounded-full transition-all duration-300"
style={{ width: `${percent}%` }}
/>
</div>
<p className="text-sm text-gray-500">{progress.text}</p>
<p className="text-xs text-gray-400 mt-2">
{percent < 100
? 'First load downloads ~1.5GB. Cached after that.'
: 'Compiling model shaders...'}
</p>
</div>
</div>
)
}
Key UX decisions:
- Tell users the download size upfront. "First load downloads ~1.5GB" prevents users from thinking the app is broken.
- Graceful fallback for no WebGPU. Don't crash — tell them the app works without AI features.
- After first load, it's fast. The cached model loads in 2-5 seconds on subsequent visits. The shader compilation step takes another second or two.
Wiring it all together
Here's the main app component that ties notes, persistence, and AI together.
// src/App.tsx
import { useState, useCallback } from 'react'
import { get, set } from 'idb-keyval'
import { ModelLoader } from './components/ModelLoader'
import { summarizeNote, tagNote } from './lib/note-ai'
type Note = {
id: string
content: string
summary?: string
tags?: string[]
createdAt: number
}
const NOTES_KEY = 'browser-ai-notes'
async function loadNotes(): Promise<Note[]> {
return (await get<Note[]>(NOTES_KEY)) ?? []
}
async function saveNotes(notes: Note[]): Promise<void> {
await set(NOTES_KEY, notes)
}
export default function App() {
const [ready, setReady] = useState(false)
const [notes, setNotes] = useState<Note[]>([])
const [draft, setDraft] = useState('')
const [processing, setProcessing] = useState(false)
const handleReady = useCallback(async () => {
const saved = await loadNotes()
setNotes(saved)
setReady(true)
}, [])
if (!ready) {
return <ModelLoader onReady={handleReady} />
}
const handleSave = async () => {
if (!draft.trim() || processing) return
setProcessing(true)
const newNote: Note = {
id: crypto.randomUUID(),
content: draft.trim(),
createdAt: Date.now(),
}
// Run summarization and tagging in parallel
const [summary, tags] = await Promise.all([
summarizeNote(newNote.content),
tagNote(newNote.content),
])
newNote.summary = summary
newNote.tags = tags
const updated = [newNote, ...notes]
setNotes(updated)
await saveNotes(updated)
setDraft('')
setProcessing(false)
}
return (
<div className="max-w-2xl mx-auto p-6">
<h1 className="text-2xl font-bold mb-6">Private Notes</h1>
<div className="mb-8">
<textarea
value={draft}
onChange={(e) => setDraft(e.target.value)}
placeholder="Write a note..."
className="w-full h-32 p-3 border rounded-lg resize-none"
disabled={processing}
/>
<button
onClick={handleSave}
disabled={!draft.trim() || processing}
className="mt-2 px-4 py-2 bg-blue-600 text-white rounded-lg disabled:opacity-50"
>
{processing ? 'AI is thinking...' : 'Save Note'}
</button>
</div>
{notes.map((note) => (
<div key={note.id} className="border rounded-lg p-4 mb-4">
<p className="whitespace-pre-wrap">{note.content}</p>
{note.summary && (
<p className="mt-2 text-sm text-gray-500 italic">
{note.summary}
</p>
)}
{note.tags && note.tags.length > 0 && (
<div className="mt-2 flex gap-1">
{note.tags.map((tag) => (
<span
key={tag}
className="px-2 py-0.5 bg-gray-100 text-gray-600 rounded-full text-xs"
>
{tag}
</span>
))}
</div>
)}
</div>
))}
</div>
)
}
Notice Promise.all on the summarization and tagging calls. Even though they're both hitting the same local engine, WebLLM handles queuing internally. In practice they run sequentially on the GPU, but the parallel call keeps the code cleaner than chaining.
Performance benchmarks
I tested on three machines to give you realistic expectations:
| Device | GPU | Model Load (cached) | Summarize (100-word note) | Tags | |--------|-----|---------------------|---------------------------|------| | MacBook Pro M3 | Integrated | 3.2s | 1.8s (~45 tok/s) | 0.6s | | Desktop, RTX 3060 | Discrete | 2.1s | 0.9s (~62 tok/s) | 0.3s | | ThinkPad, Intel Iris Xe | Integrated | 5.8s | 4.2s (~15 tok/s) | 1.4s |
The ThinkPad is the worst case you'll see on WebGPU-compatible hardware. Still under 5 seconds for a summary — acceptable for a background operation, too slow for a real-time typing assistant.
Gotcha: first-run shader compilation. The very first inference after loading the model takes 2-3x longer because WebGPU compiles the shaders on first use. After that, they're cached. Your loading screen should account for this — or run a throwaway inference during load.
Handling the WebGPU compatibility gap
Not every browser supports WebGPU yet. Here's how to handle it without breaking the app.
// src/lib/check-webgpu.ts
export async function checkWebGPUSupport(): Promise<{
supported: boolean
reason?: string
}> {
if (!navigator.gpu) {
return {
supported: false,
reason: 'WebGPU is not available in this browser. Use Chrome 113+ or Edge 113+.',
}
}
try {
const adapter = await navigator.gpu.requestAdapter()
if (!adapter) {
return {
supported: false,
reason: 'No compatible GPU adapter found. Your GPU may not support WebGPU.',
}
}
return { supported: true }
} catch {
return {
supported: false,
reason: 'Failed to initialize WebGPU.',
}
}
}
Use this check before attempting to load the model. If WebGPU isn't available, show the notes app without AI features. The app should always work — AI is an enhancement, not a requirement.
As of early 2026, WebGPU support looks like this:
- Chrome 113+ — full support
- Edge 113+ — full support (Chromium-based)
- Firefox — behind a flag in Nightly, expected stable in 2026
- Safari — partial support in Technology Preview
That covers roughly 75-80% of desktop browser traffic. On mobile, support is much more limited — Android Chrome has experimental WebGPU, but performance is inconsistent. I'd recommend treating browser AI as a desktop-first feature for now.
What you should know before shipping this
Model size matters more than you think. A 1.5GB download is fine for a power-user tool. It's not fine for a landing page demo. If you need something lighter, Qwen2.5-0.5B-Instruct is about 500MB quantized and still handles basic classification.
Memory usage is real. The 1.5B model uses about 1.2GB of GPU memory. On devices with shared memory (most laptops), this comes from the same pool as everything else. Users with 8GB of RAM and 30 browser tabs open will have a bad time.
You can't stream tokens from WebLLM the same way you stream from an API. WebLLM does support streaming via engine.chat.completions.create with stream: true, and it returns an async iterable just like the cloud API. But the perceived speed difference is less dramatic since there's no network latency to mask.
Don't ignore the cold start. The 2-5 second model load on cached visits is noticeable. Consider loading the model in a Web Worker so it doesn't block the main thread, or trigger loading proactively when the user navigates to a page that will need AI.
What's next
If running models in the browser feels limiting, the next post covers building an AI-powered autocomplete for any text input — a server-side approach where you stream inline suggestions as ghost text. It's the other end of the spectrum: low latency, high capability, and works in every browser.