Build a Voice-Enabled AI Assistant in the Browser with TypeScript
Monday 11/05/2026
·12 min readYou want a voice AI assistant in the browser. You wire up SpeechRecognition, pipe transcripts to an LLM, pipe responses to speechSynthesis.speak(), and ship. Then real users hit it. The assistant talks over itself. Safari refuses to recognize anything until the page is reloaded. The user interrupts mid-answer and the bot keeps droning on for another 12 seconds. On mobile Chrome the mic cuts off every 3 seconds. The "wake word" detects every time someone says "hey" in a podcast playing in the background.
A voice AI assistant browser typescript implementation isn't hard to start. It's hard to make it feel like a real conversation. This post walks through the parts the demos skip: barge-in (interruption), wake word gating, restart loops, mobile autoplay rules, and how to fall back gracefully when the Web Speech API isn't available. Full TypeScript, full React, no native SDKs.
What we're building
A React component that:
- Listens continuously for a wake phrase ("hey claude") using the browser's
SpeechRecognition. - After wake, records the user's utterance until they stop talking.
- Sends the transcript to an LLM endpoint and streams the response.
- Speaks the response back using
speechSynthesis(with an ElevenLabs fallback path for higher quality). - Stops speaking the instant the user starts talking again (barge-in).
- Handles Chrome's auto-restart bug, Safari's permission quirks, and mobile autoplay restrictions.
Architecture:
[mic] → SpeechRecognition → wake gate → utterance buffer
↓
/api/chat (SSE stream)
↓
speechSynthesis ← barge-in monitor
No backend dependency for speech — it all runs client-side. The only server call is the LLM stream.
The Web Speech API and why it's quirky
The Web Speech API gives you two pieces: SpeechRecognition (speech-to-text) and SpeechSynthesis (text-to-speech). Both are widely supported but inconsistent in important ways.
SpeechRecognition is prefixed webkitSpeechRecognition in Chrome and Edge. Firefox has no support without a flag. Safari supports it but does the recognition on Apple servers (you'll see a notification on macOS). Mobile Chrome on Android works but auto-stops after ~3 seconds of silence and needs to be restarted manually.
SpeechSynthesis is universal but the available voices vary per device. iOS gives you good voices for free; Android voices sound robotic. voiceschanged fires asynchronously, so reading getVoices() synchronously on page load returns an empty array half the time.
Both APIs are designed around one-shot interactions, not continuous conversation. That's where most tutorials stop. We're going past that.
Step 1: A safe speech recognition wrapper
Start with a typed wrapper that papers over the prefix mess and handles the auto-restart pattern.
pnpm add nanoid
pnpm add -D @types/dom-speech-recognition
// src/lib/speech-recognition.ts
type RecognitionEvent =
| { type: 'partial'; transcript: string }
| { type: 'final'; transcript: string }
| { type: 'error'; error: string }
| { type: 'end' }
type Listener = (event: RecognitionEvent) => void
export class ContinuousRecognizer {
private recognition: SpeechRecognition | null = null
private listener: Listener
private shouldRun = false
private restartTimer: number | null = null
constructor(listener: Listener) {
this.listener = listener
}
static isSupported(): boolean {
return (
typeof window !== 'undefined' &&
('SpeechRecognition' in window || 'webkitSpeechRecognition' in window)
)
}
start() {
if (!ContinuousRecognizer.isSupported()) {
this.listener({ type: 'error', error: 'unsupported' })
return
}
this.shouldRun = true
this.spawn()
}
stop() {
this.shouldRun = false
if (this.restartTimer) window.clearTimeout(this.restartTimer)
this.recognition?.stop()
this.recognition = null
}
private spawn() {
const Ctor =
(window as any).SpeechRecognition ||
(window as any).webkitSpeechRecognition
const rec: SpeechRecognition = new Ctor()
rec.continuous = true
rec.interimResults = true
rec.lang = 'en-US'
rec.onresult = (event) => {
for (let i = event.resultIndex; i < event.results.length; i++) {
const result = event.results[i]
const transcript = result[0].transcript.trim()
if (!transcript) continue
this.listener({
type: result.isFinal ? 'final' : 'partial',
transcript,
})
}
}
rec.onerror = (event) => {
// 'no-speech' and 'aborted' are normal — don't surface them
if (event.error === 'no-speech' || event.error === 'aborted') return
this.listener({ type: 'error', error: event.error })
}
rec.onend = () => {
this.listener({ type: 'end' })
if (this.shouldRun) {
// Chrome stops recognition after ~60s even with continuous=true.
// Restart it after a short delay to avoid a tight loop on errors.
this.restartTimer = window.setTimeout(() => this.spawn(), 250)
}
}
try {
rec.start()
this.recognition = rec
} catch (err) {
// start() throws if called too soon after stop()
this.restartTimer = window.setTimeout(() => this.spawn(), 500)
}
}
}
Two gotchas baked in here: the auto-restart loop (Chrome stops continuous recognition after about 60 seconds) and the try/catch around start(), which throws an InvalidStateError if you call it before the previous recognition fully ended. The 250ms delay is empirical — anything shorter fails on slower devices.
Step 2: Wake word gating without paying for Porcupine
Real wake-word detection uses on-device neural models (Porcupine, Picovoice). For a hobby app or internal tool, you can gate on a phrase in the live transcript instead. It's not robust against noise but it's free.
// src/lib/wake-word.ts
const WAKE_PHRASES = ['hey claude', 'hi claude', 'okay claude']
const COOLDOWN_MS = 2000
export class WakeWordDetector {
private lastTrigger = 0
matches(transcript: string): boolean {
const now = Date.now()
if (now - this.lastTrigger < COOLDOWN_MS) return false
const lower = transcript.toLowerCase()
const hit = WAKE_PHRASES.some((phrase) => lower.includes(phrase))
if (hit) this.lastTrigger = now
return hit
}
stripWakeWord(transcript: string): string {
let result = transcript
for (const phrase of WAKE_PHRASES) {
const idx = result.toLowerCase().indexOf(phrase)
if (idx >= 0) {
result = result.slice(idx + phrase.length).trim()
break
}
}
return result
}
}
The cooldown is the part most demos miss. Without it, every partial event fires the wake word again, because the phrase stays in the transcript buffer for a second or two.
If you want real on-device wake-word detection later, swap this class for a Porcupine Web wrapper with the same matches/stripWakeWord interface. Nothing else has to change.
Step 3: Detecting end of utterance
After the wake word fires, you need to know when the user has finished talking. The naive approach is to wait for a final event from SpeechRecognition, but on mobile that can take 5+ seconds. A better signal is a silence timer that resets every time a new word arrives.
// src/lib/utterance-buffer.ts
export class UtteranceBuffer {
private buffer = ''
private silenceTimer: number | null = null
private onComplete: (transcript: string) => void
private silenceMs: number
constructor(onComplete: (transcript: string) => void, silenceMs = 1200) {
this.onComplete = onComplete
this.silenceMs = silenceMs
}
append(text: string) {
this.buffer = this.buffer ? `${this.buffer} ${text}` : text
this.resetSilenceTimer()
}
flush() {
if (this.silenceTimer) window.clearTimeout(this.silenceTimer)
const transcript = this.buffer.trim()
this.buffer = ''
this.silenceTimer = null
if (transcript) this.onComplete(transcript)
}
cancel() {
if (this.silenceTimer) window.clearTimeout(this.silenceTimer)
this.buffer = ''
this.silenceTimer = null
}
private resetSilenceTimer() {
if (this.silenceTimer) window.clearTimeout(this.silenceTimer)
this.silenceTimer = window.setTimeout(() => this.flush(), this.silenceMs)
}
}
1200ms is a reasonable default. Shorter and you cut people off mid-sentence; longer and the assistant feels sluggish.
Step 4: Speech synthesis with barge-in
The killer feature for a voice assistant is being interruptible. If the user starts talking while the bot is speaking, the bot needs to shut up immediately. This is the difference between a toy demo and something you'd actually use.
// src/lib/speech-output.ts
export class SpeechOutput {
private utterance: SpeechSynthesisUtterance | null = null
private voicesReady: Promise<void>
constructor() {
// voiceschanged fires async on some platforms
this.voicesReady = new Promise((resolve) => {
if (speechSynthesis.getVoices().length > 0) return resolve()
speechSynthesis.onvoiceschanged = () => resolve()
})
}
async speak(text: string, onEnd?: () => void) {
await this.voicesReady
this.stop()
const utterance = new SpeechSynthesisUtterance(text)
utterance.rate = 1.05
utterance.pitch = 1.0
utterance.volume = 1.0
const preferred = speechSynthesis
.getVoices()
.find((v) => v.lang.startsWith('en') && v.localService)
if (preferred) utterance.voice = preferred
utterance.onend = () => {
this.utterance = null
onEnd?.()
}
utterance.onerror = () => {
this.utterance = null
onEnd?.()
}
this.utterance = utterance
speechSynthesis.speak(utterance)
}
stop() {
if (this.utterance) {
this.utterance.onend = null
this.utterance.onerror = null
this.utterance = null
}
speechSynthesis.cancel()
}
isSpeaking(): boolean {
return speechSynthesis.speaking
}
}
Note the manual onend = null before calling cancel(). Without it, Chrome fires the onend handler after cancellation, which can trigger the next state transition in your component and lead to surprising loops.
If you want higher-quality voices, swap speechSynthesis.speak() for an ElevenLabs streaming call:
// src/lib/elevenlabs-output.ts
export async function speakWithElevenLabs(text: string, signal: AbortSignal) {
const response = await fetch('/api/tts', {
method: 'POST',
body: JSON.stringify({ text }),
signal,
})
if (!response.ok || !response.body) throw new Error('TTS failed')
const audio = new Audio()
const mediaSource = new MediaSource()
audio.src = URL.createObjectURL(mediaSource)
await audio.play()
mediaSource.addEventListener('sourceopen', async () => {
const sourceBuffer = mediaSource.addSourceBuffer('audio/mpeg')
const reader = response.body!.getReader()
while (true) {
const { done, value } = await reader.read()
if (done) break
await new Promise<void>((resolve) => {
sourceBuffer.addEventListener('updateend', () => resolve(), { once: true })
sourceBuffer.appendBuffer(value)
})
}
mediaSource.endOfStream()
})
return () => {
audio.pause()
audio.src = ''
}
}
The signal lets you abort the stream when barge-in fires. The server route is a thin proxy to ElevenLabs' streaming endpoint — straightforward.
Step 5: The streaming LLM call
Keep this part dumb. SSE over fetch, abort signal for cancellation:
// src/lib/chat-stream.ts
export async function* streamChat(
message: string,
history: { role: string; content: string }[],
signal: AbortSignal
) {
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ message, history }),
headers: { 'content-type': 'application/json' },
signal,
})
if (!response.ok || !response.body) throw new Error('chat stream failed')
const reader = response.body.pipeThrough(new TextDecoderStream()).getReader()
let buffer = ''
while (true) {
const { done, value } = await reader.read()
if (done) return
buffer += value
const lines = buffer.split('\n')
buffer = lines.pop() ?? ''
for (const line of lines) {
if (!line.startsWith('data:')) continue
const data = line.slice(5).trim()
if (data === '[DONE]') return
try {
const parsed = JSON.parse(data) as { delta?: string }
if (parsed.delta) yield parsed.delta
} catch {
// ignore malformed events
}
}
}
}
The /api/chat route can wrap any LLM SDK — Anthropic, OpenAI, whatever. I covered the Claude streaming version in detail in the post on streaming Claude API responses in Next.js, so I won't repeat it here.
Step 6: Wiring it all together in React
// src/components/VoiceAssistant.tsx
import { useEffect, useRef, useState } from 'react'
import { ContinuousRecognizer } from '@/src/lib/speech-recognition'
import { WakeWordDetector } from '@/src/lib/wake-word'
import { UtteranceBuffer } from '@/src/lib/utterance-buffer'
import { SpeechOutput } from '@/src/lib/speech-output'
import { streamChat } from '@/src/lib/chat-stream'
type State = 'idle' | 'listening' | 'thinking' | 'speaking'
export function VoiceAssistant() {
const [state, setState] = useState<State>('idle')
const [transcript, setTranscript] = useState('')
const [response, setResponse] = useState('')
const recognizerRef = useRef<ContinuousRecognizer | null>(null)
const wakeDetectorRef = useRef(new WakeWordDetector())
const bufferRef = useRef<UtteranceBuffer | null>(null)
const speechRef = useRef<SpeechOutput | null>(null)
const abortRef = useRef<AbortController | null>(null)
const historyRef = useRef<{ role: string; content: string }[]>([])
const stateRef = useRef<State>('idle')
useEffect(() => {
stateRef.current = state
}, [state])
useEffect(() => {
if (!ContinuousRecognizer.isSupported()) return
speechRef.current = new SpeechOutput()
bufferRef.current = new UtteranceBuffer(async (utterance) => {
setState('thinking')
setTranscript(utterance)
await handleUserMessage(utterance)
})
recognizerRef.current = new ContinuousRecognizer((event) => {
if (event.type === 'error') {
console.warn('recognition error', event.error)
return
}
if (event.type !== 'partial' && event.type !== 'final') return
const current = stateRef.current
// Barge-in: any speech while the bot is speaking interrupts it
if (current === 'speaking') {
speechRef.current?.stop()
abortRef.current?.abort()
setState('listening')
bufferRef.current?.cancel()
}
const detector = wakeDetectorRef.current
if (current === 'idle' && detector.matches(event.transcript)) {
setState('listening')
const stripped = detector.stripWakeWord(event.transcript)
if (stripped && event.type === 'final') {
bufferRef.current?.append(stripped)
}
return
}
if (stateRef.current === 'listening' && event.type === 'final') {
bufferRef.current?.append(event.transcript)
}
})
recognizerRef.current.start()
return () => {
recognizerRef.current?.stop()
speechRef.current?.stop()
abortRef.current?.abort()
}
}, [])
async function handleUserMessage(message: string) {
abortRef.current = new AbortController()
historyRef.current.push({ role: 'user', content: message })
let full = ''
setResponse('')
try {
for await (const delta of streamChat(
message,
historyRef.current,
abortRef.current.signal
)) {
full += delta
setResponse(full)
}
} catch (err) {
if ((err as Error).name === 'AbortError') return
setResponse('Sorry, I lost connection.')
full = 'Sorry, I lost connection.'
}
historyRef.current.push({ role: 'assistant', content: full })
setState('speaking')
speechRef.current?.speak(full, () => {
if (stateRef.current === 'speaking') setState('idle')
})
}
if (!ContinuousRecognizer.isSupported()) {
return <p>Voice mode isn't supported in this browser. Try Chrome or Safari.</p>
}
return (
<div className="rounded-lg border p-4">
<div className="mb-2 text-sm uppercase tracking-wide opacity-60">
{state}
</div>
<div className="mb-3">
<strong>You:</strong> {transcript || <em>say "hey claude..."</em>}
</div>
<div>
<strong>Assistant:</strong> {response}
</div>
</div>
)
}
The stateRef is the load-bearing trick — without it the event handlers close over the stale state value and barge-in stops working after the first conversation turn.
Mobile gotchas you'll hit
speechSynthesis.speak() on iOS Safari requires a user gesture for the first call. After a tap anywhere on the page, subsequent calls work. The fix: render a "Start" button that calls speechSynthesis.speak(new SpeechSynthesisUtterance(' ')) on click. After that one-time gesture, autoplay is unlocked for the session.
Mobile Chrome on Android stops recognition aggressively. Even with continuous = true, you'll see onend fire every 5-10 seconds. The auto-restart loop in ContinuousRecognizer.spawn() handles this transparently, but the user may notice a half-second gap where their speech isn't captured. Mention this in your UI — a pulsing dot beats silent dead air.
iOS Safari restarts recognition with a beep on every restart. There's no way to disable it. If this is unacceptable, your only option is a server-side STT (Whisper, Deepgram) streamed from a MediaRecorder capture. That's a different post.
When the Web Speech API isn't enough
Use the browser API if: you want zero infrastructure, your users are on desktop Chrome/Edge, and you're okay with English-heavy language detection.
Skip it if: you need accurate multi-language transcription, you want word-level timestamps, you need silence detection that works in noisy environments, or you're targeting Android-only. In those cases, capture audio with MediaRecorder, send PCM/Opus chunks to Deepgram or AssemblyAI over WebSocket, and you're back in control. The component structure above doesn't change — just swap ContinuousRecognizer for a WebSocketRecognizer.
What's next
Voice is the most demanding latency budget in AI UX — every 100ms of round-trip is noticeable. The fastest cheat is sending easy turns to a small model and only escalating to a big one when needed. I wrote that up in How to Route LLM Requests to Cheap vs Expensive Models Automatically in TypeScript — pair it with this assistant to drop average response latency by half without losing quality on hard questions.