Build a Voice-Enabled AI Assistant in the Browser with TypeScript

Monday 11/05/2026

·12 min read

You want a voice AI assistant in the browser. You wire up SpeechRecognition, pipe transcripts to an LLM, pipe responses to speechSynthesis.speak(), and ship. Then real users hit it. The assistant talks over itself. Safari refuses to recognize anything until the page is reloaded. The user interrupts mid-answer and the bot keeps droning on for another 12 seconds. On mobile Chrome the mic cuts off every 3 seconds. The "wake word" detects every time someone says "hey" in a podcast playing in the background.

A voice AI assistant browser typescript implementation isn't hard to start. It's hard to make it feel like a real conversation. This post walks through the parts the demos skip: barge-in (interruption), wake word gating, restart loops, mobile autoplay rules, and how to fall back gracefully when the Web Speech API isn't available. Full TypeScript, full React, no native SDKs.

What we're building

A React component that:

Listens continuously for a wake phrase ("hey claude") using the browser's SpeechRecognition.
After wake, records the user's utterance until they stop talking.
Sends the transcript to an LLM endpoint and streams the response.
Speaks the response back using speechSynthesis (with an ElevenLabs fallback path for higher quality).
Stops speaking the instant the user starts talking again (barge-in).
Handles Chrome's auto-restart bug, Safari's permission quirks, and mobile autoplay restrictions.

Architecture:

[mic] → SpeechRecognition → wake gate → utterance buffer
                                              ↓
                                     /api/chat (SSE stream)
                                              ↓
                                speechSynthesis ← barge-in monitor

No backend dependency for speech - it all runs client-side. The only server call is the LLM stream.

The Web Speech API and why it's quirky

The Web Speech API gives you two pieces: SpeechRecognition (speech-to-text) and SpeechSynthesis (text-to-speech). Both are widely supported but inconsistent in important ways.

SpeechRecognition is prefixed webkitSpeechRecognition in Chrome and Edge. Firefox has no support without a flag. Safari supports it but does the recognition on Apple servers (you'll see a notification on macOS). Mobile Chrome on Android works but auto-stops after ~3 seconds of silence and needs to be restarted manually.

SpeechSynthesis is universal but the available voices vary per device. iOS gives you good voices for free; Android voices sound robotic. voiceschanged fires asynchronously, so reading getVoices() synchronously on page load returns an empty array half the time.

Both APIs are designed around one-shot interactions, not continuous conversation. That's where most tutorials stop. We're going past that.

Step 1: A safe speech recognition wrapper

Start with a typed wrapper that papers over the prefix mess and handles the auto-restart pattern.

pnpm add nanoid
pnpm add -D @types/dom-speech-recognition

// src/lib/speech-recognition.ts
type RecognitionEvent =
    | { type: 'partial'; transcript: string }
    | { type: 'final'; transcript: string }
    | { type: 'error'; error: string }
    | { type: 'end' }

type Listener = (event: RecognitionEvent) => void

export class ContinuousRecognizer {
    private recognition: SpeechRecognition | null = null
    private listener: Listener
    private shouldRun = false
    private restartTimer: number | null = null

    constructor(listener: Listener) {
        this.listener = listener
    }

    static isSupported(): boolean {
        return (
            typeof window !== 'undefined' &&
            ('SpeechRecognition' in window || 'webkitSpeechRecognition' in window)
        )
    }

    start() {
        if (!ContinuousRecognizer.isSupported()) {
            this.listener({ type: 'error', error: 'unsupported' })
            return
        }
        this.shouldRun = true
        this.spawn()
    }

    stop() {
        this.shouldRun = false
        if (this.restartTimer) window.clearTimeout(this.restartTimer)
        this.recognition?.stop()
        this.recognition = null
    }

    private spawn() {
        const Ctor =
            (window as any).SpeechRecognition ||
            (window as any).webkitSpeechRecognition
        const rec: SpeechRecognition = new Ctor()
        rec.continuous = true
        rec.interimResults = true
        rec.lang = 'en-US'

        rec.onresult = (event) => {
            for (let i = event.resultIndex; i < event.results.length; i++) {
                const result = event.results[i]
                const transcript = result[0].transcript.trim()
                if (!transcript) continue
                this.listener({
                    type: result.isFinal ? 'final' : 'partial',
                    transcript,
                })
            }
        }

        rec.onerror = (event) => {
            // 'no-speech' and 'aborted' are normal - don't surface them
            if (event.error === 'no-speech' || event.error === 'aborted') return
            this.listener({ type: 'error', error: event.error })
        }

        rec.onend = () => {
            this.listener({ type: 'end' })
            if (this.shouldRun) {
                // Chrome stops recognition after ~60s even with continuous=true.
                // Restart it after a short delay to avoid a tight loop on errors.
                this.restartTimer = window.setTimeout(() => this.spawn(), 250)
            }
        }

        try {
            rec.start()
            this.recognition = rec
        } catch (err) {
            // start() throws if called too soon after stop()
            this.restartTimer = window.setTimeout(() => this.spawn(), 500)
        }
    }
}

Two gotchas baked in here: the auto-restart loop (Chrome stops continuous recognition after about 60 seconds) and the try/catch around start(), which throws an InvalidStateError if you call it before the previous recognition fully ended. The 250ms delay is empirical - anything shorter fails on slower devices.

Step 2: Wake word gating without paying for Porcupine

Real wake-word detection uses on-device neural models (Porcupine, Picovoice). For a hobby app or internal tool, you can gate on a phrase in the live transcript instead. It's not robust against noise but it's free.

// src/lib/wake-word.ts
const WAKE_PHRASES = ['hey claude', 'hi claude', 'okay claude']
const COOLDOWN_MS = 2000

export class WakeWordDetector {
    private lastTrigger = 0

    matches(transcript: string): boolean {
        const now = Date.now()
        if (now - this.lastTrigger < COOLDOWN_MS) return false

        const lower = transcript.toLowerCase()
        const hit = WAKE_PHRASES.some((phrase) => lower.includes(phrase))
        if (hit) this.lastTrigger = now
        return hit
    }

    stripWakeWord(transcript: string): string {
        let result = transcript
        for (const phrase of WAKE_PHRASES) {
            const idx = result.toLowerCase().indexOf(phrase)
            if (idx >= 0) {
                result = result.slice(idx + phrase.length).trim()
                break
            }
        }
        return result
    }
}

The cooldown is the part most demos miss. Without it, every partial event fires the wake word again, because the phrase stays in the transcript buffer for a second or two.

If you want real on-device wake-word detection later, swap this class for a Porcupine Web wrapper with the same matches/stripWakeWord interface. Nothing else has to change.

Step 3: Detecting end of utterance

After the wake word fires, you need to know when the user has finished talking. The naive approach is to wait for a final event from SpeechRecognition, but on mobile that can take 5+ seconds. A better signal is a silence timer that resets every time a new word arrives.

// src/lib/utterance-buffer.ts
export class UtteranceBuffer {
    private buffer = ''
    private silenceTimer: number | null = null
    private onComplete: (transcript: string) => void
    private silenceMs: number

    constructor(onComplete: (transcript: string) => void, silenceMs = 1200) {
        this.onComplete = onComplete
        this.silenceMs = silenceMs
    }

    append(text: string) {
        this.buffer = this.buffer ? `${this.buffer} ${text}` : text
        this.resetSilenceTimer()
    }

    flush() {
        if (this.silenceTimer) window.clearTimeout(this.silenceTimer)
        const transcript = this.buffer.trim()
        this.buffer = ''
        this.silenceTimer = null
        if (transcript) this.onComplete(transcript)
    }

    cancel() {
        if (this.silenceTimer) window.clearTimeout(this.silenceTimer)
        this.buffer = ''
        this.silenceTimer = null
    }

    private resetSilenceTimer() {
        if (this.silenceTimer) window.clearTimeout(this.silenceTimer)
        this.silenceTimer = window.setTimeout(() => this.flush(), this.silenceMs)
    }
}

1200ms is a reasonable default. Shorter and you cut people off mid-sentence; longer and the assistant feels sluggish.

Step 4: Speech synthesis with barge-in

The killer feature for a voice assistant is being interruptible. If the user starts talking while the bot is speaking, the bot needs to shut up immediately. This is the difference between a toy demo and something you'd actually use.

// src/lib/speech-output.ts
export class SpeechOutput {
    private utterance: SpeechSynthesisUtterance | null = null
    private voicesReady: Promise<void>

    constructor() {
        // voiceschanged fires async on some platforms
        this.voicesReady = new Promise((resolve) => {
            if (speechSynthesis.getVoices().length > 0) return resolve()
            speechSynthesis.onvoiceschanged = () => resolve()
        })
    }

    async speak(text: string, onEnd?: () => void) {
        await this.voicesReady
        this.stop()

        const utterance = new SpeechSynthesisUtterance(text)
        utterance.rate = 1.05
        utterance.pitch = 1.0
        utterance.volume = 1.0

        const preferred = speechSynthesis
            .getVoices()
            .find((v) => v.lang.startsWith('en') && v.localService)
        if (preferred) utterance.voice = preferred

        utterance.onend = () => {
            this.utterance = null
            onEnd?.()
        }
        utterance.onerror = () => {
            this.utterance = null
            onEnd?.()
        }

        this.utterance = utterance
        speechSynthesis.speak(utterance)
    }

    stop() {
        if (this.utterance) {
            this.utterance.onend = null
            this.utterance.onerror = null
            this.utterance = null
        }
        speechSynthesis.cancel()
    }

    isSpeaking(): boolean {
        return speechSynthesis.speaking
    }
}

Note the manual onend = null before calling cancel(). Without it, Chrome fires the onend handler after cancellation, which can trigger the next state transition in your component and lead to surprising loops.

If you want higher-quality voices, swap speechSynthesis.speak() for an ElevenLabs streaming call:

// src/lib/elevenlabs-output.ts
export async function speakWithElevenLabs(text: string, signal: AbortSignal) {
    const response = await fetch('/api/tts', {
        method: 'POST',
        body: JSON.stringify({ text }),
        signal,
    })
    if (!response.ok || !response.body) throw new Error('TTS failed')

    const audio = new Audio()
    const mediaSource = new MediaSource()
    audio.src = URL.createObjectURL(mediaSource)
    await audio.play()

    mediaSource.addEventListener('sourceopen', async () => {
        const sourceBuffer = mediaSource.addSourceBuffer('audio/mpeg')
        const reader = response.body!.getReader()
        while (true) {
            const { done, value } = await reader.read()
            if (done) break
            await new Promise<void>((resolve) => {
                sourceBuffer.addEventListener('updateend', () => resolve(), { once: true })
                sourceBuffer.appendBuffer(value)
            })
        }
        mediaSource.endOfStream()
    })

    return () => {
        audio.pause()
        audio.src = ''
    }
}

The signal lets you abort the stream when barge-in fires. The server route is a thin proxy to ElevenLabs' streaming endpoint - straightforward.

Step 5: The streaming LLM call

Keep this part dumb. SSE over fetch, abort signal for cancellation:

// src/lib/chat-stream.ts
export async function* streamChat(
    message: string,
    history: { role: string; content: string }[],
    signal: AbortSignal
) {
    const response = await fetch('/api/chat', {
        method: 'POST',
        body: JSON.stringify({ message, history }),
        headers: { 'content-type': 'application/json' },
        signal,
    })
    if (!response.ok || !response.body) throw new Error('chat stream failed')

    const reader = response.body.pipeThrough(new TextDecoderStream()).getReader()
    let buffer = ''

    while (true) {
        const { done, value } = await reader.read()
        if (done) return
        buffer += value
        const lines = buffer.split('\n')
        buffer = lines.pop() ?? ''

        for (const line of lines) {
            if (!line.startsWith('data:')) continue
            const data = line.slice(5).trim()
            if (data === '[DONE]') return
            try {
                const parsed = JSON.parse(data) as { delta?: string }
                if (parsed.delta) yield parsed.delta
            } catch {
                // ignore malformed events
            }
        }
    }
}

The /api/chat route can wrap any LLM SDK - Anthropic, OpenAI, whatever. I covered the Claude streaming version in detail in the post on streaming Claude API responses in Next.js, so I won't repeat it here.

Step 6: Wiring it all together in React

// src/components/VoiceAssistant.tsx
import { useEffect, useRef, useState } from 'react'
import { ContinuousRecognizer } from '@/src/lib/speech-recognition'
import { WakeWordDetector } from '@/src/lib/wake-word'
import { UtteranceBuffer } from '@/src/lib/utterance-buffer'
import { SpeechOutput } from '@/src/lib/speech-output'
import { streamChat } from '@/src/lib/chat-stream'

type State = 'idle' | 'listening' | 'thinking' | 'speaking'

export function VoiceAssistant() {
    const [state, setState] = useState<State>('idle')
    const [transcript, setTranscript] = useState('')
    const [response, setResponse] = useState('')

    const recognizerRef = useRef<ContinuousRecognizer | null>(null)
    const wakeDetectorRef = useRef(new WakeWordDetector())
    const bufferRef = useRef<UtteranceBuffer | null>(null)
    const speechRef = useRef<SpeechOutput | null>(null)
    const abortRef = useRef<AbortController | null>(null)
    const historyRef = useRef<{ role: string; content: string }[]>([])
    const stateRef = useRef<State>('idle')

    useEffect(() => {
        stateRef.current = state
    }, [state])

    useEffect(() => {
        if (!ContinuousRecognizer.isSupported()) return

        speechRef.current = new SpeechOutput()

        bufferRef.current = new UtteranceBuffer(async (utterance) => {
            setState('thinking')
            setTranscript(utterance)
            await handleUserMessage(utterance)
        })

        recognizerRef.current = new ContinuousRecognizer((event) => {
            if (event.type === 'error') {
                console.warn('recognition error', event.error)
                return
            }
            if (event.type !== 'partial' && event.type !== 'final') return

            const current = stateRef.current

            // Barge-in: any speech while the bot is speaking interrupts it
            if (current === 'speaking') {
                speechRef.current?.stop()
                abortRef.current?.abort()
                setState('listening')
                bufferRef.current?.cancel()
            }

            const detector = wakeDetectorRef.current

            if (current === 'idle' && detector.matches(event.transcript)) {
                setState('listening')
                const stripped = detector.stripWakeWord(event.transcript)
                if (stripped && event.type === 'final') {
                    bufferRef.current?.append(stripped)
                }
                return
            }

            if (stateRef.current === 'listening' && event.type === 'final') {
                bufferRef.current?.append(event.transcript)
            }
        })

        recognizerRef.current.start()

        return () => {
            recognizerRef.current?.stop()
            speechRef.current?.stop()
            abortRef.current?.abort()
        }
    }, [])

    async function handleUserMessage(message: string) {
        abortRef.current = new AbortController()
        historyRef.current.push({ role: 'user', content: message })
        let full = ''
        setResponse('')

        try {
            for await (const delta of streamChat(
                message,
                historyRef.current,
                abortRef.current.signal
            )) {
                full += delta
                setResponse(full)
            }
        } catch (err) {
            if ((err as Error).name === 'AbortError') return
            setResponse('Sorry, I lost connection.')
            full = 'Sorry, I lost connection.'
        }

        historyRef.current.push({ role: 'assistant', content: full })
        setState('speaking')
        speechRef.current?.speak(full, () => {
            if (stateRef.current === 'speaking') setState('idle')
        })
    }

    if (!ContinuousRecognizer.isSupported()) {
        return <p>Voice mode isn&apos;t supported in this browser. Try Chrome or Safari.</p>
    }

    return (
        <div className="rounded-lg border p-4">
            <div className="mb-2 text-sm uppercase tracking-wide opacity-60">
                {state}
            </div>
            <div className="mb-3">
                <strong>You:</strong> {transcript || <em>say &quot;hey claude...&quot;</em>}
            </div>
            <div>
                <strong>Assistant:</strong> {response}
            </div>
        </div>
    )
}

The stateRef is the load-bearing trick - without it the event handlers close over the stale state value and barge-in stops working after the first conversation turn.

Mobile gotchas you'll hit

speechSynthesis.speak() on iOS Safari requires a user gesture for the first call. After a tap anywhere on the page, subsequent calls work. The fix: render a "Start" button that calls speechSynthesis.speak(new SpeechSynthesisUtterance(' ')) on click. After that one-time gesture, autoplay is unlocked for the session.

Mobile Chrome on Android stops recognition aggressively. Even with continuous = true, you'll see onend fire every 5-10 seconds. The auto-restart loop in ContinuousRecognizer.spawn() handles this transparently, but the user may notice a half-second gap where their speech isn't captured. Mention this in your UI - a pulsing dot beats silent dead air.

iOS Safari restarts recognition with a beep on every restart. There's no way to disable it. If this is unacceptable, your only option is a server-side STT (Whisper, Deepgram) streamed from a MediaRecorder capture. That's a different post.

When the Web Speech API isn't enough

Use the browser API if: you want zero infrastructure, your users are on desktop Chrome/Edge, and you're okay with English-heavy language detection.

Skip it if: you need accurate multi-language transcription, you want word-level timestamps, you need silence detection that works in noisy environments, or you're targeting Android-only. In those cases, capture audio with MediaRecorder, send PCM/Opus chunks to Deepgram or AssemblyAI over WebSocket, and you're back in control. The component structure above doesn't change - just swap ContinuousRecognizer for a WebSocketRecognizer.

What's next

Voice is the most demanding latency budget in AI UX - every 100ms of round-trip is noticeable. The fastest cheat is sending easy turns to a small model and only escalating to a big one when needed. I wrote that up in How to Route LLM Requests to Cheap vs Expensive Models Automatically in TypeScript - pair it with this assistant to drop average response latency by half without losing quality on hard questions.