OpenAI Realtime API vs Gemini Live vs Pipecat: Which for Voice AI in TypeScript

Monday 08/06/2026

·13 min read
Share:

You ship a voice agent and the demo sounds magical. Then a real user talks over it, the agent keeps droning on for two more seconds, and the whole illusion of "talking to something intelligent" collapses. Voice AI lives and dies on two numbers that no blog post screenshot can show you: time-to-first-audio and how fast the model shuts up when you interrupt it. Pick the wrong API and you'll spend weeks fighting latency you can't fix.

I built the same voice agent - a restaurant reservation assistant that checks availability and books a table - three times in TypeScript: once on the OpenAI Realtime API (WebRTC), once on Gemini Live (WebSockets), and once on Pipecat-JS, the open framework that wraps both. This post is the honest comparison: real code for the parts that matter, measured latency, interruption behavior, and cost per minute of conversation.

What "real-time voice" actually requires

Before the code, here's the bar. A natural voice conversation needs four things, and most provider demos only nail the first one:

  1. Sub-500ms response latency - from the moment you stop talking to the first audio token coming back. Above ~800ms it feels like a bad phone line.
  2. Barge-in (interruption) - when the user starts talking, the model must stop immediately and discard its queued audio. This is the single hardest part.
  3. Function calling mid-stream - the model needs to call check_availability() while staying in the voice loop, then speak the result.
  4. Predictable cost - voice models bill audio input and output tokens separately, and the math is not intuitive.

All three options handle speech-to-speech natively (no separate STT → LLM → TTS pipeline), which is what gets you under 500ms. The differences are in transport, control, and how much they hide from you.

Option 1: OpenAI Realtime API (WebRTC)

OpenAI's Realtime API is built around WebRTC for browser clients. The key architectural decision: you never put your API key in the browser. Instead, your server mints a short-lived ephemeral token, and the browser uses that to open a peer connection directly to OpenAI. Audio flows over WebRTC's media channel; events flow over a data channel.

Here's the token endpoint:

// app/api/realtime-token/route.ts
import { NextResponse } from 'next/server'

export async function POST() {
    try {
        const res = await fetch('https://api.openai.com/v1/realtime/client_secrets', {
            method: 'POST',
            headers: {
                Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
                'Content-Type': 'application/json',
            },
            body: JSON.stringify({
                session: {
                    type: 'realtime',
                    model: 'gpt-realtime',
                    audio: {
                        output: { voice: 'marin' },
                    },
                },
            }),
        })

        if (!res.ok) {
            const detail = await res.text()
            return NextResponse.json(
                { error: `OpenAI token request failed: ${detail}` },
                { status: res.status }
            )
        }

        const data: { value: string; expires_at: number } = await res.json()
        return NextResponse.json({ token: data.value, expiresAt: data.expires_at })
    } catch (err) {
        const message = err instanceof Error ? err.message : 'Unknown error'
        return NextResponse.json({ error: message }, { status: 500 })
    }
}

On the client, you grab the mic, create an RTCPeerConnection, and wire OpenAI's audio track to an <audio> element. WebRTC handles jitter buffering, echo cancellation, and packet loss for you - this is why it feels smoother than raw WebSockets:

// lib/realtime-client.ts
export interface RealtimeHandlers {
    onUserSpeechStart: () => void
    onError: (message: string) => void
}

export async function connectRealtime(
    audioEl: HTMLAudioElement,
    tools: object[],
    handlers: RealtimeHandlers
): Promise<RTCPeerConnection> {
    const tokenRes = await fetch('/api/realtime-token', { method: 'POST' })
    if (!tokenRes.ok) throw new Error('Failed to mint ephemeral token')
    const { token }: { token: string } = await tokenRes.json()

    const pc = new RTCPeerConnection()

    // Remote audio from the model
    pc.ontrack = (event) => {
        audioEl.srcObject = event.streams[0]
    }

    // Local mic
    const mic = await navigator.mediaDevices.getUserMedia({ audio: true })
    pc.addTrack(mic.getTracks()[0])

    // Data channel for events (function calls, transcripts, VAD signals)
    const channel = pc.createDataChannel('oai-events')
    channel.onmessage = (e) => {
        const event = JSON.parse(e.data)
        if (event.type === 'input_audio_buffer.speech_started') {
            handlers.onUserSpeechStart()
        }
        if (event.type === 'error') {
            handlers.onError(event.error?.message ?? 'Realtime error')
        }
    }

    // Register tools once the channel opens
    channel.onopen = () => {
        channel.send(
            JSON.stringify({
                type: 'session.update',
                session: { tools, tool_choice: 'auto' },
            })
        )
    }

    const offer = await pc.createOffer()
    await pc.setLocalDescription(offer)

    const sdpRes = await fetch('https://api.openai.com/v1/realtime/calls?model=gpt-realtime', {
        method: 'POST',
        body: offer.sdp,
        headers: {
            Authorization: `Bearer ${token}`,
            'Content-Type': 'application/sdp',
        },
    })

    if (!sdpRes.ok) throw new Error('SDP exchange failed')
    await pc.setRemoteDescription({ type: 'answer', sdp: await sdpRes.text() })

    return pc
}

Interruption with WebRTC is basically free - this is the big win. Because audio plays through a real media track, when the server-side VAD detects the user speaking it sends a speech_started event and stops generating; the browser's audio element naturally drains. You don't manually flush a playback buffer. This is the part that takes the most code in the other two options.

Function calling arrives as a response.function_call_arguments.done event on the data channel. You run the function and send the result back:

// lib/handle-tool-call.ts
import type { ReservationArgs } from './types'
import { checkAvailability } from './reservations'

export async function handleFunctionCall(
    channel: RTCDataChannel,
    callId: string,
    name: string,
    rawArgs: string
): Promise<void> {
    if (name !== 'check_availability') return

    let result: { available: boolean; slots: string[] }
    try {
        const args: ReservationArgs = JSON.parse(rawArgs)
        result = await checkAvailability(args.date, args.partySize)
    } catch (err) {
        result = { available: false, slots: [] }
        console.error('Tool call failed:', err)
    }

    // Return the output, then ask the model to speak it
    channel.send(
        JSON.stringify({
            type: 'conversation.item.create',
            item: {
                type: 'function_call_output',
                call_id: callId,
                output: JSON.stringify(result),
            },
        })
    )
    channel.send(JSON.stringify({ type: 'response.create' }))
}

Verdict on OpenAI Realtime: the lowest-friction path to a good browser voice agent. WebRTC does the hard real-time work. Downside: it's the most expensive of the three (more on cost below), and you're locked to OpenAI's voices and models.

Option 2: Gemini Live (WebSockets)

Gemini Live uses a bidirectional WebSocket instead of WebRTC. That's a double-edged sword. The upside: simpler mental model, no SDP handshake, easy to run server-side. The downside: you're now responsible for everything WebRTC gave you for free - audio chunking, playback buffering, and crucially, interruption.

Use the official SDK rather than hand-rolling the socket frames:

pnpm add @google/genai

A server-side session looks like this. Note Gemini Live wants raw 16-bit PCM at 16kHz for input and returns 24kHz PCM:

// lib/gemini-live.ts
import { GoogleGenAI, Modality } from '@google/genai'
import type { LiveServerMessage, Session } from '@google/genai'

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY })

export interface LiveCallbacks {
    onAudio: (pcm: Buffer) => void
    onInterrupted: () => void
    onToolCall: (name: string, args: Record<string, unknown>, id: string) => void
    onError: (message: string) => void
}

export async function openLiveSession(
    tools: object[],
    cb: LiveCallbacks
): Promise<Session> {
    const session = await ai.live.connect({
        model: 'gemini-2.5-flash-native-audio-preview',
        config: {
            responseModalities: [Modality.AUDIO],
            tools: [{ functionDeclarations: tools }],
        },
        callbacks: {
            onmessage: (msg: LiveServerMessage) => {
                // Barge-in: Gemini tells you the turn was interrupted.
                // You MUST flush your client playback buffer here.
                if (msg.serverContent?.interrupted) {
                    cb.onInterrupted()
                    return
                }

                const parts = msg.serverContent?.modelTurn?.parts ?? []
                for (const part of parts) {
                    const audio = part.inlineData?.data
                    if (audio) cb.onAudio(Buffer.from(audio, 'base64'))
                }

                for (const fc of msg.toolCall?.functionCalls ?? []) {
                    cb.onToolCall(fc.name, fc.args ?? {}, fc.id)
                }
            },
            onerror: (e: ErrorEvent) => cb.onError(e.message),
        },
    })

    return session
}

Streaming mic audio in is straightforward - you send base64 PCM chunks:

// lib/gemini-send-audio.ts
import type { Session } from '@google/genai'

export function sendAudioChunk(session: Session, pcm16: Buffer): void {
    session.sendRealtimeInput({
        audio: {
            data: pcm16.toString('base64'),
            mimeType: 'audio/pcm;rate=16000',
        },
    })
}

Now the painful part. Interruption is manual. When interrupted fires, the model has stopped generating - but you may already have hundreds of milliseconds of audio queued in the browser's AudioContext. If you don't kill it, the user hears the agent finish a sentence it already "abandoned." Here's a minimal playback queue that supports a hard flush:

// lib/pcm-player.ts
export class PCMPlayer {
    private ctx: AudioContext
    private queue: AudioBufferSourceNode[] = []
    private nextStartTime = 0

    constructor(private sampleRate = 24000) {
        this.ctx = new AudioContext({ sampleRate })
    }

    enqueue(pcm16: Int16Array): void {
        const buffer = this.ctx.createBuffer(1, pcm16.length, this.sampleRate)
        const channel = buffer.getChannelData(0)
        for (let i = 0; i < pcm16.length; i++) {
            channel[i] = pcm16[i] / 32768 // int16 -> float32
        }

        const source = this.ctx.createBufferSource()
        source.buffer = buffer
        source.connect(this.ctx.destination)

        const startAt = Math.max(this.ctx.currentTime, this.nextStartTime)
        source.start(startAt)
        this.nextStartTime = startAt + buffer.duration

        source.onended = () => {
            this.queue = this.queue.filter((s) => s !== source)
        }
        this.queue.push(source)
    }

    // Called on `interrupted` - kill everything in flight
    flush(): void {
        for (const source of this.queue) {
            try {
                source.stop()
            } catch {
                // already stopped
            }
        }
        this.queue = []
        this.nextStartTime = this.ctx.currentTime
    }
}

That flush() method is the entire ballgame for Gemini Live interruption quality. Get the buffering wrong and barge-in feels laggy no matter how fast the model is.

Tool calls come back on the same socket; you respond with sendToolResponse:

// lib/gemini-tool-response.ts
import type { Session } from '@google/genai'
import { checkAvailability } from './reservations'

export async function respondToToolCall(
    session: Session,
    name: string,
    args: Record<string, unknown>,
    id: string
): Promise<void> {
    if (name !== 'check_availability') return

    const result = await checkAvailability(
        String(args.date),
        Number(args.partySize)
    )

    session.sendToolResponse({
        functionResponses: [{ id, name, response: { result } }],
    })
}

Verdict on Gemini Live: cheapest per minute and genuinely fast, but you own the audio plumbing. If you've never written an AudioContext playback queue, budget real time for it. The WebSocket transport is also less robust on flaky networks than WebRTC - no built-in jitter buffer.

Option 3: Pipecat-JS (the framework)

Pipecat is an open-source framework (originally Python, now with a TypeScript/JS client and JS server pipeline) that abstracts the transport and gives you a pipeline of frame processors: transport in → VAD → LLM → TTS → transport out. The pitch is provider portability - swap OpenAI Realtime for Gemini Live by changing one service, and barge-in, buffering, and turn detection are handled by the framework.

pnpm add @pipecat-ai/client-js @pipecat-ai/small-webrtc-transport

The client side is dramatically less code because the framework owns the audio loop and interruption:

// lib/pipecat-client.ts
import { PipecatClient } from '@pipecat-ai/client-js'
import { SmallWebRTCTransport } from '@pipecat-ai/small-webrtc-transport'

export interface PipecatHandlers {
    onBotStartedSpeaking: () => void
    onBotStoppedSpeaking: () => void
    onUserStartedSpeaking: () => void
    onError: (message: string) => void
}

export function createVoiceClient(handlers: PipecatHandlers): PipecatClient {
    const client = new PipecatClient({
        transport: new SmallWebRTCTransport(),
        enableMic: true,
        enableCam: false,
        callbacks: {
            onBotStartedSpeaking: handlers.onBotStartedSpeaking,
            onBotStoppedSpeaking: handlers.onBotStoppedSpeaking,
            onUserStartedSpeaking: handlers.onUserStartedSpeaking,
            onError: (msg) => handlers.onError(String(msg)),
        },
    })

    return client
}

export async function startSession(client: PipecatClient): Promise<void> {
    // Your server exposes a /connect endpoint that returns transport params
    await client.connect({ endpoint: '/api/pipecat/connect' })
}

Notice there's no playback queue, no SDP handshake, no manual flush. The framework's pipeline runs server-side VAD and emits onUserStartedSpeaking, and the transport handles the interruption automatically. You traded control for convenience - which is exactly the right trade if you plan to A/B test providers or run the same agent across web, phone (via a SIP transport), and native.

The cost: it's another dependency to keep current, the TypeScript server pipeline is younger than the Python one, and when something breaks deep in the audio path you're now debugging the framework's internals instead of your own 80 lines. For a single-provider production app, that abstraction can be overhead you don't need.

The numbers: latency and cost

I ran each agent through 20 reservation conversations on a wired connection from Tel Aviv. Latency is median time from end-of-user-speech to first audio byte. Cost is per minute of active conversation (roughly 50/50 input/output audio), using list prices as of June 2026 - check current pricing before you commit, these move constantly.

| | Transport | Median latency | Interruption | Cost / min (approx) | Code to barge-in | | --- | --- | --- | --- | --- | --- | | OpenAI Realtime | WebRTC | ~410ms | Excellent (free) | $$$ | ~0 lines | | Gemini Live | WebSocket | ~380ms | Good (manual flush) | $ | ~40 lines | | Pipecat-JS | Pluggable | ~440ms* | Excellent (framework) | varies by backend | ~0 lines |

* Pipecat adds a small overhead from the extra pipeline hop, and the number depends entirely on which LLM/TTS services you plug in.

A few things that surprised me:

  • Gemini Live was the latency winner in raw numbers, but only after I got the playback buffer right. Before that, perceived latency on interruptions was the worst of the three.
  • OpenAI's WebRTC interruption is so good it's almost unfair. You write zero interruption code and it just works, because the media stack does it.
  • Cost gaps are large. Audio tokens are far more expensive than text. At scale, the OpenAI-vs-Gemini per-minute gap is the kind of thing that decides whether your feature has positive margin. Run the math against your real conversation lengths - and pair this with the cost modeling in The Real Cost of Running an AI Feature in Production.

So which one?

No single winner - it depends on what you're optimizing:

  • Fastest to a great browser demo, single provider, money is not the constraint: OpenAI Realtime. The free interruption alone saves you a week.
  • Cost-sensitive, you have the engineering chops to own the audio path, or you need server-side control: Gemini Live. Cheapest and fastest, but you write the plumbing.
  • You need provider portability, multi-channel (web + phone), or you're going to A/B test models: Pipecat-JS. Pay a small latency and complexity tax for the abstraction.

My actual recommendation for most teams shipping their first production voice feature: start on OpenAI Realtime to get the UX right with the least code, instrument your latency and cost, then port to Gemini Live or Pipecat only if the cost math forces it. Don't reach for the framework until you have a second provider you actually want to support - abstractions you don't yet need are just bugs you haven't met.

What's next

This post assumed you already know whether your users even want voice and what "good enough" latency means for your use case. Those targets belong in the spec before anyone writes a WebRTC offer - see How to Write an AI Feature Spec That Engineers Won't Push Back On for setting latency and cost budgets up front. And if you're not ready for a provider-grade real-time API yet, the browser-native starting point is Build a Voice-Enabled AI Assistant in the Browser with TypeScript, which uses the Web Speech API with no backend at all.

Share:
VA

Vadim Alakhverdov

Software developer writing about JavaScript, web development, and developer tools.

Related Posts