← All work
AI & Machine Learning · Prototype · 2025

Real-Time Voice AI

Engineering sub-second voice agents over telephony: VAD-gated streaming ASR over the non-streaming Whisper model, semantic endpointing, barge-in, and per-clause TTS overlap.

Real-Time Voice AI
Year
2025
Status
Prototype
Category
AI & Machine Learning
Role
Architect & Lead

Key metrics

≤800 ms
Turn budget
7 s
Naive baseline
Full-duplex
Barge-in

Architecture

Full-duplex voice-agent control loop over telephony: a VAD gate feeds a streaming ASR stage (LocalAgreement-2 partials over Whisper, or managed Deepgram), semantic endpointing decides end-of-turn, the LLM reply is synthesized per-clause and overlapped with generation, and barge-in interrupts playback the moment the caller speaks.

Case study

Real-Time Voice AI Over Telephony: Engineering for Sub-Second Conversation

A case study in solving a genuinely hard, latency-critical problem: making a machine talk to a human over the phone and have it feel like a conversation, not a help desk on a satellite delay.

A spoken conversation is not a request/response API. It is a turn-taking, full-duplex, real-time control loop running over a lossy, jittery network you do not control. The machine has to receive a continuous audio stream, decide the exact moment the human has finished a thought, transcribe speech as it arrives, generate a reply token-by-token, synthesize that reply into audio before the sentence even exists, and play it back — all while staying alert for the human cutting in. Get any one of those wrong and the illusion collapses: the agent talks over people, or it sits in dead air long enough that callers say "hello? are you there?".

The constraint that shapes everything is human perception. People judge a reply as conversational below ~500 ms, acceptable up to ~800 ms, noticeably delayed past ~800 ms, and outright broken past ~1500 ms. That is the budget for the entire round trip — end of the human's speech to the first sound out of the agent's mouth. A naive pipeline blows it in the first stage alone.

This is the body of work I draw on to build voice agents that actually hit that bar.

The hard problem: why naive pipelines feel broken

The intuitive way to build this is a sequence: wait for the caller to finish → transcribe the whole utterance → send it to an LLM → wait for the full answer → synthesize the whole answer → play it. It is correct, it is easy to reason about, and it is unusably slow. A widely-cited 2026 optimization report describes a stack that started at 7 seconds voice-to-voice built exactly this way.

Three forces conspire against the naive approach:

  1. Telephony is hostile. Carrier audio is narrowband 8 kHz G.711 — the classic "telephone" sound, with no compression artifacts but zero resilience to loss. The media arrives as RTP packets (20 ms frames) over a network that reorders, delays, and drops them — or, on the Twilio Media Streams leg I use as the worked example through this study, as that same µ-law audio re-wrapped into base64 JSON frames on a WebSocket. Models trained on 16 kHz wideband audio pay a real word-error-rate tax on this band-limited signal either way.
  2. The best models are not streaming models. Whisper, the strongest open ASR model, is fundamentally an offline sequence-to-sequence model built for 30-second chunks containing a complete sentence. It has no native low-latency path. The same "wait for the whole thing" assumption is baked into batch TTS.
  3. Stages are slow individually and catastrophic in series. Endpointing, ASR finalization, LLM time-to-first-token, and TTS time-to-first-byte each cost hundreds of milliseconds. Add them up and you are at multiple seconds before the budget even considers network transit.

The fix is one idea applied relentlessly: overlap the stages, do not sequence them. That single principle is what took the reference stack from 7 seconds to ~500 ms — almost entirely by pipelining and by fixing endpointing, not by buying faster models.

The latency budget is the spine of the discipline

You cannot optimize what you cannot decompose. Here is a realistic voice-to-voice ("mouth-to-mouth") budget for a well-tuned cascade agent, reconciled from two published budgets that both land near ~800 ms worst-case and ~400–500 ms when the stages overlap aggressively. Figures are flagged where they vary materially by hardware or measurement method.

Stage What's happening Typical (approx) Notes
Network / RTP ingress packet transit + jitter buffer hold 20–60 ms jitter buffer (NetEQ) trades latency for smoothness
VAD + turn endpoint wait trailing-silence window before "user done" 50–800 ms the single biggest controllable term; defaults are 500–800 ms
Streaming ASR finalization partial → final transcript commit 80–200 ms varies by engine and tuning
LLM TTFT (first token) prefill + first decode 150–400 ms the largest model-side term; hard to compress without hurting answers
TTS TTFB / TTFA (first audio) first audio chunk out 40–300 ms vendor best-case ~40–75 ms; third-party P50 ~188–313 ms
Network / RTP egress encode + transit back 20–60 ms symmetric with ingress
Voice-to-voice total end of human speech → first agent audio ~500–800 ms <500 natural, >800 delayed, >1500 broken

Four insights fall out of this table, and they drive every architectural decision below:

  • Endpointing dominates the controllable latency. A misconfigured VAD is the single easiest way to add 500 ms without touching a single model, because most implementations default to a 500–800 ms trailing-silence window. Reclaiming that is the highest-leverage change available.
  • The LLM gets the biggest model-side allocation because it is the hardest to compress without degrading answer quality.
  • TTS TTFB is the dominant perceived term once endpointing is fixed — it is literally the last thing standing between "the human stopped talking" and "the human hears something."
  • Overlap collapses the wall-clock total. Because STT partials feed the LLM mid-utterance and TTS starts on the first clause, the perceived gap is roughly turn-detect + LLM-TTFT + TTS-TTFBnot the sum of all rows. That is the entire game.

The architecture: three tiers, each scaled on its own axis

The system splits into three layers with sharply different operational characteristics. Conflating them is the most common way these systems fail to scale.

graph TD
    PSTN[PSTN / Carrier
SIP trunk · RTP · G.711 8kHz] subgraph EDGE["TELEPHONY EDGE — real-time media plumbing"] SBC[SBC
Kamailio / OpenSIPS] MEDIA[Media edge
Twilio Media Streams · FreeSWITCH / Asterisk / LiveKit SIP
transcode G.711↔PCM · NetEQ jitter buffer · PLC] end subgraph ORCH["AGENT ORCHESTRATOR — per-call state machine"] VAD[Silero VAD] TURN[Semantic turn detector
transformer] ASR[Streaming ASR
Deepgram / faster-whisper] LLM[Streaming LLM
vLLM · KV + prompt cache] TTS[Streaming TTS
Cartesia / ElevenLabs] end subgraph GPU["MODEL SERVICES — stateless, batched, independently scaled"] GASR[ASR inference] GLLM[LLM inference
continuous batching] GTTS[TTS inference] end PSTN <--> SBC SBC <--> MEDIA MEDIA -->|20ms PCM frames| VAD VAD --> TURN TURN --> ASR ASR -->|partials| LLM LLM -->|clause chunks| TTS TTS -->|audio out| MEDIA ASR -.-> GASR LLM -.-> GLLM TTS -.-> GTTS VAD -.->|"barge-in: stays HOT during playback"| TTS

Tier 1 — Telephony edge. This owns SIP signaling, RTP de/encode, codec transcoding (G.711 ↔ PCM, optionally Opus on a WebRTC leg), the jitter buffer, and packet-loss concealment. It is not AI — it is hard real-time media plumbing that telephony incumbents (FreeSWITCH, Asterisk, Kamailio/OpenSIPS) have hardened over two decades. LiveKit SIP, Jambonz, and Twilio Media Streams also play this role.

The worked example in this study is Twilio. A Programmable Voice number answers the call; a TwiML <Connect><Stream> verb forks the live audio to your orchestrator as a bidirectional WebSocket carrying base64-encoded 8 kHz µ-law frames (≈20 ms each, media events in, media messages back out to speak). It is the fastest path from "phone number" to "audio frames in my pipeline" — no SBC, no SIP trunk config — at the price of accepting Twilio's G.711 media plane end to end and its WebSocket as your jitter boundary. Where a stage below has a Twilio-specific wrinkle, I call it out inline. When I control more of the stack, I prefer terminating Opus where I can and transcoding to G.711 only at the carrier edge, for the resilience reasons in the jitter section below.

Tier 2 — Agent orchestrator. One per-call session running the AI state machine: VAD → turn detection → STT → LLM → TTS, plus barge-in handling. LiveKit Agents and Pipecat are the modern frameworks here; both make every stage a swappable plugin (Silero VAD, a transformer turn detector, Deepgram or faster-whisper for ASR, Cartesia/ElevenLabs for TTS) so you can replace any component without rewriting the pipeline.

Tier 3 — Model services. ASR, LLM, and TTS are stateless inference endpoints — hosted APIs or self-run on GPUs — scaled independently of call count. This separation is what lets you put media servers near callers (to cut RTT) while keeping GPUs wherever the capacity is, paying the inter-region hop only on the control plane.

A note on speech-to-speech models: end-to-end S2S models (e.g. realtime multimodal APIs) collapse STT+LLM+TTS into one network and preserve prosody beautifully. But the modular cascade remains the production default because it gives you per-stage swapping, observability, and cost control. This case study is about making the cascade fast.

Step through one conversational turn as audio flows through the per-call pipeline — VAD gating, turn detection, streaming ASR, LLM, and overlapped TTS — and watch where the latency budget is spent and recovered:

Choosing the stack: LiveKit Agents vs Pipecat

Tier 2 — the orchestrator — is where most of the leverage in the previous section actually gets implemented, and in 2025/2026 two frameworks own that layer. I have built on both, and they are not interchangeable: they make opposite bets about who owns the media plane and how much of the pipeline you assemble yourself. Picking the wrong one means fighting the framework for the lifetime of the project, so this is a decision worth making deliberately.

LiveKit Agents — WebRTC-native, batteries included

What it is. A real-time agent framework built directly on LiveKit's WebRTC media plane, with first-class SIP/telephony (phone numbers without a Twilio bridge), a worker/job model (a long-lived worker process spawns one job — typically one OS process — per call), and a built-in semantic turn detector. You declare an AgentSession of plugins (STT/LLM/TTS/VAD) and the framework runs the VAD → turn detection → STT → LLM → TTS loop, barge-in included.

from livekit import agents
from livekit.agents import AgentServer, AgentSession, Agent, inference, room_io, TurnHandlingOptions, InterruptionOptions
from livekit.plugins import noise_cancellation, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel

server = AgentServer()

@server.rtc_session(agent_name="phone-agent")
async def entrypoint(ctx: agents.JobContext):
    # One job == one call. The session is the per-call state machine.
    session = AgentSession(
        vad=silero.VAD.load(),                          # frame-level speech/silence
        stt=inference.STT(model="deepgram/nova-3", language="multi"),
        llm=inference.LLM(model="openai/gpt-5.2-chat-latest"),
        tts=inference.TTS(model="cartesia/sonic-3", voice="<voice-id>"),
        turn_handling=TurnHandlingOptions(
            turn_detection=MultilingualModel(),         # semantic EOT, not a silence timer
            interruption=InterruptionOptions(mode="adaptive", min_duration=0.5),  # barge-in / backchannel guard
        ),
    )
    await session.start(
        room=ctx.room,
        agent=Agent(instructions="You are a concise, friendly phone assistant."),
        room_options=room_io.RoomOptions(
            audio_input=room_io.AudioInputOptions(
                noise_cancellation=noise_cancellation.BVC(),
            ),
        ),
    )
    await session.generate_reply(instructions="Greet the caller.")

if __name__ == "__main__":
    agents.cli.run_app(server)   # boots the worker; it accepts jobs and spawns one per call

The shape to notice: the file is a worker definition, not a script. cli.run_app starts a long-running worker that registers with LiveKit and is handed a JobContext per call; everything inside entrypoint is per-call. TurnHandlingOptions is exactly the "semantic turn detection beats a silence timer" idea from the budget section, shipped as a config knob — MultilingualModel() is the transformer end-of-turn detector, and the interruption block is the debounced barge-in handling from problem #4. Swapping Deepgram for self-hosted faster-whisper, or Cartesia for ElevenLabs, is a one-line change.

Gotchas:

  • Process-per-job memory math. The default worker spawns a subprocess per call for isolation (good — it's problem #5's mitigation for free), but each process carries its own loaded models and Python heap. Budget RAM as base + per_call × num_jobs, cap num_idle_processes/job concurrency per worker, and load shared heavy models (the VAD) in prewarm so they're not re-paid on every job.
  • Turn-detector model: download + latency tax. MultilingualModel() pulls a transformer model on first use and runs an inference per candidate end-of-turn. Pre-download it in your image build (python my_agent.py download-files) so cold starts don't surface in p99, and measure its added latency against the endpoint-wait budget — it buys accuracy but is not free.
  • Worker draining on deploy. A naive redeploy kills the worker and drops every live call. Wire graceful drain (the framework stops accepting new jobs on SIGTERM and lets in-flight calls finish) and set the termination grace period longer than your longest expected call, or set a max-call-duration backstop.
  • SIP trunk + DTMF specifics. Native SIP means you own trunk config (codecs, auth, NAT) and DTMF handling — keypad digits arrive as RFC 2833/telephony events, not as transcribable audio, so menu/IVR logic must subscribe to DTMF events explicitly rather than expecting the STT to "hear" the tones.
  • You're tied to LiveKit's media plane. The convenience is also the lock-in: scaling media means LiveKit Cloud or self-hosting the LiveKit server/SFU. That's a real ops surface, separate from your agent workers.

Pipecat — a frame pipeline you assemble yourself

What it is. A transport-agnostic pipeline of frame processors: audio, transcripts, LLM tokens, and control signals all flow as typed frames through an ordered list of processors you compose by hand (transport.input() → STT → context → LLM → TTS → transport.output()). It is deliberately lower-level than LiveKit Agents and ships a large catalog of service integrations, with the transport (Daily, Twilio, raw WebSocket/WebRTC, telephony) swapped independently of the pipeline.

import os

from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.worker import WorkerParams, PipelineWorker
from pipecat.workers.runner import WorkerRunner
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
    LLMContextAggregatorPair, LLMUserAggregatorParams,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transports.base_transport import BaseTransport, TransportParams

# Transport is decoupled: same pipeline runs over WebRTC, Twilio, or a raw socket.
transport_params = {
    "webrtc": lambda: TransportParams(audio_in_enabled=True, audio_out_enabled=True),
}

async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
    stt = DeepgramSTTService(api_key=os.environ["DEEPGRAM_API_KEY"])
    llm = OpenAILLMService(api_key=os.environ["OPENAI_API_KEY"])
    tts = CartesiaTTSService(api_key=os.environ["CARTESIA_API_KEY"], voice_id="<voice-id>")

    context = LLMContext()
    aggregators = LLMContextAggregatorPair(
        context,
        # VAD lives on the user aggregator — it gates STT and drives interruptions.
        user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
    )

    pipeline = Pipeline([
        transport.input(),       # frames in from the wire
        stt,                     # audio frames -> transcript frames
        aggregators.user(),      # VAD + user-turn aggregation
        llm,                     # transcript -> streamed token frames
        tts,                     # tokens -> audio frames (per clause)
        transport.output(),      # frames out to the wire
        aggregators.assistant(), # record what the agent actually said
    ])

    worker = PipelineWorker(pipeline, params=WorkerParams(enable_metrics=True))
    runner = WorkerRunner(handle_sigint=runner_args.handle_sigint)
    await runner.run(worker)

async def bot(runner_args: RunnerArguments):
    transport = await create_transport(runner_args, transport_params)
    await run_bot(transport, runner_args)

The shape to notice: the pipeline is the architecture, written out as an ordered list. That explicitness is the whole point — you can drop a custom FrameProcessor anywhere (redaction before STT, a guardrail between LLM and TTS, metrics taps) because every stage speaks the same frame protocol. Interruptions ride the same channel: when VAD detects barge-in, Pipecat emits a StartInterruptionFrame that propagates downstream so STT/LLM/TTS all flush in order. (Note the 2026 rename: PipelineTaskPipelineWorker, PipelineRunnerWorkerRunner, and VAD moved onto the context aggregator — older tutorials show the previous names.)

Gotchas:

  • GIL + async backpressure (problem #5, but you own it). The pipeline is one asyncio loop; any CPU-bound processor (resampling, a synchronous model call) blocks every frame behind it. Keep inference off the loop (call services as network endpoints), and watch for queue buildup when a downstream stage (usually TTS) can't keep up — a slow stage silently grows latency rather than erroring.
  • Frame ordering around interruptions. StartInterruptionFrame must reach STT, LLM, and TTS, in order, or fragments of the abandoned reply leak out after the user has taken the floor. If you insert custom processors, make sure they pass control frames through promptly and don't buffer them behind queued audio — this is the #1 way hand-rolled pipelines mishandle barge-in.
  • VAD / endpointing tuning is on you. Pipecat gives you SileroVADAnalyzer and stop-secs/confidence parameters but no opinionated semantic turn detector wired in by default; out of the box you're closer to a silence timer, so tune the endpointing window per use case (or add a turn-detection processor) rather than shipping the defaults.
  • Transports are not uniform. Daily, Twilio Media Streams, and raw telephony differ in sample rate, codec (8 kHz µ-law on Twilio vs wideband on Daily), and interruption fidelity. Code written and tuned against Daily can regress badly on a Twilio phone leg — test endpointing and barge-in on the actual target transport, not the dev one.
  • Scaling is one-pipeline-per-call, and that's your problem to operate. Pipecat runs a pipeline per call but doesn't hand you a worker/job orchestration layer the way LiveKit does — you wire process isolation, autoscaling on an active-calls metric, and graceful drain yourself (Pipecat Cloud exists if you'd rather not).

The ASR seam: Deepgram, the managed default

Both frameworks above name Deepgram as the first STT plugin, and that reflects production reality: it is the most common managed streaming-ASR choice for telephony agents, and on a phone leg it earns the slot. It accepts 8 kHz µ-law natively (encoding=mulaw), so Twilio's frames go in untouched — no transcoding step in front of the transcriber. Its interim results, server-side endpointing, and utterance-end events replace most of the hand-rolled VAD gating from the WhisperLive approach: speech_final is the endpoint signal, partials drive the semantic turn detector, and UtteranceEnd is the trailing-off-speech backstop — with finals typically landing sub-300 ms. The trade-off against self-hosted faster-whisper is the usual one: per-minute cost and a vendor in the hot path, in exchange for not operating GPU ASR yourself.

import base64

from deepgram import AsyncDeepgramClient
from deepgram.core.events import EventType
from deepgram.listen.v1.types import ListenV1Results, ListenV1UtteranceEnd

client = AsyncDeepgramClient()     # reads DEEPGRAM_API_KEY from the environment

async def transcribe_call(twilio_frames):
    async with client.listen.v1.connect(
        model="nova-3",
        encoding="mulaw",          # Twilio's G.711 µ-law goes in untouched
        sample_rate="8000",
        channels="1",
        interim_results="true",    # partials feed the LLM mid-utterance
        endpointing="300",         # ms of trailing silence before speech_final
        utterance_end_ms="1000",   # UtteranceEnd backstop for slow trailing-off speech
        smart_format="true",
    ) as connection:

        def on_message(message):
            if isinstance(message, ListenV1Results):
                text = message.channel.alternatives[0].transcript
                if message.speech_final and text:   # endpoint reached: the turn is over
                    finalize_turn(text)
                elif text:
                    on_partial(text)                # drives the semantic turn detector
            elif isinstance(message, ListenV1UtteranceEnd):
                finalize_turn(None)                 # backstop when speech_final never fires

        connection.on(EventType.MESSAGE, on_message)
        await connection.start_listening()

        async for frame in twilio_frames:           # Twilio "media" events off the WebSocket
            if frame["event"] == "media":
                await connection.send_media(base64.b64decode(frame["media"]["payload"]))

The shape to notice: endpointing moved into the ASR vendor. endpointing=300 is the trailing-silence window from the budget table, but applied by a model that already knows where the words are, and interim_results means the LLM is reading partials long before the endpoint fires. (Note the 2026 SDK shape: deepgram-sdk v7 dropped the old LiveOptions / listen.websocket.v("1") API — connection parameters go straight to listen.v1.connect(), and events arrive as typed messages on EventType.MESSAGE. Older tutorials show the v3/v4 names.)

Twilio wrinkle: the media payload is already base64 µ-law, so the entire ingest path is decode-and-forward — resist the urge to resample to 16 kHz first; Nova-3 handles the narrowband natively and the resample just adds latency and CPU on the event loop. And Deepgram closes a socket that goes quiet: if you ever stop forwarding caller audio (you shouldn't — barge-in needs the VAD hot during playback), send connection.send_keep_alive() to hold the connection open.

Verdict: which to pick

Reach for LiveKit Agents when the dominant requirements are WebRTC/SIP at scale and time-to-production: you want native telephony with phone numbers, a managed media plane, a worker/job model with process isolation per call, and a turn detector and barge-in handling that work well out of the box. It is the "batteries-included, opinionated" choice, and it's the faster path to a robust phone agent.

Reach for Pipecat when you need fine-grained control of the pipeline — custom frame processors (redaction, guardrails, domain routing, multimodal branches), transport flexibility (you're not committing to one media plane), or unusual topologies (fan-out, agent handoff). You trade batteries-included convenience for a pipeline you fully own and can shape, and you accept owning more of the scaling and barge-in correctness yourself.

On the Twilio worked example specifically: Pipecat speaks Media Streams natively (a Twilio WebSocket transport drops into the same pipeline), while LiveKit Agents wants the call brought into its own media plane via LiveKit SIP or a trunk — fine if you're committing to LiveKit anyway, an extra hop if Twilio is already your edge.

In practice the deciding question is rarely latency — both can hit the budget from the previous section — it's how much of the media plane and pipeline you want to own. LiveKit owns more for you; Pipecat hands you more control. Everything in the next section (Whisper gating, semantic endpointing, TTS overlap, barge-in, the GIL, draining) has to be solved in either framework — the difference is whether the framework solves it for you or hands you the primitives to solve it yourself.

Hard problems and how I solved them

This is the part that separates a demo from a system that holds up on a real phone line under real load. Each problem below is one where the obvious library default actively works against you.

1. Whisper hallucinates on silence

Whisper emits text for every segment it is handed, including silence and background noise. Fed near-zero audio, it "fills in" plausible-sounding text — the infamous looping "Thanks for watching!" or a repeated last phrase. Its internal no_speech_prob is unreliable, so you cannot simply threshold on it. On a phone call, where there is constant low-level line noise and frequent pauses, this turns the transcript into garbage.

Why the library fails: Whisper has no native VAD. Its design assumption is "here is a 30-second clip that contains speech." That assumption is false on a live call.

The mitigation — gate the model on a real VAD, before inference. This is exactly the design at the heart of my WhisperLive project: a Silero-style VAD front-end classifies each frame as speech or silence, and Whisper never sees the silent frames. No audio in means no hallucinated text out. (WhisperX demonstrates the same effect; a Calm-Whisper fine-tune reports >80% reduction in non-speech hallucination as a complementary approach.)

# VAD-gating: Whisper only ever transcribes confirmed speech.
speech_buffer = []
for frame in audio_frames(frame_ms=20):       # 20 ms PCM frames off the wire
    if vad.is_speech(frame):                  # Silero frame-level classifier
        speech_buffer.append(frame)
    elif speech_buffer and silence_ms() > HANGOVER_MS:  # a real pause, not a breath
        transcript = whisper.transcribe(concat(speech_buffer))
        emit_segment(transcript)               # commit, then reset
        speech_buffer.clear()
    # pure silence with empty buffer -> Whisper is never invoked

The logic is small but the discipline is the point: the model is invoked only on buffered speech, and silence is dropped on the floor before it can cost an inference call or produce a hallucination. As a bonus, gating cuts GPU cost — you run the model only when someone is actually talking.

2. Endpointing is either too eager or too slow

The orchestrator must decide when the human is done. The default mechanism is a trailing-silence timer: after the last word, wait N milliseconds of quiet, then call it a turn. Set N short and you cut people off mid-thought ("I'd like to book a — " agent starts talking). Set N to the typical 500–800 ms default and, per the budget table, you have just handed away the single largest controllable chunk of latency.

Why the library fails: silence is a terrible proxy for "finished a thought." Humans pause mid-sentence to think, and stop crisply at the end of a question. A pure timer cannot tell those apart.

The mitigation — semantic, model-based turn detection. A small transformer reads the partial transcript and predicts whether the utterance is semantically complete, regardless of how long the silence is. Because it judges meaning rather than waiting on the clock, it can fire before the trailing-silence window even begins, reclaiming most of that 500–800 ms.

# Semantic turn detection fires on completeness, not on a silence timer.
async def on_partial(text, silence_ms):
    if turn_model.is_complete(text):          # transformer: "is this a finished turn?"
        finalize_turn(text)                    # fire immediately — don't wait out silence
    elif silence_ms > MAX_SILENCE:             # safety net for trailing-off speech
        finalize_turn(text)
    # else: keep listening; VAD stays hot

The semantic check is the fast path; the silence timer survives only as a backstop for utterances that trail off without a clean grammatical end. In practice this is the highest-ROI change in the whole stack — it costs nothing at the model layer and returns hundreds of milliseconds of perceived latency.

3. TTS time-to-first-byte spikes

Once endpointing is fixed, TTS TTFB becomes the dominant perceived term — it is the last thing before the caller hears anything. The trap: vendor-reported TTFB is best-case (about 40 ms for Cartesia Sonic, 75 ms for ElevenLabs Flash), but independent P50 benchmarks land at 188–313 ms with wide variance, and a cold connection or a long first sentence makes the tail worse.

Why the library fails: batch TTS waits for the entire response text before it synthesizes a single byte. If the LLM is still generating, the caller hears nothing. And measuring against the marketing TTFB sets you up to miss your SLO under real load.

The mitigation — sentence-chunked synthesis with pre-warmed connections, budgeted against P95. Stream LLM tokens, and the moment the first complete clause exists, hand it to TTS and start playback — synthesizing audio before the user has even finished hearing the full thought. Keep a warm connection to the TTS provider so the first call of a turn isn't paying connection setup, and keep a fallback voice/provider for when the primary's tail latency spikes.

# Overlap TTS with LLM generation: synthesize per clause, not per response.
buffer = ""
async for token in llm.stream(prompt):
    buffer += token
    if ends_clause(buffer):                    # ". " "? " "! " or a long comma clause
        audio = await tts.synthesize(buffer)   # warm connection, first clause is SHORT
        playback.enqueue(audio)                # caller hears word 1 while word 30 is generated
        buffer = ""
if buffer:
    playback.enqueue(await tts.synthesize(buffer))

Two details matter. First, the first clause is deliberately short, so its TTFB is minimal and the caller hears something fast; later clauses synthesize while earlier ones play. Second, I budget against P95 under my own traffic, not the vendor's P50 — the only number that predicts how the system feels at 3 a.m. under load.

Twilio wrinkle: outbound audio must go back over the same WebSocket as base64 8 kHz µ-law media messages — so the TTS provider's wideband PCM gets transcoded before it is queued, and that transcode belongs inside the per-clause loop, not after the full reply. Send a mark message after each clause: Twilio echoes it back when that audio has actually played, which is the only reliable playback-position signal you get.

4. Barge-in and double-talk

Real conversation is interruptible. When the caller speaks while the agent is talking, the agent must stop immediately — within ~100–300 ms — or it sounds like it isn't listening. The naive pipeline shuts VAD off during playback (to avoid hearing its own voice), which makes barge-in impossible: stale TTS keeps streaming while the human talks, and now both are talking at once ("double-talk").

Why the library fails: the obvious "don't listen while you speak" instinct is exactly backwards for natural conversation.

The mitigation — keep VAD hot during playback, and on detected user speech, cancel the in-flight TTS and flush the LLM. The subtlety is avoiding false barge-in on backchannels — the "uh-huh," "right," "mm-hmm" that humans emit to signal they're listening, not to take the floor. A short debounce on speech duration separates a real interruption from a backchannel.

# Barge-in: VAD never sleeps during playback; debounce defeats false triggers.
async def on_user_speech(duration_ms):
    if not agent_is_playing:
        return
    if duration_ms < BACKCHANNEL_MS:           # "uh-huh" while agent talks -> ignore
        return
    playback.stop()                             # local + instant: silence NOW
    await tts.cancel()                          # then kill in-flight synthesis
    llm.flush()                                 # local/sync: abandon the half-generated reply

The cancellation has to reach all three places — synthesis, generation, and the playback queue — or fragments of the abandoned reply leak out after the agent should have gone quiet.

Twilio wrinkle: the playback queue is partly inside Twilio — frames you have already written to the WebSocket are buffered on their side and will keep playing after you stop sending. Barge-in on Media Streams therefore means sending a clear message to flush Twilio's buffer the instant real user speech is confirmed, and using the last echoed mark to know exactly how much of the abandoned reply the caller actually heard (which matters when the LLM resumes the conversation).

5. Python's GIL and async backpressure

The orchestrator frameworks are Python and asyncio-based. That is great for I/O concurrency and terrible if you forget how it works: one slow processor stalls the entire pipeline, and any CPU-bound work (audio resampling, a synchronous model call) blocks the event loop, freezing every call that worker is handling.

Why the library fails: the framework gives you a clean async pipeline, but it cannot stop you from blocking the loop or from letting a slow downstream stage build an unbounded queue behind it.

The mitigation — one process (or worker) per call, all heavy inference pushed to separate GPU services, and bounded queues everywhere. Never run model inference inside the event loop; call it as a network service so the loop stays free to shuttle 20 ms audio frames. Process isolation per call means one stuck session cannot take its neighbors down with it.

# Keep the event loop free; bound queues so a slow stage sheds instead of drowning.
asr_q = asyncio.Queue(maxsize=8)               # bounded: backpressure, not infinite buffering

async def feed_asr(frame):
    try:
        asr_q.put_nowait(frame)                # NEVER block the audio loop
    except asyncio.QueueFull:
        drop_oldest(asr_q); asr_q.put_nowait(frame)   # prefer fresh audio over a backlog

# inference runs in a separate GPU service, not here:
transcript = await asr_client.stream(asr_q)    # await = loop stays responsive

The bounded queue is doing real work: under load it sheds the oldest audio rather than letting latency balloon as a backlog grows. For a real-time system, stale audio is worthless — dropping it is correct.

6. RTP jitter and packet loss

The network reorders, delays, and drops RTP packets. Dropouts hurt twice: they make the agent hard for the human to understand, and they degrade ASR word-error-rate, because the model is fed audio with holes in it.

Why the library fails: a fixed-size jitter buffer either adds latency (too deep) or stutters (too shallow), and raw G.711 has no loss resilience at all.

The mitigation — an adaptive jitter buffer plus a forward-error-correcting codec. WebRTC's NetEQ is a dynamic jitter buffer and error-concealment algorithm that smooths jitter while keeping latency as low as the conditions allow — a direct, adaptive latency/quality tradeoff rather than a fixed one. Pair it with Opus, which carries built-in FEC (redundant data to reconstruct lost packets) and PLC (synthesizing plausible audio for gaps the ear won't notice). This is the concrete reason I prefer Opus on the WebRTC leg even when the call ultimately terminates to PSTN: I get loss resilience end-to-end and only transcode to brittle G.711 at the very edge.

There is no clever snippet here — this is configuration and codec choice at the media tier — but it is load-bearing: loss handling has to live upstream of the ASR model, because by the time a gap reaches the transcriber, the word is already gone.

7. Scaling to thousands of concurrent calls

A demo handles one call. Production handles thousands, and the gap between those two is where most voice stacks fall over. Independent benchmarking of open-source voice engines "at 100 concurrent calls" consistently shows demoware buckling well before the marketing concurrency number.

Why naive scaling fails: running one LLM request at a time per call serializes GPU work and wastes the hardware; running them naively in parallel spikes per-request TTFT; and scaling down mid-call drops live conversations.

The mitigation — continuous batching with a latency cap, regional media, and graceful draining.

  • Continuous (in-flight) batching on the inference servers — vLLM with PagedAttention — lets new requests join the running batch as decode slots free up, amortizing decode cost across all active calls without serializing them. A published benchmark scaled Qwen3-8B to 100 concurrent calls at ~278 ms TTFT with 100% success and wall time barely moving — the whole point of continuous batching. You set a latency cap so throughput optimization never pushes any single call's TTFT past the budget.
  • Worker-pool concurrency at the orchestrator: each worker holds ~20–40 realtime sessions; add workers roughly linearly with call volume; autoscale on an active_calls metric; pre-warm a pool so cold starts never surface in p99 (and plan for ~5× marketing spikes).
  • Back-pressure before 429s: track upstream provider concurrency (e.g. in Redis) and shed or queue load before the API returns rate-limit errors.
  • Regional media servers near callers to cut RTT, GPUs wherever the capacity is, inter-region hop only on the control plane.
  • Graceful connection draining wired to SIGTERM, so a deploy or scale-down lets in-flight calls finish instead of dropping them.
# Drain on shutdown: stop taking new calls, let live ones finish.
async def on_sigterm():
    global accepting_new_calls                 # module-level flag, not a local
    accepting_new_calls = False                # load balancer stops routing here
    while active_calls:                        # let live conversations complete
        await asyncio.sleep(1)
    await shutdown()                            # only now: tear the worker down

Draining is small code with outsized impact: it is the difference between a routine deploy and hundreds of callers hearing a dead line mid-sentence.

Results and impact

Built on these principles, a telephony voice agent delivers what the naive pipeline cannot:

  • Natural-feeling, sub-second turns. By overlapping stages — STT partials into the LLM, LLM tokens into per-clause TTS — and by replacing silence-timer endpointing with semantic turn detection, the perceived gap collapses to roughly turn-detect + LLM-TTFT + TTS-TTFB and lands in the conversational band rather than the "broken" one.
  • Robust under real-world packet loss. Opus FEC/PLC and an adaptive NetEQ jitter buffer keep both human comprehension and ASR accuracy stable when the network misbehaves — the normal condition for phone calls, not the exception.
  • Horizontally scalable to thousands of concurrent calls. Continuous batching with a latency cap, per-call process isolation, regional media placement, and graceful draining let the system grow with call volume without sacrificing the latency budget or dropping live calls during deploys.

The throughline is that none of this came from a single magic model. It came from treating latency as a budget to be decomposed, overlapping every stage that can be overlapped, and fixing the specific places — Whisper's silence hallucination, silence-timer endpointing, TTS cold-start tails, barge-in races, the GIL, jitter, batch-vs-latency — where the convenient default quietly betrays you.


The techniques in this study — VAD gating, semantic endpointing, per-clause TTS overlap, barge-in, and the Pipecat / LiveKit / Deepgram agents above — are assembled into a small companion mini-project, available on request.

Get in touch for the companion code bundle (VAD gating, barge-in, TTS overlap, Pipecat / LiveKit / Deepgram agents)

Tech stack

PythonWhisperSilero VADDeepgramPipecatLiveKit

Other 2025 work