AI & Machine Learning · Prototype · 2025-01-01

WhisperLive

Production-grade real-time speech-to-text using OpenAI Whisper with voice activity detection and TensorRT acceleration. Multiple backends, Docker support, mature codebase (235 commits).

Year: 2025
Status: Prototype
Category: AI & Machine Learning
Role: Architect & Lead

Key metrics

235

Commits

Multi

Backends

TensorRT

Acceleration

Architecture

Real-time speech-to-text pipeline built on OpenAI's Whisper model with voice activity detection (VAD) for low-latency streaming. Multiple inference backends including TensorRT acceleration for production workloads. Docker-deployable.

Case study

WhisperLive

Production-grade real-time speech-to-text with voice activity detection and TensorRT acceleration.

Speech-to-text has moved past batch processing. Modern applications need transcription as it happens -- live captions during meetings, real-time subtitles for streams, voice-driven interfaces that respond while the user is still talking. WhisperLive bridges that gap by wrapping OpenAI's Whisper model in a streaming pipeline tuned for low-latency production use.

How it works

The core pipeline chains three stages: a voice activity detector (VAD) segments incoming audio into speech regions, filtering silence and background noise before any inference runs. The filtered segments feed into a Whisper model running on one of several backend engines -- standard PyTorch for development, or TensorRT for production workloads where latency matters. Results stream back to the client as partial transcripts that refine in real time.

graph LR
    MIC[Audio Input] --> VAD[Voice Activity
Detection]
    VAD --> |speech segments| ENGINE[Whisper Engine]
    ENGINE --> |partial transcripts| OUT[Streaming
Output]

    subgraph "Inference Backends"
        PT[PyTorch
CPU / CUDA]
        TRT[TensorRT
Optimized]
    end

    ENGINE --- PT
    ENGINE --- TRT

Multiple inference backends let you trade off between flexibility and performance. The PyTorch backend runs anywhere and is useful for development and testing. TensorRT compilation targets NVIDIA GPUs specifically, quantizing and fusing operations for significantly lower latency under load. Switching between backends is a configuration change, not a code change.

Deployment

WhisperLive ships as Docker images with all dependencies baked in, including CUDA and TensorRT runtimes for GPU-accelerated inference. The containerized setup means you can run it on a local workstation for development or deploy to GPU-equipped cloud instances for production without changing the application code. With 235 commits across the codebase, the project has matured past the "works on my machine" stage into something you can reliably operate.

Where it fits

The target use cases are anywhere transcription latency matters: live meeting captioning, accessibility overlays for video streams, voice-controlled applications that need to act on speech as it arrives, and real-time annotation workflows where a human reviews AI-generated transcripts on the fly. The VAD front-end keeps inference costs down by only running the model when someone is actually speaking.

Tech stack

PythonPyTorchWhisperTensorRTDocker

Key metrics

Architecture

Case study

WhisperLive

How it works

Deployment

Where it fits

Tech stack

Other 2025 work