Building a real-time AI voice agent — one that can listen, think, and speak with sub-second latency — was a major engineering challenge two years ago. In 2026, with the GPT-4o real-time API, Claude's voice capabilities, and mature TTS/STT models, it is a weekend project for a competent developer. This guide covers the complete technical stack for building voice agents that feel natural.

The Voice Agent Pipeline

StageTechnologyLatency BudgetWhat Happens
1. Audio InputWebRTC (browser mic)~20msRaw audio captured from user's microphone
2. Speech-to-Text (STT)OpenAI Whisper, Deepgram, AssemblyAI100-300msAudio transcribed to text
3. LLM ReasoningGPT-4o, Claude Sonnet, Gemini300-1,000msAI understands intent and generates response
4. Text-to-Speech (TTS)ElevenLabs, OpenAI TTS, Play.ht100-500msText converted to natural speech
5. Audio OutputWebRTC (browser speaker)~20msAudio played to user
Total (ideal)500-1,500msTarget: under 1 second for natural conversation

Voice Agent Architecture

# Simplified voice agent using OpenAI Realtime API
# The Realtime API combines STT + LLM + TTS into one WebSocket connection
# Latency: ~500-800ms end-to-end (much faster than chained APIs)

import asyncio
import websockets
import json

async def voice_agent():
    async with websockets.connect(
        "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime",
        extra_headers={"Authorization": f"Bearer {API_KEY}"}
    ) as ws:
        # Configure the session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": "You are a helpful coding mentor. Be concise.",
                "voice": "alloy",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "turn_detection": {"type": "server_vad"}  # Voice activity detection
            }
        }))

        # Stream audio from browser mic -> ws -> receive audio back
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.audio.delta":
                # Send this audio chunk to browser speaker
                play_audio(event["delta"])

STT, LLM, and TTS: Breaking Down the Stack

ComponentBest OptionsKey Considerations
STT (Speech-to-Text)Deepgram (lowest latency, 100ms), Whisper v3 (best accuracy), AssemblyAI (best features)Deepgram for real-time; Whisper for batch/offline; AssemblyAI for diarization + sentiment
LLMGPT-4o Realtime (all-in-one, best latency), Claude (best reasoning), Gemini (cheapest)Realtime API eliminates STT/TTS chaining latency; separate STT+LLM+TTS gives more control
TTS (Text-to-Speech)ElevenLabs (most natural voices), OpenAI TTS (good + integrated), Play.ht (cloning + emotions)ElevenLabs for quality; OpenAI for simplicity; Play.ht for custom voice cloning

Interruption Handling (Barge-In)

Critical feature: Users need to be able to interrupt the AI mid-response (like a human conversation). Implementation: monitor microphone audio level during AI playback. If sustained audio above threshold, immediately stop TTS playback, flush the LLM's partial response, and start listening for the new input. Without interruption handling, the voice agent feels robotic and frustrating.

Bottom line: The OpenAI Realtime API is the fastest path to a working voice agent — it bundles STT + LLM + TTS into one low-latency WebSocket. For production, consider Deepgram (STT) + your preferred LLM + ElevenLabs (TTS) for more control over each component. Target under 1 second end-to-end latency for a natural conversation feel. See also: AI Agents Guide and Function Calling Guide.