Back to feed
OpenAI Blog·

Introducing the Realtime API

Signal
85
Hype
25
In three linesOpenAI launches Realtime API enabling developers to build fast bidirectional speech experiences. The API supports speech input/output with low latency and native function calling integration.

## OpenAI Realtime API: What It Actually Changes for Voice

### 1. What's Being Announced

OpenAI is opening access to its Realtime API — the same infrastructure powering ChatGPT's Advanced Voice Mode. The API enables real-time speech-to-speech exchanges with significantly reduced latency compared to classical pipelines. It natively supports audio input and output, voice activity detection (VAD), mid-sentence interruption handling, and function calling directly from the audio stream — no intermediate transcription step required.

The underlying model is `gpt-4o-realtime-preview`. Two voices are available at launch. Pricing is usage-based: $0.06/min for audio input, $0.24/min for audio output — figures worth comparing against the cumulative cost of a manually assembled STT + LLM + TTS pipeline.

### 2. Why This Is Structurally Significant

Before this announcement, building a natural voice experience required assembling at least three distinct components: an STT engine (Whisper, Deepgram, AssemblyAI), an LLM for reasoning, and a TTS engine (ElevenLabs, Play.ht, Azure Neural). Each hop between components added latency — typically 800 ms to 2 s of total perceived delay, which breaks the illusion of natural conversation.

The Realtime API short-circuits this pipeline by processing audio end-to-end within a single multimodal model. The practical result: latencies comparable to ChatGPT's voice mode, approximately 300–500 ms in public demos. That's the threshold below which users perceive a conversation as fluid.

Native interruption handling is an underappreciated technical point. In a classical pipeline, if the user cuts in, you must detect the interruption, cancel the in-progress TTS generation, and restart the LLM — three asynchronous operations that are bug-prone and latency-heavy. Here, it's handled at the model level.

### 3. Ecosystem Implications — Winners and Losers

**Direct losers:** Specialized STT and TTS providers see their value proposition shrink for conversational use cases. Deepgram, AssemblyAI, ElevenLabs, and Play.ht are exposed. ElevenLabs in particular had positioned low latency as a key differentiator — that advantage erodes if OpenAI delivers the announced performance at scale. Integrators who built abstractions around multi-component pipelines (Vocode, Pipecat) need to revisit their architecture.

**Direct winners:** Developers building voice agents, AI call centers, embedded assistants, and accessibility applications. The reduction in integration complexity is real: a single WebSocket endpoint replaces three APIs with their separate SDKs, error handling, and billing.

**Competitors forced to react:** Google (with Gemini Live, announced at Google I/O 2024 but not yet in public API) and Anthropic (no native voice capability announced) are now behind on this specific segment. Hume AI, which had an early lead on empathic voice with its EVI model, remains differentiated on the emotional axis but loses the early API access advantage.

### 4. What to Watch

The $0.24/min output audio pricing is steep for high-volume use cases (call centers, IVR). For comparison, ElevenLabs charges approximately $0.30/1,000 characters, which works out cheaper for short responses. The economic equation will depend heavily on the input/output ratio and average exchange duration.

Voice customization remains an open question: the two voices available at launch are fixed, with no voice cloning or style adjustment — a gap versus ElevenLabs or Resemble AI for brands that want a proprietary sonic identity.

Finally, availability is currently limited to developers in early access via the Assistants API, with a gradual rollout. Capacity constraints at scale remain to be validated in real production environments.

Read source
Your take?
OpenAIVoiceAI Agents

Summary generated by Claude — human-verified