The
OpenAI Realtime API was the first realtime API available and proved an important concept. Developers could call one API with audio via WebSockets, and get audio back from the API, skipping the need to build a full pipeline. The Realtime API collapsed STT, LLM, and TTS into a single connection and set the bar for how voice agents should feel.
But two years later, the OpenAI Realtime API is no longer state of the art. It locks you into one specific architecture with one vendor for the model, the voice, and the transport. You can't swap in a different LLM or choose a leading TTS engine.
The good news: the field has expanded. In the span of about a year, we went from one viable speech-to-speech API to five. At Inworld, we built a
Realtime API powered by the #1-ranked TTS on the
Artificial Analysis Speech Arena, with access to 10+ LLM providers through a single endpoint and full compatibility with the OpenAI Realtime protocol. Google
launched Gemini 3.1 Flash Live in March 2026. xAI shipped the
Grok Voice Agent API in December 2025. Hume released
EVI 3. This guide compares all five alternatives against the original and breaks down which one fits which use case.
What Is a Realtime Voice API?
A realtime voice API accepts audio from a user and returns audio from an AI agent through a persistent streaming connection. Usually a WebSocket.
The simplest way to understand it: you speak into a microphone, and an AI voice speaks back. Behind the scenes, the API handles three jobs. Speech recognition via a speech-to-text model (STT) that converts your voice to text. An LLM processes the incoming speech and decides what to say in response. Then a text-to-speech (TTS) model does voice generation to turn the LLM response into spoken audio.
Most voice APIs wait for you to finish speaking, then process your words, then build a full audio response, then play it back. That sequence creates a noticeable pause even if each step is fast. Realtime APIs work differently. They start speaking back to you while they're still figuring out the rest of the answer.
Three trends are shaping this space right now:
- The OpenAI Realtime protocol is becoming a de facto standard. Inworld and xAI both follow OpenAI's event schema. If you've built a client for OpenAI's Realtime API, you can point it at either provider with minimal code changes.
- Modular architectures are winning on flexibility. OpenAI and Google process audio natively inside the model. That means tight coupling and no component swapping. Inworld uses a modular architecture (STT + LLM + TTS) that lets you choose each piece independently. Inworld's approach matches the latency of native models while giving you access to 10+ LLM providers through a single API.
- Pricing is collapsing. xAI charges $0.05 per minute flat. Inworld TTS runs at $10 per million characters (roughly $0.01/min) with LLM costs at provider rates. OpenAI's audio output at ~$0.24/min looks expensive by comparison.
Who Needs a Realtime Voice API (and When)?
Teams migrating off OpenAI's Realtime API. You built a prototype on OpenAI, it works, and now you're staring at per-minute costs that scale poorly. Or you want to use Claude or Gemini as the reasoning model instead of being locked to GPT. Or your enterprise customers need data residency options OpenAI doesn't offer.
Teams building voice agents from scratch. If you're starting a new voice project, you can either assemble a pipeline yourself or use a single realtime API that handles everything. Inworld covers both: the Realtime API ships the full pipeline in one endpoint, and the free
Agent Runtime gives you component-level control if you need custom processing between stages.
Enterprise teams with compliance requirements. Healthcare applications need HIPAA. Financial services need SOC2 and data residency. Inworld is the only realtime API on this list with full on-premise deployment, HIPAA with BAAs, SOC2 Type II, GDPR, and EU/India data residency. The other providers are cloud-only.
When a realtime API is the wrong tool: batch audio processing (transcribing recordings, generating audiobooks) and content production workflows where latency doesn't matter. These use cases are better served by standalone TTS and STT APIs.
How We Evaluated These APIs
Every API on this list meets the same baseline: a single WebSocket endpoint that accepts audio input and returns audio output in a streaming, bidirectional connection. Beyond that, we evaluated on six criteria:
TTS quality. We referenced the
Artificial Analysis Speech Arena. Listeners compare speech samples side-by-side without knowing which model produced them. ELO scores from this benchmark are the most objective quality signal available.
Latency. Specifically P90 time-to-first-audio: the delay between the end of your speech and the first audible output frame. Averages hide tail latency. P90 shows what your users actually experience.
Protocol compatibility. Does the API follow the OpenAI Realtime event schema? If so, your existing client code transfers with minimal changes. Proprietary protocols mean starting from scratch.
Model flexibility. Can you swap the underlying LLM, TTS, or STT? Or are you locked to the provider's own models?
Transport options. WebSocket only, or WebSocket plus WebRTC? WebRTC is built for browsers and handles unstable network conditions automatically. WebSocket gives you more control for server-side applications.
Pricing model. Per-minute flat rate, per-token, or per-character? The structure determines how your costs scale.
The 4 Best OpenAI Realtime API Alternatives in 2026
1. Inworld Realtime API
#1-ranked TTS at 1/20th the cost of ElevenLabs. The TTS powering the audio output is
Inworld TTS 1.5 Max. It holds the #1 position on the Artificial Analysis Speech Arena with an ELO of 1,236 as of March 2026. Three of the top four models on that leaderboard belong to Inworld. The ranking comes from blind preference testing across thousands of listener comparisons, not self-reported metrics. At $10 per million characters, that's roughly 20x cheaper than ElevenLabs Multilingual v2 for a higher-ranked model.
Full pipeline coverage in one endpoint. Inworld's
Realtime API handles the full voice agent pipeline in one place: STT, LLM, TTS, VAD, turn-taking, and interruption handling. Audio goes in over WebSocket or WebRTC. Audio comes back. No middleware required. You can
build a working voice agent in minutes.
Model flexibility across 10+ LLM providers. This is the biggest differentiator. Inworld's
Router gives you unified access to OpenAI, Anthropic, Google, Mistral, xAI, Cerebras, Fireworks, Groq, DeepInfra, and Tenstorrent through a single API key. You can swap the reasoning model mid-session without changing your integration code. You can A/B test Claude against GPT against Gemini and measure the impact on user outcomes. No other realtime API on this list offers model-level experimentation built into the platform.
Automatic interruption handling and semantic VAD. Setting interrupt_response: true enables barge-in. The agent stops speaking and begins processing new input when the user talks over it. The semantic VAD listens to what you're saying, not just whether you've gone silent, to decide when you're done talking. You control the tradeoff between fast responses and premature cutoffs with a configurable eagerness parameter (low, medium, high).
WebSocket and WebRTC as first-class transports. WebRTC is built for voice in the browser. It handles unstable connections, adapts audio quality on the fly, and works through firewalls without extra configuration. WebSocket works for server-side orchestration and telephony bridges. Both transports share the same event model. You can run WebRTC on the client and WebSocket on the backend without maintaining separate codepaths.
Drop-in OpenAI migration. The API follows the
OpenAI Realtime protocol. Events like
session.update,
input_audio_buffer.append,
response.create, and
response.done work with the same semantics. If you're already running on OpenAI's Realtime API, you migrate by changing the endpoint URL and API key. Inworld extends the base protocol with router support, semantic VAD configuration, and dynamic session updates. None of that breaks compatibility with existing OpenAI-shaped clients.
Best for: Teams that want the highest-ranked TTS and multi-model flexibility in a single realtime API. Strongest fit for teams migrating from OpenAI, since the protocol is compatible and your client code transfers directly.
Pros:
- #1 ranked TTS on Artificial Analysis (ELO 1,236) at $10/1M characters
- Model-agnostic: 10+ LLM providers via Router with automatic failover and A/B testing
- Semantic VAD with configurable eagerness
- WebSocket and WebRTC with shared event model
- Built-in observability, telemetry, and per-stage latency tracing for production debugging
- OpenAI protocol compatible with documented migration path
- SOC2 Type II, HIPAA, GDPR compliance with on-premise deployment option
- Free Agent Runtime for teams that want pipeline-level control alongside the Realtime API
Cons:
- Realtime API is in research preview, not yet GA. For production-critical deployments with zero tolerance for breaking changes, this matters.
- 15 languages supported versus 50+ (OpenAI) or 90+ (Google). If your application requires broad multilingual coverage today, this is a limitation.
Pricing: TTS at $10/1M characters (~$0.01/min). LLM costs pass through at provider rates. Agent Runtime is free.
Full pricing details.
2. Google Gemini 3.1 Flash Live
Google
launched Gemini 3.1 Flash Live on March 26, 2026. It's natively multimodal. Audio goes directly into the model and audio comes directly out. No separate STT or TTS stage. The model processes speech, reasons over it, and generates a spoken response in one pass.
The architecture is different from modular approaches. Gemini 3.1 Flash Live is built on
Gemini 3 Pro and accepts audio, images, video, and text as inputs with a 128K token context window. The Live API uses a bidirectional WebSocket connection. Raw 16-bit PCM audio at 16kHz goes in. Raw PCM audio comes back.
The benchmark results are strong. It scores 90.8% on ComplexFuncBench Audio (multi-step function calling via voice) and 36.1% on Scale AI's Audio MultiChallenge (instruction following during interruptions and background noise). It supports
90+ languages for realtime conversations.
The trade-off is lock-in. Flash Live uses Google's proprietary Live API protocol, not the OpenAI event schema. If you're migrating from OpenAI, you're rewriting your client code. You're also locked to Google's models for reasoning, TTS, and STT. No swapping in Claude or a different TTS engine.
Best for: Teams already on Google Cloud who want native multimodal speech-to-speech without managing separate pipeline components. Strong for applications requiring broad language coverage or deep GCP integration (Dialogflow, Contact Center AI, Vertex AI).
Pros:
- Natively multimodal: no pipeline stages, lower theoretical latency
- 90+ languages
- 90.8% on ComplexFuncBench Audio (voice-based function calling)
- 128K context window
- Generous free tier (no credit card required for Google AI Studio)
Cons:
- Proprietary protocol. Not OpenAI-compatible. Migration from OpenAI means rewriting client code.
- Locked to Google's models. No swapping LLM, TTS, or STT providers.
- WebSocket only. No WebRTC support documented at launch.
- Preview status. Not yet GA.
- Per-turn billing re-processes accumulated context tokens, which can surprise you on cost for long conversations.
Pricing: Token-based. Audio input at $1.00/1M tokens, text output at $3.00/1M tokens via the
Gemini API. Free tier available for development. Per-turn billing means costs accumulate across conversation turns as context grows.
3. xAI Grok Voice Agent API
xAI
launched the Grok Voice Agent API in December 2025. The same stack powers Grok Voice in millions of Tesla vehicles and the Grok mobile app. It was battle-tested at scale before the API went public.
The API follows the
OpenAI Realtime protocol. WebSocket connection at
wss://api.x.ai/v1/realtime, same event schema. If you've built on OpenAI, your client code transfers with minimal changes.
The pricing is aggressive:
$0.05 per minute flat. No token math. No separate input/output rates. A 10-minute voice agent conversation costs $0.50 total. The same conversation on OpenAI's Realtime API might run $2-3 depending on output length.
Built-in tools are a differentiator. The API natively supports
web search and X (Twitter) search as first-party tools. The model can invoke them mid-conversation without custom function definitions. If your voice agent needs to look up real-time information during a call, you don't need to build that integration yourself.
The voice quality is solid but not independently benchmarked on Artificial Analysis. xAI claims the API ranks #1 on Big Bench Audio, but that measures reasoning capability, not TTS quality. You get 5 voice options with expressive tags for controlling delivery style. No voice cloning.
Best for: Teams that want the cheapest flat-rate realtime voice API with OpenAI protocol compatibility. Strong for applications that benefit from built-in web and X search capabilities during conversations.
Pros:
- $0.05/min flat rate. Simplest, most predictable pricing in this comparison.
- OpenAI protocol compatible
- Built-in web search and X search as native tools
- 100+ languages with automatic detection and mid-conversation switching
- Battle-tested stack (Tesla, Grok mobile app)
Cons:
- Locked to Grok models. No swapping in a different LLM.
- WebSocket only. No WebRTC support.
- 5 preset voices. No voice cloning.
- TTS quality not independently ranked on Artificial Analysis Speech Arena.
- Limited voice customization compared to Inworld's audio markup or Hume's natural language control.
Pricing: $0.05/min connection time, flat rate.
Pricing details.
4. Hume EVI 3
Hume shipped EVI 3 as a speech language model that understands and generates emotion natively. The model doesn't just process words. It analyzes prosody, tone, and emotional cues in your voice and adjusts its delivery accordingly.
The voice control is the standout feature. You describe the voice you want in plain English: "Sound hesitant, like someone delivering bad news." Or: "Speak with warm enthusiasm, like a favorite teacher." No SSML tags. No audio markup syntax. The model interprets the instruction and generates speech that matches. You can also design entirely new voices from text descriptions without providing any audio sample. No other provider offers that.
EVI 3 targets sub-300ms end-to-end latency. It supports voice cloning from under 30 seconds of audio. You can plug in Claude, GPT, Gemini, or your own custom model as a supplementary reasoning engine. EVI 3 generates the initial response while the heavier LLM processes in parallel, then integrates the LLM's output once it's ready.
The trade-off is specialization. Hume is built for applications where emotional nuance is the product, not just a feature. AI companions, therapy bots, coaching apps, social experiences. For transactional voice agents where the goal is speed and accuracy, EVI's emotion processing adds overhead the use case doesn't need.
Best for: Applications where emotional intelligence and voice expressiveness are core differentiators: AI companions, mental health, coaching, social AI, gaming characters.
Pros:
- Natural language voice control and voice design from text descriptions
- Emotional prosody analysis of user input
- Sub-300ms end-to-end latency
- Interoperable with external LLMs (Claude, GPT, Gemini, custom)
- Voice cloning from under 30 seconds of audio
- 200K+ designed voices on platform
Cons:
- 11 languages at launch (20+ expansion announced but not yet shipped)
- Not ranked on Artificial Analysis Speech Arena, making it hard to compare TTS quality objectively
- Subscription-based pricing with per-minute overage charges adds billing complexity
- For transactional voice agents, the emotion processing adds overhead without proportional value
Pricing: Tiered subscriptions from free (5 EVI minutes/month) to Business ($500/month for 12,500 EVI minutes). Pro plan overage at $0.06/min. Enterprise pricing is custom.
Full pricing.
Overview of OpenAI Realtime API
OpenAI's
Realtime API reached GA in August 2025 with the
gpt-realtime model. It's the most mature option here. Its event schema has become the protocol other vendors build against.
The API is natively multimodal. Audio flows directly into the model without a separate STT step. The model reasons over the audio and generates both text and audio responses. Fewer moving parts. In theory, lower latency.
WebSocket and WebRTC are both supported. The gpt-realtime model improved on its predecessor with better audio quality, stronger instruction following, and more reliable function calling (66.5% on ComplexFuncBench Audio, up from 49.7%). It now supports image inputs alongside audio and MCP server integration for tool calling.
The limitation hasn't changed since launch: you're locked to OpenAI. The LLM is OpenAI's. The TTS is OpenAI's. The STT is OpenAI's. If you want Claude for reasoning, or Inworld's TTS for voice quality, or a specialized STT for domain-specific accuracy, you can't use them. The pricing reflects that bundled nature: roughly $0.06/min for audio input and $0.24/min for audio output. That adds up quickly at scale.
Best for: Teams building directly on OpenAI's models who want the most mature production track record in this category.
Pros:
- Established realtime voice API. GA since August 2025.
- Natively multimodal. No pipeline stages.
- WebSocket + WebRTC
- Image inputs alongside audio (gpt-realtime)
- MCP server support for tool calling
- 50+ languages
- Natural language voice instructions via gpt-4o-mini-tts
Cons:
- Locked to OpenAI models. No LLM, TTS, or STT swapping.
- TTS quality ranks below Inworld on the Artificial Analysis Speech Arena, where Inworld holds 3 of the top 4 positions
- Audio output pricing at ~$0.24/min is the most expensive option in this comparison
- No voice cloning (13 preset voices, custom voices only via enterprise agreement)
- No on-premise deployment
Pricing: $32/1M audio input tokens (~$0.06/min), $64/1M audio output tokens (~$0.24/min). Text tokens billed separately.
Full pricing.
Comparison Table
| API | Architecture | Protocol | Transports | Languages | TTS Ranking (Artificial Analysis) | Pricing | Best For |
|---|
| Inworld | Modular (STT+LLM+TTS) | OpenAI-compatible | WebSocket + WebRTC | 15 | #1 (ELO 1,236) | ~$0.01/min TTS + LLM pass-through | Model flexibility + top TTS quality |
| Gemini Flash Live | Natively multimodal | Google proprietary | WebSocket | 90+ | Not ranked | Token-based (~$1-3/1M tokens) | Multilingual + GCP integration |
| Grok Voice Agent | In-house full stack | OpenAI-compatible | WebSocket | 100+ | Not ranked | $0.05/min flat | Cheapest flat rate + built-in search |
| Hume EVI 3 | Speech language model | Hume proprietary | WebSocket | 11 | 34th | $0.06/min (Pro overage) | Emotional AI + companion apps |
| OpenAI Realtime | Natively multimodal | OpenAI (original) | WebSocket + WebRTC | 50+ | 12th | ~$0.30/min blended | Tightest OpenAI model integration |
Why Inworld Is the Strongest Alternative
The OpenAI Realtime API proved that speech-to-speech should be a single API call. Every alternative on this list agrees. The question is which one gives you the most capability per dollar without recreating the lock-in problem.
We at Inworld answer on three axes. Inworld TTS is ranked #1 on the Artificial Analysis Speech Arena at $10 per million characters. That's 20x cheaper than ElevenLabs and roughly 1.5x cheaper than OpenAI for a higher-quality model. The model flexibility is unmatched: 10+ LLM providers through a single API key, with automatic failover, A/B testing, and intelligent routing built in. And the OpenAI protocol compatibility means you can migrate without rewriting your client code.
Gemini Flash Live is the strongest option for multilingual applications. Grok wins on pricing simplicity. Hume owns the emotional AI niche. OpenAI still has the longest production track record. But if you want top-ranked voice quality, multi-provider model flexibility, and a migration path that doesn't require starting over,
Inworld is the clear choice.
FAQs
What is a realtime voice API?
A realtime voice API accepts audio input and returns audio output through a persistent streaming connection. Usually a WebSocket. It handles speech recognition, turn detection, language model inference, and text-to-speech synthesis server-side. You don't need to wire together separate services for each stage. The pipeline stages overlap rather than running sequentially, so latency stays low enough for the exchange to feel natural.
How is a realtime API different from chaining STT + LLM + TTS?
A chained pipeline processes each stage one at a time. Transcribe the full utterance. Send the text to an LLM. Wait for the complete response. Synthesize the full audio clip. Play it back. Each handoff adds latency. A realtime API overlaps these stages. The TTS engine starts synthesizing from the first tokens of the LLM response while the model is still generating the rest. It also handles VAD, turn detection, and interruption recovery server-side. In a chained pipeline, you'd build all of that yourself.
Can I migrate from OpenAI's Realtime API to Inworld?
Yes. Inworld's Realtime API follows the OpenAI event schema. Events like
session.update,
input_audio_buffer.append,
response.create, and
response.done work with the same semantics. Inworld publishes a
migration guide documenting the process. In practice, you change the WebSocket endpoint URL and API key, then optionally configure Inworld-specific extensions like semantic VAD, router settings, and model selection.
Is Inworld's Realtime API better than Gemini Flash Live?
They serve different needs. Gemini Flash Live processes audio natively inside a multimodal model. That eliminates pipeline stages but locks you to Google's models and a proprietary protocol. Inworld uses a modular architecture with the #1-ranked TTS on Artificial Analysis (ELO 1,236), OpenAI protocol compatibility, and access to 10+ LLM providers. If you need more than 15 languages, Gemini's 90+ language support is a clear advantage. If you want model flexibility, top-ranked voice quality, and the ability to migrate without rewriting client code, Inworld is the stronger fit.
What latency should I expect from a realtime voice API?
A well-optimized pipeline typically achieves 500-800ms end-to-end for natural-feeling conversations. Time-to-first-audio is the most important metric. Inworld TTS-1.5 Mini delivers under 130ms P90. Inworld TTS-1.5 Max delivers under 250ms P90. These are end-to-end measurements including network overhead, not inference-only numbers. Ask vendors for P90 benchmarks rather than averages.
When should I use WebRTC vs WebSocket?
Use WebRTC when your voice agent runs in a browser or mobile app over variable network conditions. It adapts to connection quality automatically and works through firewalls without extra setup. Use WebSocket when you're connecting through a telephony bridge, running behind a proxy, or need fine-grained control over every message in the session. Inworld and OpenAI support both transports with the same event model. Gemini, Grok, and Hume are WebSocket only.
What's the cheapest realtime voice API?
Grok at $0.05/min flat has the simplest billing. But cheapest per minute and best value are different questions. Inworld TTS runs at roughly $0.01/min, and it's ranked #1 on Artificial Analysis. Grok's TTS isn't independently ranked on that leaderboard. If you're optimizing for cost per quality minute, Inworld delivers the highest-ranked voice output at the lowest TTS cost in this comparison. OpenAI's blended cost of roughly $0.30/min is the most expensive. Gemini's token-based pricing falls in between but is harder to predict because per-turn billing reprocesses accumulated context.
Do I need a realtime API, or can I build my own pipeline?
Inworld supports both approaches. The
Inworld Realtime API handles STT, LLM, TTS, VAD, interruption handling, and turn-taking in a single WebSocket or WebRTC connection. You send audio, you get audio back. Most teams ship faster this way because they skip the weeks of infrastructure work that go into wiring pipeline stages together.
If you need more control, Inworld's
Agent Runtime is a free orchestration layer that lets you build custom pipelines as directed acyclic graphs. You can wire together specific STT, LLM, and TTS nodes, add custom processing logic between stages, run parallel LLM evaluations, and deploy to cloud with
inworld deploy. The Runtime uses the same Inworld TTS (#1 on Artificial Analysis) and the same Inworld Router (10+ LLM providers) as the Realtime API, so you get component-level flexibility without giving up model quality or multi-provider access.
Frameworks like
Pipecat and
LiveKit Agents also support custom pipelines, but they require you to bring your own models and manage your own infrastructure. Inworld gives you both a managed realtime API and a self-hostable pipeline toolkit from the same vendor.