Get started
Realtime API

Controllable speech-to-speech that understands, reasons, and interacts

The only Realtime API where STT voice profile, LLM steering, and Realtime TTS-2 expressive output run as one WebSocket call. Sub-second latency, hundreds of LLMs, native [steering] tags and non-verbal cues rendered inline.
<1s
Latency
Hundreds
of Models
#1
Ranked Quality

Every configuration, one session

Pick any LLM for the conversation engine. Swap providers without changing your integration.

// Configure your realtime session ws.send(JSON.stringify({ "type": "session.update", "session": { "type": "realtime", "modelId": "anthropic/claude-sonnet-4-6", "instructions": "You are a helpful voice agent.", "output_modalities": ["audio", "text"], "audio": { "output": { "model": "inworld-tts-2", "voice": "Sarah" } } } }));

Sub-second response time

Optimized data flow delivers end-to-end speech-to-speech latency under one second. Voice agents respond with human-level cadence.

  • Optimized STT, LLM, and TTS pipeline for the best latency and quality.
  • Full-duplex audio streaming over WebSocket or WebRTC
Experience it live
Realtime API
<1s
Speech-to-speech latency
STT
200ms
LLM
400ms
TTS
180ms

Intelligent turn taking

Context-aware semantic VAD with adjustable eagerness. The agent knows when to listen, when to speak, and when a user is interrupting.

  • Semantic VAD detects intent boundaries, not just silence
  • Adjustable eagerness from cautious to aggressive
  • Graceful barge-in handling — no awkward overlaps or cut-offs
Try it in Playground
Live call
User
Hi I’d like to order 12 iced teas…
User
… I mean two taro bobas
Agent
Two taro bubble teas coming up!

Conversational intelligence

The inworld/inworld-stt-1 model emits a voice profile — emotion, vocal style, accent, age, pitch — alongside every transcript chunk. Those signals land in the LLM as structured context, the LLM emits Realtime TTS-2 [steering] tags inline, and Realtime TTS-2 renders the response with matching prosody and non-verbal cues.

  • inworld/inworld-stt-1 emits 5 paralinguistic signals per audio chunk with confidence scores
  • Voice profile flows into LLM context; LLM emits inline [Speak softly] / [sigh] tags
  • Realtime TTS-2 consumes the tags and renders expressive audio — no prompt engineering required
Hear the difference
Per audio chunk
Emotion
Frustrated
92%
Age
25–34
87%
Accent
British
94%
Rate
Fast
89%
→ Injected into LLM and TTS context

Voice profile steers Realtime TTS-2 in Realtime

The whole pipeline lives in one WebSocket: STT emits a voice profile per chunk, the LLM emits Realtime TTS-2 [steering] tags inline in its response, Realtime TTS-2 renders the response with matching prosody and non-verbal cues like [sigh] and [laugh]. No prompt engineering, no second model call.

  • Realtime TTS-2 [Speak softly] / [whisper] / [excited] tags interpreted natively
  • Non-verbal cues — [sigh], [laugh], [hmm] — rendered as real audio events
  • End-to-end TTFT stays sub-second; the whole pipeline runs as one round-trip
Hear it on Realtime TTS-2
01
STT voice profile (inworld/inworld-stt-1)
emotion
sad
style
soft
pitch
low
02
LLM response with Realtime TTS-2 markup
[Speak softly] I’m so sorry to hear that. [sigh] Let’s figure this out together.
03
Realtime TTS-2 audio out
voice: Sarah · model: inworld-tts-2
~600ms TTFT • one WebSocket round-trip

Provider agnostic, full control

Route to hundreds of LLMs, choose your STT engine, and access custom Inworld voices — all from a single session. Swap any component at any time.

  • OpenAI, Anthropic, Google, Groq, Mistral, xAI, and more
  • Choose STT provider independently of LLM
  • Access to all Inworld built-in voices as well as your cloned and custom voices
Configure and try
session.json
{
"type": "realtime",
"modelId": "anthropic/claude-sonnet-4-6",
"stt": { "model": "inworld/stt-1" },
"audio": { "voice": "Sarah" }
}

Fluent tool calling for agents

Register tools at session start or add them on the fly. The assistant calls your functions mid-conversation without breaking the audio stream.

  • Declare tools once — the agent invokes them when needed
  • Utilize our built-in web search as well as any custom tool you define.
  • Audio stays open while tools execute and results stream back
Build with it
Voice Stream
Realtime API
Your Tools
get_booking()
result streamed back
update_crm()
confirmed
check_weather()
result streamed back
Audio stays open throughout

Use cases

The Realtime API powers any application where voice is the primary interface.

Inworld Realtime API vs OpenAI Realtime API

Drop-in compatible with the OpenAI Realtime API. More flexible, more models, better pricing.
Capability
Realtime API
OpenAI Realtime
OpenAI SDK compatible
Sub-second latency
LLM choice
Hundreds of models
GPT-4o only
TTS quality
#1 ranked TTS on Artificial Analysis
Built-in only
Custom voices
Built-in + cloned + custom
6 preset voices
Function calling
Semantic turn detection
Conversational intelligence
Emotion, age, accent
Transport options
WebSocket, WebRTC
WebSocket, WebRTC
Pricing (per minute)
From $0.015/min
From $0.06/min
Provider lock-in
None — swap models anytime
OpenAI only

FAQ

Realtime TTS-2 [steering] tags — like [Speak softly], [whisper], or non-verbal cues like [sigh] and [laugh] — are emitted by the LLM inline in its response text and consumed by Realtime TTS-2 in the same response stream. You don't wire anything by hand: when the Realtime session is configured with a Realtime TTS-2 voice, tags in the assistant text are parsed off and rendered as expressive audio events, and the surrounding text is spoken with the requested prosody. See the Realtime TTS-2 launch post for the full tag inventory.
Realtime TTS-2 adds a small TTFT delta over TTS 1.5 — typically 50–100ms in our benchmarks — in exchange for native expressive output (steering tags, non-verbals) that would otherwise require a second model call or an SSML preprocessor. Because everything still runs as one Realtime WebSocket, end-to-end TTFT stays sub-second. The exact numbers depend on the LLM and voice you select; see the pricing + perf page for the current matrix.
Yes. The Realtime API is fully compatible with the OpenAI Realtime API, so you can migrate by simply swapping the endpoint and auth credentials. A full migration guide is available here.
When using the realtime API, you only pay for the underlying model usage. Rates for all models are available here. Inworld gives you built-in tools to manage costs, like capping response length, canceling responses early, and trimming conversation history, so you stay in full control of your spend.
The Realtime API supports the languages available through the underlying models you select.
By default, you can run up to 20 concurrent conversations, with up to 1,000 requests per second shared across them. Need more? Contact our team to discuss higher limits for your use case.
The Realtime API gives you access to hundreds of models from leading providers, such as OpenAI, Anthropic, Google, Mistral, xAI, and more. You can pick the best model for your application without being locked into a single provider.
WebSocket is currently publicly available, with WebRTC and SIP support in early access. Please reach out to our team if you’d like access.

Start building in minutes

Get an API key, open a WebSocket, stream audio.
Copyright © 2021-2026 Inworld AI
Realtime API: Expressive Speech-to-Speech with Realtime TTS-2