































































Pick any LLM for the conversation engine. Swap providers without changing your integration.
// Configure your realtime session
ws.send(JSON.stringify({
"type": "session.update",
"session": {
"type": "realtime",
"modelId": "anthropic/claude-sonnet-4-6",
"instructions": "You are a helpful voice agent.",
"output_modalities": ["audio", "text"],
"audio": {
"output": {
"model": "inworld-tts-2",
"voice": "Sarah"
}
}
}
}));// Configure your realtime session
ws.send(JSON.stringify({
"type": "session.update",
"session": {
"type": "realtime",
"modelId": "anthropic/claude-sonnet-4-6",
"instructions": "You are a helpful voice agent.",
"output_modalities": ["audio", "text"],
"audio": {
"output": {
"model": "inworld-tts-2",
"voice": "Sarah"
}
}
}
}));Pick any LLM for the conversation engine. Swap providers without changing your integration.
Optimized data flow delivers end-to-end speech-to-speech latency under one second. Voice agents respond with human-level cadence.
Optimized data flow delivers end-to-end speech-to-speech latency under one second. Voice agents respond with human-level cadence.
Context-aware semantic VAD with adjustable eagerness. The agent knows when to listen, when to speak, and when a user is interrupting.


Context-aware semantic VAD with adjustable eagerness. The agent knows when to listen, when to speak, and when a user is interrupting.
The inworld/inworld-stt-1 model emits a voice profile — emotion, vocal style, accent, age, pitch — alongside every transcript chunk. Those signals land in the LLM as structured context, the LLM emits Realtime TTS-2 [steering] tags inline, and Realtime TTS-2 renders the response with matching prosody and non-verbal cues.
The inworld/inworld-stt-1 model emits a voice profile — emotion, vocal style, accent, age, pitch — alongside every transcript chunk. Those signals land in the LLM as structured context, the LLM emits Realtime TTS-2 [steering] tags inline, and Realtime TTS-2 renders the response with matching prosody and non-verbal cues.
The whole pipeline lives in one WebSocket: STT emits a voice profile per chunk, the LLM emits Realtime TTS-2 [steering] tags inline in its response, Realtime TTS-2 renders the response with matching prosody and non-verbal cues like [sigh] and [laugh]. No prompt engineering, no second model call.
The whole pipeline lives in one WebSocket: STT emits a voice profile per chunk, the LLM emits Realtime TTS-2 [steering] tags inline in its response, Realtime TTS-2 renders the response with matching prosody and non-verbal cues like [sigh] and [laugh]. No prompt engineering, no second model call.
Route to hundreds of LLMs, choose your STT engine, and access custom Inworld voices — all from a single session. Swap any component at any time.

{"type": "realtime","modelId": "anthropic/claude-sonnet-4-6","stt": { "model": "inworld/stt-1" },"audio": { "voice": "Sarah" }}
Route to hundreds of LLMs, choose your STT engine, and access custom Inworld voices — all from a single session. Swap any component at any time.

{"type": "realtime","modelId": "anthropic/claude-sonnet-4-6","stt": { "model": "inworld/stt-1" },"audio": { "voice": "Sarah" }}
Register tools at session start or add them on the fly. The assistant calls your functions mid-conversation without breaking the audio stream.
Register tools at session start or add them on the fly. The assistant calls your functions mid-conversation without breaking the audio stream.

Capability | Realtime API | OpenAI Realtime |
|---|---|---|
OpenAI SDK compatible | ||
Sub-second latency | ||
LLM choice | Hundreds of models | GPT-4o only |
TTS quality | #1 ranked TTS on Artificial Analysis | Built-in only |
Custom voices | Built-in + cloned + custom | 6 preset voices |
Function calling | ||
Semantic turn detection | ||
Conversational intelligence | Emotion, age, accent | |
Transport options | WebSocket, WebRTC | WebSocket, WebRTC |
Pricing (per minute) | From $0.015/min | From $0.06/min |
Provider lock-in | None — swap models anytime | OpenAI only |
