Realtime API

Controllable speech-to-speech that understands, reasons, and interacts

The only Realtime API where STT voice profile, LLM steering, and Realtime TTS-2 expressive output run as one WebSocket call. Sub-second latency, hundreds of LLMs, native [steering] tags and non-verbal cues rendered inline.

Get Started View Docs

Ready to build? Get your API key in under a minute.

<1s

Latency

Hundreds

of Models

Ranked Quality

Every configuration, one session

Pick any LLM for the conversation engine. Swap providers without changing your integration.

// Configure your realtime session
ws.send(JSON.stringify({
  "type": "session.update",
  "session": {
    "type": "realtime",
    "modelId": "anthropic/claude-sonnet-4-6",
    "instructions": "You are a helpful voice agent.",
    "output_modalities": ["audio", "text"],
    "audio": {
      "output": {
        "model": "inworld-tts-2",
        "voice": "Sarah"
      }
    }
  }
}));

// Register tools for agentic use cases
ws.send(JSON.stringify({
  "type": "session.update",
  "session": {
    "type": "realtime",
    "modelId": "openai/gpt-5.2",
    "tools": [{
      "type": "function",
      "name": "get_booking",
      "description": "Look up a reservation",
      "parameters": {
        "type": "object",
        "properties": {
          "confirmation_id": { "type": "string" }
        }
      }
    }]
  }
}));

// Configure your realtime session
ws.send(JSON.stringify({
  "type": "session.update",
  "session": {
    "type": "realtime",
    "modelId": "anthropic/claude-sonnet-4-6",
    "instructions": "You are a helpful voice agent.",
    "output_modalities": ["audio", "text"],
    "audio": {
      "output": {
        "model": "inworld-tts-2",
        "voice": "Sarah"
      }
    }
  }
}));

// Register tools for agentic use cases
ws.send(JSON.stringify({
  "type": "session.update",
  "session": {
    "type": "realtime",
    "modelId": "openai/gpt-5.2",
    "tools": [{
      "type": "function",
      "name": "get_booking",
      "description": "Look up a reservation",
      "parameters": {
        "type": "object",
        "properties": {
          "confirmation_id": { "type": "string" }
        }
      }
    }]
  }
}));

Every configuration, one session

Pick any LLM for the conversation engine. Swap providers without changing your integration.

Sub-second response time

Optimized data flow delivers end-to-end speech-to-speech latency under one second. Voice agents respond with human-level cadence.

Optimized STT, LLM, and TTS pipeline for the best latency and quality.
Full-duplex audio streaming over WebSocket or WebRTC

Experience it live

Realtime API

<1s

Speech-to-speech latency

STT

200ms

LLM

400ms

TTS

180ms

Sub-second response time

Optimized data flow delivers end-to-end speech-to-speech latency under one second. Voice agents respond with human-level cadence.

Optimized STT, LLM, and TTS pipeline for the best latency and quality.
Full-duplex audio streaming over WebSocket or WebRTC

Experience it live

Realtime API

<1s

Speech-to-speech latency

STT

200ms

LLM

400ms

TTS

180ms

Intelligent turn taking

Context-aware semantic VAD with adjustable eagerness. The agent knows when to listen, when to speak, and when a user is interrupting.

Semantic VAD detects intent boundaries, not just silence
Adjustable eagerness from cautious to aggressive
Graceful barge-in handling — no awkward overlaps or cut-offs

Try it in Playground

Live call

User

Hi I’d like to order 12 iced teas…

User

… I mean two taro bobas

Agent

Two taro bubble teas coming up!

Live call

User

Hi I’d like to order 12 iced teas…

User

… I mean two taro bobas

Agent

Two taro bubble teas coming up!

Intelligent turn taking

Context-aware semantic VAD with adjustable eagerness. The agent knows when to listen, when to speak, and when a user is interrupting.

Semantic VAD detects intent boundaries, not just silence
Adjustable eagerness from cautious to aggressive
Graceful barge-in handling — no awkward overlaps or cut-offs

Try it in Playground

Conversational intelligence

The inworld/inworld-stt-1 model emits a voice profile — emotion, vocal style, accent, age, pitch — alongside every transcript chunk. Those signals land in the LLM as structured context, the LLM emits Realtime TTS-2 [steering] tags inline, and Realtime TTS-2 renders the response with matching prosody and non-verbal cues.

inworld/inworld-stt-1 emits 5 paralinguistic signals per audio chunk with confidence scores
Voice profile flows into LLM context; LLM emits inline [Speak softly] / [sigh] tags
Realtime TTS-2 consumes the tags and renders expressive audio — no prompt engineering required

Hear the difference

Per audio chunk

Emotion

Frustrated

92%

Age

25–34

87%

Accent

British

94%

Rate

Fast

89%

→ Injected into LLM and TTS context

Conversational intelligence

inworld/inworld-stt-1 emits 5 paralinguistic signals per audio chunk with confidence scores
Voice profile flows into LLM context; LLM emits inline [Speak softly] / [sigh] tags
Realtime TTS-2 consumes the tags and renders expressive audio — no prompt engineering required

Hear the difference

Per audio chunk

Emotion

Frustrated

92%

Age

25–34

87%

Accent

British

94%

Rate

Fast

89%

→ Injected into LLM and TTS context

Voice profile steers Realtime TTS-2 in Realtime

The whole pipeline lives in one WebSocket: STT emits a voice profile per chunk, the LLM emits Realtime TTS-2 [steering] tags inline in its response, Realtime TTS-2 renders the response with matching prosody and non-verbal cues like [sigh] and [laugh]. No prompt engineering, no second model call.

Realtime TTS-2 [Speak softly] / [whisper] / [excited] tags interpreted natively
Non-verbal cues — [sigh], [laugh], [hmm] — rendered as real audio events
End-to-end TTFT stays sub-second; the whole pipeline runs as one round-trip

Hear it on Realtime TTS-2

STT voice profile (inworld/inworld-stt-1)

emotion

sad

style

soft

pitch

low

LLM response with Realtime TTS-2 markup

[Speak softly] I’m so sorry to hear that. [sigh] Let’s figure this out together.

Realtime TTS-2 audio out

voice: Sarah · model: inworld-tts-2

~600ms TTFT • one WebSocket round-trip

STT voice profile (inworld/inworld-stt-1)

emotion

sad

style

soft

pitch

low

LLM response with Realtime TTS-2 markup

[Speak softly] I’m so sorry to hear that. [sigh] Let’s figure this out together.

Realtime TTS-2 audio out

voice: Sarah · model: inworld-tts-2

~600ms TTFT • one WebSocket round-trip

Voice profile steers Realtime TTS-2 in Realtime

Realtime TTS-2 [Speak softly] / [whisper] / [excited] tags interpreted natively
Non-verbal cues — [sigh], [laugh], [hmm] — rendered as real audio events
End-to-end TTFT stays sub-second; the whole pipeline runs as one round-trip

Hear it on Realtime TTS-2

Provider agnostic, full control

Route to hundreds of LLMs, choose your STT engine, and access custom Inworld voices — all from a single session. Swap any component at any time.

OpenAI, Anthropic, Google, Groq, Mistral, xAI, and more
Choose STT provider independently of LLM
Access to all Inworld built-in voices as well as your cloned and custom voices

Configure and try

session.json

{
"type": "realtime",
"modelId": "anthropic/claude-sonnet-4-6",
"stt": { "model": "inworld/stt-1" },
"audio": { "voice": "Sarah" }
}

Provider agnostic, full control

Route to hundreds of LLMs, choose your STT engine, and access custom Inworld voices — all from a single session. Swap any component at any time.

OpenAI, Anthropic, Google, Groq, Mistral, xAI, and more
Choose STT provider independently of LLM
Access to all Inworld built-in voices as well as your cloned and custom voices

Configure and try

session.json

{
"type": "realtime",
"modelId": "anthropic/claude-sonnet-4-6",
"stt": { "model": "inworld/stt-1" },
"audio": { "voice": "Sarah" }
}

Fluent tool calling for agents

Register tools at session start or add them on the fly. The assistant calls your functions mid-conversation without breaking the audio stream.

Declare tools once — the agent invokes them when needed
Utilize our built-in web search as well as any custom tool you define.
Audio stays open while tools execute and results stream back

Build with it

Voice Stream

Realtime API

Your Tools

get_booking()

result streamed back

update_crm()

confirmed

check_weather()

result streamed back

Audio stays open throughout

Voice Stream

Realtime API

Your Tools

get_booking()

result streamed back

update_crm()

confirmed

check_weather()

result streamed back

Audio stays open throughout

Fluent tool calling for agents

Register tools at session start or add them on the fly. The assistant calls your functions mid-conversation without breaking the audio stream.

Declare tools once — the agent invokes them when needed
Utilize our built-in web search as well as any custom tool you define.
Audio stays open while tools execute and results stream back

Build with it

Use cases

The Realtime API powers any application where voice is the primary interface.

Voice agents & customer support

Build voice agents that handle appointment booking, order status, account management, and escalation. Sub-second responses keep callers engaged and reduce handle time.

Inworld Realtime API vs OpenAI Realtime API

Drop-in compatible with the OpenAI Realtime API. More flexible, more models, better pricing.

Capability	Realtime API	OpenAI Realtime
OpenAI SDK compatible
Sub-second latency
LLM choice	Hundreds of models	GPT-4o only
TTS quality	#1 ranked TTS on Artificial Analysis	Built-in only
Custom voices	Built-in + cloned + custom	6 preset voices
Function calling
Semantic turn detection
Conversational intelligence	Emotion, age, accent
Transport options	WebSocket, WebRTC	WebSocket, WebRTC
Pricing (per minute)	From $0.015/min	From $0.06/min
Provider lock-in	None — swap models anytime	OpenAI only

FAQ

Realtime TTS-2 [steering] tags — like [Speak softly], [whisper], or non-verbal cues like [sigh] and [laugh] — are emitted by the LLM inline in its response text and consumed by Realtime TTS-2 in the same response stream. You don't wire anything by hand: when the Realtime session is configured with a Realtime TTS-2 voice, tags in the assistant text are parsed off and rendered as expressive audio events, and the surrounding text is spoken with the requested prosody. See the Realtime TTS-2 launch post for the full tag inventory.

Realtime TTS-2 adds a small TTFT delta over TTS 1.5 — typically 50–100ms in our benchmarks — in exchange for native expressive output (steering tags, non-verbals) that would otherwise require a second model call or an SSML preprocessor. Because everything still runs as one Realtime WebSocket, end-to-end TTFT stays sub-second. The exact numbers depend on the LLM and voice you select; see the pricing + perf page for the current matrix.

Yes. The Realtime API is fully compatible with the OpenAI Realtime API, so you can migrate by simply swapping the endpoint and auth credentials. A full migration guide is available here.

When using the realtime API, you only pay for the underlying model usage. Rates for all models are available here. Inworld gives you built-in tools to manage costs, like capping response length, canceling responses early, and trimming conversation history, so you stay in full control of your spend.

The Realtime API supports the languages available through the underlying models you select.

By default, you can run up to 20 concurrent conversations, with up to 1,000 requests per second shared across them. Need more? Contact our team to discuss higher limits for your use case.

The Realtime API gives you access to hundreds of models from leading providers, such as OpenAI, Anthropic, Google, Mistral, xAI, and more. You can pick the best model for your application without being locked into a single provider.

WebSocket is currently publicly available, with WebRTC and SIP support in early access. Please reach out to our team if you’d like access.