AI Voice Agents That See and Listen

Realtime interactions are becoming the primary way people interact with agents. To showcase what's possible with Inworld's latest voice model, Realtime TTS-2, Inworld partnered with Stream to build a flagship reference implementation using their open-source Vision Agents framework. The 'Crashout Buddy' watches your face, hears your words, and shapes its delivery in real time based on how the end user actually feels in the moment.

This reference example can be adapted to many more use cases: from professional coaching with realtime guidance, companion apps that notice context and environment, patient intake with verbal and non-verbal cues, 1:1 personalized customer experiences, and beyond.

What this means for realtime voice agents

Realtime TTS-2 + Vision Agents only requires a few lines of code to unlock the ability for AI to make users feel heard and stay engaged:

Emotional context and conversational awareness become first-class inputs.

Whatever your agent knows about the user from sentiment in transcript, signals from a vision model, sensor data, etc. can be turned into a steering directive that shapes how the voice delivers the response.

Multilingual deployment prevents a fragmented voice identity.

One voice across 100+ languages means a single agent persona for a global user base. No need for model swaps mid-conversation when a user switches languages.

Non-verbal sounds become part of the script.

Laughs, sighs, breaths, and pauses live inline alongside the words. The agent sounds like it's actually listening before responding.

Long conversations build on themselves.

Because Realtime TTS-2 carries conversational context forward, sustained interactions stop feeling like a series of disconnected scripts.

Sub-200ms latency unlocks interruption and barge-in patterns suitable for realtime agent loops.

“We've had early access to Inworld's Realtime TTS-2 for a few days and we're all blown away. The expressiveness, language steering and multi-lingual support are genuinely impressive. The subtle details like natural pausing make it hard to differentiate between AI and human.”

Neevash Ramdial · Vision Agents Lead, Stream

Voice steering, live

The same steering capability powering Crashout Buddy. Pick a delivery tag and hear how the voice changes with the same line.

Try a delivery tag

[speak tired but warm, like she just got home from a long day]I missed you. How was today?

End-of-day affection. Lower energy, gentle smile.

Start building

Any use case can follow a simple pattern: take the user's video feed, run a lightweight perception model on it, use the results as context, and let an expressive voice model render the response with appropriate delivery. Vision Agents makes the orchestration simple. Inworld Realtime TTS-2 makes the voice interaction believable.

Vision Agents — the framework
Inworld Realtime TTS-2 — expressive voice
GitHub repo — the code

The next generation of agents doesn't read scripts. It understands the full context of the user and makes them feel truly understood.

Realtime voice agents can now see, listen, and engage

What this means for realtime voice agents

Voice steering, live

Start building