Tavus has launched Raven-1, a multimodal perception system that enables real-time conversational AI to understand emotion, intent, and context by fusing audio and visual signals. Unlike traditional systems that flatten speech into text and emotion into rigid categories, Raven-1 produces natural language descriptions of tone, expression, and posture at sentence-level granularity—allowing AI to perceive not just what users say, but how they say it. The model is now generally available across Tavus conversations and APIs.
Tavus launches Raven-1, a multimodal perception system for conversational AI.
It fuses audio (tone, pacing, prosody) and visual (expression, posture, gaze) signals in real time.
Outputs interpretable natural language descriptions of emotional state and intent.
Audio perception latency is sub-100ms; combined pipeline latency under 600ms.
Supports custom tool calling for developer-defined events like emotional thresholds.
Available immediately across all Tavus APIs and Conversational Video Interface (CVI).
Conversational AI has made significant strides in language generation and speech synthesis, yet understanding remains a critical gap. Most systems rely on speech-to-text transcription, stripping away tone, hesitation, sarcasm, and emotional nuance. Without perception of how something is said, AI is forced to guess at intent—and those guesses fail precisely when accuracy matters most. Raven-1 addresses this by treating audio and visual signals as a unified whole, not separate data streams.
Traditional emotion detection systems flatten human expression into rigid labels like "happy" or "sad." Raven-1 takes a fundamentally different approach: it generates rich, sentence-level natural language descriptions of emotional state and attentional shifts. These outputs are directly aligned with LLMs, requiring no translation layer. This enables AI to reason over nuanced, layered, or even contradictory emotional signals—such as frustration mixed with hope—that categorical systems cannot capture.
Raven-1 was architected from the ground up for real-time operation. Audio perception completes in under 100 milliseconds, with the full multimodal pipeline maintaining context under 600ms. This makes it suitable for high-stakes applications like healthcare, therapy, coaching, and interviews, where up to 75% of diagnostic signal comes from patient communication rather than tests. The system excels on short, ambiguous inputs—a single "fine" or "sure" carries radically different meaning depending on delivery, and Raven-1 captures that difference.
Raven-1 functions as a perception layer that works in concert with Tavus's Sparrow-1 (conversational timing) and Phoenix-4 models, creating a closed loop where perception informs response and response reshapes the moment. This enables AI that doesn't just generate fluent language, but understands when to speak, when to listen, and how to adapt in real time to the human on the other side of the conversation.
"Raven-1 captures and interprets audio and visual signals together, enabling AI systems to understand not just what users say, but how they say it and what that combination actually means." The model is now generally available, bringing human-like perception one step closer to reality.
About Tavus
Tavus is a San Francisco-based AI research company pioneering human computing, the next era of computing built around adaptive and emotionally intelligent AI humans. Tavus develops foundational models that enable machines to see, hear, respond, and act in ways that feel natural to people.