Speech AI has a bigger job than recognizing words. It has to bridge understanding.
That challenge sits at the heart of Shawn Zhang's vision for Sanas. In this conversation, he discusses how the company is addressing accent bias, reducing friction in live conversations, and building speech intelligence that works in real time. He also shares why the future of communication may depend on treating speech not as a feature, but as infrastructure.
At Stanford, I spent a lot of time working on advanced speech systems. However, what stayed with me was a conversation with a close friend at the start of the Covid pandemic. He left Stanford, moved back home, and took a call center job where his performance was shaped less by what he said and more by how he sounded.
That experience exposed a broader issue. Accent bias shows up in high-volume, communication-heavy roles where clarity is tied directly to outcomes. It creates friction that has nothing to do with skill or knowledge, yet it affects hiring, compensation, and day-to-day performance, and personal feelings of professional value.
In communication-heavy roles, that gap shows up constantly. Companies have invested heavily in automation and AI, but the communication layer itself has remained largely untouched.
The underlying technology to improve clarity in real time already existed. The gap was how and where it was being applied, and whether it could improve understanding without flattening identity.
Most systems are not built to handle how people actually speak. Accents, pacing, background noise, connection quality – these are still common failure points.
When something breaks, the responsibility shifts to the speaker. People repeat themselves, slow down, or change how they talk just to be understood. That’s a limitation in the system, not the person. The other issue is timing. Most tools react after the conversation has already drifted. Once that moment passes, it’s hard to recover.
Voice systems should operate in real time and adapt to people as they are. That’s the gap that still needs to be solved.
Latency was the first constraint we ran into. If there’s even a small delay, it breaks the conversation immediately. Voice AI has to be fast enough that people don’t notice anything at all.
But making speech clearer isn’t just about audio quality. You also have to preserve how someone sounds, their tone, their intent, the emotion behind what they’re saying. Otherwise, it starts to feel unnatural.
Then there’s the reality of how messy real conversations are. Different accents, different environments, background noise, unstable connections. Early models weren’t trained to handle that level of variation.
Much of our early work went into expanding data and testing in real-world conditions to make sure it holds up.
As Sanas progressed, we knew we were solving more than just an audio problem. It became more about building a real-time system where speech recognition, generation, and infrastructure all have to work together at once.
Building on top of existing platforms introduced too much friction. Such systems weren’t designed for real-time voice transformation, and as a result, latency, reliability, and integration issues show up quickly.
Most approaches that sit on top of those platforms inherit those limitations, including how data moves and how it can be secured.
Embedding speech intelligence directly into the communication layer changes that. It allows us to operate within sub-200 millisecond latency, maintain consistent performance, and enforce security controls at the system level rather than relying on external layers.
When speech runs inside the infrastructure, it can meet the same expectations as other core services, including privacy, reliability, and data protection.
At that point, speech intelligence becomes part of how communication works rather than something added on afterward.
Evolving beyond CX, one thing became very clear to us. It was a strong starting point because the pain was obvious and measurable, but it was never the end game for Sanas.
Once we built the technology and saw it working in real environments, it became clear that the same communication issues exist across the enterprise and a much broader set of industries.
We’re now seeing that play out in areas like healthcare, financial services, retail, and travel, anywhere communication is happening in real time, and clarity directly affects outcomes.
As we move into telcos, devices, and developer platforms, the requirements change quite a bit. You’re operating across different systems, different environments, and often with higher expectations around security, reliability, and control.
The way we think about it now is less about a single use case and more about where speech shows up across systems. The platform sits inside those interactions and makes sure communication works the way it’s supposed to. Clear, natural, and consistent.
Over time, that becomes something those systems rely on directly rather than something that’s added later.
From a technology and infrastructure perspective, Tomato.ai was a strong fit because they’ve been building toward the same reality we see: voice sits at the center of how enterprises and platforms operate, but the underlying real-time communication layer hasn’t kept up.
Their work in low-latency processing, zero-shot voice transformation, and production integrations across VoIP and carrier environments complements what we’ve built.
At scale, voice systems fail in small ways that compound quickly. That’s where their experience stood out. Their technology has already been tested in high-volume environments where consistency and uptime are critical.
Bringing that into Sanas strengthens our ability to operate directly within communication systems instead of layering on top of them.
For me, it always comes back to the actual experience of a conversation.
You can build something that looks great on paper, but if it doesn’t hold up when two people are talking in real time, it doesn’t matter. That’s been a consistent reminder for us.
Sanas’ co-founder and my partner, Sharath Narayana, and I stay very close to what’s happening in production, how people are actually using the system, where things break, where things feel off. That feedback loop matters more than anything.
We still push the technology forward, but it’s always tied to that question of whether the conversation feels right. Not just whether it’s more accurate, but whether it still sounds natural, whether meaning comes through, whether people can stay focused on what they’re saying instead of how they’re saying it.
If you stay grounded in that, the technology doesn’t drift too far from the problem you’re trying to solve.
I think the industry will realize it misunderstood where the real problem was. A lot of today’s conversations are still centered on models and features, but the bigger gap has always been the underlying communication layer itself.
That layer hasn’t been built to support real-time intelligence in live environments at scale. It’s fragmented, inconsistent, and often treated as something systems work around rather than something they rely on.
Over the next five years, that will change. Speech will be built directly into the communication layer and treated the same way enterprises treat core systems today, with clear expectations around reliability, security, and availability.
When we reach that point, speech becomes dependent. Applications, networks, and devices won’t just include it, they’ll assume it’s there and working. Teams won’t design around communication limitations, they’ll design with the expectation that speech is clear, consistent, and real-time by default.
That shift changes how systems are built and how people interact within them.
Shawn Zhang, CTO and Sanas co-founder, leverages his engineering expertise from Stanford’s AI Lab to pioneer AI-driven solutions. Inspired by a friend's experience with accent bias, Shawn co-founded Sanas. His leadership in AI research and development ensures Sanas’s technology transforms global communication to be more inclusive.
Sanas is a real-time Speech AI platform built to power global enterprise and communications platforms. Founded in 2021 in Palo Alto, California, Sanas enables speech to be understood clearly and naturally across languages, accents, and environments. The platform provides real-time speech enhancement, accent transformation, and language understanding that can be embedded directly into applications, platforms, and carrier networks.
Since going to market in 2023, Sanas has grown from zero to $62 million in annual recurring revenue and is on track to surpass $130 million, supporting large-scale, real-time voice communication across global enterprises and platform providers.
Learn more at Sanas.ai.