aiOla, a voice-AI lab, has announced the launch of Drax, an open-source AI model that applies flow-matching generative methods to speech recognition. This novel approach allows Drax to capture the nuance of real-world audio without the delay that plagues traditional systems, achieving accuracy on par with leading models like OpenAI's Whisper while operating with five times lower latency and reaching up to 32 times faster-than-real-time performance.
Quick Intel
aiOla has open-sourced Drax, a new AI model for speech recognition.
Drax matches the accuracy of top models like OpenAI's Whisper but is up to 5x faster.
It uses a parallel, flow-based approach to output entire transcriptions at once, reducing latency.
The model is robust against background noise, accents, and jargon in real-world settings.
It will be released in three sizes on GitHub and Hugging Face under a permissive license.
This technology is critical for enterprise applications like call centers and manufacturing.
Solving the Speed-Accuracy Trade-Off in Speech AI
Modern speech systems often force a choice between speed and accuracy. Sequential models like Whisper are accurate but slow for long-form audio, while faster diffusion-based models can struggle with real-world noise. Drax introduces a breakthrough by processing speech in parallel, outputting the entire token sequence at once. This method dramatically reduces latency and prevents the compounding errors common in long transcriptions, making it suitable for lengthy enterprise conversations where minor errors can impact compliance and customer experience.
A Training Method Built for Real-World Robustness
Drax's performance stems from its unique three-step training path. Similar to image diffusion models, it learns to reconstruct speech from noise, moving through a middle state that exposes it to realistic, acoustically plausible errors. This training enhances its robustness, allowing it to handle accents, background noise, and natural speech variability effectively. In benchmarks, Drax achieved an average word error rate of 7.4% in English, matching Whisper-large-v3, while maintaining comparable or better accuracy across Spanish, French, German, and Mandarin.
By open-sourcing Drax, aiOla aims to spark community-driven innovation and set a new benchmark for automatic speech recognition. The release of this high-performance, low-latency model provides a powerful tool for developers and enterprises, paving the way for voice to become a more practical and reliable interface for large-scale, mission-critical applications across various industries.
aiOla is a deep tech Voice AI lab redefining how enterprises interact with their systems by making voice the natural interface for executing workflows and capturing data. Built on patented foundation models, aiOla's technology goes beyond standard speech-to-text, transforming manual processes by merging voice recognition with intelligent workflow agents to turn natural speech into real-time, structured data. This enables hands-free process automation across CRMs, ERPs, QMS platforms, and more, allowing enterprises in critical industries to embed voice-driven workflow agents directly into operations.
Supporting over 100 languages and accurately interpreting jargon, accents, abbreviations, and acronyms even in noisy environments, aiOla, backed by $58 million in funding from New Era, Hamilton Lane and United Airlines and led by a world-class professional and research team, is advancing voice as the primary interface for enterprise systems, modernizing workflows, and reinventing how data entry gets done.