-->

2024 Speech Industry Award Winner: aiOla Speaks Your Business’ Unique Language

Article Featured Image

In many businesses, employees use a significant amount of industry language, such as jargon, abbreviations, and acronyms, throughout their workdays. Speech technologies used in those settings have typically had a hard time understanding this type of language, but Israeli startup aiOla is changing all that.

aiOla, a speech recognition technology provider founded in 2020, has introduced an artificial intelligence model that can instantly adapt to the unique vocabulary of any industry. Its technology supports more than 100 languages, even heavily accented speech or speech with a lot of background noise, without having to retrain the models.

“AI speech models are falling short in enterprise settings because they can’t understand industry jargon,” said aiOla cofounder and CEO Amir Haramaty in a statement. “aiOla’s solution can adapt digital and paper processes into AI speech-driven systems that allow businesses to finally tap into unspoken data while enhancing overall workflow efficiency.”

“Enterprises across every industry are acutely aware of the pressing need to adopt AI to maintain a competitive edge, but they don’t know where to begin,” said Mitch Garber, executive chairman of aiOla, in a statement. “While text-based AI solutions are great for office environments, speech interfaces reign supreme for industrial settings because they seamlessly integrate into existing workflows and collect previously uncaptured spoken data. Prior AI speech recognition models couldn’t perform for business use cases because of their inability to grasp jargon. Today, aiOla is changing that by providing instantly tailored AI models that can understand the unique jargon of your specific industry, your organization, or even your team.”

The jargon-understanding involves a two-step process whereby a keyword spotting model detects the specific terms and an adaptive layer then automatically trains and retrains the speech recognition models in using the jargon. After training, the jargon vocabulary can be hot-swapped to the jargon of a different sector, achieving state-of-the-art performance in recognizing both industry-specific language and general speech.

As a result of this work, aiOla’s AdaKWS speech recognition engine outperforms a number of other leading systems. On a benchmark of keyword and jargon detection that included 16 languages, OpenAI’s Whisper model yielded 88 percent accuracy, while aiOla’s AdaKWS model achieved 95 percent accuracy. Additionally, in a recent benchmark involving hard-to-detect keywords from English-language audiobooks, Apple’s CED model yielded 92.7 percent accuracy, while aiOla’s AdaKWS reached 95.1 percent accuracy.

For this use case, aiOla expanded on OpenAI’s Whisper model, but that wasn’t the only work the company did with Whisper. In early August, aiOla released Whisper-Medusa, an open-source artificial intelligence model based on a multi-head attention architecture.

aiOla’s Whisper-Medusa is reportedly much faster than Whisper because it alters how the model predicts tokens. While Whisper predicts one token at a time, Whisper-Medusa can predict 10 at a time, resulting in a 50 percent increase in speech prediction speed and generation runtime, according to the company.

Additionally, Whisper-Medusa is trained using weak supervision, a process in which the main components of Whisper are initially frozen while additional parameters are trained. This training process involves using Whisper to transcribe audio datasets and employing these transcriptions as labels for training Medusa’s additional token prediction modules. aiOla currently offers Whisper-Medusa as a 10-head model and is working on a 20-head version with equivalent accuracy.

“Creating Whisper-Medusa was not an easy task, but its significance to the community is profound,” said Gill Hetz, vice president of research at aiOla, in a statement. “Improving the speed and latency of [large language models] is much easier to do than with automatic speech recognition systems. The encoder and decoder architectures present unique challenges due to the complexity of processing continuous audio signals and handling noise or accents. We addressed these challenges by employing our novel multi-head attention approach, which resulted in a model with nearly double the prediction speed while maintaining Whisper’s high levels of accuracy. It’s a major feat, and we are very proud to be the first in the industry to successfully leverage multi-head attention architecture for automatic speech recognition systems and bring it to the public.”

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues