Speech Model Training Gets Easier with the Latest Tools

In recent years, voice-enabled technologies have become increasingly popular, and the demand for customized speech solutions has grown significantly. Companies require efficient, reliable speech solutions that they can implement themselves without the time or expense of employing developers for many of the basic and some of the advanced uses.

That’s not an easy task in today’s climate, where even the most basic speech applications for converting speech to text, for example, have incorporated very advanced and technical artificial intelligence to detect patterns in sound waves to increase transcription accuracy. Other products rely on AI trained for very specific business language and use cases, according to Katie Kuzin, product lead for Scribe speech recognition AI at Kensho Technologies.

“Technology has gotten a lot better,” says Rebecca Wettemann, CEO and founder of Valoir. “It’s much more usable for business users. More companies are looking at how they can have high-quality interactions using voice, and as more companies move their contact centers to the cloud, they are rethinking the interactions they have with customers via voice and determining how much of it they can do with intelligent, AI-driven interactions.”

Only a few years ago, there were only a few tools on the market, like TensorFlow, Google’s end-to-end open-source platform for machine learning. Tools like this required a lot of technical knowledge.

But today, the speech-to-text/text to-speech tools available from companies like Microsoft require much less of that. Even TensorFlow has become easier to use. Other options have also emerged in just the past few years, making speech engines far more accessible and easy to implement.

In recent months alone, the industry witnessed the launch of OpenAI’s Whisper, in September. Then, just this past March, the company made available the APIs for Whisper, making it cheaper and easier for companies to integrate Whisper into their conversation platforms and third-party software. Whisper accepts files in formats like M4A, MP3, MP4, MPEG, MPGA, WAV, and WEBM and has been trained on 680,000 hours of multilingual and multitask supervised data collected from the web, according to the company.

Facebook, Instagram, and WhatsApp parent company Meta also introduced Wav2Vec 2.0, a new model pretrained to predict the correct speech unit for masked parts of audio. With just 10 minutes of transcribed speech and 53,000 hours of unlabeled speech, wav2vec 2.0 enables speech recognition models capable of a word error rate of 8.6 percent on noisy speech and 5.2 percent on clean speech, the company claims.

Wav2Vec 2.0 “opens the door for speech recognition models in many more languages, dialects, and domains that previously required much more transcribed audio data to provide acceptable accuracy,” Meta said in a blog post at the time of the release.

The fact that these models can be fine-tuned on specific domains or setups with small amounts of labeled data could save both computational resources and human capital that previously required months or even years of effort, according to industry experts.

One technical method for training AI models is transfer learning. This method involves taking a pretrained model and fine-tuning it for specific use cases. It can save companies a lot of time and resources because they can build on knowledge already present in pretrained models, Kuzin says.

Companies that have been the most successful with this technology have a tuning practice, along with a business dictionary of terms, according to Jaime Meritt, chief product officer at Verint. “They take the technology’s generic capabilities and make it their own.”

“Another exciting approach is self-training using pseudo-labeling, whereby unlabeled audio data is transcribed using an [automatic speech recognition] model and then used as ground truth to train a new model in a supervised manner,” says Tal Rosenwein, vice president of research and development at OrCam Technologies, a maker of voice-enabled wearable assistive technologies. “This method is gaining attention, since unlabeled audio is now easy to acquire as the amount of streaming audio increases through social media, YouTube, and other platforms.”

“There are a lot of tools out there today where I just need to drag and drop the user interface,” Wettemann says. “They all have prebuilt libraries and drag-and-drop user interfaces. Business users can take advantage of a lot of the heavy lifting that has already been done.

“There are tons of APIs out there that I can use with Microsoft, Google, or Amazon. And there are tons of capabilities out there,” she adds.

Genesys, for example, offers a user-friendly interface to help companies gather data, transcribe voices, and recognize intent and deliver it back to the business without the need for additional coding,

Many paid APIs, such as the ones offered by Sonantic, a startup that uses AI to produce voices from text (acquired by music streaming service provider Spotify in July), and resemble.ai, a voice generator and voice cloning solutions provider, can generate high-fidelity speech from text, but in many cases, companies want to have their own text-to-speech (TTS) models.

To reduce human effort, companies can train their proprietary models using available repositories found in collaborative platforms, like Papers with Code and HuggingFace, a French company that develops tools for building applications using machine learning, a library built for natural language processing applications, and a platform that allows users to share machine learning models and datasets, according to Rosenwein. “However, companies must ensure the training dataset can be commercialized, as the voice being trained is usually cloned.”

Rosenwein adds that as generative AI explodes, TTS flourishes accordingly. Microsoft’s release of VALL-E and VALL-EX enables in-context learning—meaning generating multilingual speech is based on the given text with the target speaker’s voice and prosody using only a few seconds of their audio.

Samsung has already committed to using it for the “text call” feature on the Samsung Galaxy 23 smartphone.

These highly available technologies enable companies today to create custom automatic speech recognition AI models without having to hire full development teams, then either training or customizing them to suit their needs, Kuzin says.

Companies can also create custom speech solutions using AI models trained with data specific to their desired domains, according to Kuzin. “If a company in the medical industry needs a speech-to-text solution, they should look for an AI model on medical terminology that can prove its ability to recognize medical jargon.”

After finding the model that fits their broader industries, they can give that model a custom dictionary, Kuzin recommends. In the renewable energy industry, for example, a custom dictionary might include terms like hydro DM, biomass conversion, and renewable portfolio standard.

Providing the ASR model with a crafted set of words and phrases, custom dictionaries can greatly improve the accuracy of speech-to-text transcription, Kuzin says. This is particularly important for industries where accurate speech transcription is critical, such as healthcare, education, law, and media and entertainment.

Though Wettemann says companies can get fairly creative with today’s speech technologies without coders or developers, others say the technologies provide a good start but are far from the stage where they can provide really significant benefits to businesses.

“Given the time and expense associated with building technical solutions in house, organizations eager to capitalize on and launch voice technology must be careful in their approach to this kind of project,” cautions Bill Schwaab, vice president of sales for North America at Boost.ai, a conversational AI systems provider.

“Artificial intelligence is only as strong as the data it is trained on, and there are no guarantees project managers or IT staff will have the capability to overcome any gaps in subject matter expertise,” Schwaab explains. “That’s not to say it is impossible to train models internally, but companies seeking to provide a good voice experience on their own must fully understand how stakeholders behave in the interactions they hope to optimize.”

Industry Specificity

Financial services and telecommunications firms have a slight lead over other industries using the technology. One reason is that they have large data analytics teams that know how to work with the data needed for projects like this, Meritt says.

Speech analytics and conversational AI technologies are completely transforming business in financial services, adds Brian Steele, vice president of product management at Gyphon.ai, a provider of conversational intelligence and coaching solutions.

“Banks, investment firms, credit unions, and other financial organizations are embracing AI-enabled technologies to find greater productivity across the board, especially in an effort to make remote work smoother and better support their customer service initiatives,” he says.

The demand for seamless customer experiences and quick service in financial services has grown in the digital age, putting more pressure on internal employees and reps to deliver this experience, Steele adds. With the ability to rely on conversational systems, call recordings, AI-powered recommendations, and more, financial services employees can create better strategies for customer service and sales while also improving their own performance with the aid of technology.

But doing so requires a lot of data to train the AI models, Merritt says, noting that most businesses have to do “a significant amount of tuning.”

“Think about how much of any business is conducted through conversations. A massive amount of data has remained untapped for a very long period of time,” he says. “The interest in voice model training is directly related to the realization that we are missing out on so much insight.”

The expanding use of the technology also coincides with a growing need to automate manual work for tasks like regulatory compliance monitoring, especially in the financial services and medical fields. Some healthcare firms are even using the technology to ensure that nurses show a sufficient amount of empathy for patients.

Companies are also looking to add real-time guidance into their technologies, a trend that Merritt expects to continue to grow as companies look to train their contact center and sales employees. The technology can also be an efficient way to analyze the contact center’s aggregate calls to identify trends and improve CX efforts, he says.

But not all voice training models offer the same capabilities, Schwaab cautions. “A critical consideration to make in training a voice model is the breadth and accuracy of its training data. Centralizing management of a model through an accessible, easy-to-use tool cuts through any barriers that prevent those in the know from elevating business-critical information.”

Users should also plan to identify the number of ways voice models will need to respond to particular queries, Schwaab advises. “There is no way to ensure a query or prompt is submitted in a uniform manner each time, so training data must incorporate variety and nuance to secure the best outcomes. End users want to feel like the situation or context in which they’re engaging with AI is understood, and that cannot be accomplished with static models.”

Meritt adds that static models can only go so far with the number of languages they understand. Even developer-enhanced models can be challenged by certain languages and dialects. “For global businesses, scale is difficult, and the technology requires a significant amount of training,” he says.

Schwaab adds that the maintenance of these solutions needs to be a continuous process, so businesses should plan to routinely evaluate performance and change course if something is not working as intended.

But regardless of whether they have technical teams in place, companies will find “augmenting speech-to-text solutions using ASR AI models is a viable option for companies looking to create solutions that meet their specific requirements,” Kuzin says. “By leveraging a bespoke model for your use case and tools available through various cloud providers, companies can create customized voice models to easily get valuable insights from unstructured audio data.”

And there will be many opportunities for the technology going forward. Wettemann says there is plenty of room for growth, with only about 30 percent of firms deploying it now, and with many of those in the early stages of use.

“There is still a lot of old stuff out there,” Wettemann says. But “we are reaching a tipping point as ChatGPT is raising the visibility for the technology, and we will see a lot more firms seeing how they can do this.”

Phillip Britt is a freelance writer based in the Chicago area. He can be reached at spenterprises1@comcast.net.

Speech Model Training Gets Easier with the Latest Tools

Industry Specificity

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions