-->
  • April 1, 2024
  • By Leonard Klie Editor, Speech Technology and CRM magazines
  • FYI

Apple Proposes Acoustic Model Fusion to Improve Speech Recognition

Article Featured Image

With automatic speech recognition technology, a persistent challenge has been domain mismatch, where internal acoustic models struggle to accurately recognize rare or complex words or phrases or ones that have unusual pronunciations.

Previous efforts to overcome this difficulty have focused on refining the models to expand their vocabularies and streamlining architectures onto a single neural network, but these activities have had limited success.

Apple has recently proposed a new approach that integrates external acoustic models into end-to-end ASR systems. This methodology, which Apple is calling acoustic model fusion (AMF), aims to refine speech recognition by enriching the system with broader acoustic knowledge.

With AMF, the external acoustic model is trained specifically for the desired domain, such as call centers or medical dictation, and incorporated into the E2E system. The predictions from both the E2E system’s internal acoustic model and the external model are combined through a process called log-linear interpolation. By combining the strengths of both models, AMF aims to achieve better recognition accuracy, particularly for words or phrases that are specific to a domain, like medical terminology, or for accents or pronunciations not well represented in the E2E model’s training data.

Apple ran its AMF methodology through a series of experiments using diverse datasets, including virtual assistant queries, dictated sentences, and synthesized audio-text pairs. In the preliminary research, this methodology yielded a 14.3 percent reduction in word error rates and an enhanced recognition of named entities and rare words.

In a blog post detailing the research, Apple called the AMF a promising breakthrough, paving the way for more accurate, efficient, and adaptable speech recognition systems and opening new avenues for applying ASR technology across a myriad of domains.

But Jim Larson, a senior scientist at Open Voice Network, says acoustic model fusion is just one approach of many that researchers are undertaking in this area.

“This is a hot topic among LLM researchers,” he says.

Many developers, Larson explains, are using retrieval-augmented generation (RAG), a process in which they submit a query to a traditional database management system or some other information source and then embed the results into the prompts to the LLM so that the LLM knows about the additional data and provides an improved overall result.

OpenAI, creators of ChatGPT and early innovators in the large language model space, has been working for the past few years with another methodology called reinforcement learning from human feedback (RLHF). It’s a technique used in machine learning to train models by incorporating human input.

According to OpenAI, RLHF uses human preferences as a reward signal to fine-tune models. OpenAI first collects a dataset of human-written demonstrations on prompts submitted to its application programming interface and uses this to train the supervised learning baselines. Next, it collects a dataset of human-labeled comparisons between two model outputs on a larger set of API prompts. It then trains a reward model (RM) on this dataset to predict which output labelers would prefer. Finally, it uses this RM as a reward function and fine-tunes the model policy to maximize this reward.

Another technique that is being explored, according to Larson, is direct preference optimization (DPO). Unlike RLHF, DPO bypasses the need for a separate reward model. Instead, it directly optimizes the LLM based on human preferences provided as a dataset. Experiments have shown that DPO can fine-tune LLMs to align with human preferences as well as or better than other methods, notably in its ability to control sentiment of generations and match or improve response quality in summarization and single-turn dialogue while being substantially simpler to implement and train. DPO has been applied with success to train models like Zephyr and Intel’s NeuralChat.

And the research doesn’t stop there. Many other companies, academic programs, and software developers are experimenting with other ways to retrain AI models to improve accuracy and reduce the likelihood of hallucinations.

“Which of these techniques is best? Depends on who you ask,” Larson concludes.

SpeechTek Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues