March 13, 2024
By Kashyap Kompella founder, RPA2AI Research and AI Profs
Interact

Audiovisual Speech Recognition Takes ASR to the Next Level

Automatic speech recognition (ASR) technology has made tremendous progress in the past few years and enables a broad range of B2C and B2B use cases. Despite that, ASR still has some limitations. ASR accuracy decreases in noisy environments, for example, reducing its utility.

More recently, after ChatGPT gained wide popularity in a short period of time, there is now an increased level of interest in making artificial intelligence tools like ChatGPT multimodal—that is, instead of using text-only prompts, can you prompt (i.e., provide inputs to) large language models (LLMs) using images, audio, or video inputs?

We can ask the same question about ASR technology. After all, humans rely on both audio and visual cues to comprehend the spoken word. Likewise, can we make ASR better by using multiple inputs? That’s the focus of audiovisual speech recognition (AVSR). Let me note that AVSR is not an entirely new field. But several factors—advances in AI methods, the increased availability of video and streaming data, and the proliferation of smart devices that can record and recognize video—can turbocharge AVSR.

So how does AVSR work? Simply put, AVSR integrates both audio and visual information using separate deep learning techniques for each. AVSR draws from multiple research disciplines—computer vision, machine learning, signal processing, and speech recognition. The audio component involves the usual speech recognition techniques, analyzing acoustic information and speech patterns to identify words. The visual component involves analyzing the speaker’s lip movements and facial expressions using computer vision methods. By combining these two sources of information, AVSR achieves better performance and higher accuracy, particularly in noisy environments.

AVSR Applications

AVSR enables new use cases and offers improvements over traditional ASR. This is what I call ASR 2.0.

Improving Accessibility

The focus of researchers has mostly been on visual speech recognition (sometimes also referred to as voiceless speech recognition). Think lip reading applications, which are of great help to hearing-challenged users; they can also help stroke patients who have lost the capacity for speech communicate better with those around them.

AVSR helps build next-generation and improved versions of these applications and helps improve accessibility. For example, it can be integrated with customer service apps to better serve users with hearing disabilities. AVSR can be used to make workplace communication and collaboration better for employees with such disabilities.

As the examples that follow demonstrate, AVSR is not just an accessibility technology.

Improving Communication in Noisy Places

Construction sites and factory floors are famously noisy places, with loud machinery and vehicles making spoken speech difficult to hear. AVSR applications can help improve communication and enhance safety.

Enhancing User Experience

In vehicles, AVSR can augment voice command applications, making them more resilient to traffic sounds. Similarly, it can improve the accuracy of many smart home devices and digital assistants that have voice-based user interfaces.

Sports, Media, and Entertainment

AVSR can provide better real-time captioning in live broadcasts; it can also automate script alignment during post-production for films and TV shows.

E-learning and Education

AVSR can be used to enhance the online learner experience with accurate subtitles in lectures. Language learning apps can provide feedback on both student pronunciation and lip movement.

Law Enforcement Scenarios

AVSR can help recognize conversations in silent closed-
circuit TV footage; this capability will be handy for law enforcement and investigation teams.

Enhanced Security of Voice Authentication Systems

The security of voice-based security password systems can be enhanced by matching both the voice and the lip movement patterns. It’s also possible to prevent automated hacks by using visual cues to detect the presence of a “live” person.

Integration with LLMs

Another emerging application area is the integration of AVSR with LLMs. There are several interesting possibilities. For example, in an online meeting with international users who speak different languages, an AVSR application could capture input, and then an LLM could process that input for real-time translation. AVSR applications paired with LLMs could also add speech to silent videos.

While there is greater research interest in AVSR from both specialist and large technology companies, commercial applications are still in the early stages. But as we move toward multimodal AI, AVSR will play an integral role.

Kashyap Kompella is CEO of rpa2ai Research, a global AI industry analyst firm, and co-author of Practical Artificial Intelligence: An Enterprise Playbook.