Recognizing Atypical Speech Is ASR’s Achilles’ Heel
Automatic speech recognition is a foundational technology of today’s world. The billions of smartphones with voice assistants and hundreds of millions of voice-activated smart devices speak to that. Speech technology offers myriad ways, both big and small, to delight the consumer, increase her convenience, and elevate the user experience. Practically all speech technology applications come with an “ASR inside” tag.
What’s more, ASR is also a key enabler of accessibility and inclusivity. ASR can serve as an alternative input method to control computers, phones, tablets, and other electronic devices. Individuals suffering from a stroke or a brain injury, or those affected by neurological conditions that hamper motor abilities, often can’t use input devices like a mouse or a keyboard. Voice-based control of devices empowers such users and allows them to be in control.
ASR can be truly transformative, but its shortcomings have also received attention. Similar to other machine learning applications, speech recognition does not work well for certain demographic groups and many low-resource languages.
But another ASR limitation does not receive needed attention: Mainstream ASR systems struggle with recognition of atypical or dysarthric speech. The ability to produce clear speech requires respiration, phonation, articulation, resonance, and prosody. Dysarthria, a very common speech disorder in which the muscles used for speech have been weakened, can affect any of these processes (but most commonly articulation) and results in speech that is difficult to understand.
Researchers have found that the word error rate (WER) for ASR systems ranges from 4 to 7 percent for normal speech. The WER shoots up to 60 to 67 percent for dysarthric speech, with a WER of up to 25 percent when speech has moderate intelligibility and 80 to 90 percent when speech has low intelligibility. Other research has reached a similar conclusion: ASR systems are considerably less useful (or practically useless) for the people who need them the most—as mentioned, neuro-motor conditions usually result in speech difficulties.
Alas, this vastly diminishes the potential of ASR to be the user-friendly gateway to greater accessibility. As the adoption of voice controls increases—in home automation software, in smart speakers, in digital assistants, in computers and TV—the millions of U.S. adults who have trouble using their voice will struggle to accomplish simple tasks and control their immediate environment.
Why can’t ASR handle atypical speech? The main reason is that ASR systems were not trained on data with atypical speech patterns. But thankfully, the leading players have started to focus on this area. Google’s Euphonia project has crowd-sourced non-standard speech samples, and its Relate project is working to improve ASR models. Amazon has invested in the startup Voiceitt and is integrating its technology into Alexa. Apple is building features such as hold-to-talk and automatic stutter detection (voice assistants stop listening to or interrupt stuttering users) to make it easier for dysarthric individuals to use Siri.
These are welcome steps. But as ASR systems become embedded in our daily lives, much more needs to be done. ASR should be accessible by design; accessibility should not be a retrofit or an afterthought.
How can we get there? Through participatory design, partnerships for domain knowledge, and data collection informed by clinical practitioners’ perspectives, ASR systems can get to the next level.
First, listen to all voices. Follow a participatory design process where the end users are involved in the solution design from the beginning. AI experts and engineers are already familiar with user-centered design principles, and engaging users during different system life cycle stages will improve quality and ensure that features users need are understood and prioritized.
Solving ASR’s challenges with dysarthria requires deep domain knowledge. Forging partnerships with clinical practitioners and speech pathologists is imperative. There are several challenges in creating a representative data sample of disordered speech because there is a lot of diversity in speech patterns based on the actual condition causing dysarthria.
Creating a corpora of datasets for dysarthric speech is resource-, time-, and expert-intensive. The large diversity of dysarthric speech is an obstacle to building and testing high-quality ASR systems that can handle disordered speech. In a way, this is similar to bias in AI and machine learning when the training data is not representative enough. Understanding the different manifestations of dysarthria from a clinical perspective and how that can impact ASR performance can help in optimized data collection, ensuring representative data and helping to build ASR systems that can perform better in the real world.
Artificial intelligence has transformed ASR, but now it is time to make ASR more inclusive. x
Kashyap Kompella is CEO of rpa2ai Research, a global AI industry analyst firm, and co-author of Practical Artificial Intelligence: An Enterprise Playbook.