Mitigating Bias in Speech Recognition Systems
Image recognition systems today are unrecognizable from those of the pre-deep-learning phase. And as their performance skyrocketed, so did their adoption. But with increased adoption, their limitations have also become quite visible. One particular area of concern is the accuracy of image recognition systems when it comes to minority groups. The systems make many errors, with grave consequences. When a facial recognition system fails to verify the identity of a user, it can mean shutting off access to resources or inconveniencing them in other ways. Thankfully, there is growing awareness of the perils of algorithmic bias. Companies and researchers are figuring out the best ways to mitigate it.
But enough about image recognition. What about speech recognition systems? Do they also suffer from bias?
Sadly, the answer is yes. Of course, there is no doubt that automatic speech recognition (ASR) software has improved significantly in recent years because of machine learning. ASR is used in our smart speakers and in our phones’ virtual assistants. It drives use cases such as speech-to-text conversion, audio captioning, assistive technologies, medical transcription, and more. But the specter of bias haunts all machine learning applications, including speech recognition. Simply put, speech recognition software does not work well for certain demographic groups.
Consider this study published in the Proceedings of the National Academy of Sciences by researchers from Stanford and Georgetown universities. The researchers tested ASR technology from five leading vendors (Amazon, Apple, Google, IBM, and Microsoft) and found these troubling results:
• The word error rate for African-American speakers is about twice as high as it is for white speakers for the five big ASR systems.
• Overall, up to 23 percent of transcripts of audio snippets from African-American speakers were unusable, but the corresponding number for white speakers is less than 2 percent.
This study did not look at the experience of other ethnic groups, but I’d expect to see similar results for those as well. Clearly, there is a problem. Let’s understand what’s causing it and how we can address it.
A machine learning model is only as good as the data that is used to train it. The training dataset may be large, but if that data is largely from—or is skewed toward—a particular demographic, then the models work well for that group, but not so well for other groups. Bias is simply a manifestation of a higher error rate for the group underrepresented in the training data. The aforementioned study demonstrates the results.
What’s the fix? The training data should include different accents, spoken language variants, and pronunciations of non-native speakers. But the solution to reduce AI bias is not just technical in nature. The teams building the speech tech systems should represent a diversity of voices and experiences. No one sets out to build biased systems, but in the absence of diversity, teams may not be aware of their unconscious biases and blind spots.
The problem of AI bias in image recognition systems has received widespread attention, but so far speech recognition systems, though they have similar problems, have not been under the same level of scrutiny. But as their adoption increases and they are deployed for use cases such as recruitment (where the stakes are high, and the laws are strict), these issues will assume greater urgency.
When speech recognition works well, user experience improves and customer satisfaction increases. But when it misfires or doesn’t work, it’s a minor annoyance for some, but for others it can be a dehumanizing experience. Companies selling ASR systems and the organizations using them need to understand the issues, have a plan for mitigating bias, and help make AI in speech technology more inclusive.
Kashyap Kompella is CEO of rpa2ai Research, a global AI industry analyst firm, and is the co-author of Practical Artificial Intelligence: An Enterprise Playbook.