Memes and Phonemes: “Yanny or Laurel?” Shows Why Accurate Speech Recognition Isn’t Easy
Yanny or Laurel? By now you’ve surely heard the audio clip that incited a viral debate (if not, check it out). It’s a high-profile example of an interesting phenomenon—different people can listen to the same thing, at the same time, and hear completely different things.
This paradox can have many implications, including those that brands should consider in their day-to-day operations. Understanding why it occurs—and how the speech recognition systems that drive customer engagement solutions like IVRs, mobile apps, and virtual assistants can accommodate those differences—helps ensure that brands are providing a frustration-free customer experience.
First, a crash course on how speech recognition systems work. Most speech recognition technologies use neural networks that are designed to mimic the way the human brain works. Speech recognition vendors train their networks by feeding them thousands of hours of human speech; this process enables them to learn the diverse ways that people pronounce phonemes, which are the building blocks of all words.
Think of phonemes like atoms: changing just one in a word changes its meaning. For example, replace the /k/ phoneme in “king” with the /r/ phoneme, and it becomes “ring.”
But each language has unique phonemes, and a phoneme that changes a word’s meaning in one language might not have the same effect in another language. For instance, English has 18 phonemes that are consonants, and another 15 that are vowels, while Mandarin Chinese has 22 consonant phonemes and about seven vowel phonemes. Why “about?” Because linguists disagree about how many there are. That’s another example of how nuanced speech is and why it’s so challenging to get software to learn those nuances.
And it doesn’t end there. There are dialects within each language, and dialects sometimes have different vocabulary (think of how a “bun” is called in dialects of English). But there is nearly always variation at the level of phonemes; the same phoneme is pronounced differently by speakers of different dialects. Again, this presents a practical problem for speech recognition, and once again, it is solved by training on lots of data demonstrating this variability.
So Many Variables
Bottom line: with enough training, a neural network can learn how people pronounce each phoneme. But as the “Yanny vs. Laurel” debate shows, there are still differences in how humans hear and interpret sounds. But why?
One reason is that some people’s ears do a better job of hearing lower frequencies, while others are tuned to higher ones. This difference can be due to how their native language has tuned their hearing, or to hearing loss that muffles certain high- or low-frequency sounds. Another variable is the quality of the audio, which can be influenced by factors like the equipment people use to listen to a sound file. For speech recognition to solve this, the system needs to accurately discern each phoneme, even when the person is on a mobile phone that uses a low-quality codec, is in a noisy place or both.
So, how does speech recognition cope with the “Yanny vs. Laurel” challenge? We had our speech recognition platform listen to the clip, and it heard Laurel. It’s important to first note that the audio clip is, of course, atypical; it was specifically manipulated to create the effect that it has. Such problems do not easily occur in real life. Of course, we sometimes mishear each other, but typically we are mixing up words that sound similar – and speech recognition occasionally has the same challenge. Could we have manipulated the system to hear “Yanny” instead of “Laurel?” Probably, either by changing the so-called preprocessing, the part that tunes in on specific frequencies or by making “Yanny” a word that is as or more common than “Laurel.”
But for now, we’ll stop short of taking this step, and enjoy the “Yanny vs. Laurel” debate along with the rest of the world. It’s a fun example that offers great insights into how we hear and speak.
How do we harness IVR and improve it for AI, using the power of the voice?