Overcoming Accents and Speech Irregularities

Thanks to technology being studied and developed at the Toronto Rehabilitation Institute, people with speech impediments and foreign accents could someday become much more easily understood.

The original purpose of the research, led by Frank Rudzicz, an expert in speech recognition and artificial intelligence, was to help people with pathological speech disorders, but its use could go far beyond that.

Rudzicz, an assistant professor in the computer science department at the University of Toronto and founder and chief science officer at Thotra, has developed a system that transforms the speech signals of those for whom clear communications are a problem into a form that can be more easily understood. It corrects for stuttering, inserts dropped sounds, and adjusts the rhythm of speech.

According to Rudzicz's findings, among human listeners, recognition rates increased by as much as 191 percent, from 21.6 percent to 41.2 percent, relative to the original speech, by using the module that corrects pronunciation errors.

"This system is a substantial step toward full automation in speech transformation without...expert or clinical intervention," Rudzicz says.

The technology automatically detects where stuttering happens, looking at acoustics to identify areas of "disfluencies," or speech irregularities, then cuts them out and blends them with adjacent segments of speech. It also includes a classifier—a piece of artificial intelligence that has learned speech from looking at other people's data about what constitutes stuttering. "We use a couple of machine-learning algorithms," Rudzicz says. "Neural networks, for example, look at different examples of stuttering and regular speech and learn the difference between the two. You provide the neural network with the new data, and the neural network automatically identifies the stuttering and removes it."

Rudzicz says the technology uses synchronous overlap that blends two sequences or segments of speech so the pitch doesn't suddenly jump when a period of speech is cut.

To demonstrate how the system works, Rudzicz has used it against speech from Colin Firth, the actor who famously played King George VI and stammered his way through the movie The King's Speech.

"We used automatic detection of speech disfluencies, such as long pauses and stammering," he says. "We cut that and background noise out, and there was blending between adjacent speech segments."

The system does need full speech recognition to get many of the speech transformations to work. "Speech recognition can make mistakes, so we want to avoid using it whenever possible," Rudzicz says. "Even though Siri is very expressive, Siri is not your voice, and she will always say the same sentence in exactly the same way, whereas you might put your own inflection on it, which in a way encodes your personality."

Speech processed through the system also does not sound as robotic as, say, Siri. "We can adjust some of the vowels to sound like another vowel," Rudzicz says. "You're modifying the speech signal and adjusting it to remove stuttering or long pauses or background noises; you're not removing things like tempo, the timing and rhythm of their speech," he says.

While mostly geared toward those with speech disorders, the technology could also be a boon for overseas contact centers, where foreign accents have been something of an Achilles heel in the customer service industry. In this case, a computer could sit between the talker and the listener over a network. The computer would take the talker's speech and play the modified version to the listener at the other end.

"There are scripts, so we know that when they say the word fear and it comes out sounding like fair, we know that the 'a' should have been an 'e' and we can modify this," Rudzicz says.

Rudzicz predicts that if customers have an easier time understanding call center employees, the overall experience will be improved. "It is possible that to some extent this could save some money in training employees to sound more intelligible, but this technology cannot correct major syntactic or grammatical errors," he says.

The technology could also become portable at some point, as the algorithms are very simple and could fit inside a standard smartphone, according to Rudzicz.

"Everything [on the system] runs very quickly," he says. "If this was implemented on a smartphone or a tablet, it could be run just as quickly as it is now in face-to-face situations."