Speech Goes to School

Imagine a device that can detect if someone is lying during a phone conversation or one that interrupts loud talkers, or an algorithm that can detect Parkinson's disease from a voice sample. From the practical to the seemingly fantastical, university speech research is offering up myriad projects that may one day become as commonplace as Siri.

Building a Better User Interface

At the Massachusetts Institute of Technology, Blade Kotelly, lecturer at the Gordon–MIT Engineering Leadership Program and CEO of Storytelling Machines, focuses on helping people understand how to design emotionally and intellectually compelling user interfaces.

"This is the most important aspect of speech technology right now," Kotelly says. "It's how you teach people what they can and can't do when they engage with a speech system—IVR interfaces and iPhones. The system needs to correctly understand the intent [of] what they've said and do the right thing to provide feedback that both helps users with what they want now and…learn how to use other parts of the system later, even when the system changes."

Kotelly teaches Engineering Innovation and Design, where students create hundreds of different designs every semester using software from Angel.com. The software allows students to make apps quickly, even if they don't know anything about speech apps.

For example, students made an application designed to help parents put kids to bed. It asks, "Which superhero does your child like the most?" The parent could respond "Batman." If the system doesn't recognize the superhero, it may say, "I don't know that one right now, but if you want, you can ask for that later." At this point, the parent can understand what the system can do, which keeps the interaction efficient. Once the system understands what the parent wants to do, the mother or father can hand the phone to the child. When the child says "Hello," Batman's voice comes on and asks questions. It doesn't matter what the child says, because the system will say, "Batman likes to go to bed early," which can motivate the child.

"The interesting stuff is about the user interface," Kotelly says. "What does it give you back for an answer? People have been talking for years about multimodal interfaces. You can say something and see a result. But that's not really using much multimodal quality…. When you speak something, it shows you some words you spoke, but at the moment, it's just scratching the surface of how a full multimodal interface could work. It's not using that to its full advantage. It's the response you get that helps you form an understanding of what it can or can't do—this is important because learnability is at the core of what makes a good, large vocabulary speech system. If the system can do everything but people don't know it, then the system isn't being used fully."

Are You Talking to Me?

What if your IVR system could calm an irate caller, or mimic his voice so that his experience is more pleasant? These scenarios are closer than you might think.

At Carnegie Mellon University, Alan Black, an associate professor in the Language Technology Institute, is working on several projects, including analyzing the emotion behind what's being said. "This is probably where the core future is; it's not just what people say but how they say it," Black says.

Black pointed to similar work showcased in an annual emotion recognition challenge organized by colleagues at Erlangen University in Germany, to automatically identify which of 12 emotions are being used.

"We do analysis of speech to find out what the personality of the speaker is, what the emotion is," he says. "We're also able to synthesize that so that we can actually engage people in an appropriate conversation to be able to get the right personality and emotion."

Toward that end, Black is working on trying to detect when people are angry—one indicator might be that they are hyperarticulating—and then trying to de-escalate the conversation. "We've tried a number of techniques to get people to stop doing that, and one of the things we've discovered is having our synthesizer speak more quietly actually calms people down so they stop shouting," he says.

Also working on human behavior and speech is Julia Hirschberg, a professor in the department of computer science at Columbia University. One area of research Hirschberg and her colleagues are focused on is improving dialogue systems so that speaking with them is much more like talking to a real human being than a robotic voice. By taking large samples of conversations, she and her team noticed that people would sometimes start talking like each other, using the same speaking rates and words. This type of unconscious mimicking is known as entrainment or alignment.

"We're now doing experiments to see if this actually makes a difference in people's evaluation of the dialogue system," Hirschberg says. "We've built a spoken dialogue system and bring in subjects and record some of their speech, and note some of their characteristics, such as pitch."

The subjects are asked to perform a task where they have to interact with the system, which can feature entraining, so that it's very close to the way they speak, or interact with a system that's completely different from how they speak. The subjects are then questioned about what they like or dislike about the system.

"Our hypothesis is that they're going to like it when the system is talking more like them," Hirschberg says. "The research focuses on allowing more humanlike interactions. The subjects indicate that they don't want systems that are menu-based. It will probably be five to ten years [before] we can have a completely natural interaction, but it's a lot closer than it has been."

Whom Do You Trust?

It can be difficult to tell when someone is lying to your face, and even harder to detect untruths over the phone. However, scientists at Nagoya University in Japan have developed what they say is the world's first technology to analyze phone conversations that automatically detect situations in which one party might "overtrust" scammers.

Overtrust might involve a situation in which an individual may have some sort of diminished capacity and can't objectively evaluate an explanation being given by another party. The university and researchers at Fujitsu, a partner in the research, have created a function that, by detecting changes in voice pitch and level, is able to infer situations of overtrust on the part of an intended victim overwhelmed by distressing information from someone who is trying to defraud him.

The scientists also have developed basic technology for detecting remittance-soliciting phone phishing scams by combining the technology for detecting situations of overtrust with the detection of characteristic keywords. The technology uses a keyword list provided by the National Police Academy in Japan and recordings of actual remittance-solicitation phone scams.

The organizations said that in cases of remittance-solicitation phone phishing scams, a perpetrator might pretend to be an acquaintance of the victim, or someone in a position of authority, such as a police officer or lawyer. The researchers developed a function using word-spotting voice recognition technology to identify when the suspected perpetrator uses special keywords from a preregistered list, such as indebtedness or compensation. This function, which ignores everything except the keywords on the preregistered list, detects the number of times keywords relating to remittance-solicitation scams are spoken.

"There are limits to human powers of perception and judgment," the researchers said. "When overwhelmed with information that may be distressing, some individuals…may have a diminished capacity to objectively evaluate information provided by another party. In situations of overtrust, there is the risk of believing everything another person is saying, even in cases of remittance-soliciting phone phishing scams, for example. In order to prevent such scams, there is a need to detect such situations and provide appropriate support."

The SpeechJammer

If only Jerry Seinfeld had a SpeechJammer gun to silence all those long, loud, close talkers. Thanks to Japanese researchers, this could one day be a reality.

Scientists at Ochanomizu University and the National Institute of Advanced Industrial Science and Technology are focusing on technologies that can control the properties of people's speech remotely. As a first step in the project, the researchers reported on a system that jams remote speech using Delayed Auditory Feedback (DAF).

"It is thought that when we make utterances, we not only generate sound as output, but also utilize the sound actually heard by our ears (called auditory feedback) in our brains," the researchers wrote in a paper. "Our natural utterances are jammed when the auditory feedback is artificially delayed. It is thought that this delay affects some cognitive processes in our brain. This phenomenon is known as speech disturbance by DAF." DAF has a close relationship with stuttering; it leads physically unimpaired people to stutter, otherwise known as speech jamming.

The group built two prototypes of devices, dubbed SpeechJammer guns, which deliver the speech back to the speaker, and take into consideration the distance between the speaker and the device. Users can easily operate the speech jamming function by simply sighting the device toward the person speaking—the target—and pulling the trigger switch like a pistol.

"In general, human speech is jammed by giving the speakers back their own utterances at a delay of a few hundred milliseconds," said the researchers. "This effect can disturb people without any physical discomfort, and disappears immediately [when they] stop speaking. Furthermore, this effect does not involve anyone but the speaker."

I Only Have Ears for You

Have you ever noticed that when you're in a crowded room filled with loud voices, you're still able to focus on one person talking to you and can ignore other simultaneous conversations? Thanks to scientists from the University of California, San Francisco, the mystery of the so-called "cocktail party effect" has been solved.

UCSF neurosurgeon Edward Chang, M.D., a faculty member in the UCSF Department of Neurological Surgery and the Keck Center for Integrative Neuroscience, and UCSF postdoctoral fellow Nima Mesgarani, Ph.D., wanted to get a better understanding about how selective hearing works in the brain.

The scientists selected three patients with severe epilepsy who were undergoing brain surgery. Targeting the parts of the brain that cause seizures, the scientists placed electrodes on the outer surface of the patients' brains, so they could record activity in the temporal lobe, where the auditory cortex is located.

The patients were asked to listen to two speech samples from two different speakers that featured different phrases played at the same time. They were then tasked with pinpointing words spoken by one of the two speakers. By using an algorithm, the researchers found that they could find which specific speaker and words the patient heard.

"The combination of high-resolution brain recordings and powerful decoding algorithms opens a window into the subjective experience of the mind that we've never seen before," Chang said on UCSF's Web site. "The algorithm worked so well that we could predict not only the correct responses, but also even when they paid attention to the wrong word."

The findings could one day help consumer technologies that involve electronic devices with voice-activated interfaces. Mesgarani said that this, however, could be a long way off, as the engineering required for distinguishing one voice from a sea of speakers is very complex.

Speech recognition, Mesgarani said on the UCSF Web site, "is something that humans are remarkably good at, but it turns out that machine emulation of this human ability is extremely difficult."

Speech Makes Medical Strides

Speech technology has increasingly played a role in medical research, aiding scientists in uncovering and developing a host of helpful solutions that may someday cut medical testing costs and reach remote parts of the world.

Max Little, a Wellcome Trust fellow at the Media Lab at MIT, is working with students as well as other scientists from the University of Oxford on the Parkinson's Voice Initiative research.

Little took the techniques that he had been developing for voice disorders and was sent data that was blinded as to whether the subject had Parkinson's or not. His techniques were able to correctly separate those who had Parkinson's from those who didn't and reached an accuracy rate of 86 percent, which has subsequently improved to 99 percent.

"Parkinson's is a movement disorder; you see it in people's limb movements, but you can hear it in their voice as well," Little says. "We look for tremor and weakness in the voice, and then we use that in order to objectively score someone's symptoms on a standard clinical scale."

Currently, the project seeks to collect 10,000 voice samples from around the globe, and has set up phone numbers for people to anonymously contribute voice samples of both healthy and Parkinson's patients. The calls take an average of three minutes but the actual voice sample is about 10 seconds.

"It's not necessary to have running speech," Little says. "We're not looking at articulations, we're looking at effects on the vocals, and how the dynamics change."

Although Little's team has shown that the technology can detect Parkinson's from voice recordings, the initiative is for scientific research only, and callers don't receive a diagnosis.

"This could enable some radical breakthroughs, because voice-based tests are as accurate as clinical tests, but additionally, they can be administered remotely, and patients can do the tests themselves," Little says. "Also, they are ultra low-cost, as they don't involve expert staff time, so they are massively scalable."

When Snoring Is More Than Annoying

In Israel, Dr. Yaniv Zigel, at Ben-Gurion University of the Negev, is working with other scientists on ongoing research regarding obstructive sleep apnea (OSA) diagnosis using audio signals, such as snores and speech. The research is being performed in cooperation with the Sleep-Wake Disorder Unit at Soroka Medical Center in Israel, headed by Ariel Tarasiuk.

"Our hypothesis is that it is possible to distinguish between OSA and non-OSA subjects by analyzing particular speech signal properties using an automatic computerized system," Zigel says. "In the past, it was already found that OSA is associated with several anatomical abnormalities of the upper airway that are unique to this disorder. From the speech side, it is well known that acoustic parameters of human speech are affected by the physiological properties of the vocal tract, such as vocal tract structure and soft tissue characteristics. Therefore, it was suggested that acoustic speech parameters of an OSA patient may differ from those of a non-OSA subject."

After recording the speech of more than 100 subjects referred to a sleep clinic by different doctors as potential OSA patients, the scientists discovered that by using a unique set of acoustic features extracted from speech signals, as well as a developed method for classification between OSA and non-OSA subjects, a 90 percent classification rate could be achieved.

"These results show that acoustic features from speech signals of awake subjects can predict OSA, and the suggested system can be used as a basis for future development of a tool for initial and noninvasive screening of potential patients," Zigel says. "In our ongoing research, we are developing a method to estimate the severity of OSA using speech signals."

While Ben-Gurion University may have to sleep on this technology a little more, its future—and that of the other universities—in speech research is ripe with promise.

Staff Writer Michele Masterson can be reached at mmasterson@infotoday.com.

Speech Goes to School

Voice Is Changing, Not Disappearing

Your Voice Is Your Password

Voice Biometrics Grows Up

ICSI Partners with Microsoft

Reports of Persona's Death Have Been Greatly Exaggerated

LVA: The New Fraud Detector?

Eltropy Expands Voice Authentication Ecosystem with Illuma, IDgo, and Pindrop

Modulate Expands Velma with Voice-Native Real-Time Conversation Intelligence

Corti Launches Symphony for Speech-to-Text

Why Voice AI’s Next Big Challenge Isn’t Accuracy. It’s Relationship Design.