November 10, 2012
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

A New Age for Computer Interactions

This is an exciting time in speech technology, with advances occurring almost daily. Three interesting new technologies and exciting applications are prosody-based speech synthesis, speech classifiers, and brain-computer interfaces.

Can emotion in synthesized speech improve voice dialogues? Many grammar-based speech recognition systems provide a confidence score that indicates the probability of correctness of a recognized response spoken by the user. Speech researchers in Israel (highlighted at this year's Afeka-AVIOS Speech Processing Conference) used confidence scores to add prosody to speech synthesis for presentation to the user. The researchers found that users prefer "sensitive" prosody-based systems to those not sensitive to the users' prosody.

In a prosody-sensitive system, the system asks, "Who are you calling?" Based on the confidence score from the speech recognition engine, hesitations and disfluencies are inserted into the response produced by the speech synthesizer. If the confidence score is high, the speech synthesis generates a traditional response, such as "Dialing Haim Cohen." If it is in the medium range, the speech synthesis generates a response with a hesitation, such as "Dialing...Haim Cohen." If it is in the low range, the speech synthesis generates a response with a disfluency, such as "Dialing, um, Haim Cohen." This helps prepare callers in case the recognition engine makes an error and routes them to the wrong target.

Prosody can be improved if it reflects the user's emotion, detected by speech classifiers. Speech classifiers are widely used to route telephone calls to the appropriate department or person and for speaker identification and verification. They can also be used to detect the user's emotion. Imagine how an application could improve its user interface when the speech recognition system detects that a user has become bored, tired, frustrated, or excited.

There are many other applications for speech classifiers. Some call centers in Russia use classifiers to detect when a caller becomes angry, and reconnect the caller to a call center supervisor. Speech classifiers also may be used to estimate the user's approximate age, gender, stress level, and degree of intoxication. In the future, law enforcement officers may use speech classifiers to identify drunk drivers.

Another team of researchers in Israel developed a screening technique to determine if people suffer from sleep apnea based on a 30-second recording of their voices. This screening process may save patients' time, expense, and the discomfort for the night that they would otherwise spend in a medical sleep lab being tested for this disorder.

Is it possible to detect other types of illnesses by the sound of a person's voice? Perhaps. And, taking innovation a step further, can a person control a computer solely through his or her thoughts?

Yes, according to an Austrian company that markets a new brain-computer interface product. The product overlays the PC screen with a mask that contains icons used to control the program running on the screen. The icons flicker at different frequencies. Users wear caps containing tiny monitors. When the user pays attention to one of the icons, its flickering frequency can be detected by the monitors, resulting in the execution of the command assigned to that icon. Icons may represent letters for spelling words for text messages or represent commands to external devices, such as printers, TVs, assistive robots, and games. The product has an accuracy level of up to 98 percent.

Brain-computer interfaces could change how people interact with computers. This can be life-changing for people with disabilities, but others will benefit as well. When the cost of brain-computer interfaces becomes economical, there will be many commercial uses by consumers who don't mind wearing a cap studded with monitors. When gamers play a computer game, their brains effectively become a "third hand" for entering instructions. People watching TV will no longer need a remote control; they'll just think of the name of the program they want to watch.

Using classifiers to process speech and other biometric signals in conjunction with prosody could lead to dramatic new interaction techniques between people and computers. The possibilities are limitless.

James A. Larson, PhD, is an independent speech consultant. He is co-program chair for SpeechTEK and teaches courses in speech user interfaces at Portland State University in Portland, Oregon. He can be reached at jim@larson-tech.com.

A New Age for Computer Interactions

ICSI Partners with Microsoft on Speech Research

Speech Goes to School

Speech Synthesis With Emotion

Eltropy Expands Voice Authentication Ecosystem with Illuma, IDgo, and Pindrop

Modulate Expands Velma with Voice-Native Real-Time Conversation Intelligence

Corti Launches Symphony for Speech-to-Text

Why Voice AI’s Next Big Challenge Isn’t Accuracy. It’s Relationship Design.