TTS (text-to-speech) has become mainstream technology. To find proof for that claim, all one needs to do is cite the recent parodies of TTS voices done by National Public Radio (www.npr.org/ramfiles/me/20020214.me.05.ram) and Saturday Night Live (www.burnsorama.com/SpeechRecoDate.wmv).
As those and other spoofs reveal, today's commercial TTS is still unnaturally uniform in affect, which is why researchers agree that one of the hurdles that remains to be overcome is creating TTS that can express a range of emotion and attitudes in dialogues (called "expressive TTS"). The fact that I've written two feature articles on expressive TTS (see "Show Some Emotion" in this issue and "Once More with Feeling" coming in the March/April issue) shows how complex and difficult this challenge is. I commend Loquendo for releasing the first commercial expressive concatenative TTS product in 2005 and for stating clearly that it's a first-generation product.
Developers of expressive TTS aren't the only speech professionals seeking to exploit the link between speech and emotion. Developers of speech input systems believe that emotion detection could easily become a central element of corporate customer-retention programs. A tool that could identify irate callers (or callers who are heading in that direction) and then initiate whatever procedures are in place for handling such situations would be extremely powerful. Emotion detectors could be integrated into agent training and could help monitor the performance of agents as well as that of self-service applications. "We tend to focus on self-service operations but there's a lot of turnover in call centers," explains David Nahamoo, head of IBM's Human Language Technology Group. "Some agents may not be trained properly and might not be aware of the nuances of customer care. A tool that provides feedback could provide great assistance."
Mining for Emotions
Commercial tools for detecting emotion in call centers are typically part of speech-analytics systems, but speech-analytics companies disagree about the value of emotion detection and how it should be implemented. "Emotion detection is critical for identifying true customer intent and pre-empting defection," says Ilan Kor, speech analytics product manager of NICE Systems, the first speech-analytics company to introduce emotion detection software. "If a customer expresses dissatisfaction, this needs to be flagged in real time. The issue can then be reviewed and handled almost instantaneously." He adds that NICE emphasizes acoustic indicators of emotion because "word spotting may catch "yes, you have great service," but only emotion detection can show that what was really said was an angry "YES, you have GREAT service!!!"
Yochai Konig, chief technology officer of Utopy, said his company's technology does emotion detection using acoustic information in its SpeechMiner product. They found the technology to be "helpful for identifying some cases where a caller is angry, but not all of those cases." Konig and Utopy prefer to focus on combining linguistic and non-linguistic content to get the necessary business intelligence.
"Many people think that when you look at emotion you're looking at the acoustic signal or tone of voice. That's not true for Nexidia's approach to emotion detection," says Mark Finlay, Nexidia's head of research. Unlike NICE and Utopy, Nexidia has adopted a purely linguistic technique.
Nexidia's VP of marketing, Anna Convery explains, "What we call our 'customer satisfaction' queries - ones which indicate things such as anger, dissatisfaction, happiness, and other emotions - look for specific words, phrases, and statements, such as, 'I want to talk to the manager' or 'This is ridiculous.'" According to Convery, "When our customers see that it's the verbiage that is pinpointing the emotion they realize that what they really need to do is to identify the people who are saying those words."
An Evolutionary Process
Clearly, the detection of emotion in speech is as difficult and multi-faceted as the expression of emotion. As with expressive TTS, the commercial emotion-detection tools of today are simply the first generation of this evolving technology. Their successes and failures will lead to increasingly accurate measures of emotion that can ultimately be used by the next generation of self-service automation, intelligent agents, and a whole panoply of other systems.