Q&A: Human Emotions and Digital Agents

Article Featured Image

Bruce Balentine is a Chief Scientist at Enterprise Integration Group specializing in speech, audio, and multimodal user interfaces. In almost three decades of work with speech recognition and related speech technologies, Balentine has designed user interfaces for telecommunications, desktop multimedia, entertainment, language training, medical, in-vehicle, and home automation products. Balentine will moderate the panel "How Can Digital Agents Make Use of User Emotion?" at the SpeechTek conference in April. Conference chair James A. Larson interviewed Balentine in advance of this year's conference.

Q: What emotions can be automatically detected and classified from a user’s speech?

A: There are two approaches to detecting emotion. The first considers the acoustical signal coming from a human user. The second looks at text, which has been converted from a speech signal via speech recognition.

Acoustical processing: The first method considers what is known as prosodic information—loudness, stressed/unstressed inflection patterns, pitch contours/pitch range, spectral distortions caused by tension in the user’s voice/breath/body, and similar physical characteristics of the acoustical signal. These signals measure the physiology of emotion, and can be used to infer both its intensity and its valence. Valence is a term that describes the positive or negative impact of the user’s subjective feelings.

Reliance on the acoustical signal is uncertain and ambiguous—hence the preference for the verb “infer” rather than “detect and classify.” Although we may say that we are “detecting” emotion in a broad general sense, we cannot really say that we are accurately distinguishing between, say, anger or fear, joy or humor. A better way to think of physiological measurements is to call them “affective” detection and analysis, rather than “emotion” per se. Affect is a term that some use to distinguish the physical components of language from the cognitive components (where specific emotion classes reside).

Not all speech ASR technologies make these prosodic measurements, but they are well-known in the signal processing industry and have been (can be) used to infer affective states and the user’s general emotional context.

Semantic processing: The second method looks at speech that has already been recognized and converted to text, applying natural language understanding (NLU) technologies to parse, classify, and understand the meaning (semantics) of the speech. This method is often called “sentiment analysis,” and is more accurate and specific to a broader set of emotional classes. The reason is that users who are feeling emotion tend to use certain language as a natural part of a conversational discourse. Although specific phrases such as, “Boy am I mad,” rarely (but occasionally do) occur, more subtle expressions such as swear words, spontaneous declarations (“Siri, you are so stupid!”) and/or embedded keywords (sorry, sad, annoyed, great) can point toward distinctions between positive and negative classes including joy, pleasure, surprise, anger, annoyance, and disappointment. These sentiments may then be used to modulate the behavior of the conversation, or to trigger entries into a user profile for future analysis.

Note that an ideal system relies on both acoustical measurements and sentiment analysis to generate a more refined picture of the user’s emotional state, allowing detailed user modeling and even a limited Theory of Mind (ToM), wherein the machine is able to act upon a theory of the user’s mind to manage the conversation and subsequent task performance.

Q: Can these same emotions be expressed to a user via synthesized speech?

A: Yes, to a degree. Speech synthesis applies the two methods above in reverse—first generating text that contains expressions that imply emotion, and then rendering the text into an acoustical speech signal that contains the prosodic characteristics of the emotions. Many synthesizers today—notably concatenative TTS that is based on acoustical snippets of real recorded human voice sounds—can swap in a different set of acoustical patterns as the base source for phonemes and syllables. These technologies are based on extensive sampling of real people under various emotional circumstances, and so can simulate the subtle acoustical attributes of a stressed physiology.

Note that machines do not feel the emotions, and users know that they don’t, so the users tend to view the expressions as though the application is an actor or actress—simulating feelings for the sake of the authenticity of the conversation. Applications that try to make use of these capabilities sometimes also use other modalities, for example a lip-synching avatar or humanoid robot that gestures or makes facial expressions synergistic with the voice sounds. The goal is generally to create a “character” that is appealing to the user, and to achieve a more “realistic” conversation.

Q: How can knowing a user’s emotions enhance the user-computer dialog?

A: There is considerable controversy here. Some researchers (I can send you some references if you want) support the idea that such knowledge can be used to make real decisions. Two common ones are to respond to user anger by terminating an IVR call and sending the caller immediately to an agent; or, conversely, to respond to user happiness by offering some new product or service as a promotion.

Both hypotheses don’t hold up to usability tests. In the former case, users learn to feign anger in order to “trick” the IVR into giving them an agent. Even when the anger is real, the behavioral conditioning is all wrong. Instead of rewarding users who choose self-service, the solution rewards resistance to self-service and conditions the user to choose behaviors that are detrimental to application goals. In the latter case, user “happiness” correctly detected is often undone by the unsolicited promotion, leaving a sour taste and disappointed user—precisely the opposite of the intended effect. User “happiness” that is misclassified (error rates are high with these technologies) confuses users and leads to error amplification.

Better solutions include the following:

  • Building a feedback loop that uses “tension” or “anxiety” detection to calibrate the rate of speech and insertion of white space into the machine’s spoken audio. These two features together represent “pace,” and the effect is to align the machine’s pace with the user’s desired pace. Subsequent detection then measures whether the adaptation has been effective.
  • Responding to user anger by shifting the initiative from a user-directed to a machine-directed mode, using queries to handhold and walk the user back to a stable place. After recovery, such applications easily revert to a mixed-initiative or user-initiated dialogue style.

Q: What new applications will speaker emotion detection and synthesis enable?

A: More research is needed—and better tools from the technology vendors are required—before we’ll see real progress here. But my research points toward a class of application in which a personally- or professionally-used device is tightly coupled to a user who is concentrating closely on a difficult but important task. This class of applications requires a tight turn-taking protocol, must support user-initiated backup moves, and cannot exhibit social or frivolous personality traits—exhibiting instead task-focus, concentration, intelligence, and an excellent Theory of Mind. The users will learn to rely on subtle and multi-leveled signals from such applications, and emotion has its place in that mix.

Such applications include robot assistants, especially in healthcare, manufacturing, and design. Also included are public kiosks, design tools for the arts (especially music), multi-user groupware, and some entertainment systems (games). Notably, the most common present-day application artificial speech in conversation—customer care and call center front-end systems (IVR)—is unlikely to benefit for a number of reasons that I will go into in my SpeechTEK session.

Q: Can lies be detected from a user’s speech?

A: Not really. There has been some work on this challenge—expanding on biometric studies in general—that claim to have some degree of accuracy. Dr. Judith Markowitz has studied this problem. But realistically, such a scheme is unlikely to lead anywhere useful.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

Q&A: Jane Price on Creating a Persona for your IVA

So you've decided you need an Intelligent Virtual Assistant. Your competitors have one, and your customers expect to be able to communicate with you this way. But an IVA is an extension of your brand that deserves as much thought as any other platform. Speech Technology Magazine recently asked Jane Price, SVP of Marketing, Interactions, about the ins-and-outs of constructing an IVA and giving it personality through a persona.

Video: Developments in Speech Technology

At SpeechTEK 2018 in Washington D.C., the Speech Technology Magazine team had the chance to talk with a series of experts about what developments they see developing in speech technology over the next year. Michael McTear, Allyson Boudousquie, Debra Cancro, and Crispin Reedy sat down to talk with us about the developments they see coming down the pipe--and what they would like to see improved.

Hard Metrics for Soft Skills: Using Voice Analytics in Leadership Development

We all know a good presentation when we hear it - but the question is, which speech characteristics contribute to that greatness? Speech analytics can tell us.

Q&A: Recent Deep Neural Net (DNN) Advances in Speech

For over 30 years, David Thomson has developed and managed software and algorithm creation for speech recognition, speech synthesis, speech compression, and voice biometrics and managed speech R&D for voice interface systems than have handled over 20 billion calls. He is presenting the SpeechTEK Univeristy course, "Recent Deep Neural Net (DNN) Advances in Speech," at SpeechTEK 2018. He recently joined CaptionCall, a company that builds services for the deaf and hard of hearing. Conference program chair James A. Larson interviewed Thomson in advance of this year's event.

The Game Changer in Gaming: Voice Recognition Technology

Familiarity with voice recognition software has increased exponentially with our use of smartphones (thank you Siri!), voice controls in our cars, and smart home devices like Google Home or the Amazon Echo. So why would I want my gaming experience to be any different?

Q&A: Strategizing Customer Experiences for Speech

Crispin Reedy is a Voice User Experience designer and usability professional at Versay Solutions. She has over 15 years of experience on the front lines of the speech industry, in the design, usability, and tuning disciplines. She is presenting the SpeechTEK University course "Strategizing Customer Experiences for Speech" on Wednesday April 11 at SpeechTEK 2018. SpeechTEK program chair, James Larson, talked to Reedy in advance of her conference session.