Speech Emotion Recognition: The Next Step in the User Experience
Speech emotion recognition (SER) is a branch of the larger discipline of affective computing, which is concerned with enabling computer applications to recognize and synthesize a range of human emotions and behaviors. But why do we need SER in the first place? The short answer: SER can greatly enhance the user experience.
Automatic speech recognition (ASR) is all around us, and we are regularly interacting with virtual assistants, electronic devices, and software applications via voice user interfaces that use ASR technology. But more often than not, despite significant improvements in ASR, it still feels like there is a missing ingredient when we interact with these smart devices and applications. There’s a big difference between ASR-mediated interactions and our normal human-to-human communications, because when we interact with other humans we account for their emotional states and expressions and accordingly adjust our responses, understanding, and behaviors.
Human communications are rich and complex. You might have heard of the oft-quoted 7-38-55 rule of communication. Formulated in the 1960s, this axiom of sorts claims that it’s not words but nonverbal cues that do the heavy lifting of conveying the intent and meaning of our communications. According to this, the actual words spoken account for only 7 percent of the meaning; voice expressions and intonations account for 38 percent; and the rest, 55 percent of meaning, is conveyed through body language.
Let me note that this rule has not held up to scientific scrutiny in terms of the percentages apportioned, but it continues to be popular and contains a kernel of truth—that there are important informational signals to be gleaned from speech patterns. Not just the “what” but the “how” also matters. ASR technology works better when both the semantics of message and the state of emotion are considered.
Note also that our focus here is on speech emotion recognition and not as much on making machine-generated speech sound more human-like by injecting intonations and emotions using emotion markup tags. Automatic generation of realistic-sounding speech that closely approximates human speech is a welcome but different topic. If many of the current ASR applications seem too robotic or unrealistic, that’s perhaps because they interpret our spoken words too literally, without a sense of the underlying emotions.
Speech emotion recognition has a variety of applications across a wide range of domains, as described below:
Customer support and employee wellness: Analyzing voice calls to identify the customer’s emotional state can lead to better handling of customer service calls. For example, an angry customer can be directed to a support agent who has been trained to handle such situations. Once the emotion is identified, the software can be programmed to suggest a conversational script tailored for customers who are upset.
Voice analysis of the agent conversations can provide clues about their stress levels and emotional wellness. Time-series analysis of such data can identify patterns of customer behavior, changes and trends in employee and team motivation levels, and other actionable insights. These insights can help improve both employee engagement and customer satisfaction.
The coronavirus pandemic has led to unprecedented levels of remote work arrangements, and these arrangements can impact employee morale. SER-based analytics applications can help organizations assess how well employees are coping with the sense of isolation these new work conditions can produce.
Healthcare and assistive robotics: There is a lot of interest in companion robots for patients and elders in nursing homes and care centers. These robots can learn the different emotional states of the users they are assisting, and this will go a long way toward increasing their acceptability and adoption. Another use case pertains to autistic individuals who have difficulty recognizing the emotions being expressed by the people with whom they interact. SER applications can provide cues on the emotions behind the words.
E-learning applications: During online learning, students can experience various emotional states, including anxiety, confusion, and boredom. Inputs about the learners’ current states, such their level of interest, can be used to change the pace of teaching or can prompt a different teaching style; all this can help enhance student engagement and lead to better learning outcomes.
Sports and video games: SER can help identify key moments and exciting parts of a sports game or match by analyzing the commentary and generating highlight clips. Users play video games by donning digital avatars, and the expressions and actions of these avatars can be modified based on the emotions expressed by users, which can make the gaming experience more fun and engaging.
This is by no means an exhaustive list of applications. In fact, use cases for SER can be found in a variety of other scenarios of human-computer interaction that involve speech. The foundation of SER is being able to correctly derive the underlying emotion, but how exactly does that work? How does SER complement the sentiment analysis of written text? What are the challenges and limitations?
As a decades-long research field, SER has many traditional techniques, but we are also seeing the application of newer deep learning methods. How are artificial intelligence and deep learning approaches helping us improve SER? We will examine these topics in this space in Speech Technology’s spring issue. x
Kashyap Kompella is CEO of rpa2ai Research, a global AI industry analyst firm, and is the co-author of Practical Artificial Intelligence: An Enterprise Playbook.