Interest Mounts for Emotion Detection

Article Featured Image

With huge technological advances in just a few short years, the debate about whether artificial intelligence can be sentient has moved out of science fiction and Hollywood circles and into the boardrooms of companies big and small all over the world. The real question that business leaders are asking: Is AI emotion detection ready for prime time?

There’s little doubt that having artificial intelligence that can recognize human emotions could be beneficial to contact centers, marketing, sales, recruiting, hiring, and so many other business functions, but do we really want machines to interpret or replace human interactions? And do we trust them to do so accurately?

It would seem so. Emotion detection is starting to make inroads into all sorts of business processes, and research firm MarketsandMarkets projects the global emotion detection and recognition market to grow from $23.6 billion this year to $43.3 billion by 2027, at a compound annual growth rate of 12.9 percent.

MarketsandMarkets’ report says this expected growth is due to a rising demand for speech-based emotion detection systems to analyze emotional states and for socially intelligent artificial agents; this demand is fueled in part by the increased need for operational excellence. But there are a number of factors that are also working against growing adoption. One of the main ones is a lack of clarity about the terms and technologies involved.

The terms “emotion detection” and “sentiment analysis” are oftentimes used interchangeably, but there are differences.

Sentiment analysis is typically a text-based learning classification task, according to officials at Deepgram, a provider of AI-based speech recognition and contact center solutions. It might operate on single sentences, paragraphs, or entire documents. Sentiment analysis has a variety of uses, including analyzing customer feedback, monitoring social media conversations, tracking brand reputation, gauging public opinion on a topic or issue, and evaluating customer satisfaction levels, the Deepgram experts wrote in a recent blog post.

Emotion detection, sometimes referred to as emotion recognition, on the other hand, typically relies on audio, using factors like intonation, volume, and speed to determine which emotion a speaker is feeling, usually coded as one of several categories like happy, sad, angry, etc., according to Deepgram.

Emotion detection provides clues to a customer’s attitudes toward a company in ways that simpler analytic techniques don’t.

“Emotion is part of who we all are,” says Rick Britt, CallMiner’s vice president of AI. “We all feel anger and frustration, joy and pleasure. But we also share those emotions differently. For example, some people get very quiet and soft-spoken when they’re angry, as opposed to talking loud and fast. And as we all know, detecting and understanding emotions is hard.”

Understanding emotion within customer conversations, such as whether a customer is frustrated or satisfied with a company or product, can be extremely powerful for customer-facing organizations, Britt adds. “While detecting emotion is hard for machines, just like it’s hard for humans, advances in deep learning are helping companies identify the ways customers show a broad spectrum of emotions within their interactions.”

Also complicating the issue is the fact that emotions are unique to both the individuals and the organizations with which they’re engaging, according to Britt. Everyone has personal emotional baselines that can be situational. The emotional reactions that customers display when they’re engaging with a debt collector are far different than when they’re interacting with an e-tailer.

“When organizations can understand emotions effectively and accurately, they can better take action on what’s happening in their customer conversations,” Britt says. “This can include helping contact center or customer service agents better handle emotional interactions, such as when they might be interacting with a vulnerable customer who needs additional care and empathy. Or understanding customer emotions over the duration of a conversation so they can pinpoint the successful steps an agent took to de-escalate an interaction that might have started as negative and ended positive. These insights can drive better agent onboarding and coaching efforts.”

This capability is important for contact centers because when organizations can detect emotion in conversations they can surface issues before they become real problems, empower agents with more data-driven performance feedback, learn from past interactions to improve customer outcomes in the future, and more, Britt says.

“Human communication is complex and contains verbal and nonverbal elements,” adds Kushal Lakhotia, senior staff applied scientist at Outreach. “Emotion is a crucial nonverbal component of how human beings express themselves. It is conveyed via audio and visual cues, e.g., intonations in speech and facial expressions. Emotion recognition technology extracts complementary signals to speech recognition and thus helps fully understand what a person is trying to communicate.”

Such data is especially useful in conversational intelligence applications, summarizing the salient points of a conversation that require deeper understanding of a person’s message beyond the words they are saying, according to Lakhotia.

Linguistics Is the Best Emotion Detector

However, D. Daniel Ziv, Verint’s vice president of go-to-market for speech and text analytics, counters that even though technologies provided by Verint and others can identify volume, increased speed of speaking, and similar emotional indicators of a customer’s satisfaction or dissatisfaction with a company and a particular interaction, the actual words used are a better indicator of customers’ feelings than those other indicators.

“Not all swear words have four letters, and some words just naturally carry more emotion than others,” Ziv explains. “We can statistically identify which words carry more emotion. Because our transcription now is very accurate, that tends to generate very accurate results compared to using tone and pitch and speed and other things that can carry emotion.”

Ziv adds: “If I’m very angry, and I haven’t said a single angry word but [the satisfaction score] is only based on tone, there is a very high likelihood that it’s a false positive. There could be a baby crying in the background, I could be calling from a noisy bus or from an airport. We did a lot of tests and found that it’s more accurate to use linguistic-based sentiment and some acoustic-based evidence.”

Cross-talk—when the customer talks over the agent or vice-versa—is another strong indicator of true sentiment, as are long silences or gaps in the conversation, according to Ziv. “We’ve tested five different types of algorithms that use acoustic analysis for emotion detection. Acoustic-only is very inaccurate. Linguistic-only is very accurate, and it’s more accurate than it used to be because our transcription is now more accurate,” he says.

Contact center customers are increasingly seeking emotion detection scores, Ziv says, because they want sentiment and want to replace surveys so that they can have automated coverage on 100 percent of their interactions, rather than on only that small portion of the interactions that include completed follow-up customer sentiment surveys.

A Better Sentiment Predictor

Emotion detection capabilities are becoming more popular in contact centers because they provide a truer picture of customer sentiment than Net Promoter Scores, according to Ziv. Obtaining NPS data takes effort on the part of customers, many of whom just don’t want to be bothered, he says. “Customers are tired of [NPS surveys] because they are bombarded with them. So there’s a decline in the response rate.”

Even if a customer does respond, the NPS survey doesn’t provide detail as to why customers will or will not refer a company, Ziv adds. “It’s not that helpful. It’s helpful in terms of identifying trends, but it doesn’t really help fix the problem. So the shift is to using the actual information we have from the customer.”

Companies are looking to extract true customer sentiment from emotions displayed via voice and text interactions, as well as the context surrounding those interactions, Ziv explains further. “So now we have a much richer understanding of what’s driving high and low sentiment.”

While solutions have gotten far more accurate in the past few years, that is only one of the recent advances in emotion detection, according to Lakhotia.

“Spoken emotion recognition is an area of research that focuses on the paralinguistics, which, unlike automatic speech recognition, requires capturing prosodic elements of speech,” Lakhotia explains. “The research in this area was focused on designing specific models to capture prosody that could be trained to detect emotion. However, with the evolution of self-supervised learning using neural networks in speech, the area has seen a shift from specialized models to generalized ones.”

Lakhotia adds that self-supervised learning enables a vast amount of unlabeled data to train models that can extract signals from speech. These models are typically trained on thousands of hours of speech, and such pre-trained models can then be adapted for an array of spoken tasks with much less labeled task-specific data.

“Some popular SSL models that have been adopted widely in the last couple of years for multiple spoken tasks are CPC, wav2vec 2.0, and HuBERT,” Lakhotia says. “This has in turn led to the introduction of standardized benchmarks such as SUPERB, HEAR, and LeBenchmark, which have helped move the area forward by introducing a consistent way of comparing multiple SSL models on a collection of tasks, including spoken emotion recognition.”

While spoken emotion recognition is an active area of research, the datasets used for it are subsets of wider multimodal datasets, such as IEMOCAP, CREMA-D, and RAVDESS, that include vocal data as well as facial expressions, according to Lakhotia. The presence of such datasets is fueling multimodal emotion recognition research that goes beyond speech and incorporates audio-visual signals instead.

Churn Detection

Some companies, particularly those with high churn rates, hope that emotion detection can provide agents with strong, real-time indicators that a particular customer is likely to churn rather than just making an idle comment about going to a competitor, according to Ziv.

“In churn, you have to look at other factors,” Ziv says, noting that many times the customer frustration displayed toward a product might have little to do with the company that sold it.

In some industries, particularly telecom and insurance, customer churn can be extremely high at the end of contract periods.

“And adding speech categories that look for customers at risk typically improves those churn models significantly,” Ziv maintains. “Exactly how accurate it is varies from customer to customer. But we’ve seen things above 90 percent accurate, and we’ve seen sometimes just an improvement of churn from 50 percent to 70 percent, which is a big deal.”

While it can be an excellent indicator of churn, other factors, such as the availability of other suppliers, also affect churn, Ziv points out. This is particularly common in the TV arena, as most locations have only a single cable provider. While there might be a satellite internet provider as well, the reality is that frustrated customers in that situation might have little choice.

Among the other reasons for hesitation around computer emotion detection, there are those who think that some of the existing emotion detection solutions, particularly those that include facial recognition technology, might be too personally invasive.

Emotion technology must avoid the dehumanization resulting from the emotional ignorance of many current machine learning systems, according to Rijul Gupta, founder and CEO of DeepMedia.AI. “Soulless technology at present is visible in the state of Google Translate (technically correct but lacking emotion), TikTok Voice (sounds robotic). The lack of ingrained emotional detection in these systems does not produce consumer delight or even acceptance.”

Zoom Video Communications is reportedly starting to explore emotion detection technology, which has raised the ire of more than 28 human rights organizations. They have urged Zoom to halt its work on an emotion tracking system that aims to analyze the engagement and sentiments of users.

Many industry experts expect the privacy issue to loom large for years to come. But at the same time, the demand for the technology will grow, as evidenced by the MarketsandMarkets projection, and the technology itself will continue to evolve.

“There have been key developments in the last few years in building self-supervised models that can jointly extract signals from audio-visual inputs,” Lakhotia says. “This has enabled modeling spoken and visual inputs with a single model. The combination of development in multimodal modeling and the presence of high-quality audio-visual datasets to conduct experiments will propel the area beyond spoken emotion recognition and establish new state-of-the-art results for emotion recognition.”

Companies will continue to use emotion detection to help drive their voice-of-the-customer efforts, Ziv says. “I think that we’ll see more unique cases of acting on it, and the algorithms behind it will evolve somewhat.”

However, more than the algorithms or the emotion scores, the most important benefit for companies will be the ability to use the analytics to take action in real time rather than waiting until after an interaction occurs, Ziv says. 

Phillip Britt is a freelance writer based in the Chicago area. He can be reached at spenterprises1@comcast.net.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues