November 10, 2017
By Dan Miller Senior Analyst - Opus Research, Inc.
Interact

Speech Processing or Natural Language Understanding? You Be the Judge

Here’s a variant on the “Which came first?” riddle. What’s more important in the age of conversational commerce: speech processing— that is, automated speech recognition (ASR) and text-to-speech (TTS)—or natural language understanding (NLU)? The two might be inseparable, but I’m very much starting to come down on the side of NLU.

It is gratifying to see that speech technology is finally achieving high visibility in homes and cars through the proliferation of voice-first devices. Google, Amazon, and Apple (along with Microsoft) are investing heavily in making voice a ubiquitous option. This has had a ripple effect as hundreds of companies and tens of thousands of developers now render their services as Alexa Skills, Google Actions, or their own flavor of voice-first apps. They enable millions of individuals to use their own (spoken) words to conduct searches, ask questions, or take charge of their home entertainment options.

The result: billions of voice queries launched each day by saying “Alexa,” “OK Google,” “Hey Siri,” or some other magic phrase.

Yet these initiatives would not go very far were it not for the efforts of NLU engines and application development platforms that are back-ended by computer power dedicated to assigning sets of utterances to specific categories (a.k.a. “tagging”), building huge databases of utterances that become synonyms of one another (a.k.a. “building grammars”), and constantly determining the best answer or action to take in response to spoken input.

Indeed, when you look at the system architecture behind many voice-first services, you find that the spoken words do not remain in their initial audio form for very long. Amazon, for instance, does not share audio input with the individuals or companies that provide its “skills.” Instead spoken words are immediately rendered as text, and subsequent actions that are required to derive meaning and respond to intent are performed as if they were typed in by the original speaker.

Speech processing is important, of course. The quantum improvement in word error rate that’s happened in just the past year or so (now hovering at a little over 90 percent) can take some credit for raising confidence that spoken instructions will actually be understood. But the real improvement has been in the domain of NLU, which is where the oft-cited domains of Big Data and analytics have had tangible positive impact on providing correct results and, in turn, creating a better user experience.

Speech Processing That Plays Well with Others

Obviously, there is no rivalry at play here. Improved speech recognition coupled with advancements made in microphones and other sensors just keeps feeding more and better audio input into the Big Data repositories that fuel the conversational commerce continuum. In addition, TTS, the rendering of text output as spoken words, is constantly improving by its own measures. For their part, Siri and Alexa can show empathy, tell jokes, and show other forms of expressiveness that were, shall we say, lacking in past renditions.

In addition to its close ties to NLU, speech recognition is also finding affinity with other elements of a conversational user interface. Intelligent assistants can now perceive facial expression, observe hand gestures, and detect changes in pitch and timbre, all of which are key indicators of an individual’s mood, disposition, or patience with the course that a conversation is taking.

The happy harmonizing of a proliferation of sensors, improved accuracy in speech recognition, the humanizing of spoken output, and the “understanding” of diverse sets of input heralds a golden age in conversational commerce. Speech-based services will always remain a key part, but equal or greater attention will be paid to technologies that support NLU and many other flavors of what the general press refers to as artificial intelligence (AI).

My counsel to longtime “speech geeks” (like myself)? Hang in there. Speech processing just keeps getting better, and there is a very important place for it in offering voice-first services.

Founder Dan Miller is the lead analyst for Opus Research.