The Changing Perception and Reality of Speech Recognition

Most consumers have encountered speech recognition largely in call center automation, where the speech recognition can be annoyingly overstructured. Callers might feel they are being prevented from speaking to an agent by the automated system, a further source of annoyance. Part of the negative view of speech recognition has also been its limitations compared to human speech recognition.

Today, the technology seems to be at a tipping point, with both the perception of it and capabilities rapidly moving speech recognition toward an everyday experience. Apple's Siri is a big part of the attitude change. The model of a friendly personal assistant, easily available, seemingly always with you in the form of a mobile phone, and apparently responding to unstructured speech (natural language), has changed perceptions both of how useful the technology can be and how far speech recognition has come.

The friendly part of the perception is in part due to Apple's marketing genius. When the company emphasized the naturalness of the interaction rather than reminding users they were talking to a computer, I initially thought that the natural language model would encourage pushing the service beyond its capabilities. But Apple foresaw this issue, and made it an advantage. They put in canned clever answers to many of the testing questions that Siri might be asked (from "What is the meaning of life?" to "Will you marry me?"). As a marketing tool and confidence builder, this insight is proving tremendously effective.

The speech technology in Siri is remarkable. In speech recognition, the mobile phone environment is one of the most difficult, with background noise a typical issue. The iPhone includes noise cancellation, which helps. Beyond that, the speech recognition accuracy for unconstrained speech with very few context restrictions is remarkable. What accounts for this apparent quantum leap in the capabilities of speech recognition?

Part of the accuracy can be attributed to the speech recognition itself, and part to the natural language processing of the transcribed speech. The transcription of the speech to text is displayed so one can see what Siri "heard," and it is remarkably accurate (based on personal experience and the reaction of the marketplace). What adds to the experience, however, is the post-processing, which can compensate for recognition errors. For example, in a personal experience, the iPhone responded to one spoken request with the text interpretation, "Fries electronics near here," but then, without further interaction, displayed the location of a Fry's Electronics store nearby, the correct interpretation of intent. The natural language processing either made its own match to similar-sounding words or is working from output of the recognizer that includes more than the highest-scoring option.

Another aspect of this performance is the infrastructure used. The speech recognition and natural language processing are done in the network, so they can use the processing power and memory resources of a server, rather than the limitations of the small device. The processing also has access to constantly updated large databases, e.g., local businesses. The core speech recognition probably uses more than a pure statistical language model, with entries such as business names or street addresses represented by a list within the statistical language model, making it easy to update without rebuilding the entire model.

Beyond the significant technology advances, this tipping point is supported by consumer enthusiasm for the personal assistant model of interaction. This attitude change is important because it leads to consumer tolerance of inaccurate responses when they do occur, and a willingness to repeat or restate a request.

A subsidiary effect of the personal assistant model is that call centers will face even more resistance to automation if they don't adopt a less-structured natural language approach in their operations, since consumers know now that it is possible. Conversely, they will witness more acceptance of automation when they do adopt it. Companies have a chance to build on this major change in attitudes by recognizing a paradigm shift and adopting the assistant model.

William Meisel, Ph.D., is president of TMA Associates (www.tmaa.com) and publisher and editor of Speech Strategy News. He is a consultant on market and product opportunities created by the maturing of speech technology and is co-organizer of the Mobile Voice Conference with the nonprofit AVIOS, where he serves as executive director.

Job Descriptions for Personal Assistants

Radar O'Reilly is a tough act to follow.

10 Jul 2012

The Changing Perception and Reality of Speech Recognition

Job Descriptions for Personal Assistants

ServiceNow Partners with OpenAI on Voice AI

FlashLabs Releases Chroma 1.0 Voice AI Model

Agora Partners with MiniMax on Voice AI

VoiceRun Launches Voice AI Platform with $5.5 Million Seed Round