The Evolution of Speech Recognition

In her column "Building Smart Systems with Cognitive Computing" (Speech Technology, Spring 2014), Sara Basson argued that what people really want from speech recognition is speech understanding. Understanding implies that the software can take action on what is being said—i.e., interpreting what is intended by the speech rather than just displaying it, sometimes called natural language interpretation (NLI).

Speech understanding is exemplified by mobile personal assistants: Apple's Siri, Google's voice search, Microsoft's Cortana, and others. These virtual assistants try to address much of what we do with smartphones and are moving to other platforms.

There are less challenging objectives that address the desire for speech understanding. The trick is to find applications with limited context, allowing a company to create an interaction that feels natural to the user within that context.

Some enterprise applications that are used frequently, such as warehouse picking systems or field workers updating Salesforce.com software by voice, can tolerate a more structured interaction. The system becomes "natural" over time.

But what about systems that address only occasional interactions with an untrained user, such as customer service calls? In the past, IVR systems have used unnatural interactions, often limiting a caller's choice at each point in a decision tree by a series of prompts. Today, however, there are IVR systems that say, in effect, "Please tell me how I can help you." The key to making this possible is that the interaction is in the limited context of what customers ask a specific company—not general language understanding.

A similar context constraint is implicit in speech recognition that selects from a list of items, such as your contact list when you use voice dialing. Speech recognition company Novauris specializes in recognizing very long lists, and has demonstrated recognizing an item from a list of 243 million distinct items, so "lists" doesn't necessarily imply simplicity. When one can choose any movie available for rental, any song available for purchase, or any entertainment option available on a TV by just saying it, the interaction will certainly seem unconstrained. Again, the key is a context that the user intuitively understands.

There are sophisticated NLI techniques that learn from a large database of examples, with the approach sometimes called machine learning. Some linguistic approaches use a database of words or phrases with similar meanings to aid NLI. This can be more complicated, but some companies are developing tools to support NLI techniques that reduce the technical challenge.

An alternative approach is writing custom software tailored to the specific task, guided by human intuition. The software might search for specific key phrases to identify context (e.g., "schedule" would suggest making an appointment through the user's calendar application) and then use processing specific to that context (such as searching for a time, day, and contact name for the scheduled calendar item). This is one way to handle the "How can I help you?" prompt in IVRs when the scope of responses is sufficiently understood.

There are widely available tools that support "defined grammars," essentially a list of all the sentences that the speech recognition can understand expressed in a more compact manner. These tools allow for defining subsections of the grammar (for example, a date, an account number, or a contact name). The structure of defined grammars can aid a custom program by allowing quick identification of important information used in taking action on the request.

A reality of today's evolving speech technology markets directed at consumers is that perceived natural interaction is beginning to be a requirement. Increasingly, users will expect to be able to just say what makes sense in the application context and have it work. There are huge opportunities for companies to connect with their customers and for apps that provide specialized services. Over time, the NLI part of the task will become easier for nonexperts because of tools and services.

High levels of NLI expertise aren't a requirement to participate in this trend. But companies must at least experiment to understand its opportunities and limitations, or they'll find themselves playing catch-up with competitors that do.

William Meisel is the president of TMA Associates, editor of Speech Strategy News, and executive director of AVIOS. His recent book, The Software Society, examines the tightening human-computer connection through language.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

The Evolution of Speech Recognition

Speech Is Set to Dominate the Wearables Market

Apple Upgrades Siri in Latest iPhone, iOS Releases

AVIOS Has High Hopes for Speech

SoundHound Partners with Acrelec

Deepfake AI Market to Generate $41.36 Billion by 2032

SoundHound Launches Vision AI

Vuzix Introduces LX1 Smart Glasses for Warehouses