I'm Sorry, Dave, I'm Afraid I Can't Do That

When looking for the ultimate in natural language (NL) understanding, the speech community needs to look no further than the HAL 9000 supercomputer or the bridge of the U.S.S. Enterprise.

Both HAL, the sentient computer entity with the soft voice and conversational manner from the 1968 film 2001: A Space Odyssey, and Lt. Cmdr. Data, the sentient android second officer and chief operations officer from Star Trek: The Next Generation, have been described as the holy grails of NL technology. Simply put, their ability to understand, interpret, and communicate with humans remains unparalleled and unrivaled.

With Data seated near him on the bridge, Capt. Jean-Luc Picard is never forced to interact with annoying menu prompts—Press 1 or say photon torpedoes, press 2 or say phasers. Never does Picard have to worry about his commands being correctly recognized and categorized by a call router inside Data’s android brain. If Picard needs to fire at a Romulan warship or launch a counterattack against the Borg, he can say shoot, fire, destroy, eviscerate, annihilate, obliterate, kill, or any other term of his choosing. Regardless of his specific words, Data will understand and carry out the order.

But unfortunately for us living in the here and now, Data is impossible to build and program given the technological realities and limitations facing today’s designers and developers. And while HAL might have represented Stanley Kubrick’s vision for the 21st century, we’re no closer to HAL-type technology either.

But to really assess NL’s current limitations, understand its current sophistication, comprehend the controversy surrounding it, and properly forecast its future, it is vital to examine the past. According to Roberto Pieraccini, chief technology officer at SpeechCycle, the NL saga is a circular one that really began two decades ago.

Back then, Pieraccini says, most speech technology research focused on NL and few people were considering menu-based systems or directed dialogue. But while NL was preferred and more interesting, it was, in the mid-1990s, deemed “not ready for prime time.” At that point, the industry made what Pieraccini calls “a very smart, very good decision” in shifting its focus to menu-based speech recognition. Were it not for that decision, Pieraccini says the very basic problems of NL would not have been solved, and the technology couldn’t have been first adopted around 2000.

Today that same technology is used by different applications in different forms, and, as Pieraccini notes, “nearly all [toll-free] numbers today have what is called a call router based on natural language technology.” But it is from those varying forms—all of which are labeled as a form of NL technology, processing, or understanding—that much of the controversy and confusion emanate.

According to Pieraccini, several methods exist to create NL technology today. Speech recognizers can transcribe speech almost perfectly. People can speak freely and, with a small margin of error, every word is correctly transcribed. But—unlike Dr. David Bowman’s interactions with HAL—the system only understands the words, not the meaning. Understanding the meaning is where NL technology comes into play. And this, Pieraccini says, is “a very difficult thing.”

The most common and accepted methods for creating NL technology are:

1. Rules-based grammars: the traditional method in which the speaker is constrained by the sentence structure expressed as grammar rules. If a user says something that is not predicted by the grammar, chances are the machine will not understand.

2. Statistical language models (SLMs): a more sophisticated method in which language statistics, along with a statistical classifier, are used to place recognized utterances into one of several meaning categories. Sentences are not constrained by rigid structures. The trick here is to collect a lot of examples of word-by-word transcriptions of real utterances, tag them with the correct meaning category, and then use statistical machine-learning algorithms to model the language and learn to categorize what users say into one of the precompiled categories.

3. Robust parsing or concept spotting: the most sophisticated method; a way to model, often statistically, the individual concepts of meaningful phrases in the language independently from the whole sentence, allowing the technology to pick out key concepts and respond accordingly.

“The industry is rediscovering these things, but these things were available in the early ’90s,” Pieraccini says. “We are rediscovering [them] today because now it makes sense for commercial applications.”

And while most companies, vendors, consultants, and users see the importance and benefits of making use of NL—particularly in interactive voice response (IVR) systems—there is little agreement about what exactly NL is and how to best define and use it.

“Vendors have used natural language to mean everything under the sun, and it has kind of lost its meaning now,” says Jim Larson, an independent consultant, VoiceXML trainer, and co-chair of the World Wide Web Consortium’s Voice Browser Working Group. “I just wish that we would stop using this word. It doesn’t help anybody. It’s a slippery term.”

According to Larson, customers who buy speech recognition technology are often confused about NL. Most people, he says, assume it amounts to something like HAL 9000.

“When a person hears the term natural language, that’s what he thinks,” Larson says, noting that many vendors market their products as NL, leading customers to think they can say anything to the system and be understood. “From a vendor’s point of view, it’s a good sales spiel to label something as natural language.”

In Search of Standards

Part of Larson’s issue with the ambiguity surrounding NL technology is the lack of industry standards—particularly for statistical grammars. “Several years ago the Voice Browser [Working] Group of the World Wide Web Consortium attempted to standardize a format for statistical grammars,” he says. “The Voice Browser [Working] Group abandoned that effort because its members were not able to agree upon a format that was acceptable to all of the vendors at the time.”

The Voice Browser Working Group has no plans to work on a standard format for statistical grammars. “Right now the Voice Browser Working Group is just ignoring the whole issue of natural language, primarily because it’s such an ill-defined thing,” Larson continues. “And I don’t think any of the members could agree upon what natural language means, let alone standardize anything.”

To Larson—who could be called an NL purist—true NL is the “holy grail” of research; systems like HAL 9000 or Data are being worked on in university labs and may never come to fruition. However—for better or worse—most people in the industry hold slightly more malleable views.

“We’ve found in the industry that everybody uses the term natural language to refer to so many different things that it actually really doesn’t mean anything anymore sometimes,” says Jeff Foley, senior manager of solutions marketing at Nuance Communications. He offers a more fluid definition of NL.

“It’s anything that lets you better recognize what a caller is saying when they’re interacting with the system,” he says. “It doesn’t necessarily mean going away from directed dialogue. There are ways to apply natural language’s magical abilities to directed dialogue as well. But what it’s basically designed to do is catch anything that a customer might say that might otherwise be out of vocabulary, out of grammar, that the system isn’t expecting, and pull it into the constraints the system understands so that the conversation can continue.”

And while Foley’s broader interpretation of NL might contribute to the confusion about the term, his definition is echoed by many industry insiders.

Deborah Dahl, principal at speech and language technology consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group, offers a similarly flexible summary.

“The way we use NL in speech applications is that it’s technology that lets users express their concerns, interests, and what they need in their own natural words without having any constraints,” she says. “So the system isn’t going to be telling them what they can say; they can say whatever they need. And then the technology has techniques for figuring out what the users intended from what they said.”

Offering a similar perspective is Manish Sharma, director of unified communications architecture and design services at Nortel and marketing and communications chair of the VoiceXML Forum.

“[NL] would mean that the person is able to speak naturally what their request is without paying attention to or being constrained by specific words or phrases that the system expects,” he says, adding there is still an expectation that the conversation be relevant and on-topic. “I don’t expect them to talk about their cat or their shopping spree last night. But I do expect them to talk within the context of what the application is, but not be constrained by specific words and phrases in the grammar that the system expects.”

Daniel Hong, lead analyst at Datamonitor, calls NL understanding “a statistical language and semantic-based approach that interprets the meaning of a string of words rather than using the traditional finite state grammar approach which lists possible utterances. NL understanding enables callers to speak in infinitely varied ways using disfluencies, such as ‘um,’ and get understood.”

Pieraccini has his own definition. “[NL] is a technology that to varying degrees, to different degrees, tries to capture some information from unconstrained or mildly constrained utterances. And the degree of the type of information it can capture depends on how sophisticated this technology is,” he says. “So today and to that respect, I agree with the statement…there is no true natural language technology in the sense that there is not a technology today that understands free-form language.”

Not a HAL 9000

With so many definitions, conflicting viewpoints, and different methods of creating NL technology, it’s easy to understand why the concept engenders some degree of confusion. But despite the controversy and misunderstandings, NL technology is being designed, developed, and implemented with greater frequency. And while the industry may not be able to offer enterprises a Data or HAL 9000, companies can do plenty with NL technology.

Leading the way in the development of new and cutting-edge NL technology is Pieraccini and his team at SpeechCycle. And according to Pieraccini, much of that success is due to one key ingredient: real speech scientists with extensive backgrounds in statistical machine learning.

“They know how to handle data, and they know how to handle a lot of data,” Pieraccini says. “One of my mantras is there is no data like more data. I like to say that all the time. The more data you have, the better the performance of the system. And of course you have to know how to use data.”

Pieraccini is referring to his high-resolution SLM, for which he and his team collect, transcribe, and annotate millions of user utterances. He describes high-res SLM as similar to regular SLM, but with some patented changes. These patented changes inherent to high-res SLM allow Pieraccini to “model the different degrees of specificity that people have when they speak,” he says.

Typically with SLM, incoming sentences are routed to a number of categories. But this technology, Pieraccini says, doesn’t allow for the acknowledgment of different degrees of specificity in user language. With high-resolution SLM, the system responds to the level of specificity of each user and, depending on that level, might ask a disambiguating question.

Working with high-resolution SLM, Pieraccini has built call routers with close to 300 categories—typical call routers have only a few dozen. And because his system has so many categories built on so much data, he can achieve accuracy that approaches 90 percent—a very high number for this type of application.

Nuance is also working with NL technology to address what Foley considers one of the technology’s biggest problems: out-of-grammar statements.

“People say things, they offer information, they talk to someone behind them, they forget that they’re talking to a computer that’s expecting something from a limited list of responses,” he says. “And in some cases, the out-of-grammar rates were outnumbering the actual misrecognition errors by factors of 5 to 1 in some of the worst cases. We’ll see some applications will have a 20 percent to 30 percent out-of-grammar rate, which is just deadly because you talk about what’s the accuracy of this system—and maybe it’s 90 percent. But that’s the accuracy of the in-vocabulary responses.”

One way that Nuance is addressing out-of-grammar statements is by offering its customers SmartListener, a solution that can not only be applied to a directed dialogue to give callers choices but also accept choices from callers—or out-of-grammar statements—that aren’t offered.

“Rather than have to go in and manually put in all those things, [we] use a combination of that and some robust parsing capabilities to come up with a better way of understanding what customers say when they don’t say something that’s on the list,” Foley says. “With SmartListener you can learn to recognize that [a] person responded about payments and ignore the stuff at the beginning and the end. Some people want to call it word spotting. It’s a little more than that, but that may be the best way to describe it easily: picking out what the person’s intent was.”

Nuance has seen amazing results with SmartListener, Foley says. “In the lab and with the few folks that we’ve deployed to…they’re seeing uplifts in recognition accuracy that are larger than we see when we introduce a new version of the software in raw accuracy—in some cases as much as a 20 [percent] to 40 percent uplift in reduction in error rate.”

Other NL offerings from Nuance include the SpeakFreely and One-Step Correction solutions. Rather than offering callers a list of choices or using a traditional grammar, SpeakFreely provides a flat menu structure that allows an IVR to ask How may I help you? and allows callers to give a free response.

To that end, Foley says the system—based on thousands of captured and tagged utterances—can map callers to the right destination and “use some of our statistical modeling so that if a new caller says something that wasn’t on [the] list of trained responses but is close enough, then the computer can still figure it out and route them to the right place.”

With One-Step Correction, a caller who makes a mistake and tries to correct herself won’t immediately be rejected by the system.

“Being able to catch that the person is responding to a previous part of the conversation and go back and make that change without simply rejecting it…that sort of thing is really helpful in keeping the dialogue going and eliminating retries and confirmations,” Foley says.

Despite varying definitions of NL and the many applications and solutions bearing its name in the market today, the advantages and disadvantages of the technology are clear and—to a large degree—agreed on.

Let’s Be Realistic

On the plus side, NL is realistic, callers prefer it to navigating annoying menus, it boosts customer satisfaction, and, under the right circumstances, it can work better and save companies money. On the other hand, it is expensive, complex, and notoriously difficult and time-consuming to design and deploy due to the thousands of utterances that need to be collected and tagged. It’s also difficult to update and somewhat misleading.

So when and under what circumstances should an enterprise go with NL, directed dialogue, or a combination of the two?

Nortel’s Sharma says NL is advisable when users have a broad distribution of requests, whereas directed dialogue is best when there are limited choices. He advises customers to look at their distribution of call requests.

“I don’t think that we should just blindly apply [NL] to all verticals, all solutions,” he says. “We have to look at the particular distribution, at a particular menu level, to say what are we trying to capture. And if it’s a very lopsided distribution…then I would say directed dialogue still is a good approach for it. If you come across a user request distribution where the user requests are distributed over lots of different reasons, then natural language would be a good funnel.”

According to Pieraccini, the decision should be based on how the caller interacts with the system. “You need to use natural language only when it is needed,” he says. “If there is a kind of knowledge model between the system and the caller, there is no need to talk a lot.”

As an example, Pieraccini cites ordering a pizza: The choices and options are simple, the caller will likely understand what he wants and how to get it, and so there is no need for a complex, natural system.

But, “when you go to different applications when the model is not so clear, then it’s very hard to construct a menu that allows people to simply use directed dialogue to express what they want,” he continues, “either because the menu’s too long or because it doesn’t make much sense to the caller.”

In such cases, he says “the best thing is to let the caller explain in natural language what the problem is and, in the background, map what they say to one or more different problems because the model of the problem is not available to the user.”

However, Pieraccini warns this is not as easy as it sounds. “Building effective natural language systems requires expertise in handling lots of data, which can be very tricky,” he says. “If I said there is no data like more data, I would add the second corollary to this: There is no data like good data.” To have good data, he says, one needs expertise.

But despite the difficulties and differing definitions, most people in the industry agree that the future will bring more and more NL.

“In the next two to three years we will be able to use more and more data. And it will be used to…get rid of rules-based grammars even if we use a directed dialogue prompt,” Pieraccini says.

Foley agrees, noting that as costs go down, NL adoption will increase. “We’re finally to the point where people realize that speech works,” he says. “It’s hard to say that it doesn’t work because there are so many systems out there that use it to some degree or another and get better automation rates than touch-tone alone.”

Foley does, however, note that it’s “still in that ‘is it going to work for me’ area, and both the general populace and people making decisions in contact centers may not be aware of how good the technology is. And so they may be reluctant to try it.”

Not surprisingly, Larson is less optimistic—both about NL’s future and the possibility of actualizing it in its ultimate form. “There’s no free lunch,” he says. “You get what you pay for. And if it sounds inexpensive, you probably don’t get much with natural language.”

“I don’t think we’re ever going to reach that holy grail, where we can chat with a computer like HAL or Data, unless we can narrow it down to a very specific domain,” he adds. “At least not in the next 10 years or so.”

I'm Sorry, Dave, I'm Afraid I Can't Do That

Eltropy Expands Voice Authentication Ecosystem with Illuma, IDgo, and Pindrop

Modulate Expands Velma with Voice-Native Real-Time Conversation Intelligence

Corti Launches Symphony for Speech-to-Text

SwitchBot Launches Voice-Enabled KATA Friends