I inwardly wince every time a client announces that he wants me to design a natural language voice user interface. What follows is often an awkward series of questions that is intended to find out just what the client means by natural language. The answers clients provide can represent a range of possibilities that span, on a scale of complexity, from a basic verbal command and control system all the way up to an unbounded conversational dialog with a machine possessing the verbal skills of William F. Buckley, Jr.
Given the dramatic range of expectations, the subsequent discussion of the existing abilities of speech technologies can take the form of good news or bad news. Those who think command and control-type interactions are sufficient to qualify as natural language are going to get the good news because current speech technologies are capable of much, much more. But its mostly bad news for those who are longing to match wits with the Buckley Machine.
What Gave You That Idea?
Where do folks get their ideas about natural language and natural language systems? There are many possible answers to that question but most people seem to bring some preconceived notions about computers and language to the discussion of speech technologies. These notions usually reveal the influences of television and motion picture portrayals of boundlessly knowledgeable and conversationally competent computers. The source of and circumstances under which these kinds of expectations are established is important. It affords a valuable insight into the clients psyche and expectations.
Suspend Your Disbelief A person visits a movie theater to watch a science fiction film. Obviously, the person knows perfectly well that what they are seeing is not real. In fact, all he is really seeing is some colored lights projected across the theater onto a screen. In order for this person to respond to the film, as if it were real, the person must temporarily suspend his disbelief about what he is seeing. In an entertainment-oriented culture, in an age of almost constant video stimulation, this is something that most of us do many, many times throughout the day.
Thus we are used to suspending our disbelief about what we see and hear around us. Put another way, we have become perfectly comfortable with pretending that something that we see or hear is real.
A Recipe for False Expectations
- Take the average speech technology shopper
- Add a personal history of pretending that audio and video experiences are real
- Toss in a dash of Miracle Demo*
Note that only a dash of Miracle Demo is required. Given the power of ingredient Number 2, even the tiniest amount of Miracle Demo is sufficient to create a false expectation.
Youve Got to Admit, Its Getting Better
I would like to believe that things are getting better within the speech industry. Just a few years ago, there seemed to be a lot more talk about conversational dialog systems and advanced dialog management strategies. For a while, there seemed to be a resurgence of interest in the knowledge-based methods of the 1980s to support conversational dialog systems and natural language processing. Two years ago, several speakers at a SpeechTEK presentation predicted that fundamental breakthroughs in our ability to do natural language processing were just around the corner. These breakthroughs would follow, it was claimed, the incorporation of deep linguistic and other knowledge sources to solve the NLP problem, once and for all.
During the question and answer follow up discussion, I posed this question:
The use of knowledge and rule based systems was energetically explored during the mid-to-late 1980s. This resulted in a number of impressive text-based NLP systems that were almost exclusively created for highly specific, restricted domains. None of these systems enjoyed any substantial success because their users, over time, would discover the brittleness of their knowledge, come to distrust the systems and eventually decline to use them. What breakthroughs in our knowledge of human-to-human conversational communication have occurred since the 1908s that will enable the great step forward that you predict?
The panelists responded with some vague generalities after an initial awkward silence, but eventually, if indirectly, gave the answer, None.
Cant We All Just Get Along (without talking about NLP)? For me, the goal of true NLP implies nothing less than the conversational abilities of the Buckley machine. Such a machine would be a miraculous achievement that would have a huge influence on subsequent world history. That we are really nowhere near such a goal should be obvious to people who know anything about the field.
I would like to ask an open question to the speech community: Why do we persist in talking about NLP and unrestricted conversational dialog systems? Such talk only encourages unrealistic expectations of speech technologies among those outside of the field. And such unrealistic expectations have the potential to greatly harm our nascent industry.
Automatic speech recognition and text-to-speech technologies themselves do not have anything to do with natural language per se. ASR and TTS can wonderfully provide for verbal inputs and outputs but neither affords any particular ability to simulate whatever happens within the human skin, (presumably the stuff of natural language processing), between the social acts of hearing and speaking.
Stop the Insanity The fact is, our species has relatively little collective experience using any kind of language to talk to machines. Simply put, there is nothing natural about talking to machines. It is therefore a mistake for us in the speech industry to pretend that ASR is some how a significant step forward in NLP or that what we can currently (and quite effectively) do with ASR and TTS in anyway constitutes conversational natural language.
*Miracle Demo: Typically a multimodal presentation that includes an idealized, error-free, demonstration of the capabilities of speech technologies. Users and the machines with which they interact in such demos are often depicted saying and responding to highly variable and considerably complicated utterances.
Dr. Walter Rolandi is the founder and owner of The Voice Use Interface Company in Columbia, S.C. Dr. Rolandi provides consultative services in the design, development and evaluation of telephony based voice user interfaces (VUI) and evaluates ASR, TTS and conversational dialog technologies. He can be reached at firstname.lastname@example.org.