Beyond Recognition, to Understanding

[IMGCAP(1)]

You could make a pretty good case that the birthplace of the speech recognition industry is the Massachusetts Institute of Technology (MIT) Laboratory for Computer Science. Several leading speech recognition companies owe their core technology to developments at MIT, and it remains an “idea factory” for the speech recognition industry.

Dr. Victor Zue, an associate director at the MIT Lab, is widely recognized as one of the pioneers of speech research. He heads the Spoken Language Systems Group, a community of researchers and students at MIT devoted to researching the development of interactive conversational systems.

These are systems that can interact with users with natural, spoken language, in order to solve problems such as travel planning and geographic navigation. As such, they are on the cutting edge of developing the next generation of speech: natural language applications.

A number of projects at MIT address issues related to speech recognition and language understandings, including feature extraction, lexical access, new word detection, search strategies, dialogue management, and language modeling. From that research, the MIT Labs formulate and test models for human/computer interaction.

For example, MIT has recently developed Jupiter, a conversational system that gives up to date weather information over the phone for 500 cities throughout the world.

Dr. Zue was the keynote speaker at the ASAT ‘98 (Advanced Speech Applications and Technologies) Conference and Exposition in San Jose, where he demonstrated Jupiter, and gave his views on the future of speech technology. That presentation is the basis of this article.

While generally an enthusiast about speech recognition and natural language understanding, Dr. Zue also acknowledges that solving speech related problems is a daunting task.

He joked that the research effort required to create conversations between people and machines “will keep me and my family fed for years to come.”

“High performance, speaker-independent speech recognition is now possible,” said Dr. Zue. “Commercial recognition systems are now available. But system error rates are still more than 10 times higher than those of humans, even for simple tasks.

“Instead of recognizing speech, we should strive towards understanding speech, even for transcription and document preparation. Speech understanding is also an important ingredient for conversational interfaces,” he said.

Conversational Interfaces

The term conversational interface refers, not surprisingly, to an interface that allows people to converse with machines in much the same way humans communicate with one another. It augments speech recognition technology with natural language technology to allow the machine to “understand” the verbal input. It becomes possible for a machine to engage in a dialogue with the user and to speak a desired response in natural language. At least that is how it works in theory.

However, he reports that conversational systems are emerging that can do the following:

Deal with continuous speech by unknown users.

Understand the meaning of the utterances (in narrow domains), take appropriate actions, and respond.

Operate in real domains and over the telephone.

Handle multiple languages.

Deliver those capabilities in real time, using standard workstations or PCs with no additional hardware.

Systems with conversational interfaces differ in the degree with which the human or the computer takes the initiative. The research at MIT focuses on “mixed initiative” systems, in which both the human and the computer play an active role in accomplishing the goal through conversation. An important part of pursuit of this goal is dialogue modeling.

Dialogue Modeling

“Effective spoken language interfaces must include extensive and complex dialogue modeling,” said Dr. Zue. “Displayless systems depend heavily on text planning and generation to communicate. System-initiated dialogue exchanges have greater predictability. But too much control is annoying to users.”

In dialogue modeling, programmers prepare the system’s half of the conversation, including verbal, tabular, and pictorial responses, as well as any clarification requests.

Dialogue modeling can resolve ambiguities, and inform and guide the user. An effective dialogue model may suggest sub-goals to a caller. (For example, if a caller has asked for flight information, the system could respond by asking “What time do you want to leave?”). Effective dialogue models can also provide plausible alternatives when the requested information is not available.

Much of what the Spoken Language Systems Group at MIT has learned about dialogue modeling is put to use in the MIT Jupiter System, a conversational interface for weather information.

The group also is working on the MIT Pegasus System, which is a conversational interface for flight status information.

Future Challenges

“Speech based interfaces are inevitable,” Dr. Zue believes. He expects the trend to be driven by the miniaturization of computers, increased connectivity and the ever-present human desire to communicate. “To be truly useful, these interfaces must be conversational in nature. They must embody linguistic competence and help people solve real problems.” He points out that “systems with limited capabilities are beginning to emerge and will soon be commercially viable.”

Weather from Jupiter

Jupiter is a conversational system that provides up-to-date weather information over the phone for several hundred cities worldwide. Jupiter knows about 500 cities worldwide (of which about 350 are in the US) and gets its weather information from four different Web-based sources.

Jupiter is based on the client-server architecture of Galaxy. Galaxy architecture forms the core of all the MIT Spoken Language Systems Group’s conversational systems.

Users can access the information by calling on a telephone (1-888-573-8255) and speaking naturally. (Users outside the USA can call 1-617-258-0300.)

Both English and Spanish versions of the system exist. Approximately 60,000 sentences have been collected from 10,000 calls to create the system.

“Jupiter is the closest thing we have to a real application domain,” said Dr. Victor Zue, director of the Spoken Language Systems Group at MIT. “We are dealing with both interface and content. We have telephone-based interaction and are keeping the system up for general use.”

Many other potential applications exist, he points out. These include flight status, stock quotes, traffic conditions, and sports information.

Some of these are under active development at MIT.

For more information, go to http://www.sis.lcs.mit.edu/jupiter.

Beyond Recognition, to Understanding

ServiceNow Partners with OpenAI on Voice AI

FlashLabs Releases Chroma 1.0 Voice AI Model

Agora Partners with MiniMax on Voice AI

VoiceRun Launches Voice AI Platform with $5.5 Million Seed Round