What's New in Government-sponsored Speech Recognition Research
Automatic Speech Recognition (ASR) has been applied to many commercial tasks, including dictation, command and control, and a range of telephone-based services. Current ASR technology resulted from 50 years of research, both industrial and academic-not all of it initially aimed at speech processing. For instance, the Hidden Markov Model statistical approach was originally proposed in government-sponsored cryptography research in the 1960's. Programs to explicitly support speech recognition were then established by the Defense Advanced Research Projects Agency (DARPA), the same organization that funded the original development of the Internet. However, in the late '90's DARPA speech research funding declined. It was common to hear a few years ago that the speech recognition problem was essentially "solved". This may have been due to exaggerated claims that encouraged the view that a "Star Trek" level of performance was right around the corner.
While current recognition engines are adequate for many tasks, the range of plausible applications could be much broader if the basic technology was significantly improved. In particular, recognition systems continue to perform poorly with microphones that are not placed close to the talker's mouth. For instance, recognition in the car without headset placement is still typically limited to speaker-trained recognition and a small acoustically distinct vocabulary. Even for speech with little noise or room reverberation, recognition error rates are an order of magnitude worse for conversations than for the simpler problems. Recognition of conversational speech could be useful for many applications, such as data mining, or tracking the decision process of a meeting. Word error rates for conversational telephone speech recognition still hover in the 25-30% region, as measured in recent tests conducted by the National Institute of Standards and Technology (NIST). Performance is significantly worse for recognition of conversational speech during multiparty meetings when recorded by microphones a few feet away.
Given these limitations, DARPA has initiated two new programs to promote significant new reductions in error rate for the remaining problems in speech recognition. One of these, which has just begun this year, is called Effective Affordable Reusable Speech-to-text (EARS). An extremely ambitious goal of this system is to reduce the word error rate on conversational speech by a factor of five within five years. Research will also be directed into making the output significantly more readable than previous systems have achieved, incorporating automatic punctuation, capitalization, excising of disfluencies and speaker tracking. The research will be conducted for several languages. Another important component in the program funds approaches that radically depart from the underlying algorithms of current systems. Successes from this component will be ported to the mainstream effort at a later point.
The second new related DARPA program is called Speech In Noisy Environments (SPINE). This program should start late in 2002, and is expected to focus on command and control tasks under noisy conditions. The mainstream effort is aimed at short-term progress, but the program will also encourage more radical, long-term efforts that may significantly depart from current algorithms. The SPINE program follows earlier work in which a number of research labs tested their systems on data prepared by the Naval Research Laboratories.
DARPA programs have been emphasized here since they are large, focused efforts. However, there are a variety of other programs throughout the US, and in particular those sponsored by the National Science Foundation; one example of the latter, for instance, is the Malach project, which is studying speech from 116,000 hours of video interviews held by the Survivors of the Shoah Visual History Foundation. Additionally, there are significant speech research programs elsewhere in the world, such as European Union funded projects. A European project called "m4", for instance, is focused on multimodal meeting recognition tasks. There is a German "SmartKom" program, which investigates multimodal query systems. And the Swiss government has recently funded a network of labs to study interactive multimodal information management (IM2), with a strong speech research component.
If these and other related projects are successful, many more speech recognition applications will be feasible. Still, other advances will remain for the future, however, such as accent-independence and language-independence. Even within a single language and accent, we are far from being able to determine the meaning of spoken language outside of a restricted domain. Fulfilling such goals will require a massive set of gains in the basic science and technology that must continue throughout the upcoming century. Industry has the most to gain from algorithmic advances that will fuel applications of the future, but unfortunately the trend has been to reduce industrial support for the requisite research. Government research support will be critical for new research advances, but this will never have the impact of a joint industry-government program.
Nelson Morgan is the director of the International Computer Science Institute and a professor in EECS at UC Berkeley. He can be reached at firstname.lastname@example.org.