Automatic speech recognition still falls short after 60 years of research, but there is progress.
For many years, popular reports about automatic speech recognition (ASR) have taken one of two common forms: it's solved, or it doesn't work. New approaches sometimes provide surprising gains, only to be overly hyped. After limitations are observed in the technology, public enthusiasm wanes, funding recedes, and the prestige of the discipline declines.
What is the reality? Speech recognition works for some things, and often fails under conditions that aren't much of a problem for human beings (at least when we pay attention). In fact, the machine processing of spoken language is likely to require many more years of study to reach the level of our expectations, a statement that would not find favor with any typical potential funding source. And yet, even now there are well-designed applications that can be useful despite an imperfect underlying technology.
Despite occasional press implying the contrary, there is currently no silver bullet that has provided a quantum leap forward in performance. On the contrary, progress in speech recognition during the last quarter century has comprised a series of incremental improvements, some of which are only modestly transferable from one task to another. Taken together, these increments have summed to yield a technology that is greatly improved in practice to what we had in the late 1980s. And yet, we still find that an attentive human being can recognize speech with far fewer errors (particularly in the presence of noise and/or reverberation) than any machine algorithm. We still have a long way to go.
There are a number of promising approaches to future gains that are currently being investigated in the speech research community. Without attempting to provide anything like a serious list of these methods, I can still suggest a few that have gained some prominence. Certainly the field of machine learning has had many successes in recent years; for instance, some of the approaches to neural network processing, most recently in what is called "deep learning", look interesting and have shown some success. For the case of ASR for noisy speech, signal processing approaches inspired by simplified models of the auditory system are also making a comeback. In a number of labs, including our own, techniques inspired by models of sensory cortex have had some success. Still other approaches based on sparse representations of signals have been attracting a lot of attention.
Research directions like these are generated by the intuition and interests of their adherents. This is a time-honored approach to progress: the best researchers push forward with their approaches, often despite many failures and limited results. They have a mental model of how speech recognition should work, and generate methods that match that model as much as possible. The best of the improvements in speech recognition have begun with inspiration of some sort, often from an analogy with another field (for instance, the migration of hidden Markov model (HMM) technology from cryptography to speech).
But there is another approach to scientific progress: conducting a quantitative and qualitative study of the shortcomings of a current model. This has been the way forward in many areas (e.g., physics or astronomy). And yet such a diagnostic approach has been surprisingly lacking in the field of automatic speech recognition.
Some of this perspective is now being explored in our lab at ICSI. Steven Wegmann is working (along with postdoc Hari Parthasarathi and student Shuo-Yiin Chang) to discover what common ASR acoustic model characteristics are actually hurting us; Other diagnostic techniques could similarly show us how our language or pronunciation models need to be improved, and further work could potentially show us how our initial signal processing is failing us under many conditions. And consideration of the community's experience with the many limitations of the current technology can lend breadth to such an exploration; Jordan Cohen is leading such a study at our lab.
More than 40 years ago, John Pierce of Bell Labs wrote a letter to The Journal of the Acoustical Society of America in which he bemoaned what he viewed as the arbitrary manipulation of recognizer parameters to obtain the best performance, characterizing it as the work of a "mad scientist," rather than that of a serious researcher. This was an extreme view; his remark might have even been harmful, since Pierce's influence sharply cut back speech recognition research in the 1970s, at least at Bell Labs. There are many current avenues of research that are likely to have great merit, and the exploration of many ideas (most of which won't work) is also necessary to find a few good ones. But an important message from this letter still holds: a scientific approach is likely to be the only feasible long term path to a real solution to ASR.
ASR providers have promised much in the past, and there have been many disappointments. Despite this, the promise of a scientific approach to ASR research, particularly given the existence proof of robust human speech recognition, gives us hope for a true solution. In the meantime, we can still take advantage of application scenarios that are useful despite an imperfect core technology.
Nelson Morgan is head of the Speech Department at the International Computer Science Institute, an independent research center.