Since joining ATandT in 1977, Jay G. Wilpon, Director of Speech Process-ing Software and Technology Research within ATandT Labs, has focused his research on problems in automatic speech recognition.
His current research interests include several of the key problems that will promote the ubiquitous use of speech recognition technology within the telecommunications industry. Specifically, the focus of this work is on robust speech recognition algorithms, spoken language understanding and dialogue control, with emphasis on their application to support new innovative user interfaces for services such as voice controlled intelligent personal agents.
He has served as the chair of the IEEE Signal Processing Societys (SPS) Speech Processing Technical Committee from 1993 through 1995 and is currently an elected member of the IEEE SPS Board of Governors. In 1987, he received the IEEE Acoustics Speech, and Signal Processing Societys Paper Award for his work on clustering algorithms used in training speech recognition systems.
In 1998, Mr. Wilpon was made an IEEE Fellow for his leadership in the development of automatic speech recognition algorithms.
He recently took the time to talk with Speech Technology magazine about his vision for the future of the technology.
Is speech recognition now at a point where it can be called a mainstream product?
Lets see if we can separate this into three areas. There is the research promise, the marketing hype and the reality of the consumer experience.
In terms of marketing, it certainly has arrived. When I can pick up my local paper and see a whole page devoted to speech recognition, it would seem to be a mainstream product from the marketing side. Speech recognition is a phrase CEOs even use now, without having to explain what it means.
That implies, at least from the research side, that we are at a point where people are starting to see the true vision of what speech recognition technologies can bring to society. There are a lot of successful companies out there that are proving it. Companies have been born, merged, and grown which build speech recognition products that support telecommunications applications and desktop dictation.
You might ask what has allowed this industry to get to the point it is now. The industrys existence is the result of many people capitalizing on several decades of basic research by major industrial companies like ATandT, IBM and TI, government funding for companies such as BBN and SRI, and university research at places like Carnegie Mellon, MIT, Cambridge University and others.
It has been a concerted effort by a lot of folks for many years who have worried about speech processing. Now, many of these same organizations (or individuals from these organizations) are trying to capitalize on their own research. The technology is ripe for many low hanging fruit applications.
In 1992, ATandT used speech recognition with a vocabulary of only five words to automate portions of operator assisted calls. This service generates over a billion speech recognition attempts per year and saves ATandT an estimated $200 million per year. This is an example of a low hanging fruit application.
Researchers tend to think their technology can do everything. But that is not the case for the consumer reality. And I think some companies are trying to force applications that people dont need or want. It seems as if people are trying to force speech onto Windows. I personally dont think it is that great an idea. Windows was created as a point and click user interface using a mouse not voice. A number of years ago, Apple began to look at generating a new multi-modal desktop operating system; one that considers speech, keyboards, mice, and touch as key input devices that have a particular purpose.
I am not sure where that project stands, but at some point in time, someone will develop such a user experience and it will drastically change the way we think and use computers and telecommunications.
Where do you see the technology in 3 to 5 years?
I think speech recognition is currently going through a fairly normal technology circle - basic research, applied research, productization, use in applications, and back to research for improvements. ASR-based products and services are starting to come out; some good, some not so good.
Over the next several years, the market will figure out what works, and what doesnt, just like any technology-based industry. Everyone is learning from these early apps. I would expect a new round of emphasis on basic research to address the key technology needs that current algorithms are clearly deficient in - speech recognition over wireless and IP networks, speech understanding, and human-machine dialog control. The technology will continue to get better.
In order for speech to become ubiquitous, speech recognition has to get to the point where people forget they are using technology. When this happens we will truly see speech technology measuring up to the marketing hype. We have been experimenting with a service we call simply How May I Help You? where a customer can say anything he or she wants and the computer carries on an almost human-like conversation to address the users needs. At some points in the dialog, the user is asked to hold for an operator.
During our last trial, we heard many instances of Gee, I thought I was talking to an operator. This is what I am looking for and if stars are aligned, I expect to see this in our network in the next three to five years.
The speech industry still needs a few more killer apps that will take the computer-telephony industry by storm to get ASR beyond the market hype phase. ATandTs automation of operator services application is one killer app. Voice-control dialing and access to unified messaging is clearly another.
Using speech recognition for IVR-like applications is rapidly gaining acceptance, but we need more applications. Were beginning to see speech recognition being deployed with what Ill call advanced IVR-based apps, like getting stock quotation and news.
In the next few years, I hope to see a strong focus on user interface research and development to determine what the customer really wants out of the speech industry. You could look at picture phones as an example of a technology that was rejected for social reasons that had little to do with the technology.
We must understand where speech is, the appropriate and preferred user interface modality.
Can speech and voice dialing completely replace the telephone keypad?
Replace is too strong a word for me. Speech recognition will augment the keypad. Speech is just one mode of input. It will always play an essential role in new services, but it is not always the panacea. Speech has to be part of a well thought out multi-modal interface design strategy. This is where ATandT will continue to take its place as a leader in the field.
When you go back 100 years, telephones had no buttons on them. You just picked up a phone and were automatically connected to an operator. Lets call her Mabel. Mabel would connect you with your party, but she might also tell you about whether there was a storm coming, or other town gossip. Mabel was the true intelligent agent.
With the introduction of DialTone and touch-tone keypads, you are able to reach your party much easier, but there are fewer things you can do on the phone. Twelve buttons does not allow for much.
So in a sense, speech recognition and synthesis technologies are allowing us to go back to the future.