Speech Profiles: A Window on Speech at L&H
Lernout & Hauspie has been an international leader in the development of speech technology for numerous commercial applications, and since their September, 1997 alliance with Microsoft has been one of the most closely watched companies in the industry. As part of the alliance, L&H develops applications for current and future versions of Microsoft’s speech application programming interface (SAPI). The alliance was designed to accelerate development of the next generation of speech-enabled computing on Microsoft Windows as well as the development of voice-enabled computing that goes beyond today’s speech dictation products. The company offers products in several core speech areas, including automatic speech recognition, text-to-speech and digital speech compression. Speech Technology recently spoke with Bob Kutnick, CTO of Lernout & Hauspie, to get his views on the speech industry. Has speech become a mainstream product? Are we approaching a time when the price of speech will be zero?
We probably ought to define what we mean when we say speech. Obviously if I thought the price of speech was truly zero, I wouldn’t be in this business. But if we mean things like speech recognition being built into the Windows operating systems, yes, the day will come when we see speech free within the operating system. What will happen over the long haul is that you will be able to dictate into products like Word. The technology will be embedded. And what does it mean to say speech is in the mainstream? A lot of text-to-speech applications are there. I can call the bank, or make flight arrangements by talking to a machine. Many people are now doing banking and making travel arrangements by telephone to call centers, using speech. However, speech recognition still has a way to go before it reaches an everyday level. What are some of the biggest obstacles speech needs to overcome?
There are numerous obstacles for a speech application to become commonplace that vary with the application. Strong dialogue systems are needed and there is the whole issue of microphoning in open space areas. And there is the whole psychological issue of talking to a machine. A lot of people are not used to the idea of talking to a PC. It may be easier with the telephone, where you are talking to an inanimate object in the first place. But I think over the next couple of years, we will see this hurdle cleared. A lot of the Auto PC applications, especially the stuff with navigation tools will help. You tell the systems where you want to go, and it gives you directions. Speech is the only interface for this. Nothing else makes sense. Reading maps while you are driving is not safe. What sort of changes do you see in the speech industry in the next 3-5 years?
There will be a lot of new applications coming in the next 3 to 5 years, as people become more used to talking to the machine. Once you talk to the Auto PC, you may be more willing to talk to the PC on your desk. You’ll see a lot of speech products coming into the home, where you can do Internet surfing, and tie-ins to Web TV. You’ll have dialogue systems that will allow you to ask the TV what to watch. You could say "When is the next Clint Eastwood movie on?" It will solve the whole problem of having 500 channels and never being able to find something you want. And I think within that time frame you will definitely see speech integrated into the operating system. I think if you look at how Windows evolved you’ll get some clues as to how speech will advance. The first version of Windows was a graphical system. It took existing text and put it on a screen. It was not until 3.0 that is was really a graphic program. When they added the mouse it was just to enhance the system. But now, if you take away the mouse, most people would have a hard time getting the PC to function. With speech, we are going to have to create dialogue mechanisms within a speech environment that make sense. A lot of people start with a speech menu. To me, speech menu is an oxymoron. Speech creates an opportunity to break through menus. If you say "Print three copies of this on my laser jet," you want the machine to know, or be able to find out what you mean by this, or if you mean the laser jet at your desk or the one on the network. Think of how humans use speech. You need to have an interactive conversation to really be able to understand one another. With the operating system speech enabled, each program is able to understand voice commands directed at it. What that means is that if I say "Open my financials" it will know to look for a spreadsheet, and open Excel. Speech will evolve to that over time, to achieve the concept of understanding. For the system to be speech enabled, there is a lot of conceptual activity that needs to be understood by the operating system. These are not just dictation products, they are document generation products and the computer should be able to perform its tasks in a modeless manner. What vertical markets make the most sense for speech?
Certainly medical we believe very strongly is an important vertical. And we see a lot happening on the legal side. Both from lawyers who are using speech dictation products and from the law enforcement side. For example, there are applications where a parking inspector can, after spotting a car parked illegally, recite the license number and have information about the driver read back by text to speech. All the inspector has to do to give a ticket is stop to put the ticket on your windshield. There are a lot of hand held applications out there that use OCR (Optical Character Recognition) that can also use speech. Not with the idea of replacing OCR, as much as working in conjunction with it. With speech there will also be a lot of cases where it will not replace the keyboard, but will work in conjunction with it. Do end users notice the difference between 95% and 97% recognition rates? Ultimately, what do end users want?
End users want something that enhances their productivity and makes them work better. Most users probably don’t notice the difference between three errors on a screen instead of five. And I don’t think it matters that much. What is more important is that the dialogue is enhanced by feedback mechanisms. This is where dialogue comes into it. With dictation to a human, if the assistant has a question, he or she asks for clarification, or makes a best guess and asks "Did you mean this, or that?" Brian Lewis is the editor of Speech Technology magazine.