Hype vs. Effective Speech Technology

My first real experience with speech technology was in the research group at WordPerfect in 1994. I was given the task of adding speech recognition capabilities to WordPerfect 6. After surveying the available technology and business opportunities, we inked a deal with Dragon Systems to include a version of DragonDictate in the box. Unfortunately, the deal was signed so close to the release date that the technology was not included in any of the marketing material. Further, to utilize the technology, the user had to send in to buy a microphone. This was, in our view at the time, a great technology breakthrough that remained hidden from the users due to poor marketing. The effort made no impact on either WordPerfect or DragonDictate sales. Subsequently, WordPerfect was sold to Corel. Corel wanted to make a splash with their newly acquired product. They did it using speech technology. The repackaged WordPerfect 6, included a microphone in the box, and added a big sticker on the front of the box—“FREE Version of DragonDictate Inside!” This was the same technology that had shipped in the earlier version, with a microphone added. The result exceeded all expectations and consumer sales of WordPerfect literally tripled overnight. Would this trigger the transition of speech recognition as a niche technology to a broad-based productivity tool? Unfortunately, it was not to be. Few users actually went further than just trying out the new gimmick. The hype behind the technology was more valuable than the technology itself. Three key factors kept users from widespread use: 1) the length of time required to “train” the system, 2) a required change in modality, and 3) poor system performance due to poor quality sound input devices on computers. Over the last 10 years, significant technological strides have been made in system training requirements. Yet, the other two barriers to the success of the first widespread deployment of speech technology remain. Today’s users have grown up with a keyboard and mouse. The process of writing is closely linked to having their hands on the keyboard. It’s what comes naturally. Changing the modality of input means that many of the thought processes and timing associated with writing change. There is little speech recognition vendors can do to help users overcome the change-in-modality barrier. For users to be successful with large-vocabulary speech recognition, they must be highly motivated. This barrier to widespread adoption has not changed. Last year my employer purchased a beautiful new laptop computer for me. The computer was from a major manufacturer and included the best processor available, lots of RAM, super-cool display, huge hard disk, and full multi-media support. “Finally,” I thought, “a portable computer that I can use speech recognition on.” In great anticipation, I installed a commonly used speech engine, plugged in my microphone, and started training. In the end, my recognition rate was no better than 75%. It was totally unusable. The audio input on the device was so noisy that it failed miserably with speech recognition software. Ten years have passed, and still manufactures refuse to spend more than pennies on the audio input. Yet, with some speech technology areas stalled in their effort for broad-based acceptance, other areas have achieved significant success. Speech recognition and synthesis are commonly used in broad-based service centers. Huge savings are being realized by companies who use speech technology as their initial customer interface. From my perspective, as an eager user of speech technology, it seems that the more focused the application, the more likely it is to be successful. More specifically, the productivity gains seem to be very small when measured on a transaction basis. Only when you can apply the gain to millions of transactions does the technology become compelling. Thus, I believe the key to successful speech application is being able to target a very well-defined problem. History has shown us that efforts to produce general-purpose solutions with speech technology are unlikely to achieve a viable business model. When defining projects or products that will leverage speech technology it is imperative to clearly identify where the productivity gains will be realized. Developers should shy away from the temptation to implement a function because it is “cool”. The focus needs to be on features that are quantifiably useful. Recently, a local bank has been advertising heavily on radio and television about it’s new “Speech Access” with an incredible amount of hype. “It’s the talk of the town,” claim the hawkers. With sound bites of bank customers asking both simple and complex questions, listeners are encouraged to believe that a truly automated, intelligent teller is now available. The hype continues. Bruce Armstrong is the VP of Marketing at Adept Systems, a provider of building automation networks infrastructure. He has been on the AVIOS board of directors since 1998. He can be reached at Bruce_Armstrong@MyRealBox.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Hype vs. Effective Speech Technology

Conversational AI to Reach $41.39 Billion by 2030

Voice Deepfake Fraud Surged 1,300 Percent

ESTsoft Partners with ElevenLabs

Deepgram Launches Voice Agent API