The Next Level For Speech

Advancing speech recognition to the next level has always been the goal at IBM's research labs in upstate New York. Once, this meant simply getting machines to merely recognize speech.

Now, Dr. David Nahamoo, director of the human language technologies research division at IBM, heads a department that is moving "beyond speech, to conversational systems."

IBM's research facility in Yorktown Heights is the site where IBM creates the base engine technology that goes into all their speech products, including ViaVoice, SimplySpeaking and ViaVoice Gold, the company's general purpose continuous speech offering. For over 25 years, IBM researchers have been developing speech technology in labs which is finally bearing fruit in the marketplace.

Data Driven

Of course, IBM as a company has been data driven from its inception, and their approach to speech reflects that. For example, IBM has, from the start of its research into speech, used a statistical approach to speech recognition. This approach uses vast amounts of data to correlate features in speech with the basic sounds of a language.

The company has been researching speech recognition since 1972, when they started with "7 or 8" people in the speech department, Nahamoo said. In the 1970s, IBM and other research labs, "proved it could work," developing algorithms, Hidden Markov Models, and statistical language models.

By about 1980, the research effort was able to produce prototypes that could do "large vocabulary" discrete speech recognition. However, Dr. Nahamoo noted, "at that time, when you said large vocabulary, you meant 5,000 words."

Over the next five years, IBM and other speech related companies confronted "the challenges of the interface," as Nahamoo puts it. "First, we needed to learn how people dictate. To learn, for example, that they do not record and playback right away, but rather, almost invariably would like to dictate through to the end and then come back to fix any errors."

Such research eventually began to pay off in the form of real products, such as the VoiceType and discrete voice products and MedSpeak, the first continuous-dictation product on the market, developed with Memorial Sloan Kettering in New York and Massachusetts General Hospital in Boston.

For MedSpeak to be a success it was immediately apparent that the specialized medical vocabulary created special challenges for MedSpeak. MedSpeak has a vocabulary of over 25,000 words, many of them highly technical, descriptive terms which would not be normally used in other fields of medicine, let alone general conversation.

Traditionally, the radiologist (the physician responsible for interpreting X-rays, MRIs, and CAT scans) dictated patient information into a tape, which was then transcribed. After transcription, the report is returned to the radiologist for review and signature. The process typically takes several days, sometimes even weeks.

However, a radiologist using MedSpeak can dictate, edit and electronically sign their own reports using a PC in real time. This system is not just faster, it also increases confidentiality of the reports by reducing reliance on outside sources.

It results in a report that goes directly from the radiologist to a printed version, with few, if any, opportunities for misinterpretation.

"With MedSpeak," said Nahamoo. "We had to be able to essentially model acoustically the pronunciation of people with different accents and dialects. To do this, we had to collect a lot of speech data from a lot of radiologists. We analyzed and processed more than a million radiologists reports for our language model."

ViaVoice Gold

MedSpeak's specialized vocabulary means it is not for everyone. It is intended for a select group of medical professionals and, while it is possible to use it for general purposes it would be impractical from a cost stand point.

Until quite recently, large vocabulary, continuous speech products were produced only for special niche markets, requiring many speech samples, driving up cost. General purpose users had a couple of choices; they could use products with a limited vocabulary, or they could learn to use discrete systems.

In a discrete system, the user must pause between each word in order for the system to successfully recognize each individual word. Discrete language products have been available for some time, and some users have gotten to a point where they can speak very fast, while still pausing between words to allow the machine to recognize each individual word.

But many people found learning to speak discretely too inconvenient. When speech technology failed to go "mainstream," many observers attributed the failure to the awkward nature of speaking discretely.

So the importance of the continuous speech breakthroughs in 1997 can hardly be overstated. Last spring, Dragon Systems released NaturallySpeaking with a price tag of $695 to considerable fanfare. Within months, IBM rolled out its ViaVoice product, initially priced at $199, and triggered a price war that has brought continuous dictation within easy reach, at prices under $100. The companies are estimated to be selling tens of thousands of units per month, with IBM taking a lead toward the end of 1997, after initially trailing Dragon.

In 1998, a third competitor is expected to enter the fray. Lernout and Hauspie plans to launch its VoiceExpress product a direct competitor to the ViaVoice Gold product.

ViaVoice Gold allows users to speak to their computers at speeds up to 100 words per minute, without pausing between words. Users can dictate directly into their word processing applications, without needing to cut and paste.

Besides replacing the keyboard, the continuous speech product can handle functions typically handled by the mouse and task bar. Simply say "print" for example, to print out completed work. ViaVoice Gold also features "context recognition," which allows the machine to distinguish between similar sounding words (two, too, to) and determine the correct one to use. It also offers text to speech, allowing users to listen to their own or other imported documents, including have e-mail messages read to them. The base vocabulary of 22,000 words can be expanded to 64,000 words to accommodate personal preferences.

IBM's research into the "challenges of the interface" has paid off in a product that allows users to complete their thoughts intuitively, make corrections at a convenient time or even defer the corrections to someone else.

First of a Kind

Speech has been one of the areas to benefit from an IBM procedure that started in 1994 called "First of a Kind." In this approach, IBM Research worked with customers and other IBM divisions to identify specific needs, with the goal of enhancing product usefulness. Rather than designing the product, rolling it out, and then hoping there is a market for it, "first of a kind," brings potential customers into the design phase. As a result, IBM gets a more clear picture of just what the customer needs. They can rapidly design a product that is appropriate and which has the assurance of at least some market acceptance.

The MedSpeak product is a good example of first of a kind in action. Massachusetts General staff radiologists were instrumental in designing the product, giving IBM speech researchers the necessary understanding of the large and complicated vocabulary required to produce a radiology report. In effect, the radiologists defined the specifications of the product.

It was this focus on "user centered design," that allowed IBM to get a useful product out the door in less than one year, even though teaching the system to understand the data required acoustic training on approximately one million documents. The success has led IBM to plan a similar unit for pathology in the near future.

Future of Speech

Even though many hurdles of continuous speech have been largely overcome, the IBM research team is still seeking to provide more leading edge products.

Current projects include work with Sun Microsystems on the JAVA Speech API, a "name dialing" application, where users say the name of the person they are calling rather than dialing a keypad or saving the number. Another upcoming product is an application for the airline travel industry and a client server product that features text independence. Users could enroll in one language and have information played back in another. Still other applications in progress we saw during the tour were, unfortunately, "off the record."

It makes one wonder if IBM thinks there is anything speech can not do.

"To be fair," Nahamoo said," there are some privacy issues that need to be addressed. And obviously there are a lot of heavily graphical applications where speech might not be practical." In spite of this, "the future still holds considerable promise for speech," in Nahamoo's view.

"I think in the future you will see many more 'bottom up' applications. That is applications where speech is built into the machine from the start and not added on later.

And of course, the Internet is a source of excitement.

"In my view the revolutionary power of the Internet can be fully realized when it harnesses the power of speech," said Nahamoo. "The future will be revolutionized through conversational technology. You will be able to access a database any way you like, at any time. You can't do that today.

"We have had a lot of fun here over the years developing speech. But there is more fun ahead in the next few years."

Brian Lewis is the editor of Speech Technology magazine.

Gaining Consumer Acceptance

While the Research Labs are pushing the envelope of speech, IBM's sales and marketing efforts are focusing on gaining consumer acceptance of the technology.

William H.(Ozzie) Osborne Jr. who manages the speech division of IBM, categorizes the industry as just in its infancy, and about to break-through to become a major part of many consumers lives throughout the world.

"We're getting speech products into areas where speech can be used well," Osborne said. Versions of ViaVoice and other IBM speech products are out in Spanish, French, Italian, German and Chinese.

"The product is doing very well in China," Osborne reported. "The Chinese language sometimes requires five keystrokes for a character, so mechanical input is very time consuming. So the Chinese language keyboard is particularly unfriendly."

In the near term Osborne expected trends such as speech enabling the Internet, (a trend IBM is capitalizing on by bundling a speech enabled explorer into Aptiva) and the continued trend toward hands free computing.

Beyond that, he will be, like the rest of us, looking to see what comes out of Dr. Nahamoo's labs. Of course, he has an advantage. He gets to peek.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

The Next Level For Speech

Triton Digital Partners with ekoz.ai on Voice-Cloned Podcast Ads

Leena AI Launches Agentic AI Colleagues

Hyperlink InfoSystem Launches Clever247.ai Voice AI

Mistral Unveils Voxtral Open-Source AI Voice Model