Despite Great Progress, Engine and Interface Challenges Remain

Speech-to-Text (STT), the automatic creation of text by voice, has always been the flagship of automatic speech recognition (ASR) applications. Converting any string of spoken words to readable text holds out the promise of eliminating the tedium of typing or the cost and delay of human transcription. In order to fulfill this promise automatic speech recognition for transcription should be as easy and casual as asking your assistant to take some notes. The idea is to just add a microphone to a PC and begin speaking.
Since 1989 when the first DragonDictate STT system (with a recognition rate of 12 words per minute spoken discretely at a cost of $10,000 plus a plug-in board) for the PC was introduced, tremendous progress has been made in the development of automatic speech recognition. Now in 1999, Dragon Systems and several other vendors including IBM, Lernout and Hauspie, and Philips, offer continuous ASR software packages running on standard PCs for under $50 with effective text input rates faster than all but the best typists.
However, despite this great progress in automatic speech recognition, most people who create text are still typing or using human transcriptionists. Computers, including many small handheld devices, are still being sold with keyboards as integral components. This article explores some of the important remaining challenges, which must be met to make the generation of text from speech effective and truly universal in current and potential new applications.
There are currently two major applications for STT systems. The first is desktop dictation for personal and business use. Low cost software for this high volume, horizontal market application is widely available through retail computer channels. The other significant STT application is in vertical markets, principally in medical report generation. The potential for STT in Radiology, for example, has long been recognized. Strong financial incentives (quicker turn-around times, reduced transcription costs) and relatively benign dictation conditions (quiet environment, limited vocabulary and syntax) have made this a good candidate for initial market penetration.
However, despite recent robust sales of retail software for STT applications and excellent marketing incentives for vertical applications, the actual use of STT systems remains relatively low. Consumers are buying the systems out of curiosity but are not using them. Computer hardware developers are still basing their designs on keyboards and mice. Before addressing some of the challenges for speech recognition systems, which account for this low penetration, it is worthwhile to review some current and potential opportunities for STT systems.
Table 1describes some speech-to-text applications including some system features needed to effectively perform these applications. Both required and desirable features are listed. Real-time operation (where required) and acceptable recognition accuracy have been assumed

Table 1
Speech-to-Text Applications	System Features
	Required	Desirable
Current Applications
Dictation Vertical Markets Office, Desktop		remote microphone, SIR, automatic punctuation, non-keyboard editing
Possible New Applications
Dictation Handheld, Mobile Computers	remote microphone, non-keyboard editing	SIR, automatic punctuation
Transcription of Meetings	remote microphone, SIR, disfluent speech understanding	automatic punctuation
Live Transcription Close-Captioning Speech-to-Speech Language Translation	SIR, automatic punctuation	remote microphone, disfluent speech understanding
SIR = speaker independent recognition

The applications are arranged by technical difficulty with the current, easiest dictation applications at the top. In all the applications listed, live speech must be transformed into text, but the systems requirements vary. Remote microphone capability and speaker independent recognition to eliminate the need for close-talking microphones and system training, respectively, are, at least, desirable for all applications. Automatic punctuation, which is a requirement in close captioning, is also desirable in other applications to reduce editing. For the transcription of meetings, the recognition of disfluent speech is a requirement and highly desirable for close captioning and speech translation systems. Editing without a keyboard is a major requirement in handheld mobile systems, which may ultimately replace desktop PCs.

Engine and Interface Challenges

Two areas in speech-to-text present challenges that must be addressed to increase the numbers of actual users of STT systems and also to make possible speech recognition in new, even more difficult applications such as meeting transcription and live closed-captioning. The first area involves the development of a more robust ASR engine. The second area concerns the development of a user interface system. This user interface must be speech-centric and reflect not only the power, but also the limitations, of speech input.

The ASR Engine
The challenge for the ASR engine relates to an important but generally unspoken requirement of speech recognition - that it be casual. Just as speech is casual for humans in most circumstances, speech recognition should require no special user effort or intrusive equipment.
Despite recent advances in desktop dictation systems, at least several minutes of training are still required to achieve acceptable accuracy. It has been suggested that this is a small price for a serious user to pay, but psychologically it represents a significant barrier to the adoption in use of ASR systems even for personal use. It is a major barrier for applications involving multiple users like telephone service bureaus. For meeting transcription, live, closed-captioning, and speech-to-speech translation applications the requirement for training is clearly prohibitive.
Another immediately apparent barrier to the uses of speech recognition is for a close-talking microphone in a relatively quiet environment. Except for private desktop use few users will tolerate a close-talking microphone strapped to their heads.
Finally, the interpretation of naturally spoken speech, which includes new, out-of-vocabulary words, and the normal disfluencies of everyday speech, must be addressed in order to produce a meaningful transcript in the intended application. Natural language processing utilizing levels of knowledge about speech content, syntax, and pragmatics, must be integrated into the ASR engine.

The User Interface
The limitations of the ASR engine described above are well-known, but the user interface challenge is less-recognized. Current ASR systems are not at all "speech-centric." They are simply software, which is layered on a PC designed for a keyboard and mouse. But there is an even more fundamental problem: systems have not been designed to take advantage of the capabilities of speech, accommodate its limitations, or provide an integrated, complementary input means. No system based on speech alone can offer a natural intuitive way of creating text documents. A new interface, which successfully combines speech and gesture input, is required. There is not a one-to-one correspondence between live speech and text. For applications requiring well-formatted documents with accurate spelling, punctuation, and capitalization, STT systems must provide for easy formatting and/or editing capability. In current speech recognition systems the production of text involves use of the keyboard and mouse for practical implementation. In addition, any computer with a graphic user interface needs display navigation capability. Although it is theoretically feasible to correct words, capitalize by speech command, or move the cursor, except for the physically disabled, speech-only text creation is impractical and tedious. An effective speech-centric ASR system must recognize the need for an additional input modality.

Some Possible Solutions

Research and development at Fonix Corporation has targeted both the engine and system challenges in automatic speech recognition for various speech-to-text applications. Table 2 summarizes RandD challenges for STT systems posed above.

Table 2
Speech-to-Text Applications Development Challenges	Potential Technology Solution
Remote microphone	Phase-sensitive speech signal processing
Speaker independent recognition	Neural net phoneme determination
Automatic punctuation	Multilevel Constraint Satisfaction Networks
Disfluent speech understanding	Multilevel Constraint Satisfaction Networks
Non-keyboard editing	Pen/Voice user interface

An Improved Speech Engine
In researching the problem of the remote microphone it appears that in comparison with the human ear, current ASR systems perform a relatively crude analysis of the speech signal. Some systems are insensitive to the phase of the speech signal, which is important information for human beings, allowing them to pick out individual speakers in a crowded room (the so-called cocktail party effect). It also allows discrimination against high levels of background noise and normal reverberations.
A more detailed speech wave analysis will allow the elimination of the close-talking microphone and make possible such applications as meeting transcription and dictation for mobile devices.
One such ASR development approach has been based on neural net technology for both phoneme determination and natural language processing. Neural nets are closer to human structures and methods for processing speech signals.
Fonix has also developed a unique neural net architecture called MULTCONS (for MULTilevel CONstraint Satisfaction) network that allows the incorporation of multiple levels of knowledge about speech in the natural language processing portion of its ASR engine. These levels include phoneme and word sequences, syntax, semantic, and pragmatic information, and contain the information necessary to automate punctuation and correct natural disfluent speech.
For these and other computing devices, we expect a user interface employing advanced ASR engine technology and a pen/voice interface will finally displace the century-old typewriter keyboard as the principal user interface.

John A. Oberteuffer Ph.D. is Vice President Technology, at Fonix Corporation and the publisher of ASR News. He can be reached at jober24@aol.com.

Companies and Suppliers Mentioned

Despite Great Progress, Engine and Interface Challenges Remain

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions