Seeing the "double disability" endured by female cerebral palsy victims sparked one speech executive's career choice. Giving machines a female voice is still a motivating factor for Dr. Caroline Henton, vice president of Strategic Technology at fonix Corp., who is also the British English voice talent for Pulse Point Communications and Centigram Communications, as well as the author of 44 technical and research publications.
[IMGCAP(1)]Seeing the "double disability" endured by female cerebral palsy victims sparked one speech executive's career choice. Giving machines a female voice is still a motivating factor for Dr. Caroline Henton, vice president of Strategic Technology at fonix Corp., who is also the British English voice talent for Pulse Point Communications and Centigram Communications, as well as the author of 44 technical and research publications. Despite a busy schedule with a busy company, Dr. Henton was able to find some time to talk with Speech Technology magazine recently: Has speech technology become a mainstream product? If so, what was the development that made it possible? If not, what is holding it back?
The acceptability of speech technology in so-called 'mainstream' products depends on the particular speech application and how well it is integrated into the overall product. The time and effort required to train dictation systems restricts the number of potential users. Fully comfortable, transparent, unrestricted mainstream ASR is probably 3-5 years away. The key breakthrough in speech synthesis has been concatenative speech using non-uniform units (larger than phonemes or diphones) with variable, natural prosody, and the ability to provide a variety of voices (i.e. two sexes, young, old) in different languages. There has not been a quantum 'breakthrough' in ASR, rather it has been an incremental, grinding progress. Generally, ASR systems should aim lower, but be more accurate and offer more intuitive use. Why do you consider concatenative speech to be such a breakthrough and how does it differ from other TTS systems?
There are two approaches to TTS, concatenative and parametric, or rule-driven TTS. Parametric suffers from the fact that we don't really have the acoustical knowledge to really know what is going on in speech in order to make a natural sound. Concatenative by definition uses pieces of varying size to produce more natural sounding voices. What do you mean when you say ASR systems should aim lower?
I think on the ASR side some companies have rolled out some products that are not accurate enough. There are products out there that I would not use. I don't have six hours to get the machine to work. I think some companies have promised too much too soon. Speech can be most effective in a situation where there is a predictable vocabulary and a well honed dialogue. What is the near term future of speech technology?
The future of speech technology lies in mobile communication whereby users' lives are enhanced and made more efficient with a seamless, transparent speech-driven interaction with machines. The human should never be blamed for the failings of the machine/computer component. A robust speech interface will lead to greater work efficiency through natural and accelerated communications. Business contacts will be reinforced more quickly, documents will be transmitted with richer content and in more malleable forms, giving rise to more effective, unambiguous, multimodal communication. Embedded speech systems (such as on-board, in-dashboard chips coupled with GPS, mobile navigation and intelligent information systems) add security, safety, speed of arrival and, if necessary, can ensure the rapid dispatch of emergency rescue teams. Interactive agents/avatars will be client specific, customized, available in many languages, and easily integrated into all multi-media distributed applications. What are the most important factors driving the market?
Mobility is the single most powerful driving force in the growth of speech technology. Smaller, handheld devices that incorporate cell-phone links, PDA applications, handwriting entry, and infrared connectivity will be physical necessities for communication in the next millennium. What first attracted you to speech recognition issues?
In 1982, while a graduate student at the University of Oxford, I watched a "Discovery" channel documentary about cerebral palsy sufferers in the United States. Their intelligence and determined cheer in overcoming physical obstacles was impressive and moving. The epiphany for me occurred when I watched young female cerebral palsy victims complaining that it was alienating to have to talk with a mechanical voice, but to have to talk with the wrong voice was doubly disabling and de-humanizing. I realized then that I wanted to research female speech, and create the highest quality female synthetic speech.