Innovations - Speech Technology With Impact
Innovative Research in the Labs Part III - Nuance
This month Nuance, unlike our previous companies, has its roots in speech technologies rather than computing. In fact, if we listed all of the myriad companies that contributed to Nuance's current incarnation, either through mergers, company or asset acquisitions, it would read like the "Who's Who" of speech technology. Some of the more notable names in the Nuance melting pot include ScanSoft, SpeechWorks, Dragon Systems, Lernout & Hauspie, Rhetorical, Phonetic Systems, and LocusDialog.
Areas of Focus
Nuance's research, experience and product portfolio is as diverse as any in the industry with focus in three areas: network, embedded, and dictation. In a traditional model, each of these core areas feed specific markets. Network speech is deployed into IVR and contact center self-service applications for transactions and call routing and in directory service applications. Embedded speech addresses the mobile market through voice dialing, telematics, command and control, and navigation. Finally, dictation is aimed at the desktop market, medical transcription and audio indexing areas. In these areas, Nuance has considerable experience in developing and deploying critical applications across all industries.
Innovation of the User Experience
With the maturation of the market, Nuance's R&D is continuing to refine the core technology of their products, but is also innovating way beyond such basics as increased vocabulary size, improved accuracy, and faster retrieval rates by developing new areas and creatively combining core technologies. In addition to advances in technology, two other trends contribute to the vision that drives Nuance R&D. The convergence of existing technology in different markets, the rapid proliferation of multi-modal devices, and low-latency packet-based wireless networks are enabling Nuance to focus on wholly transforming the user experience, from siloed applications in the core markets above to multimodal applications that draw from multiple technologies with a more intuitive, efficient, conversational interface.
For example, Nuance is working on making the telephony user interface one that reflects a user's natural speaking tendencies, adding flexibility to the conversation flow, moving away from plain directed dialogs. They do this by drawing from techniques used in the dictation market, such as reducing dependence on handcrafted "closed" grammars by using statistical language models (SLMs) or using research done in speaker adaptation. This borrowing from one market allows advancements in another, so that an application can support multi-concept utterances, allowing the system to adapt to the caller's input to create further prompts tailored to the information given, or caller's history.
For speaker adaptation and personalization, Nuance is borrowing from speaker identification technology to select a custom caller model either from known caller information or one representative of a caller type. Personalization is being accomplished by adapting during the call to things like prompting level in response to dialog patterns or stored user preferences such as language and speed. Applications are moving toward information-driven dialogs, which use rules to dynamically compute the appropriate response based on multiple pieces of information in an utterance or from caller history.
Such advances are combined and applied to multimodal applications, so that a user can interact with an application from their device of choice and have the application adapt. For example, from a wireless phone with a screen, the caller can speak or key in what they want and get a visual display, text, or text-to-speech back. Further, the availability of screen devices is driving new applications such as dictation of email and SMS on mobile phones, done both in an embedded fashion or distributed with the bulk of processing being done on a network server.
Finally, in addition to improving the naturalness of dialog flow and prompting, Nuance is also improving output delivery by improving naturalness of TTS too, including prosodic modeling and control of the speaking style at a fine level.
We are starting to see blending of technologies and applications from different markets resulting in new capabilities. It won't be long before extraneous information given by a caller,results in the application presenting a new option from a different service, in anticipation of a caller's need. So don't be surprised if when you say, "I want to go to Boston on Thursday, but I might want to fly to New York for the weekend," you find yourself with ticketing, New York weather information and Broadway tickets all in one call.
Have any innovative research news from R&D? Please email me at email@example.com