SpeechTEK Hands-On: The Latest and Greatest
At SpeechTEK 2009, held in August in New York, a number of speech technology vendors submitted products for review by attendees as part of the SpeechTEK Lab group of sessions. The hands-on, interactive product evaluations involved technologies in five categories: speaker identification and verification, text-to-speech, enterprise solutions, mobile solutions, and in beta, a new area that allowed companies to demonstrate experimental technologies that are not yet ready for commercial release. Judith Markowitz, president of J. Markowitz Consultants, moderated the speaker identification and verification session. Bill Scholz, president of the Applied Voice Input/Output Society (AVIOS), and Thomas Schalk, vice president of voice technologies at ATX, moderated the session on mobile devices. Deborah Dahl, principal at Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group, moderated the enterprise solutions session. Caroline Henton, founder and chief technology officer at Talknowledgy, ran the session on TTS, and Moshe Yudkowsky, president of Disaggregate Consulting, moderated the “In Beta” session.
Speaker Identification & Verification (By Judith Markowitz)
In the speaker identification and verification lab, seven companies staged demonstrations involving enrollment plus verification, followed by a demo of their own choosing. The small room became overcrowded, and the noise reached exhibit-hall levels when attendees remained to talk with vendors. The crowd and noise prevented our judges from effectively testing all of the products.
Avaya (securing conference calls)
The verification dialogue depends on whether the participant calls from a registered or nonregistered (e.g., hotel) phone.
One judge was unable to enroll, likely due to the ambient noise conditions. Combining the identity claim and verification into a single utterance (e.g., “My name is John Doe”) makes this system useful for frequent-use, moderate-security applications, including teleconferencing. Prompting was ambiguous and didn’t match the expected response (i.e., “My name is ___”). The error correction scheme assumed the caller was uncooperative or not listening, rather than in a noisy environment. Such assumptions quickly become annoying to users. These issues are forgivable in a demo, but cause serious problems in deployments.
Convergys (multifactor authentication on Convergys’ hosted platform)
Convergys had a good story, but judges didn’t see anything working. They liked the company’s approach to multifactor authentication, and it appears the system is platform- and vendor-independent (for vendors supporting Web services interfaces). The VA Deployment Portal appeared to have the advertised functions: user management, app configurations, and analytics.
Nuance Communications (multifactor authentication and “pseudo text-independent” technology that generalizes a zero- to nine-digit enrollment to acoustically similar proper names—e.g., “seven” to “Kevin”—for verification)
Nuance included date knowledge in the demo. Judges enrolled successfully using their telephone numbers and authenticated on name strings. The idea of using text-dependent enrollment and “near text” authentication is clever, although some name strings were hard to pronounce. This application is practical, easy to use, and should be feasible assuming the authentication strings can be pronounced by all callers.
PerSay (MobileSV embedded on a BlackBerry)
Judges thought this product was conceptually excellent and the technology was robust.
Mobile SV worked surprisingly well considering the noise level in the room. The user interface and work flow of the Evaluation Studio product for developing tests of SIV technology were quite good, and wizards were helpful in stepping developers through some of the processes. Both products seem practical and useful.
Recognition Technologies (enrollment on a telephone and classification/identification on a headset microphone)
As raw technology, this delivered what it promised and was interesting, but it isn’t a solution. The requirement for one minute of enrollment on read text and for 10 seconds of free speech to verify felt cumbersome, but the vendor stated it is working to reduce the total time to about 30 seconds. The lack of a user interface made this worse. Unless this technology is combined with a liveness test (e.g., speech recognition), it would be vulnerable to recording attack.
Speech Technology Center (VoiceGrid speaker identification for very large voice databases)
Judges walked away unclear about how VoiceGrid works and how the company quantifies its SIV metrics.
VoiceVault (VoiceSign, for vocally “signing” a statement)
Customers like random digits for high-security, voiceprint applications. The separation of the identity claim from the verification utterances can make it clear to callers that a biometric is being applied. That is helpful. Unfortunately, the way this demo used the same four digits (0, 5, 7, and 9) in rotation is problematic. During enrollment, repeating the same four digits in multiple permutations was a perceptually difficult task. Sometimes it made users unsure if they had indeed repeated the correct digits. The ambient distractions made this worse. During verification, one judge deliberately presented different digits from those requested (substituting 5 and 9) and was still accepted. The vendor said the application was meant to work like that, but the same judge said it lowered his confidence in the technology, especially since, in one case, it was used as a single-token security input without challenge.
The electronic-signature demo essentially applied the same method and dialogue to a different task. This was clearly a demo, and the usability of an actual end system is hard to assess given the demonstration alone.
Mobile Devices (By Bill Scholz and Tom Schalk)
Four vendors illustrated products for mobile handheld devices, effectively illustrating the breadth and depth of functionality now achievable on devices no larger than the palm of a hand.
Novauris, developers of recognition technology that identifies complete phrases by matching them against a massive inventory of possible utterances, illustrated three mobile device applications that use its technology.
U.S. Address Input lets users speak a complete address, such as “1216 State Street, Boston, Massachusetts.” Within seconds the text of the address was displayed on the handset screen, followed by a map on which the target address was highlighted.
Music Access lets users speak a song title and/or artist, and within seconds the title and artist were displayed in text and an excerpt from the song was played. Typical requests, such as “Jailhouse Rock” by Elvis Presley” or “Satisfaction” by the Rolling Stones, triggered successful retrieval of the songs and excerpts from a library of more than 5,000 tracks in seconds.
Tokyo Train Routing lets a Japanese speaker identify the route from one Tokyo subway station to another by speaking only the end points.
All three applications demonstrated well (although the absence of Japanese speakers reduced the impact of the Tokyo Train Routing application). Novauris excels in its ability to accurately search massive databases in seconds.
Nuance Communications demonstrated its Mobile Speech Platform on a number of mobile devices. The mobile platform is an advanced architecture of tools and components that allow mobile application developers to enhance their offerings through speech enablement. The Nuance Mobile Speech Platform allows users to speak SMS or email messages using highly accurate, large vocabulary speech recognition—the same technology that powers Nuance’s Dragon NaturallySpeaking desktop dictation products. For mobile search queries, applications built using the Nuance Mobile Speech platform allow users to simply say, “Um, find the Starbucks on Main Street, please,” or “I’m looking for, ah, Justin Timberlake ringtones,” and instantly be directed to the desired content. The need to use structured phrasing when speaking queries is eliminated, and the application even filters out “um” and “ah” utterances, ensuring a fast, highly accurate recognition and response.
The Nuance Mobile Navigation Service Package has been designed specifically to leverage Nuance Mobile Search Service Package technology for the unique needs of navigation and location-oriented applications. The package includes street and address grammars that allow mobile navigation application providers to add voice destination entry (including street addresses, businesses, and points of interest) and natural spoken, turn-by-turn directions to mobile navigation applications for mobile phones and PDAs.
Vlingo has developed hosted speech recognition capabilities that have been integrated into numerous mobile device applications. A number of mobile user interfaces were demonstrated, and ease of use was highly apparent. Vlingo views the user interface as key to unleashing the power of mobile devices. The Vlingo UI enables mobile navigation, Web search, email, text messaging, social networking, and more.
The company has established a large base of users, and the accuracy continues to improve by employing proprietary adaptation techniques. Millions of users have downloaded Vlingo to their BlackBerrys or iPhones, and have spoken tens of millions of times to it to send text messages, surf the Web, update their Facebook or Twitter status, and more. Vlingo’s UI solutions remove the need for menus by allowing users to simply say what they want without narrowing their intent.
Voxeo demonstrated an architecture that aims to provide a consistent service offering across multiple channels (SMS, IM, mobile Internet, and video). Cross-modality interoperability between the VoiceObjects Server and Voxeo’s hosting platform for IM was demonstrated. The solution illustrated an approach to multichannel integration that allows the creation of services across different phone channels in a consistent, readily maintainable manner. Customer success stories using the new interoperation architecture were highlighted.
In summary, the mobile laboratory demonstrated that the technology for presenting sophisticated applications on mobile devices is rapidly maturing, and already a variety of cross-modality multilanguage applications are bringing solutions to the mobile device previously seen only on the desktop. The days of limited vocabulary coverage are gone, and natural speech input capabilities are maturing rapidly. Speech-enabled typing has proved to be a highly desirable feature on any mobile device that allows searching and messaging.
Enterprise Solutions (By Deborah Dahl)
From voice control of computers to multimodal applications supporting the mobile workforce, enterprises are recognizing the value of speech applications where the users are employees rather than customers. Compared with call center applications, enterprise applications, including dictation, office productivity software, warehouse picking applications, and field force automation, are more diverse and wide-ranging. Design considerations and requirements for enterprise applications also differ from those of call center applications. For example, employees can be specifically trained to use the application, speaker-dependent recognition is possible, microphones can be controlled, and specialized devices can be provided to employees. All of these considerations make enterprise voice applications an important area of speech technology.
This session showcased inspiring and innovative enterprise speech applications by vendors whose applications fell into five categories: desktop dictation and computer control, warehouse picking, authentication, support for a mobile workforce, and speech analytics.
1. Desktop Dictation and Computer Control
Microsoft demonstrated using voice for controlling the computer in Windows 7 and for dictation into Microsoft Office applications.
2. Warehouse Picking
Datria Systems and Cisco Systems demonstrated their Voice over IP-based warehouse picking application. Warehouse picking is an excellent application for voice for several reasons: The warehouse is usually hands-busy and eyes-busy, and users are constantly moving around and might even be in a cold environment where they need to wear gloves. These factors make it difficult for workers to use conventional technologies for obtaining and completing their assignments. The unique aspect of the Datria/Cisco solution is its use of server-based speech recognition rather than the more common use of speech on a dedicated mobile device. This results in a significant reduction in the cost of the devices used in the application because, instead of expensive mobile computers, users can speak to VoIP-enabled smartphones.
MicroAutomation Employee Services demonstrated Loquendo speech technology used to authenticate employees through speaker verification. MicroAutomation enterprise applications include password reset and access to employee benefits information.
Speech Technology Center demonstrated two applications. VoicePin is a biometric application that provides for data security (corporate and personal) on mobile devices, which are vulnerable to loss and theft. VoicePin includes a voice interface, as well as speaker verification. The second application was VoiceKey Service, aimed at enterprise IT security. VoiceKey Service uses speaker verification to control users’ access to enterprise information.
4. Support for a Mobile Workforce
LumenVox/Incendonet demonstrated Incendonet’s SpeechBridge, a speech-driven auto attendant that provides mobile email access, calendaring, and an IVR speech platform.
Lyrix demonstrated Mobiso 6.0, which uses handheld software with cloud-based services to support mobile business functions. Users download the Mobiso application and then synchronize contacts, social networks, and corporate address book entries with the Mobiso address book. Users can then call Mobiso and use speech recognition to make calls, send messages, and conduct conferences. An interesting capability is that Mobiso helps enterprises manage telephony costs by tracking expenses from business use of personal devices. For example, calls made with a personal device to a contact in the address book are automatically tracked and compiled into expense reports.
Openstream demonstrated its multimodal Smart Assistant product for mobile workers. Smart Assistant handles incoming calls, text messages, email, schedules, and events, and allows users to respond using voice, touch, or key press. Smart Assistant also synchronizes with enterprise data. While Smart Assistant helps users with generic tasks, like handling email, Openstream also demonstrated more specific solutions for field force automation.
Speech Technology Center’s VoicePin authentication application, described above, is another solution aimed at the mobile worker.
5. Speech Analytics
Autonomy etalk demonstrated its Qfiniti Explore system, which provides analytics based on conceptual content as well as literal speech. Applications include conceptual search, automated clustering, hot and breaking topics, and script adherence. Speech Analytics enables organizations to make use of voice interactions to obtain business intelligence for purposes like customer analysis and legal compliance.
Text-to-Speech (By Caroline Henton)
Of 10 text-to-speech (TTS) vendors invited, four submitted their latest products for evaluation. The lab was organized as a mix of structured and informal exploration of the capabilities of each vendor’s offerings. For the first part, we created 10 utterances to test TTS performance in pronunciation accuracy, text normalization, pausing, and other aspects of prosody:
- McCain called Obama a liberal, and then he insulted him.
- Jenny gave Peter instructions to follow.
- I want doors I can shut!
- Was the red book read or do we have to read it?
- Bring me a blue towel, and a red one.
- Cumin, fenugreek, bouillabaisse, rouille, Riesling.
- 124th Avenue, 120 4th Avenue, 100 24th Avenue.
- Suisun City, Poughkeepsie, Coeur d’Alene, Streatham, Guildford.
- Barrasso, Boustany, Faleomavaega, Grijalva, Kratovil, Radanovich, Sebelius.
- Ralph Vaughn Williams, Ralph Fiennes, Nicolas Sarkozy, Nicholas Nickleby, Maria Callas, Black Maria.
This list was given to each vendor at the start of the lab. After each synthesized the phrases, attendees were encouraged to further sample the synthesizers. Accuracy and naturalness of the 10 utterances are detailed, in turn, for each participant:
IVO Software offers two U.S. English voices and two U.K. English voices as part of its Ivona TTS offering. Utterances were synthesized in both varieties with very good, natural voice quality and speech rate. Comments on the pronunciation and prosodic accuracy focus on Ivona’s U.S. voices. Generally, intonation and (de)accenting were natural in all 10 phrases, except for Nos. 1, 3 (“can” accented wrongly), and 5, where “one” was accented and mispronounced as “wan.” With the more exacting spices, gastronomic items, and proper nouns, Ivona did not fare so well: “Bouillabaisse” and “rouille” were pronounced incorrectly; all place names in No. 8 were wrong except “Guildford”; and all U.S. current members of Congress in No. 9 were mispronounced, as were all personal names in No. 10, except for “Nicolas Sarkozy,” “Nicholas,” and “Maria Callas.” Ivona could improve its prosody and text normalization. Other shortcomings can be addressed by an exceptions lexicon and/or keeping its recorded database of words more up-to-date.
Lessac Technologies’ synthesis is under development, with a focus on prosody. The company’s Web site states, “[It] has developed a new automated method for producing human-quality expressive speech from plain text; the synthesized speech produced by a prototype demonstrator sounds quite similar to speech from a skilled newscaster.” These are worthy goals, but Lessac performed disappointingly, with unnatural prosody and word-accenting in all utterances except No. 2. “Obama” was embarrassingly unintelligible; all items in Nos. 6 and 9 were mispronounced. The addresses in No. 7 were all pronounced identically. In Nos. 8 and 10, the only place/personal names pronounced correctly were “Poughkeepsie,” “Nicolas Sarkozy,” and “Maria Callas.” R&D focus and ambitions aside, Lessac’s synthesis is clearly not ready for release as a commercial product.
Loquendo is an established TTS vendor with a history of providing quality U.S. English synthesis. It was surprising that the latest version produced utterances with many suboptimal acoustic and prosodic artifacts. The speech rate was too fast, so syllables were lost (e.g., “a” in No. 1 and “or” in No. 4). In No. 2 “Peter” was produced with an unaspirated P, and the prosody of the final fall was inappropriate, as was the monotone in No. 3. In No. 5, “towel” had an odd echo, and “one” was accented incorrectly. All items in No. 6 were badly segmented, with an additional schwa-onglide audible in “Riesling.” Pronunciation of names was unimpressive: Only “Poughkeepsie,” “Guildford,” and “Maria Callas” were correct in Nos. 8, 9, and 10. In general, it was disappointing to hear how poorly Loquendo did.
Tellme Networks was a unique participant because it does not offer “general-purpose” TTS. Rather, the Zira voice “was created with a stringent casting process, Tellme’s best audio practices, and expert linguistic design.” Tellme designed Zira TTS “to deliver more natural pronunciation of words most commonly used by its callers.” Given this focus on fulfilling customers’ branding and (niche) data domains, it was impressive how well Zira fared with the materials provided. There were some unnatural segmental errors, such as infelicitous pops and intrusive glottal stops in Nos. 1 and 2, and a schwa-onglide audible in “Riesling.” Prosody was absent in Zira, except for declarative statements, so commas, queries, and exclamation points were all ignored, which affected Nos. 3 and 4 most noticeably. For Nos. 6 and 9, only “fenugreek” and “Barrasso” were correct, and in No. 10 all were wrong except “Ralph Vaughn Williams” and “Maria Callas.” For Tellme’s TTS to deliver even more natural pronunciation, it is essential to implement punctuation-sensitive prosody and to update the dictionary.
In summary, the utterances provided served their purpose well. Vendors and participants were able to compare and contrast the strengths and weaknesses of the four TTS systems. Two areas for improvement were apparent for all vendors: better prosodic rules and more current exceptions dictionaries. For future progress in speech synthesis, more attention to general purpose TTS (and less to, say, in-vehicle navigation) and the semantic disambiguation of homographs (read/read; Maria/Maria) would be universally beneficial.
In Beta (By Moshe Yudkowsky)
Five companies joined the “In Beta” session to demonstrate technology that wasn’t quite ready for full-production release but was certainly interesting, enlightening, and useful. These products ran the gamut, from in-the-weeds technology to business management. Some products were in pre-alpha release, while others were in beta, but all received a great deal of attention from show attendees.
Deutsche Telekom Laboratories showed its pre-beta tool, Multimodal Application Builder. If you’ve ever tried to build a multimodal application, you know just how difficult it can be to juggle all of the moving parts.
The goal of this tool is to generate 80 percent of a multimodal application’s code automatically, and to create output that is compatible with the standards work of the World Wide Web Consortium’s Multimodal Working Group. The tool is built on top of the industry-standard Eclipse development editor and accepts drag-and-drop graphics and/or XML descriptions as input; the output is CCXML, SCXML, speech recognition grammars, HTML and/or Flash for graphical interactions, etc. Developers then add the back-end integration that lets the application manipulate the data.
Loquendo showed a pre-alpha tool—actually, a research-in-progress tool—that it hopes will be able to add emphasis into its Loquendo TTS output. Adding emphasis to TTS can provide important cues to the listener. Emphasis Director, working in conjunction with Loquendo’s TTS Director, gives voice user interface designers an easy way to experiment with TTS emphasis.
The tool itself is very simple to use: Users enter text, highlight the portion they want emphasized, adjust the emphasis level, and then the tool will play audio with and without added emphasis. The ultimate output of the tool is marked-up text in Loquendo’s format. In the future, Loquendo intends to add other emotions to the tool.
Lyrix provides an interesting business service, currently in beta. Limitations of cell phone technology mean most people who use cell phones for business use their personal phones, and their companies reimburse them for business calls. This creates a significant bookkeeping burden on employees and the company, and employee errors generate significant costs.
Lyrix Mobiso offers a software-as-a-service solution for use with data-capable mobile phones. It uses a central directory of phone numbers and a central “reach” number. The system compensates employees for inbound and outbound business calls; personal calls remain private. The system also integrates with customer relationship management software applications, such as SugarCRM.
RebelVox showed its pre-alpha re-engineering of the basic telephone call. Instead of dialing a number and then waiting for the other person to either pick up or for the call to go to voicemail, the user simply selects the other person’s telephone number from his smartphone’s screen and begins to speak. The recipient can break in and listen to the call live; or, if he chooses not to take the call, the call is saved as a recording.
RebelVox intends to integrate the underlying platform into other products and to offer a full multimodal application programming interface. Some interesting unanswered questions about this platform still linger: For example, if someone calls me, and I break into the call toward the end, then is it possible to hear what was already said so the person who calls me doesn’t have to repeat himself? Because this platform transforms how we think about phone calls, I think it’s safe to say we will see some interesting emergent behavior and new telephone manners in the future.
Voxeo’s Tropo platform is aimed squarely at ordinary Internet developers who want to add telephony, automatic speech recognition (ASR), and text-to-speech (TTS) capabilities to their applications. To reel them in, Tropo provides Internet-friendly technical and business interfaces.
Companies and Suppliers Mentioned