The 2009 Speech Luminaries

Article Featured Image

Luminaries are the kinds of people who make the speech technology industry move forward, whether  through technological innovation, visionary ideas, thought leadership, or pushing, expanding, or redefining current boundaries. The four luminaries chosen this year by the editors of Speech Technology have helped define the current state of the market, but more importantly, their influence will continue to shape the industry as a whole—and the world beyond—well into the future.

Watch some of our winners accepting their award.

Head of the Class

HCCL’s Juan Gilbert, Ph.D.

He HASN’T REVOLUTIONIZED the way Americans vote for their favorite candidates—at least not yet—but that might all change if Juan Gilbert, Ph.D., keeps pushing along the way he has this past year. As head of the Human Centered Computing Lab (HCCL) at Auburn University, where he also served as an associate professor of computer science and software engineering for the past nine years, Gilbert was instrumental in the design and development of Prime III, a voting machine with a multimodal user interface and automatic speech recognition technologies that allow voters to cast ballots either on a touchscreen or via a headset and microphone. Voters using the headset and microphone receive speech prompts indicating ballot options. Each option is randomly assigned a number, and voters simply speak the number associated with their particular choice, ensuring privacy and anonymity. 

Many Auburn students and faculty members who had the chance to try out Prime III during mock elections in late October expressed an interest in having access to the technology in actual elections. That issue has yet to be addressed, but Gilbert is doing all he can to make sure it gets the right exposure. In July 2008, he even testified before Congress about the Bipartisan Electronic Voting Reform Act of 2008. He stated at the time that Prime III would improve election accessibility, security, and accuracy—even if the system got it wrong. (If the mock election were any indication, then John McCain should have won the presidential election with more than 52 percent of the vote compared to Barack Obama’s 38 percent.)

Gilbert recently accepted a teaching position at Clemson University in South Carolina, and will be bringing the HCCL with him. The HCCL does research in spoken language systems, usability, user interfaces, advanced databases, data mining, and advanced distributed learning systems, all with the goal of creating innovative solutions to real-world problems. His motto throughout these endeavors: “If you could build a system that resulted in world peace, but no one could use it, it would be useless. Usability matters.”

In addition to his scholarly efforts, Gilbert has been active in the speech community as a member of the Applied Voice Input/Output Society (AVIOS) and the VoiceXML Forum, and obtained $9 million in research funding. 

Living up to a Set of Standards

Loquendo's Paolo Baggia

Text-to-speech systems have not always had the best track records when it comes to reading documents out loud. Systems have generated some laughable mispronunciations when they came across words they didn’t recognize. And it has been the same for speech recognition systems, which have generated some equally comical spellings for complex words or acronyms spoken by a user.

To address this, speech systems include a dictionary of words whose pronunciations aren’t predictable from their spellings. But these dictionaries aren’t standard from one vendor to the next, so compiling them was often an arduous task. 

For Paolo Baggia, director of international standards at Loquendo, an easier way had to exist. And so he set out, under the aegis of the World Wide Web Consortium (W3C), to author the Pronunciation Lexicon Standard (PLS 1.0), a framework for specifying word pronunciations used by speech systems in any language via phonetics and transliteration. 

Baggia also co-authored VoiceXML, Call Control XML, Extensible Multimodal Annotation, Semantic Interpretation for Speech Recognition, Speech Synthesis Markup Language, Media Resource Control Protocol, and Speech Recognition Grammar Specification, and worked tirelessly on the W3C Voice Browser Working Group’s Speech Interface Framework.

Formerly, Baggia led VoiceXML browser development for Loquendo’s VoxNauta platform. Prior to joining Loquendo in 2001, he took part in a speech application that led to the automation of TreniItalia railway company’s call centers. In 1989, he joined CSELT, Telecom Italia’s R&D lab, where he conducted research into natural language, speech parsing, spoken dialogue design, and language modeling.

Baggia is also a member of the W3C’s Voice Browser and Multimodal Interface working groups, the board of directors of the VoiceXML Forum, and the SALT Forum, plus he teaches graduate-level courses in human language technology and interfaces at Italy’s University of Trento.

Paolo Coppo, vice president of marketing and business development at Loquendo, calls many of Baggia’s research endeavors “pioneering...in realizing the true potential of speech interaction.

“The current state of speech would today be substantially poorer without the key role played by Paolo Baggia," he adds.

Leading a Conversion

Spinvox's Daniel Doulton

Voicemail-to-text service provider SpinVox’s past 12 months have seen a steady drumbeat of global carrier partnership announcements and expansion. More than just public relations huff-and-puffery though, this year SpinVox has gained access to nearly the entire Australian mobile market with partners Telstra (the once government-owned state monopoly) and Optus (Telstra’s largest competitor); inked deals with North America’s Cincinnati Bell, Telus, and Bell Canada; and forged partnerships with 13 Latin American regional Movistar carriers.

The company’s success this past year, and during its entire six-year run, has belonged in no small part to its co-founder and chief strategy and marketing officer, Daniel Doulton. Previous to his work with SpinVox, Doulton had a strong background in business. He worked as a consultant for Arthur D. Little in London and Madrid, worked in foreign exchange options trading at Citibank, and did a stint in product development and international marketing at Psion. 

As for SpinVox, company lore has it that in 2003 SpinVox’s other co-founder, Christina Domecq, received 14 voicemail messages in one morning and asked Doulton why she couldn’t get them as text messages. Doulton was sure that a technological solution existed and spent hours looking for such a service. When he couldn’t find one, the wheels began to turn, and he developed a systems approach that would become the first SpinVox Message Conversion System.

Since then, the company has tapped talent from Cambridge University to build a proprietary automated recognition engine that now handles the company’s traffic with some human quality checking. Under Doulton’s direction, the system has grown to support more than 30 million users—with a 600 percent increase in just the past 12 months. This past February, in fact, SpinVox processed its 100-millionth message conversion.

Through that rapid expansion and a number of high-profile talks, Doulton has been evangelizing the use of speech recognition technology for mobile transcription. 

Historically, speech has had trouble gaining traction in the consumer market, relegated primarily to business segments. Doulton and SpinVox are changing that, making inroads into the wide space of consumer spending and building consumer appreciation of the benefits of speech technology.

Giving Speech to One and All

Ifbyphone's Irv Shapiro

The interactive voice response (IVR) space has long been the cash cow of speech. With lower costs than live agents, contact centers have been implementing increasingly sophisticated speech-enabled systems for years, but, for the most part, advanced deployments have been confined to midmarket and enterprise-level firms. Too much up-front capital investment has prevented an appetizing pitch for advanced technology to smaller business.

Through lower-cost, hosted, cloud-based telephony solutions, however, Irv Shapiro and his company, Ifbyphone, have forged a way for IVR and speech into untapped spaces. Ifbyphone’s self-service plans cost less than $50 a month, with usage charges of pennies per minute. For those prices, users have access to voice-enabled IVR, speech-to-text-based virtual agents, call routing, voice broadcasting, and more.

Low-cost plans have netted the company some 14,000 subscribers—almost 90 percent of whom are getting their first exposure to IVR through Ifbyphone.

The company’s proposition is the brainchild of Shapiro, its founder, CEO, and chief technology officer.

Shapiro began his career as a programmer at Digital Equipment, working his way up the internal ladder to U.S. area data communications expertise center manager. He left the company to found Metamor Technologies, a systems integration firm that quickly grew to 500 employees and $32 million in revenue. After selling Metamor to Corestaff for $38 million, he founded Edventions, an educational technology firm that he also sold. From there, he was on to voice and phone mashups with the founding of Ifbyphone.

Shapiro designed the architecture for Ifbyphone’s proprietary automated VoiceXML generation technology, which combines Asterisk Open Source and carrier-grade VoiceXML/CCXML solutions, and participated in the alpha coding. He also charted the company’s unorthodox bid to attract smaller business clients and telephone providers.

While its services are not limited to small and midsize businesses, Ifbyphone’s proposition makes it an early innovator within the growing trend toward hosted and networked speech solutions. With leaders like Shapiro at the helm, we might expect an increasing democratization of the use of advanced speech solutions.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues