A View from the Voice Search Conference
The Voice Search Conference, held in San Diego in March, was a significant success, measured by presentation quality and the number of attendees in these troubling economic times. Voice search, as a topic, involves voice queries of varying complexity for Web or corporate database access, but also addresses audio indexing to find content in audio media. The main themes of the conference were how mobile devices can leverage voice search and how call center agents can benefit by using voice search.
The Applied Voice Input/Output Society (AVIOS) used the show to announce the winners of its annual Student Voice Application Contest. Antonio Rico-Sulayes of Georgetown University was judged the overall winner for his robust and easy-to-use clinic appointment manager. With this application, clinics can schedule, confirm, and cancel appointments, review personal data, and get general information. Other student applications included an interactive voice response system that allows the disabled to vote, a voice browser with spoken URL links, and a service for leaving voice messages that are emailed to the target destination. Prizes included a cash award that covered conference attendance—provided by vlingo—a laptop computer from Opera loaded with the latest version of its browser, and Visual Studio software from Microsoft.
The keynote address by Marc Davis, chief scientist at Yahoo!, described a visionary road map of voice search and how 3G devices will lead the way for many new applications on the mobile network. Since search often deals with unstructured speech, pragmatics help identify the context. The 4 Ws (what, when, where, and who) map into a model where it is possible to predict the content of a query, support searches broader than on the Web, and even make inquiries using a network of people around the world.
A panel discussion on intellectual property gave insight to the U.S. patent process. Patents are highly valued during corporate acquisitions and cross-licensing, though less valuable in lawsuits from companies that do not practice any technology. It currently takes four to five years—which is extremely long for computer technology—for the backlogged patent office to process patent requests. New congressional regulations are encouraging accelerated examination, provided the patent filer identifies prior art from at least a year before filing. However, this is a trade-off of patent issuance time with cost and potential coverage (Watch for a feature on patents in the next issue of Speech Technology magazine).
A demo track included engaging presentations. Loquendo, for example, showed a new text-to-speech tool kit supporting prosodic and emotional aspects of human dialogue, and demonstrated how emotion makes TTS conversations sound more natural. Convergys demonstrated speaker verification using the multimodality of a 3G phone for multifactor authorization in a healthcare claims inquiry. It is soon to be available on-demand for on-premises and hosted environments. Novauris demonstrated that multiple grammars with similar base phrases for frequently asked questions portions of a Web site were very successful in voice search. Novauris contends that statistical language models don’t work well for long utterances, and that slot-filling procedures are more powerful than previously thought. Vlingo also noted that most unconstrained commands fall into a few reasonable ways to make a search. SpeechCycle used multimodality to display alternative interpretations of a voice query based on semantic rankings. Microsoft also leveraged multimodality by using a voice palette to list the n-best interpretations. Melodis expanded voice search for song titles and artists by introducing singing and (nonverbal) humming of the song to drive the search.
A wrap-up panel agreed that the iPhone and 3G telephones will be a driving force in new multimodal applications that will lean heavily on voice search capabilities. The 2G texters are likely to migrate to voice search applications on their phones. Improvements are still occurring in core speech recognition technology, with linguistic frameworks able to handle longer sentences. Feature-rich phones will change call center interactions, given that 70 percent of calls will soon be made from mobile phones. Innovations in multimodal applications will ease the migration to mobile voice search. Currently, a set of do’s and don’ts for multimodal design does not exist, but that will soon be remedied by more trial applications.
Matt Yuschik, Ph.D., is a human factors specialist at Convergys. He designed and evaluated multimodal applications for call center agents, as well as voice-activated voicemail. He is on the AVIOS board of directors and the Student Voice Application Contest Committee. He can be reached at email@example.com.