The Voice Web
Telephones have been like turtles watching the Internet rabbit run. Although telephony infrastructure is changing rapidly, the way the telephone interacts with the user - and what the user does with the telephone - has barely changed for decades. The growth in wireless phones has made telephone service available almost anywhere, but the telephone is still used mostly for contacting specific phone numbers to talk to a person. When the telephone is used for contacting automated systems, the touch-tone interface is notoriously inconvenient and frustrating.
What you can do on the Voice Web
COMMUNICATIONS MANAGEMENT AND PERSONAL ASSISTANTS
Communications management and personal assistants Communications management usually includes dialing by name using a personal directory. Personal-assistant functionality includes call screening, taking and accessing voice messages, and one-number access to the subscriber (scanning several subscriber numbers based on subscriber instructions). Other personalized features include maintaining a schedule and delivering reminders. Unified messaging includes features such as reviewing e-mail or fax headers by phone using text-to-speech. Since subscribers will make calls through their personal assistants, the voice portal can potentially get additional revenues from providing bundled local and/or long-distance service.
Enterprise applications, such as voice-activated auto attendants that direct calls by name, can be a corporate voice portal. Corporate voice portals can also provide such services as reservations for a conference, location of a local store outlet or a connection to customer service.
General information includes weather, sports scores, horoscopes, general news, financial news, stock quotes, traffic conditions and driving directions. Such information is intended to make a voice-enabled service part of a subscriber's daily habit. Information can be customized, using, for example, the user's personal stock portfolio or the user's current location. As voice portals evolve, the caller will be able to "voicemark" specialized voice-equipped Web sites.
V-commerce supports a variety of transactions that can result in product or service sales. These include transactions similar to ordering from a Web sites or telephone catalog service. They also include finding a business by saying its trade name or its category.
Entertainment is part of e-commerce, and it will be part of the Voice Web. For example, the caller can use speech recognition to choose audio channels to listen to.
Telephone speech recognition creates a voice user interface. It changes the telephone in the same way that the graphical user interface changed PCs - an average user can now handle complex applications efficiently. The improved functionality makes any telephone into a personal assistant-an information appliance. In another key advantage, the VUI makes a wireless phone safer to use in a vehicle by providing a hands-free option. Wireless application protocol is an alternative approach to upgrading the telephone, but it has shortcomings. (See sidebar, "What about WAP?") The VUI opens up to telephony the resources of the Internet as well as the advantages of the Internet model. One can use any telephone to obtain information and services equipped with speech recognition, creating the Voice Web. THE VOICE WEB
Sites that support speech recognition constitute the Voice Web. Most sites have individual phone numbers (typically toll-free), and the caller must remember these numbers or store them in the phone. Some sites are attempting to aggregate information, so that the caller can have one number to call to get multiple services. A site which is phoned for services and which contains speech recognition software is a "voice portal," the full implications of which this article will address in the next section. Note that the "voice browser" application software (analogous to a Web browser) is in the voice portal, unlike the World Wide Web where the browser is in the user's device (usually a PC). A voice portal thus has more control over the user experience than a Web portal. Having the browser in a remote server simplifies users' lives: they need not acquire or maintain browser software or run it on their phones. No one has to upgrade a phone because it runs too slowly! Currently, most services are implemented locally at the voice portal. While a user can "surf" the services available at the portal site, most voice portals do not yet support going to other sites. The ability to access other sites, however, will become more common as (1) protocols for switching a call to another site, but returning it to the portal, are perfected; and (2) as a new standard, VoiceXML (analogous to the visual Web standard HTML), is fully developed and released by the World Wide Web Consortium (see www.voicexml.org). VoiceXML will allow Web sites to support telephone access without having to support speech recognition engines; instead, voice portals will download "pages" of VoiceXML code, and voice browser software at the portal will manage the ensuing dialog. The Voice Web is the confluence of a number of trends:
- The telephony voice user interface: The most fundamental enabling factor is the maturing of telephone speech recognition (and, to a lesser degree, text-to-speech synthesis). With many complex, high-usage applications deployed, some for years, the feasibility of the technology is no longer a question.
- A drop in telephony costs and changes in billing practices: The cost of a telephone call is dropping and there is a strong trend toward charging a single monthly rate. Eventually, the cost of using the telephone will be equivalent to the cost of using the Internet.
- The availability of Internet databases: The Internet has caused companies to create and centralize data that can now be accessed automatically over a telephone. The existing data infrastructures can support telephone applications with minimal incremental investment.
- The acceptance of over-the- telephone and over-the-Internet purchasing: Companies with existing telephone sales or telephone customer service need to reduce costs. Pure e-business companies need an automated telephone service to improve profitability by broadening their customer bases and automating telephone-based customer service.
What about WAP?
WAP, or wireless application protocol, and similar protocols are designed to allow Web access, largely through text on small screens of compatible wireless phones, handheld computers, or Personal Digital Assistants. It is one way of implementing the "Wireless Web."
Companies are adding support of these protocols to enable access to the information on conventional Web sites. Most of the early reviews of such services have been skeptical, with reviewers criticizing the small screens, difficulty of entering text and limited content available. Future generations of wireless devices will have more bandwidth and perhaps better ways of choosing options or entering information, but that hope itself may cause users to wait before they adopt the technology. Adoption of specialized phones has been more rapid outside the U.S. Current implementations of Web phones do not give the same experience as a Web browser, creating frustration. The experience is much more like that of a touch-tone Interactive Voice Response system, with short menus, push-button navigation, complex layering in order to keep choices to a minimum and difficulty in entering information. These problems have led to the growing realization that telephone speech recognition is a powerful alternative - or supplement - to text-based wireless solutions. Speech recognition works from any wireless phone, no matter how small. In fact, it works from any telephone; so a caller has access to exactly the same user interface from a home phone, business phone, or hotel telephone. And no one has to be persuaded to buy a new device.
As noted, a voice portal is a site that can be reached by telephone and supports speech recognition processing. "Voice portal" evokes an association with "Web portal," which usually implies a popular site to start navigating the Web, such as Yahoo or AOL. A voice portal with ambitions to be the single place a user calls can be conveniently designated a "primary voice portal." This designation distinguishes it from a speech-enabled site on the Voice Web with more specialized objectives (e.g., a brokerage company providing stock quotes and trading). Not all sites that are speech-enabled are voice portals. A site may support speech recognition by having VoiceXML code that defines a dialog enabling access to that site's information or services. The VoiceXML protocol makes it unnecessary for that site to have either telephone lines or a speech recognition engine. The VoiceXML code can be downloaded and run on a voice portal located elsewhere and run by a different company if the voice browser software at that voice portal contains a VoiceXML interpreter. There are three key types of service that a voice portal can supply: (1) information; (2) e-commerce transactions; and (3) telecommunications and personal-data management. A sidebar on "What you can do on the Voice Web" provides examples of these services. There are a number of primary voice portals available nationwide. Other companies are helping to create the applications and often hosting them as ASPs, or application service providers (see sidebar on voice portal companies.) Telephone service providers and today's Web portals will enter the picture soon. Voice portals may be free calls, may require a monthly subscription fee, may bundle in telephone services, and/or may charge for premium services. Other opportunities for revenue include selling keywords (e.g., a trademark that directs the user to information on the product), preferred placement in prompts, audio ads, sponsorship of a feature and commissions for calls transferred or sales completed during the call. Telephone service providers that offer voice portals can save money by more easily retaining subscribers that invest time in learning the portal and customizing it. The cost of acquiring a new subscriber is typically $500; so reduced "churn" can rapidly pay for the speech recognition software, applications and additional hardware. There are several companies supplying the core speech recognition technologies used in speech-enabled sites; these include Conversa, IBM Voice Systems, InfoTalk, Locus Dialogue, Lernout & Hauspie, Lucent Technologies, Natural Speech Communications, Nuance Communications, Philips Speech Systems, SpeechWorks International, TEMIC and Vocalis. CONCLUSION
The Voice Web is very early in its evolution. It is likely, however, to grow more quickly in users than the World Wide Web did in its early stages. As soon as a voice-enabled site is created, every one of hundreds of millions of conventional telephones is enabled to use that application. Further, organizations have already created services and databases for the World Wide Web that can be used by a Voice Web service. There will certainly be some struggles to discover the best services and appropriate business models, but the solid fundamentals underlying the Voice Web will motivate its rapid acceptance.
The language of the Voice Web
Telephone speech recognition: A caller, using any form of telephone device, calls a location where a system interprets the caller's speech, performs appropriate actions and responds to that speech, often with a recorded prompt. This distinguishes it from PC speech recognition, where the microphone is attached directly to the PC.
Speech-enabled sites: Locations that can be called that support telephone speech recognition.
Voice User Interface: The primary means of dealing with the system is by speaking (and the most common feedback is also voice).
Voice Web: The Voice Web is an analogy to the "visual² Web. The term summarizes the potential of using speech recognition to make any telephone usable for functions we now associate with the Internet, in ways that are analogous both in terms of utility to users and economies to providers.
Voice portal: In the broadest sense, a location that is called that has a variety of speech recognition services. A primary voice portal seeks to be the first (and perhaps only) number called. A corporate voice portal - in some cases reached through a primary voice portal, can consolidate a number of call centers and/or provide voice access to the corporate Web home page.
Voice ASP: An application service provider that hosts telephone speech recognition services.
Voice tone: Instead of hearing a dial tone when the phone is activated, the user hears an indication that he or she can simply say what they want.
Natural language processing: A speech application where the user has substantial flexibility in responding within the context of the application, as opposed to simply responding to a list of alternatives.
Dialog: In the context of the Voice Web, an automated conversation using telephone speech recognition. Dialog is the means by which a system or caller's request is clarified, confirmed or acknowledged.
VoiceXML: A standard being readied by the World Wide Web Consortium for telephone- and speech-recognition-based services. A program written in this language is downloaded to a voice portal and interpreted.
Text-to-speech: Software in the called location that generates synthetic speech from text sources, such as e-mail, news reports or a database, when recorded speech is not available or economical.
Speaker verification: A supplemental technology that can validate a claimed identity from the characteristics of the caller's voice. The claim is often made through a spoken account number, which is recognized for the content and then analyzed for validation.
Source: TMA Associates
William Meisel (mailto:firstname.lastname@example.org) is president of TMA Associates (www.tmaa.com), a speech industry consulting firm, and publisher and editor of Speech Recognition Update newsletter. He holds an annual conference on the Voice Web, the Telephony Voice User Interface Conference. Meisel holds a Ph.D. in Electrical Engineering and ran a speech recognition company for ten years.
Companies and Suppliers Mentioned