Speech On The Web: Speech Combines Telephone and Internet

Lucent Technologies is driving the convergence of the Internet and telephone networks to make accessing a business' services a better experience for customers through a choice of natural interfaces. The Natural Information Interface, demonstrated at the Council on Competitiveness' National Innovation Summit at MIT in March, combines advanced speech technologies with flexible Web and phone interfaces to create a true convergence of telephone and internet applications, making it possible for users to converse naturally with an automated system.

The demonstration - developed by Bell Laboratories - includes a sample application that provides access to a variety of financial information and transaction services via natural speech, numbers keyed in on a Touch-Tone telephone, and via a Web browser. It illustrates how businesses can extend the reach of the Internet to customers using a range of different devices.

Of the many elusive capabilities brought within reach by the merging of communications and computing, one of the most appealing and useful is natural interaction between people and the machines that serve them. Even today, the interface to a typical computer application or network-based information service is much closer to the machine than it is to the human. People are used to going more than half-way to meet the machine, or they avoid such applications and services altogether. Any advance toward enabling the machine to adapt to the human moves us closer to the real promise of high technology, which is to become as transparent as it is pervasive.

The Natural Information Interface demonstration offers a taste of what is possible, showcasing a small set of Bell Labs innovations in speech, software, and networking technologies.

Researchers have long sought to develop interfaces exploiting the most natural means of communication, and recent advances in the science and technology of speech - coupled to exponential leaps in processing power - are bringing the effort to fruition. A personal agent, a natural language call router, and a banking application demonstrate the power of an interface based on natural speech. The banking application also illustrates the ease of creating a Web browser interface and a telephone interface to the same application.

Beyond the example of personal banking, it is easy to imagine many commercial, educational, civic, and social applications that can be served using these interfaces. In any such application, an important new benefit is that the user chooses the method of interaction, whether it is a wired or wireless phone, a computer, personal digital assistant or other device.

The natural language user interface grew out of decades of innovative, multidisciplinary research. Physicists, electrical engineers, computer scientists, computational linguists, statisticians, medical researchers, and behavioral scientist all contributed to the project.

Recent achievements in speech technology have made it possible to use the telephone to search the web. They include:

  • Natural language and interactive dialogue processing
  • Speaker-independent speech recognition
  • Speaker authentication
  • Multilingual text-to-speech synthesis
  • Smart barge-in
  • Keyword and phrase spotting

Suiting the interface to the individual also means enabling access to any particular service by a variety of means. One communication device does not fit any individual's need in every situation. For example, a Web-based information service is even more valuable to a user who also has the option of calling into it by phone and interacting by voice. A software platform for device-independent service creation that supports multiple modes of access is the focus of a research project called Tardis.

In the Natural Information Interface, we use a simple banking application, AnyTime Teller, to illustrate this capability.

Communications middleware and domain-specific programming languages are the key technologies employed in the Tardis platform. One of its major components is TelePortal, an access architecture and middlewave enabling telephone access to Web services. The other two components are programming languages called Mawl and PML, designed to address service creation and the user interface, respectively. Mawl, used to program form-based Web services, expresses the abstract logic of a service, independent of the access mode. PML, or Phone Markup Language, is a dialect of HTML specialized to describe content for interpretation over a telephone. Together TelePortal, Mawl, and PML offer new flexibility to providers as well as users of information and transaction services.

Today, a service provider wishing to give users access to a Web service by any means other than a computer and a browser would have to develop and maintain a new, separate software program. The technology in the Natural Information Interface permits one-time creation of a Web service that can be accessible, with minor modification, by telephone or other means as well as by computer.

Several applications within and outside Lucent provide "click-to-dial" call control from within a Web service. Mawl can take advantage of back-end and browser technology in the same way as any other Web application. The ideal of extending HTML to include phone-specific markup in the style of PML has been proposed. TelePortal supports these uses and also provides a convenient way to develop enhancements to these basic capabilities. Combining Mawl and TelePortal provides enhanced service programming possibilities not found in existing products.

Commercial services require monitoring, modification, and other operational management. Services important to business must reliably provide operations, administration, and maintenance features such as logging, performance data collection, error reporting, online administration and data update, and software upgrades of running services. Because Mawl is built on an infrastructure that maintains the complete service state and its applications are systematically generated from high-level service specifications, we have been able to use Mawl to automate these capabilities significantly.

Future Lucent products using the technology underlying the Natural Information Interface - and the services that they enable - will move us closer to the goal of the machine adapting to the human, providing a truly transparent human-machine interface.

For more information, visit the Bell Labs web site at http://www.bell-labs.com/ConC/.

Free Speech

An Interview with Esther Dyson, Internet Guru

Esther Dyson's recent book, Release 2.0: A Design for Living in the Digital Age, examines implications arising from the age of the Internet: business implications, security and privacy matters, intellectual property rights, education issues and much more. Her spartan, clear style and refreshingly simple approach to breaking down complex problems make Release 2.0 a great read.

Interest in the Internet and a high-tech background are not required to enjoy the book, which, while addressing the main topics above, also manages to include philosophy of life, etiquette, morality and gentle humor.

Question: Your book highlights the leveling quality of the Internet. The "Net changes the balance of power among companies ... between employers and employees ... between merchants and customers." Some individuals argue that without speech technology, the Internet shuts out those individuals who a) do not have access to hardware or higher education b) are overwhelmed by new technology or are from the "wrong" culture c) are unable to speak, type or read either through disability or lack of literacy. To what extent do you agree or disagree with this view and why?

Answer: It's true that the Internet mostly shuts out such people, but that is not the "fault" of the Internet. Overall, the Net is a tool for upward mobility, something for people to use rather than for the rich to have. Obviously, speech recognition would help more people have convenient access to the Internet, but it does not eliminate the need for education. Lack of education is the greatest impediment to success in general, as well as a valuable skill for using the Net.

It can also be a tool in helping people to get education, in teaching tools that combine speech and reading and writing.

Question: You predict "companies will need to have a real personality online." How does speech technology offer companies a way to achieve personality online?

Answer: Speech technology is a useful way to communicate through the Net, both into (speech recognition) and out-from (text-to-speech or translation). But I must say that it is like a teller terminal; it makes your money more accessible, but in smaller amounts. Much of the power of the Net depends on the ability to manipulate large amounts of text, and it is quicker to read or type than to speak lucidly. Nonetheless, speech technology, like ATMs, is likely to help greatly in bringing some of the value of the Net to a larger public. Many will move on to other interfaces.

Question: Speech recognition technology might be able to be used by developers with only superficial knowledge of code to create new products. How might you see this affecting the software industry overall?

Answer: I don't see this happening in any serious way. It will allow a much broader range of people to create content, however. In that way, it will contribute to the diversity of voice the Net will foster.

Question: Speech would seem to be a natural interface for the Internet. But voice packets are not easy to transfer. Could the Net become too crowded? Is there a point at which the material being moved on the Net actually slows it down so much it is no longer a viable business tool?

Answer: No, I think that the bandwidth of the Net will expand, albeit in fits and starts, to meet the demand. Moreover, the price will go down as technology improves and use increases. I don't think you'll have one mode replacing another; they will all be competing for people's time.

The problem with a voice-based chat (although, of course the original chat was voice) is that you can't have everyone speaking at the same time, whereas that more or less works with typed "chat." So I think you'll end up with a variety of formats.

Question: - Then, would not speech be the next logical step for the Internet to take?

Answer: Chat rooms have always been modeled on discussion groups.

Question: There are some opposing forces developing around the Internet. Among the most discussed is the opposition between privacy/security vs. freedom of access and anonymity. Voice identification and other biometrics are starting to play a role in this. They are being used to protect web pages and data. Ironically, because the identify and/or verify a person, they have been described as both the best form of security and privacy protection and the greatest danger to privacy on the Internet. Are biometrics a value or a threat to free discussion on the Internet?

Answer: I think that like most such things, they can be used or mis-used. They are valuable in establishing identity, but I think anonymity should be permitted on the Net. That is, it's a valuable tool for people who want to establish their identity (or for people who want to keep intruders out), but I think anonymous communications should be legal.

Obviously, if I want a loan, I have the obligation to identify myself, but if I just want to say something, I should be able to do so anonymously (outside communities that require membership.) I support communities that have rules for their members. On the other hand, I do not think governments should outlaw anonymous communications.

Question: How important is language translation to the facilitation of the global economy? How would a machine that could translate a spoken message simultaneously into several different languages and sent it out across the world impact a multi-national corporation - or a small local enterprise?

Answer: Translation will be very important, and it will facilitate commerce immensely (for order forms, product specs and the like). However, it will take a long time before automatic translation can handle casual conversations or literary prose. I think there will be a huge market for human translators using translation tools.

Question: In chapter 1 you discuss the fun you had during the early days of the PC industry. "The industry flourished away from the spotlight, away from government interference, away from social responsibility. Personal computers were still largely novelties for hobbyists; serious mainframe computer folk considered PCs basically toys ... the PC industry remained a have for freewheeling, free-market thinking." Subsequently, as the PC industry stabilized, you moved on to following the software industry, and again most recently turned your attention to the Internet. Is it possible that speech technology, which is in a similar "startup - about-to-explode" phase, will attract your attention next - or more than it has until now?

Answer: To be honest, I spent a fair amount of time writing about translation and text-analysis tools in the past, and I think they are an important part of the market. But I think they are a part of the broader world of the Net and human-to-machine-to-human communications rather than a separate market in themselves. They are going to be so ubiquitous that we don't even notice them.

Esther Dyson is chair of EDventure Holdings in Manhattan. As leader of the elite merging market/emerging technology conferences PC Forum and High-Tech Forum in Europe, Publisher of Releast and Release 1.0 newsletters, Ms. Dyson has become a key figure of the high-tech age. Her new book, Release 2.0: A Design for Living in the Digital Age is published by Broadway Books in the United States. Visit the Website at www.Release2-0.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues