VoiceXML 2.1 has long been the standard for creating interactive voice response (IVR) applications, and indeed, VoiceXML systems continue to process thousands of phone calls every day. But this older technology is beginning to show cracks in the foundation. Some of the more common user concerns include (1) lack of graphics and video; (2) time spent listening to prompts; (3) difficulty in being understood, especially in noisy environments; (4) noise pollution; and (5) privacy and security.
Fortunately, new solutions tackle these problems. There’s multichannel IVR, which solves problem 1 by letting users download and view graphics on one channel while listening or watching video on a second channel.
Then there’s visual IVR, which solves problems 2 through 5 by displaying verbal prompts as graphical user interface menus that users click or touch to select options or type responses. Smartphone users are no longer misunderstood, unless they mistype their responses. Many users can get through prompts faster by reading them, which speeds up dialogues. Visual IVR can be used in a noisy environment, which also avoids creating noise pollution and overcomes privacy and security concerns. But there’s one drawback: Visual user interfaces don’t work on non-smartphones.
Finally, we have interactive text response (ITR), which holds immense promise as an IVR update. It enables users to read and type by sending and receiving messages. Each prompt is presented as text to which users respond by typing. Users can read (or skip) long prompts, avoid being misunderstood (unless they mistype a response), and avoid noise pollution and concerns about verbal privacy and security.
When implemented as a software agent in the cloud, ITR offers a number of advantages:
Familiarity. Almost all users send and receive messages to communicate with other people. Without needing to learn new skills, users can send and receive messages easily to interact with an application, a robot, or a virtual agent.
Versatility. ITR systems work with any device that can send and receive messages, including smartphones, dumb phones, game stations, and virtual reality headsets.
Convenience. Users may pause and resume sending and receiving messages at any point in the dialogue, letting them switch between applications or do real-world tasks.
Adaptability. A speech recognition engine can take verbal messages recorded by the user and convert them to text. A speech synthesis system converts text to speech, which is spoken to users in situations where their eyes and/or hands are busy (as when driving a car). Ideally, users should be able to switch between text and voice easily.
Reusability. ITR software agents can reuse the application logic found in VoiceXML applications. Developers will need to insert send and receive commands instead of invoking speech engines, as well as tweak the wording of some of the prompts.
ITR systems also will need extensions to handle situations that do not occur within VoiceXML applications, including the following:
• Text abbreviations as typing shortcuts, such as the commonly used FYI, BTW, and LOL.
• Emoticons that convey users’ emotions.
• Arbitrary pauses and resumptions that allow users to switch to other applications or real-world activities.
• Interaction with other applications to enable users to copy and paste information from one to the other.
• Natural language processing (NLP) to help users deviate from structured dialogue to a more conversational dialogue. New techniques enable designers to train NLP systems with phrases actually entered by users.
• Rich data to convey additional information that supplements text, including pictures, graphics, and audio and video clips.
• Human-assisted dialogue where agents listen to user conversations and replace computer-generated responses with responses they choose. The human agent may even replace the system if the user becomes upset or has difficulty communicating with it.
How do enterprises update their VoiceXML systems? First, they should carefully analyze enterprise goals to determine if their system needs maintenance, enhancement, or replacement. Work with the VoiceXML vendor to determine which new product offerings it supports.
Next, review new guidelines for developing voice applications that may also apply to text dialogues. Leverage developers’ years of experience and reuse IVR dialogue structure where possible. No sense in reinventing the wheel.
Finally, if a new conversational system is chosen, use the new techniques for training NLP systems with phrases entered by users.
James A. Larson, Ph.D., is an independent speech consultant who also teaches courses in speech interfaces for Portland State University and is co–program chair for SpeechTEK 2017.