July 1, 2002
Q & A

Sunil Soares, Program Director of Product Marketing, IBM Pervasive Computing

Q What is happening with IBM in regards to their Voice Systems business?
A We're very optimistic about our voice business. Customers have started to see how voice technology can improve their ROI and are deploying voice not only in the server/call center space, but also in embedded and desktop areas. Talk about telcos deploying voice-enabled enhanced services. We're starting to get interest in multimodal technology as well. We've recently announced a set of tools, as well as a plan to add multimodal capabilities to WebSphere Everyplace Server, an extension to the WebSphere platform for mobile applications. Based on the XHTML and Voice (X+V) mark-up language, the toolkit, available in the fall, will allow developers to rapidly turn voice and Web applications into multimodal applications. The addition of these capabilities to its recently announced WebSphere Everyplace Access (WEA) can also enable a wide variety of wireless devices multimodal access to back-end businesses.

Q What does it mean for your customers and the speech technology industry that IBM has moved Voice Systems to Pervasive Computing?
A We see voice as a key part of an enterprises' overall infrastructure - not merely as an add-on technology. Voice, and other related emerging interfaces, will be a key part of how we'll be interacting with our technology as we move into the next phase of e-business - which is why voice fits so neatly into pervasive computing. Pervasive computing is about accessing information anytime, at any place, and it's about using technology for working and living in new ways - like paying for drinks out of the vending machine via your cell phone; having your fleet of trucks automatically notify you of their location and condition so that you know exactly where they are and what maintenance they require; asking your car for directions or having it tell you how to avoid a jam going on 684N on a Friday night. We're not going to be accessing these technologies solely via our keyboards anymore - rather, we'll be using a combination of voice, stylus, pen, touchscreen and input methods that haven't even been invented yet. Because of that, voice was a natural fit into a group that is focused on looking at how we can move the dial on how we interact with technology, lowering technology barriers.

Q What is your outlook concerning speech technology?
A We're very optimistic about the technology. VoiceXML has gained a lot of momentum, especially in the past two years. According to analysts, there are more than 50,000 VoiceXML developers in the US - that's a large number. There are also very active forums, and some strong applications being written today. We're seeing voice apps being implemented in three main areas at the moment: call center/voice portals, directory assistance, and voice dialing on an enterprise level.

These days, being able to prove ROI is key. How much money can a new application save a company - and with the three applications above, you can very quickly see the returns. Having a well-designed voice application in a call center frees-up live agents for more complex calls and also leads to fewer calls abandoned. Using voice technology for directory assistance and having automated voice dialing in a company allows you to free-up the operators and use resources more effectively.

The embedded market is also an area where we're seeing a lot more interest in voice. IBM works with Johnson Controls, which supplies systems to car manufacturers. Using IBM technology, Johnson Controls has built a truly integrated, speech-enabled, in-vehicle communications system. Think of the car as a large device - right now we're seeing embedded speech technologies within the car. But in the near future, the car will be a computer accessing information off the server - where the nearest gas station is for instance. One of the most practical ways to access that information would be through voice commands. Again, voice is an integral part of the infrastructure. Not technology for mere technology's sake.

Q What do you believe will be key market drivers for this technology in the short-term? Long-term?
A These days, ROI is king. When I speak to customers, the first thing they want to know is - how cost effective is it and how much money will it save me?

In the three types of applications I mentioned in the question above: voice portals/ voice-enabled contact centers, voice dialing on an enterprise level and automated directory assistance, the ROI is very clear.

We'll be seeing more voice technology being embedded in devices, whether it's ASR or TTS or a combination of both. We've already bundled Embedded ViaVoice in the Compaq Ipaq and, in China, Legend handhelds. As computing power improves, and the handheld platforms get stronger, handhelds are practically becoming very small computers - you'll be seeing more speech technology there. Embedded voice makes a lot of sense in this emerging area of pervasive computing, especially as input moves away from keyboards and we need alternatives to the keyboard.

Another issue is standards - having one open standard, especially for emerging technologies like multimodal interaction, will be key in driving the industry forward.

Q What vertical market segments do you see supplying the most growth for speech technology developers and why?
A The medical and legal areas are quite strong in terms of dictation and transcription. (More on that in q 10) The financial sector and telcos also, because of the large numbers of calls their call centers need to handle. For simple queries, it's a lot more effecient, as well as cost-effective, to direct the callers to a voice-enabled, automated system. That's where voice has a clear value proposition. We're seeing the same thing in pockets of the travel industry. Again, we're talking about lots of requests for information that can be obtained over the phone. And, rather than having to engage a live agent for all the calls, directing the more simple calls to an automated, voice-enabled system frees-up the agents for more complex calls.

Q In what geographical markets do you expect to see the most growth over the next three-to-five years and why?
A The U.S. market is poised to take hold of the speech technology market first, since U.S. companies are already engaging in transactions and correspondence using IVR technology. Americans have adopted a willingness to use automated call centers to obtain information, whether they're gathering flight times from an airline or stock quotes from a financial institution.

Western Europe and Asia Pacific are slower in adoption - and culture plays a part there. People are not as willing to interact with automated systems - they much prefer dealing with live agents.

On the other hand, there is a wide adoption of wireless devices in Asia and Europe. Once 3G has taken hold in these markets, we could see an increase in voice technology being used, especially on the device side.

Q What should the speech technology industry as a whole be doing to increase the growth rate of speech technology deployments?
A Two things: prove ROI, and make it easier to deploy speech.

Businesses are looking to see how they can make more effective use of their current resources, and how technology can help them save money.

The industry also needs to make it easier for developers to build speech applications. VoiceXML has played a large part in simplifying voice technology to some extent, but it's still not as simple as it should be. Easy to use tools, as well as prebuilt modules like Reuseable Dialog Components found in IBM's toolkits aim to simplify this task for developers.

Standards, such as VoiceXML, will also be key as there are already a host of applications built on VoiceXML. As voice moves into its next phase - multimodal interaction, it will be important that developers are able to leverage their existing skills to extend their current applications. Combined with XHTML, a language that web developers are already familiar with, these two standards build a very strong foundation for new, multimodal technologies.

Q Who are some of your partners in providing speech technology and why did you choose those companies?
A IBM has made several key partnerships in various markets within the speech technology industry. Most notably, IBM along with VoiceXML Forum members Opera and Motorola, worked together to submit a framework to the W3C based on VoiceXML and XHTML. Called XHTML+Voice (X+V). This framework leverages two standards that web and voice developers are already familiar with and have built a great number of applications on.

Another partner of note: Johnson Controls. In the area of telematics, Johnson Controls has deployed IBM technology in its upcoming telematics offerings to the auto industry. Its first implementation is a voice-enabled mobile communications system for the Chrysler Group. An industry first, the system requires only the push of a button to make a call - all other functions are engaged via voice commands. It consists of a receiver module behind the dashboard, an embedded microphone in the rearview mirror and the driver's own mobile phone. The phone will synchronize with the receiver module to create a wireless connection via Bluetooth technology with the car's audio system. When a call is placed, audio is suspended, and the call comes through the speakers. IBM's software will allow drivers to use spoken commands (in English, French or Spanish) to place calls or access the system's audio address book, customizable by the owner. We'll be seeing this in Daimler Chrysler in their model 2003 cars.

Q Provide us with your thoughts on the various standards that are being implemented and discussed.
A VoiceXML is a key standard for developing voice applications. According to analysts, there are currently around 50,000 VoiceXML developers. Furthermore, a cursory glance of chat forums on the Internet shows that developers are having detailed conversations about VoiceXML - indicating that there are a number of apps for VoiceXML.

In terms of multimodal technology, IBM last year submitted, to the W3C, a mark-up language called XHTML + Voice (X+V) which builds off two existing standards: XHTML and VoiceXML, both of which have a large installed base of customers and applications. The capabilities provided in the tools allow the developer to speech-enable existing or new visual applications with simple X+V speech tags. Because X+V is built on standards that web and voice developers are already familiar with, developers can turn existing web applications into voice and web applications and deploy them using existing web infrastructure.

Q Describe a successful speech technology implementation and why you thought it was successful. Please include any benchmark statistics that support your thoughts.
A Austin, Texas-based medical solutions provider, Expresiv Technologies, has successfully integrated IBM's WebSphere Voice Server for Transcription within its MD One solution. MD One is one of the first commercially available telephone-based medical transcription solutions where by physicians can dictate their records or reports over practically any device including a phone, and have those records transcribed electronically and sent back to them via e-mail. While a medical transcriptionist still must listen to the doctor's voice to correct any errors, it reduces the time it takes to type the average medical report from scratch. This is an important development since skilled transcription resources (specifically in the medical field) are expensive and hard to find.

Sunil Soares, Program Director of Product Marketing, IBM Pervasive Computing

Abby Connect Launches AI Receptionist

boost.ai Upgrades Conversational AI Platform

Sanas Expands Accent Translation

Deepgram Launches Flux Conversational Speech Recognition Model