June 1, 2009
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

On-Demand Services and Mashups

Two new trends are appearing in the development of speech applications: on-demand services and mashups.

On-demand services: Sometimes called platform as a service (PaaS), software as a service (SaaS), or service from the cloud, on-demand services have been available for nonspeech technology for some time. Salesforce.com features customer relationship management applications, and Google Apps features several Web applications with functionality similar to traditional office suites. Both are available on demand from the Web.

In the area of speech technologies, several companies provide hosting services, some of which are on demand and enable developers to access specific speech technologies, such as speech synthesis, speech recognition, and speaker verification. Examples include:

Voice Forge, with more than 50 synthesized voices;
AT&T Research’s WATSON speech recognition and natural language speech synthesis;
Voxeo’s Tropo platform for speech recognition, speech synthesis, and call management functions for use in JavaScript, Ruby, Groovy, PHP, and Python applications; and
VoiceVerified’s speaker enrollment and speaker verification applications.

Rather than forcing developers to pay up front for licenses, these vendors allow the user to “rent” the software for use as a service on demand. Developers pay a monthly subscription fee or per-use fee, or use the software for free with additional cost options available. For a small monthly fee, for example, developers can access Cepstral’s text-to-speech engine by sending text to a Web site that hosts it. Developers can prototype and refine their applications, even on a shoestring budget. When the application is deployed, developers continue to pay for the service based on how many times it is used.

On-demand services have significant advantages. The technology provider no longer needs to distribute and manage code; instead, it updates its Web site. Application developers don’t need to install updates to take advantage of the latest features.

But there are also disadvantages. Application response time could be slow due to Internet delays. Developers could easily be locked in to a vendor, making it difficult to switch to another vendor without having to modify applications. Security and operational risks could exist when relying on data and services on the Web. And finally, the long-term rental cost might be greater than the up-front payment of complete licensing fees.

Mashups: Developers create mashups by integrating functionally from multiple sources. A developer could, for example, combine speech synthesis technology from Vendor A, speech recognition technology from Vendor B, and other technology from Vendor C.

Some developers download and embed new functionality into local applications. For example, Tazti (pronounced “tasty”) offers speech recognition to provide command and control for other applications, such as Facebook, Web browsers, MySpace, and iTunes. When combined with on-demand services, mashups might integrate different services into new and exciting applications. AT&T has mashups integrating voice services on iPhone clients with:

an IP-based TV service to enable users to find movies-on-demand rapidly. Users speak queries like Romance movies with Cary Grant or Movies directed by Ron Howard;
a pizza order entry service to provide speech and graphical interaction for ordering pizzas; and
a directory service to access local business listings with natural language queries. An example would be Show the nearest Federal Express, via a GPS interface.

Imagine integrating speech with Facebook, MySpace, YouTube, Google Maps, Amazon.com, MySQL databases, iTunes, shopping-cart software, and hundreds of other Web-based services. Speech enables these services without users having to type requests on small keyboards. Speech will enable users to interact with mobile devices while jogging or when their hands are busy. Imagine integrating simple verbal commands, such as yes, no, next, and prior, for navigating through Web-based instructions for assembling a bicycle, diagnosing car trouble, fixing a water faucet, or preparing a meal.

Today’s phones are like Swiss Army knives, with a tool or application for almost any situation. Oddly, many of these applications don’t allow the user to speak or listen. Using on-demand voice technology and a mashup strategy to create voice-enabled applications for mobile devices will enable the user to speak and listen in addition to seeing and selecting. After all, phones were originally designed to enable users to speak and listen.

James A. Larson, Ph.D., is co-chair of the World Wide Web Consortium’s Voice Browser Working Group and author of the home-study guide, The VoiceXML Guide. He can be reached at jim@larson-tech.com.

On-Demand Services and Mashups

Boost.ai Introduces Adaptive Voice

Krisp Launches Listener-Side Accent Conversion for Meetings, CX and Voice AI Agents

Revmo's Voice AI Rollout Yields 71 Percent Conversion and 99.9 Accuracy Across Donato's 174 Stores

Deepdub Launches Phantom X 3.2 Dubbing Model