Are Speech Vendors About to Do the Mash?

Article Featured Image

Historically, application development has been a tedious process. Programmers worked with large, autonomous, complex blocks of code that were difficult to change and maintain. A vendor, for example, would develop a CRM system in a process that typically took many years. Once up, it was difficult to change the system, so major updates would arrive every few years.

Since the dawn of the Internet, software development has changed quite dramatically. Rather than large, autonomous blocks of code, vendors now deliver small, specialized elements that can be woven together. As a result, applications can be built in short periods of time—in some cases as little as a few hours. These changes have given birth to a new type of program, dubbed a mashup (see "What Exactly Is a Mashup?"). Unlike traditional applications, which essentially were built from the ground up in a silo fashion, a mashup usually has little original code, and instead mixes and matches elements generated by other sources.

Although they have garnered limited traction in the speech market to date, interest in mashups has been rising for a variety of reasons. Application development has been especially vexing in this market because the underlying technology is complex and often based on proprietary application programming interfaces (APIs), which increase development time and maintenance requirements. Mashups allow developers to work with simpler, higher-level interfaces than were available in the past, and "are quite popular with mobile applications, an area where speech is poised to make significant headway,"` Dan Miller, senior analyst and founder of Opus Research, says.

Mashup interest is growing, but barriers to adoption remain. Many companies do not understand how to leverage the technology. Also, this approach offers so many possible options that it might not be easy for corporations to find mashups suited to their needs. Voice vendors are developing and promoting their own interfaces, so it can be difficult to support the wide—and ever-growing—array of interfaces. Finally, there is no assurance that the underlying code or the vendor will be robust enough to support business applications. Despite those barriers, the future seems bright, and the expectation is that mashups will become a cornerstone in new speech solutions.

The Rise of APIs

"In software development, a recent change has been the growing importance of APIs," says Jeffrey Kaplan, managing director of technology at the consulting firm ThinkStrategies. Previously, developers could only work with large blocks of cement-like code, and, traditionally, programmers wrote all of the software, for instance, security functions, for each application. APIs are basically entries and exits into and out of programs. With the emergence of Web services and standards, such as VoiceXML, the underlying application infrastructure has become simpler to manipulate because it has been broken down into small, discrete, pliable pieces. Now, developers can focus on their areas of interest and let someone else worry about issues like security. In effect, programming has become more like linking patches in a quilt rather than knitting every stitch.

As these changes have taken place, mashups, which essentially consist of two or more third-party pieces of code sewn together, have become quite popular. In fact, in July 2011, Programmable Web (an aggregator of APIs as well as a clearinghouse for information about mashups) found close to 6,400 mashups in use. Integrating maps and location-based information is one popular category. Here, applications connect geographic data about business establishments with map images so individuals can find places, like a restaurant or a public bathroom.

Mashups are attracting interest for a variety of reasons. Because they are simpler to build than traditional approaches, they provide companies with a quick and inexpensive way to enhance their applications. "With mashups, it becomes much easier for enterprises to add text-to-speech features to their applications than in the past," says Dan York, director of conversations at Voxeo. Also, rapid development meshes with current business drivers: In recent years, such successful organizations as Google, Apple, and Best Buy have embraced new technologies as a means to innovate faster, to empower their employees and supply chains, and to become more market-oriented and customer-driven.

Let's Dump the Hardware

The mashup movement also coincides with the recent decoupling of hardware and software functions, a trend quite evident in the speech market. Traditionally, vendors packaged their solutions as turnkey systems that customers could drop into their networks. However, hardware has recently been commoditized with the advent of inexpensive Intel x86 systems and the development of cloud computing. Consequently, traditional market heavyweights, such as Avaya and Siemens, have been moving away from hardware development and focusing more on software.

"If mashups gain popularity, it will make it even more difficult for vendors that make items like voice processing boards to build viable businesses," says Thomas Howe, a partner at Embrase Management Consulting.

In the voice arena, the key to using mashups is the development and publicizing of APIs that would connect applications to voice systems, so companies can take advantage of features such as distributed call processing, voice processing, or message management. Mashups that take advantage of voice recognition technology could come from a wide range of vendors. Speech recognition suppliers have been building up software ecosystems to embellish their systems. Mashups have been quite popular with mobile devices, such as smartphones and tablets. Wireless network suppliers want to utilize them to encourage more use of their network services. Unified communications vendors see them as a natural extension to their business. Vendors like Amazon and Google view mashups as an element that will move enterprises away from premises-based applications to cloud solutions.

While there are many options, a handful of possibilities have begun to percolate and could influence the speech recognition market as it moves forward. AT&T has been developing a Web-based platform called Speech Mashups; it sends speech to a remote server that translates users' commands to their handsets, such as the Apple iPhone. The platform works with Web-based browsers in selected handsets, other wireless devices, and TV set-top boxes.

Voxeo's Tropo, a cloud-based application platform, enables developers to build voice mashups using JavaScript, Groovy, PHP, Ruby, or Python programming languages. One example is Phone 2 Directions, a mashup that lets users call in, enter their phone numbers, and receive spoken directions based on their GPS locations.

Looking Up to the Cloud

Twilio provides a cloud API for voice communications that enables Web applications to interact with phone callers. With Twilio, Web developers can add voice functions to their existing code without having to learn a telecom programming language or set up an entire stack of PBX software.

Siemens' OpenScape platform links voice, video, texting, email, and conferencing messages to business applications. The company's mashup reads tweets, decides if they indicate that communications actions are needed, and, if so, completes them. For example, if a person tweets that he has just arrived in San Francisco, the application keys in on the words "arrived" and "San Francisco" and triggers actions within Siemens' OpenScape unified communication environment. The person's location status changes, and actions based on established policies, such as forwarding all calls to that person to voicemail during the hours between midnight and 8 a.m. Pacific time rather than midnight to 8 a.m. in the time zone he just left, are performed.

Ifbyphone is a hosted voice application and platform designed to help small and medium-sized businesses add voice functionality to their systems. Through a combination of telephony and Web services, Ifbyphone supports programmable APIs, so customers can route inbound or outbound calls and perform IVR functions.

Mashing Up Workflow and Voice

Jaduka developed a Simple Object Access Protocol–based Web service interface that enables companies to blend voice into their workflow activities. The company claims that its API is accessed 1 million times daily on a platform supporting more than 1 billion accounts and 20 million daily transactions.

Pioneer developed Zypr, a cloud-based, voice-control portal. The mashup, which runs on automobiles, tablets, smartphones, and televisions, supports voice access to maps, local search, social networking, music, video, contacts, calendars, and weather applications.

Google has cast a formidable shadow on the voice mashup market with its Google Voice service. 2lingual has used it to develop two voice-search tools that work with speech-to-text-capable browsers. Google Multilingual Voice Search works with 51 languages, and Twitter Multilingual Voice Search provides similar capabilities for tweets.

Launching a New Movement

Many of these initiatives are in a nascent stage of development. "Right now, only early adopters are using voice mashups," Howe says. Some speech mashups are shipping, but they are largely being used in test environments or with limited numbers of users.

However, that might change—and perhaps quite quickly. "Speech is becoming an attractive interface for mobile devices," Kaplan says. Today's cell phones are like Swiss Army knives, with a tool or application for almost any situation, but oddly, they often do not allow the user to speak or listen to messages. Using on-demand voice technology and mashups to create voice-enabled applications for mobile devices will enable the user to speak and hear commands as well as see and select them. Moreover, speech is a direct, intuitive interface that requires no learning and is safe for multitasking users, so handset vendors would like to use it more.

But the mobile mashup market faces a number of challenges. Though a hot area, mobile devices lack the computational capabilities needed to perform speech processing tasks like speech recognition and text-to-speech conversion, especially when large vocabularies and high-quality synthesis are involved. An emerging solution is moving the speech processing resources into the network, where they can be supported by large cloud server farms, which represent another emerging area. To date, many cloud speech services perform just a single specific task, and it is unclear if the vendors will be able to expand their portfolios. Another question is whether or not they will be able to scale up and accommodate large deployments. Finally, many of the mobile mashups have been Internet-browser based. While they work with Web-coded applications, they won't run on nonbrowser-based mobile systems, like the iPhone.

The Power of Inertia

There are also concerns about this emerging application development approach centering on usability, integration, and complexity. Since mashups are new, few organizations are familiar with how to exploit them. "Inertia is a major force; individuals and corporations are often slow to change how they do business," Howe says. Consequently, education as well as training will be needed as the market matures.

Mashup's democratization of application development comes with a few caveats. As evidenced by the growing number of mobile marketplaces, once open interfaces are published, a plethora of third parties quickly emerges to build add-on software. In some cases, the functionality that the applications offer is quite limited—how many iPhone applications appear to be useless? So finding the best fit among tens of thousands or even hundreds of thousands of possibilities becomes challenging.

In addition, some of these applications are developed by one-person operations, individuals fiddling with their favorite hobby. Consequently, there are questions about the strength of their code; it might lack the security, integrity, and scalability needed to support business usage. "Businesses can take a risk using some mashups because the vendor may not be able to provide enterprise-level support," Kaplan says.

Mashup Monetization

Monetization is another challenge. The emergence of mobile application stores turned traditional pricing on its head. Many of the vendors provide free products, and paid software is inexpensive, running from as little as $1 to maybe $5. Consequently, few mashup companies are making money at this stage. Whether customers will be willing to pay more as the market evolves is unclear. Past experience with similar models has shown a reluctance by customers to pay for items that previously were distributed for free or at a nominal cost. Perhaps Google and others will find a way to tie advertisements into the mashup model, but as of now it is unclear what business model mashup vendors will use to satisfy shareholders.

Those business models could take shape quickly. Application development has been undergoing a major transformation, making it simpler, faster, and less expensive than in the past. The building blocks to transform the voice mashup market are still being put in place, but once they are laid, the impact that mashups will have is expected to be wide ranging and quite significant.

What Exactly Is a Mashup?

The word mashup started out in the music industry and made its way into the technology space. Originally, the term described two or more songs being mashed, basically put together to create an entirely new song. Many digital DJs gained fame if their mixes were particularly entertaining.

Nowadays, the term also describes the act of taking data from multiple sources and putting it into one application. The term implies easy, fast integration, frequently the use of open application programming interfaces, and data sources to produce enriched results that were not necessarily envisioned with the original sources. An example is HousingMaps.com, which takes information about available housing rentals from Craigslist and combines it with information about location from Google Maps. People searching for home rentals do not need to go back and forth between the two sites; they can look at everything with one click.

Voice mashups are a subset of the broad category. In them, a voice service is combined with data from other applications to create either an entirely new service or a new way of interacting with an existing service. Since speech can be integrated with virtually any application (Facebook, MySpace, YouTube, Google Maps, Amazon.com, MySQL databases, iTunes, shopping-cart software, and hundreds of other Web-based services), an infinite number of mashups is possible, and these are only a few:

  • A carrier could develop an IP-based TV service to help users find movies-on-demand rapidly. Here, customers enter interests, like romance movies with Cary Grant or movies directed by Ron Howard, and the system finds the desired titles.
  • A directory service could offer access to local business listings via natural language queries, for instance, asking about the nearest FedEx drop off box and then guiding the user to it via a GPS service.
  • A speech interface would allow users to interact with mobile devices while jogging or at times when their hands are not free.
  • A vendor could integrate simple verbal commands, such as yes, no, next, and prior, for navigation through Web-based instructions for assembling a bicycle, diagnosing car trouble, fixing a faucet, or preparing a recipe.

Paul Korzeniowski is a freelance writer who specializes in technology issues. He can be reached at paulkorzen@aol.com.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues