Speech applications don't exist in a vacuum.
We usually talk about standards that are closely related to speech processing in this column, but speech applications don't exist in a vacuum. They need to actually do something to be useful. Using speech to access information and accomplish tasks for users is where its real power--and that of natural language understanding technology--come in. Let's look at standards that can be used to link speech applications to the outside world.
What do I mean by the outside world? Two things. The first is connecting speech applications to the vast amount of structured information available on the Internet, and the second is connecting speech applications to devices in the physical world.
There's a massive amount of structured information accessible to applications on the Internet in the form of application programming interfaces (APIs) to Web services. Standards for Web services (including speech processing services) include languages such as Simple Object Access Protocol and architectural principles such as Representational State Transfer based on HTTP. Web services make it possible to access structured information of all kinds to create mashup applications. The Web site Programmableweb.com provides information on more than 8,500 APIs to a huge variety of Web services, many of which would make extremely useful mashups when combined with a speech interface. They include:
- ParkMe, which helps users find available parking in more than 500 cities;
- OpenEMI Music, which offers access to music content from EMI; and
- SeatGeek, which combines information from a variety of ticket services about concerts, sports, and theater tickets. It even knows about the seating options in various venues.
There are also APIs for better-known types of information--shopping, mapping, and weather, for example.
To use these services in a speech-enabled application, the developer has to employ speech technology to determine what the user wants to do, translate his intent to an API call to the appropriate service, and then send an HTTP request to the service. When the server sends its response back to the mobile device, the app must determine how to present the data, in the form of graphics, text, spoken output, or even vibration.
Speech apps can connect not only to structured information on the Internet, but also to devices. One example is controlling and obtaining data from mobile devices. For example, if you ask Apple's Siri, "Where am I?" it will show your current location on a map by using the device's geolocation service and a mapping service. Standard APIs include the W3C Geolocation Working Group's API to device geolocation services and the W3C Device API's Working Group's APIs to device capabilities such as vibration and media capture. How about taking a picture using the media capture API, combined with speech, and saying, "Take a picture from the front camera in five seconds"?
We can go beyond just interacting with a single device to interacting with many devices, or on the Internet. This becomes more interesting as things become more interconnected. The so-called "Internet of Things," a theme at this year's Consumer Electronics Show, is a vision where many ordinary objects are addressable on the Internet. This is made possible by the new ipv6 scheme defined by the Internet Engineering Task Force for referring to addressable items. A lot of the infrastructure for controlling and getting data from devices such as home appliances, entertainment systems, and medical devices is now available. The W3C's Multimodal Architecture specification provides the infrastructure for coordinating the interaction among devices.
Unfortunately, these devices are typically controlled by apps that each have their own user interface that must be learned and remembered. More apps are appearing as more devices become connected. A natural spoken interface would be much easier to use.
It's easy to imagine how combining speech with Web services and device control can build amazing applications. How about an app that uses a weather service to monitor the weather, finds out that it's going to rain, knows the windows are open in your car, and then asks you if you want to close them by remote control? How about an app that knows when you're nearing home (through your phone's GPS and a mapping API), knows that it's dark, and asks if you want the lights turned on? The possibilities are limitless.
Deborah Dahl, Ph.D., is principal at speech and language consulting firm Conversational Technologies and chair of the World Wide Web Consortium's Multimodal Interaction Working Group. She can be reached at firstname.lastname@example.org.