HTML5 — The Last Best Hope for Voice and Video

Article Featured Image

Apparently I’m not the only fan of the old but superb TV science-fiction show Babylon 5; members of the W3C must love it, too, because they chose a logo for HTML5 that resembles the TV show’s logo. Perhaps they meant to adopt the show’s motto of “last best hope” as well. 

Indeed, HTML5 is the last best hope for open, non-proprietary voice and video because of Skype’s dominant mind share. Skype has captured the consumer market and has insinuated its service into that other behemoth: Facebook. For the next couple of years, Facebook will dominate marketing organizations and people’s spare time. Eventually Facebook will be eclipsed by something else. But, in the meantime, Skype has an opportunity to rule the roost unchallenged. Even after Facebook has faltered, Skype users will likely continue to use the service that retains their entire contact list—the network effect.

So, what is HTML5? It is the latest version of the World Wide Consortium’s specification for HTML, the language of Web pages. When you request a Web page, your browser receives instructions written in HTML that your browser turns into a Web page.

HTML5 seems to be just about ready; some browsers already support it, and it finally managed to provide a few pieces to the audiovisual puzzle. Until now, audio and video viewed in browsers displayed using plug-ins, defined as proprietary programs supplied by software vendors to encode and decode video and, on rare occasions, accept audio input. Because plug-ins are proprietary, all sorts of problems occur, from technical (crashes and poor computer performance) to financial (to push video from your Web page, you must pay to encode) to political (to date, Apple refuses to allow Flash on its phones and tablets).

Now we have a standard that not only promotes non-proprietary, license-free methods to play audio and video, but also a supplementary HTML5 draft specification that takes a radical step in letting the browser seize control of your computer’s microphone and camera. Until now, browsers have had only the most limited access to peripherals, to limit the damage that a malicious Web page could do to your computer. Browser access to the camera and microphone strikes me as a risky proposition, given the prevalence of Web-based malware. While the specification briefly mentions security, I expect browser problems continually for at least five years, perhaps indefinitely.

Still, two-way audio/video is far too useful to accentuate the negative. Say you want to compete against Skype. The heavy lifting at the user interface level—access to the camera and microphone, playback of video/audio, and a canvas to provide a user interface—is done by the browser. That does leave a tremendous amount of work to be done if you want to create a conferencing service. But, off in the distance, I hear the sound of dozens of companies revving up to provide services to any company with a clever idea of how to incorporate one-way, two-way, or conferencing services into a Web site. Best of all, because your service runs in the browser, you don’t have to persuade someone to download a program or plug-in; just persuade that person to visit your Web page.

You can broadcast video (one to many) or create a conference service (many to many). An intriguing use lies in the middle ground: one-to-one communication between a customer and someone in a service center. 

That opportunity comes in two variations. The first is the elusive video-based call center, which lets you see the agent to whom you are speaking. While I can offer a dozen arguments why video call centers raise costs and provide little benefit, HTML5 will change the cost/benefit balance. It won’t be long before a video offering becomes the difference between a large, high-end company and a mom-and-pop shop.

The second opportunity relates to the first but provides a more rational output. When I shop online, I sometimes need another view of the gizmo I’m about to buy; when I can’t figure out how to assemble it, I would like someone to demonstrate how the pieces fit together. HTML5 makes this possible. A call agent provides an interactive video of the product he supports. “See this little tab here on the side of the gizmo? You have to rotate it this way—here.” And let’s not forget a more personal touch for higher-end consumer goods: “Here’s what the shoe looks like with this color stocking.”

It’s a short hop to speech recognition–based services. A browser can become hands-free (controlled by voice), which would benefit kiosks. Voice search may migrate from the phone to the desktop. Self-service Web pages might become easier to use, or at least more interactive, if they employ two-way speech. HTML5 offers speech technology another niche, and one that is currently unoccupied.

Moshe Yudkowsky, Ph.D., is president of Disaggregate Consulting and author of The Pebble and the Avalanche: How Taking Things Apart Creates Revolutions. He can be reached at speech@pobox.com.


SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues