Video Interactive Services with VoiceXML

By Dave Burke and Scott McGlashan


New developments in handsets and networks are enabling a new world of video interactive services. Self-service interaction is evolving from Interactive Voice Response (IVR) to Interactive Voice and Video Response (IVVR). Video adds a new channel and dimension to human-computer interaction, resulting in new possibilities, and a richer and more intuitive interface.

Examples of new application possibilities include:

  • Video mail
  • Video information services
  • Video entertainment services
  • Video conferencing
  • Video call-center

VoiceXML is a W3C language designed for creating audio dialogs that feature speech synthesis, DTMF and speech recognition, audio recording, and basic telephony. VoiceXML was not originally designed to include a video channel but it is capable of being used to create powerful video interactive services without any modification to the language itself.

This article summarizes recent advances in modern video telephony and in particular how to make use of video media in VoiceXML. We give a brief overview of features planned for future releases of VoiceXML and give two examples of platforms which offer video capabilities through VoiceXML today.

Modern Video Telephony

While the idea of video telephony is certainly not new, there are two major developments fuelling a renewed interest in it today:

  1. Deployment of third generation (3G) mobile networks
  2. Video-over-IP telephony

3G networks have been rolled out across Europe and parts of Asia. In Japan, the FOMA-based system is growing rapidly. These systems leverage the 3G-324M recommendation for video telephony - a circuit-switched approach for providing conversational video over a low-bandwidth, wireless network.

3G-324M requires a standard 64 kbits/s duplex bearer (commonplace in the PSTN though full 64 kbits/s right up to the mobile device was only introduced in the third generation networks). The 64 kbits/s channel is established using normal means, for example using ISDN or ISUP signaling. A H.223 multiplexer combines the control messages, the audio, and the video into a single bit stream provided at a constant rate of 64 kbits/s. Video is encoded using the H.263 codec and optionally the MPEG-4 codec. Both of these codecs have been designed specifically for low bit rate communication. Audio is encoded using the adaptive multirate (AMR) code and optionally using the G.723.1 codec. Control messages are specified in H.245, which provides features such as capability exchange and opening of logical media channels. In particular, H.245 carries user input indication messages such as DTMF key presses, a key enabler for basic interactivity. Figure 1 summarizes the components of a 3G-324M terminal.  

 Figure 1


Figure 1: Components of a 3G-324M terminal

3G-324M presents several benefits for delivering video interactive services including:

  • Ubiquitous access without requiring a special "client" to be downloaded
  • Standard telephony revenue collection mechanism apply (e.g. premium rate)
  • Fast, immediate interface
  • No digital rights management issues (media is real-time and transient)

Video-over-IP telephony is appearing in a variety of different user agents designed for IP networks. These user-agents vary from hardware phones, to softphones, to setup boxes. Services such as Skype and SonyIVE are putting video-over-IP on the map. Furthermore, terminals in the upcoming 3GPP IP Multimedia System (IMS) will support video and audio over packet-based networks using protocols such as SIP for signaling. 

Video-over-IP telephony, on the other hand, is being enabled on packet-based networks through existing technologies. Both SIP and H.323 signaling are independent of the media sessions they set up and thus can readily initiate video calls. The Real-time Protocol (RTP), which carries voice over IP, is also used to transport video. Specifications for mapping popular video codec bit streams on to RTP are available as IETF standards.

Using Video Media with VoiceXML 2.0

The VoiceXML 2.0 language can be used to develop powerful video applications today. The language itself does not require extensions for video playback and recording, since it uses URIs to reference media resources, and media types to specify the content of these resources.  With this approach, video resources are no different from audio resources from the language perspective. However, they are different from the platform perspective: VoiceXML 2.0 only requires a platform to support playback and recording with certain audio formats (e.g. G.711) but not other audio formats (e.g. MP3) and not video formats (e.g. 3GPP). So to deploy VoiceXML video applications, a VoiceXML platform capable of supporting video media is required.

To play video to the end user, the URI of the video resource can be referenced using the <audio> element; for example,

<audio src="http://www.example.com/myvideo.3gp"/>

When the resource is fetched using HTTP, the web server returns its authoritative media type - in this example, "video/3gpp" for a 3GPP file (note: the web server may need additional configuration to map files with ending ".3gp" to the media type "video/3gpp").  If the VoiceXML platform supports the video media type, then it is queued for playback. If the platform does not support the media type, then - just like other non-mandatory media types such as MP3 audio files - the media is ignored.

The standard syntax and semantics of the <audio> element are applicable for video media; for example, the video URI can be specified using the 'expr' attribute, and fetching/caching properties apply as normal. Within the <audio> element, fallback content can be specified; for example,

  <audio src="http://www.example.com/myvideo.3gp">
      <audio src="http://www.example.com/standardvideo.3gp"/>

If "myvideo.3gp" cannot be found or played, then the fallback video "standardvideo.3gp" is played.

Video can also be used in other VoiceXML 2.0 constructs which reference media resources; for example, the "fetchaudio" property, and the "transferaudio" attribute of <transfer> can reference a video resource URI.

For video recording, the <record> element is used with its "type" attribute set to a video media type such as "video/3gpp"; for example, 

  <record name="myrecording" type="video/3gpp" beep="true" maxtime="30s">
     <audio src="http://www.example.com/recordprompt.3gp"/>

This example prompts the user with the video file "recordprompt.3gp" and subsequently  records audio and video for up to 30 seconds. Again like other non-mandatory record media types, if the video media type isn't supported by the VoiceXML platform, then it will throw the appropriate error message.

The standard syntax and the semantics of the <record> elements apply with video recording. One optional <record> feature - using voice activity detection to begin and end recordings - is not always appropriate when recording video. For most use cases, such as videomail, video recording should begin immediately after the prompts are played and end only when speech/DTMF hotword input is received, the timeout expires or the call is terminated. Consequently, VoiceXML platform which support video tend not to support voice activity detection for video recording and so will ignore any value set for the "finalsilence" attribute.

Features Planned for Future VoiceXML Versions

While VoiceXML 2.0 can already address the fundamental aspects of video media interaction, work is ongoing in the W3C Voice Brower working group to extend VoiceXML with additional media features, while retaining its fundamental focus as a voice-oriented application authoring language.

Note that the features described in this section are tentative and not yet committed features of VoiceXML 3.0!

Although VoiceXML provides signaling information about the attached connection in its "session.connection" object, it does not currently specify media information. Media information can be exposed for both the "local" and "remote" parts of the connection, where each part specifies the media streams it supports in terms of their type (e.g. "audio" or "video") and format (e.g. for video, H.263 encoding, 10 frames per second, etc). Such information can then be used to determine, for example, whether the video or audio prompts are to be used with an incoming connection.

Another area for enhancement is support for playing video media.  Using <audio> to playback non-audio media resources isn't intuitive. Rather than introduce more media-specific elements, e.g. <video>, it is preferable to introduce an extensible element which can be typed for various media resources - a new <media> element.

The <media> element contains a "type" attribute indicating the preferred media type of the resource; for example,

   <media type="video/3gpp" src="http://www.example.com/generator"/>

The preferred media type can be used to indicate to a web server which media file to select when the media is available in various file formats. Furthermore, since a 3GPP file is a format containing one or more media tracks, the parameter "Codecs" can be added to the media type so developers can unambiguously identify the media tracks required (http://www.ietf.org/rfc/rfc4281.txt). For example,

  <media type="video/3gpp; Codecs='s263'" src="http://www.example.com/generator"/>

indicates that a media file in the 3GPP format containing only H.263 video is required.

This approach also has the benefit that the preferred type can be used when the protocol does not provide a media type; for example,

where, since the file protocol does not provide an authoritative media type, the developer can specify the media type themselves using the "type" attribute. Another benefit is that SSML documents to be referenced directly; for example,

  <media type="application/ssml+xml" src="http://www.example.com/myssml.ssml"/>.

Another feature is support for simultaneous media playback from multiple resources. With connections which support multiple media streams, the audio and video tracks from a 3GPP resource can be simultaneously played back on the connection's media streams. Using a single resource has the benefit that audio and video can be tightly synchronized during transmission. However, there are also use cases for simultaneous media playback where the media is specified in separate resources:

  • Videomail: an audio message has been left using a conventional audio only system. For playback on a system with video support, a video resource can be played simultaneously with an image of the person, or an avatar.
  • Enterprise: a video stream resource from a security camera with TTS voiceover providing additional information.
  • Education: a video resource showing medical procedure with commentary provided by lecturer in student's language.
  • Talking heads: an animated avatar together with audio or TTS voiceover.

To support these use cases, the <prompt> element is enhanced with a "par" attribute to indicate whether the <media> elements are to be played sequentially (the default) or in parallel. For example,

  <prompt par="true">
     <media type="application/ssml+xml" src="commentary.ssml"/>
     <media type="video/3gpp; Codecs='s263'" src="avatar.3gp"/>

where the synthesized commentary would be played on the audio stream while the animated avatar is played on the video stream of the connection.  

On the video recording side, features are being developed to provide more control over recording and to allow prompt playback during a recording.

In cases where a container media format, such as "video/3gpp", is set as the media type for <record>, it is possible to record only the video stream, only the audio stream or both streams from the connection. To provide developer control over this, a new property "recordmodes" is specified. The property can have one or more types (e.g. "audio" or "video") as its value (the default value is "audio"). For example,

  <record name="myrecording" type="video/3gpp" beep="true" maxtime="30s">
     <property name='recordmodes' value='audio video'/>

records both the audio and video streams.  Other new properties will give developers more fine grained control over the use of voice activity detection during recording. The properties "record.vad.initial" and "record.vad.final", each with boolean values, indicate whether voice activity detection may be used to initiate and terminate recording.

There are use cases for extending <record> to allow a prompt to be played during recording:

  • Audio 'warning' feedback to indicate that the recording is about to terminate.
  • Entertainment applications like karaoke where the user is recorded while they sing along to a music prompt
  • Visual feedback, e.g. an egg timer or animated countdown, indicating recording time left  

To address these, <media> elements have an additional boolean attribute "playonrecord" to indicate whether they are to be played during recording. For example,

  <record name="myrecording" type="video/3gpp" beep="true" maxtime="30s">
     <media playonrecord="true" type="video/3gpp" src="eggtimer.3gp"/>

where "eggtimer.3gp" would be played at the onset of the recording.  The "playonrecord" attribute is ignored if the <media> element is a not descendant of a <record> element.

Video Capabilities Available Today

Voxpilot offers both 3G-324M and SIP video-over-IP capabilities in its VoiceXML Media Server and Call Server products. Voxpilot's customers have deployed video services in several countries across Europe. A good example of a VoiceXML-powered interactive media service is the award-winning "Label Studio TV" service developed by Universal Music Mobile and deployed on France's SFR network (see http://www.labelstudio.fr/lstv/).

Hewlett-Packard's OpenCall Media Platform Video product for 3G networks provides video support in CCXML and VoiceXML. Hewlett-Packard partners have developed and deployed video services, such as video mail, video-enabled 3G contact centers, video portals and video blogging. One of the most innovative VoiceXML deployment is a service nominated for a 3GSM 2006 Award where fans can make a video call to BBC's 'Football Focus' and record a message - selected messages are then played back and discussed during the program (see http://www.voxsurf.com/news/060117-3GSMnominee.htm).


In this article, we discussed advances in modern video telephony and how they are being taken advantage of to deliver richer, more intuitive IVVR services. VoiceXML, as it is available today, is sufficiently expressive to create compelling video services. New features are being planned for future versions of VoiceXML, which will enable even more compelling applications to be delivered.

SpeechTek Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues