July 3, 2006
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

Multimodal Applications' Architectures

Multimodal applications promise to dramatically change the speech industry when mobile devices and high bandwidth communication channels are widely available. Developers use a variety of architectures to implement multimodal applications involving both visual and verbal components:

Voice API: VoiceSignal[1]

Developers write multimodal applications using sets of application programming interfaces (APIs) for the voice and visual components. Currently, VoiceSignal provides the most used speech recognition and speech synthesis on mobile devices. Developers use VoiceSignal's proprietary APIs to access automatic speech recognition (ASR), text-to-speech (TTS), keypad, and display functions (Figure 1a).

Extend HTML with voice tags: SALT[2], X+V[3], Conversay[4]

Microsoft, IBM, and Conversay have provided new tags which developers embed into XHTML to provide voice input with traditional graphical user interface (GUI) (Figure 1b). Users must download HTML plug-ins into Internet Explorer (and Opera for X+V) so they can browse HTML documents using voice commands. There are only about six SALT tags, making it easy for experienced Web developers to use them. X+V (short for XHTML plus VoiceXML) consists of VoiceXML 2.0 modules which developers embed into XHTML. When using Conversay, if a link does not have text associated with it, the Conversay plug-in will generate a digit to represent the link which is displayed on the screen. Users only need to say the name of the link to jump to a new page. The Conversay tags are proprietary.

Provide a voice agent on top of XHTML: Vangard[5]

Vangard provides a tool for developers to add a verbal agent on top of an existing visual form (Figure 1c). The user is able to enter values into fields using the traditional approach (mouse and keypad) as well as speak navigation commands to locate a field. Then the user can speak the value to be entered into the field. Users may switch between using the mouse/keypad and speech input at any time. With this approach, the underlying HTML code remains unchanged, making it easy to add a verbal interface to a visual form. However, dynamically generated forms may not fit with a static verbal interface specification. The code that dynamically generates a form must be extended to also dynamically generate the verbal agent associated with the dynamically generated code.

W3C Interaction Manager[6]^,[7]

The W3C Multimodal Interaction Working Group has taken a different approach. Rather than inserting speech tags with HTML code, the Working Group uses a separate Interaction Manager (Figure 1d) to control the flow of multimodal applications. Developers will use a new W3C language, State Chart XML (SCXML), to specify the sequence of actions and invoke voice and GUI functions. State charts, an extension of state transition systems, have been widely used in software design. Using SCXML, developers specify when to send events to individual modalities such as voice (implemented by VoiceXML 3.0) and GUI (implemented by XHTML) and what to do when SCXML receives events from the modalities. Thus, control is factored out of the modalities (VoiceXML 3.0 and XHTML) and placed in SCXML and uses the modalities only for presentation of output to the user and collection of input from the user. A variety of modalities will be possible beyond VoiceXML and XHTML, including SVG, InkML, and special supermodalities such as lip reading, which uses both ASR and vision as input.

Conclusion

The VoiceSignal, SALT, X+V, and Conversay approaches are available today, while the W3C approach is still under development by the W3C. VoiceSignal, SALT, and X+V approaches use an existing language to implement the Interaction Manager to control the dialog between the computer and user. The Interaction Manager is embedded into the user interface presentation specifications. W3C factors the Interaction Manager from the presentation using a separate Interaction Manager written in a control language such as SCXML, CCXML, or a programming language. This separation of functions provides flexibility: developers can easily change the presentation for alternative devices without modifying the dialog.

James A. Larson is manager of advanced human input/output at Intel Corporation and is author of the home study guide and reference "The VXMLGuide" www.vxmlguide.com. His Web site is www.larson-tech.com.

[1] http://www.voicesignal.com/

[2] http://www.saltforum.org/

[3] http://www-306.ibm.com/software/pervasive/multimodal/?Open&ca=daw-prod-mmb

[4] http://www.conversay.com/

[5] http://www.vangardvoice.com/

[6] http://www.w3.org/2002/mmi/

[7] http://www.w3.org/Voice/

Multimodal Applications' Architectures

Voice API: VoiceSignal[1]

Modulate Tops Hugging Face's Transcription Benchmark

LALAL.AI Launches Lynx Voice Cleanup Mode

VoicePing Releases VoicePing 3.0

Voiskey Officially Launches

Deepgram Brings Nova-3 Speech Engine to Snapdragon Devices

DeepL Acquires Mixhalo

The Voice Can Sound Right, and the Video Can Still Be Wrong

Canary Speech Partners with NeuroLexIQ

Voice-Only Outreach 'Structurally Misses' Gen Z and Millennial Debt Holders, Says Vodex AI CEO

Voicelyt Launches Voice Score

DXC Partners with ElevenLabs

Nabla Launches Dictation for Mac

Fish Audio Raises $52 Million in Seed Funding

Deliverect Partners with SoundHound AI

OrcaRouter Launches OrcaDub