April 1, 1998
By Andrew Hunt Vice President, Engineering - Holly Connects
Features

Cross Platform, Cross Vendor Access to Speech

Speech technology is becoming widely available for use in real applications in personal and enterprise computing. The Java™ Speech API, developed by Sun Microsystems, Inc. and industry partners, defines a softwareinterface for speech recognizers and speech synthesizers on the Java platform. The JavaSpeech API enables developers of speech-enabled applications to incorporate moresophisticated and natural user interfaces into Java applications and deploy them on a widerange of computing platforms.

Sun's primary goal in developing the Java Speech API is to provide an easy-to-use, cross-vendor, cross-platform interface to state-of-the-art speech technology. The API defines a standard software interface for "Write Once, Run Anywhere(tm)" access to speech synthesizers, command-and-control recognizers, and dictation systems.

The Java Speech API is one of the Java Media and Communications APIs. Other media APIs provide capabilities such as playback of streaming media (audio, video etc.), computer telephony, 2D and 3D graphics, animation and advanced imaging.

In this article we explore the Java Speech API by providing an overview of its design and its use in deploying speech applications.

Design Goals

Platform Overview

Here is a brief overview of the major components of the Java platform.

The Java programming language is simple, object-oriented and easy to learn, particularly for C/C++ programmers. It has built-in garbage collection, strong type checking and a well-defined set of platform APIs which, combined with the language's simplicity, facilitate faster development and more reliable code. It has built-in security, network support and multi-threading and portability making it well-suited to Internet and Intranet applications and client-server computing.
The Java virtual machine provides a standard environment in which Java applications are run. More precisely, it runs Java byte-code which is produced by compiling Java programs. Compliance testing for VMs ensures "Write Once, Run Anywhere(tm)". The Java VM has been ported to computing devices ranging from small embedded devices, to handhelds, to PCs, to workstations, and to mainframes and supercomputers. Java VMs are included in browsers and many other software packages. The widespread availability of Java VMs gives Java applications unprecedented portability.
The Java APIs provide standard functionality on the VM. There are three Java Application Environments that specify the minimum API sets for different computing environments. Java, PersonalJava(tm), and EmbeddedJava(tm) define API sets for successively smaller computing devices. Standard extensions, including the Java Speech API, are Java APIs that are optionally included with a VM.

In addition to the primary design goals, the Java Speech API needed to be internationalized, needed to enable access to existing state-of-the-art recognizers and synthesizers, and needed to be appropriate to a range of different computing environments.

Internationalization is supported in a number of ways. First, with native support for the Unicode character set, which includes over 35,000 characters, the Java programming language provides coverage of nearly all living languages. Second, the software interfaces and data formats defined by the Java Speech API have been designed for multi-linguality. Finally, the API allows concurrent use of speech recognizers and synthesizers of different languages (when supported by speech vendors).

Access to existing speech technology is important to customers who have already installed speech products on their computers. It is also important to speech vendors to minimize the work required to support the API. Existing speech recognizers and synthesizers with appropriate functionality can support the Java Speech API by using a combination of Java code and calls to C/C++ through the Java Native Interface. Even when a recognizer or synthesizer uses native code, the Java application can remain portable.

The Java Speech API was designed to enable speech input and output for (1) PCs and workstations, (2) computer telephony systems, and (3) small, personal devices. With the Java platform, the API can meet the application requirements of these environments despite often substantial differences in computing power, operating system, type of speech recognition and synthesis technology, and diverse speech applications.

A client/server architecture, in which the computer-intensive speech technology runs on a server, is also a desirable solution for delivering speech I/O to these computing environments. Java's built-in networking capabilities simplify this task.

For Developers

With the Java Speech API, we aim to open new possibilities for speech application development and new markets for speech technology.

The three key developer audiences are:

• Java application developers who want improve user interfaces and product appeal;
• Speech application developers who will be able to develop and deploy their applications in Java with its benefits of reduced development time and greater portability;
• Speech technology companies who want to deploy their speech recognition and synthesis technology to the large market provided by the Java platform.

To ensure that the API meets the needs of these three groups, Sun established collaborations to define the draft specification. The partners were Apple Computer, Inc., AT&T, Dragon Systems, Inc., IBM Corporation, Novell, Inc., Philips Speech Processing, and Texas Instruments Incorporated.

By placing the draft specification of the Java Speech API on the Internet, Sun has made the API available for review and comment by other speech technology companies, by application development companies and by the public. As with other Java APIs, this open review process is important to ensuring a robust specification.

Portability

To ensure a high level of portability of speech-enabled Java applications, the Java Speech API specifies, in detail, the functionality of a compliant speech recognizer or speech synthesizer. In addition to the definition of the software interfaces, specifications are provided for recognition grammars and for text input to speech synthesizers.

The advantages of a standard specification include improved cross-vendor and cross-platform portability for applications. Application portability is particularly important in Internet applications.

A limitation of the compliance model is that vendor-specific functions are restricted, but not to the extent that speech vendors can't compete on the performance or price of their speech products.

In vertically integrated applications and applications tied to a particular platform, developers may choose to use vendor-specific extensions to the Java Speech API. The application will no longer be portable and the speech technology will not be interchangeable.

Synthesis Markup

Effective use of speech synthesizers often requires text markup to ensure correct pronunciation and natural phrasing, emphasis and speaking rate. Unfortunately, today's synthesizer's use proprietary, non-portable tags to mark such information. For example, some tags used to mark emphasis for existing synthesizers include \em, \", and ["].

Working with our partners, Sun developed and released for public comment the Java Speech Markup Language. JSML provides the necessary features for control of speech synthesis and uses a standard markup style (similar to HTML). The markup can be reasonably translated into existing synthesizer tags to minimize the effort required for support by existing speech synthesizers. Plus it provides the advantages of a single portable representation.

The following simple JSML extract illustrates its use. The text, "Message from John Doe regarding magazine article," uses emphasis and pausing to more clearly convey the message.

Message from John Doe regarding

magazine article.

The PARA tags indicate the start and end of a paragraph. The EMP tags indicate that "John Doe" should be emphasized. The BREAK tags inserts a pause to give additional emphasis to the message subject. The PROS (prosody) tags indicate that the subject "magazine article" should be spoken 20% slower to make it more understandable.

JSML was designed for cross-platform, cross-vendor control of speech synthesizers. With an open specification process involving many speech technology companies and a public review, JSML has been designed for widespread use. Its use is not restricted to the Java platform or the Java Speech API.

Internet Speech Applications

For many companies, their web site is integral to their image and to their business. The ability to effectively incorporate speech input/output into a web site could benefit both image and customer satisfaction.

The availability of Java in major browsers and Java's built-in security make Java ideal for downloading applications. For the Java Speech API to enable spoken interaction in down-loaded applets, client machines must either have pre-installed speech technology or an Internet connection that is fast enough for the recognition to be run from a remote server. The down-loaded applet must also have security clearance to use a recognizer ( would you like an unknown applet listening to your office through a dictation product?

It may be several years before all these conditions are met, but when we reach it, the way in which people interact with computers could be dramatically enhanced.

Recognizer Grammars

As with speech synthesizers, existing speech recognizers use a range of proprietary grammar formats. To ensure cross-vendor, cross-platform control of speech recognition, Sun and its partners developed the Java Speech Grammar Format and released it for public review and comment.

JSGF has a common heritage with other grammar formats, for example, standard BNF, the Speech Recognition Command Language from the SRAPI committee, and Sun-proprietary grammar formats. Thus, it will be familiar to developers of speech applications.

What JSGF adds is precise definitions of how to write and interpret grammars (including semantics of tokenization, recursion, and import/export) to ensure a high level of compatibility between recognizers. JSGF incorporates the style of the Java programming language for importing and exporting of rules to simplify the task of building complex grammars from smaller grammar components.

With JSGF we hope to encourage the development of grammar tools that improve developer productivity and facilitate the re-use and sharing of grammar components.

In addition to being the default grammar format of the Java Speech API, JSGF can be deployed in other environments and can be automatically translated into most proprietary grammar formats.

Software Interfaces

In this section we explore briefly the functionality of the Java Speech API as defined by the software interface. The full specification and an accompanying Programmer's Guide are available for your review and comment from: http://java.sun.com/products/java-media/speech/

The Java Speech API is divided into three packages. In the Java programming language, a package is a collection of objects (technically speaking, classes and interfaces).

The three packages are shown below. The root package, javax.speech, defines the features and capabilities of a generic speech engine. The sub-packages, javax.speech.recognition and javax.speech.synthesis, define the features of specialized types of speech engines, specifically recognizers and synthesizers. (The "x" in "javax" signifies that the Java Speech API is a standard extension to the Java platform.)

csfig1.gif (10398 bytes)

javax.speech

Any speech engine must provide a minimum feature set. These features are defined by the classes and interfaces of the javax.speech package. Any speech engine must provide (a) the ability to be created, deallocated, paused and resumed, (b) an object to describe itself, (c) the ability to connect to audio input or output, (d) notifications of engine events, and (e) an optional vocabulary management system.

The javax.speech package also provides a centralized mechanism through which engines register themselves and through which applications can select engines. Since applications and users have different functional requirements, the engine selection mechanism is based on defined characteristics.

For all engines, a locale is required. A locale defines a language, country and optional variant using ISO codes. For example, "de.ch" indicates Swiss German. Both the speech recognition and speech synthesis packages extend the basic engine selection system. A recognizer is defined by additional features, including whether it supports dictation and by the names of speakers that have trained the recognizer (if it is trainable). A synthesizer has additional features such as descriptions of all its speaking voices (with gender, age, name and style).

javax.speech.synthesis

A speech synthesizer is a specialized type of speech engine. In object-oriented terminology, a speech synthesizer extends the interface of a speech engine. Thus, the speech synthesis package of the Java Speech API provides the classes and interfaces that are required for a speech synthesizer.

The core function of a speech synthesizer is speaking text. A synthesizer that supports the Java Speech API provides the ability to speak text provided in the Java Speech Markup Language and to speak plain text. The synthesizer provides a speaking queue for text to be spoken and provides mechanisms to monitor and control that queue.

Any Java object can be spoken if it implements a simple interface to provide JSML text. For example, an application can make database entries, spreadsheet cells, text editing windows and email messages speakable.

As text is spoken, the synthesizer is required to provide events that notify the application when audio output commences and completes, when each word is spoken, when markers placed in the text are reached and when output is paused, resumed or canceled.

A speech synthesizer is also required to allow the pitch, speaking rate, volume and speaking voice to be modified.

javax.speech.recognition

The primary function of a speech recognizer is processing audio input to determine what a user has said. To achieve this, an application provides the recognizer with grammars and the recognizer returns results to the application that indicate the words it heard.

A grammar defines what words a user may say and the patterns in which those words occur. The Java Speech API supports two types of grammar. A rule grammar defines words and patterns of words using the Java Speech Grammar Format or a programmatic equivalent. Rule grammars are defined by applications according to the current context and the application's expectation of what the user might say next.

A dictation grammar is closer to the ideal of allowing free-form spoken input. Dictation grammars are built into a recognizer and typically use statistical models developed by analyzing large amounts of text.

A result is provided to an application each time the recognizer hears spoken input matching one of the currently active grammars. The result includes the list of words spoken, a reference to the matched grammar and may include information on alternative guesses, audio data, and timing information.

The speech recognition package includes other advanced and sometimes complex features which are necessary to fully utilize the power of current speech recognizers. These features are beyond the scope of this article.

The Java Speech API allows Java application developers to use speech technology for the development of advanced user interfaces. It provides a standard, cross-vendor, cross-platform interface to speech synthesizers, command and control recognition and dictation systems for use in a range of computing environments. The API supports state-of-the-art speech technology capabilities, but has been designed to be compact and easy to use. By actively involving speech companies and application developers in the specification process and opening the draft specification to public review and comment, Sun aims to provide an API that meets a wide range of needs.

Andrew Hunt is a Staff Engineer at Sun Microsystems Laboratories, and has lead the development of the Java Speech API. Further details are available at http://java.sun.com/products/java-media/speech/ or email andrew.hunt@east.sun.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Companies and Suppliers Mentioned

Cross Platform, Cross Vendor Access to Speech

SoundHound Partners with Allina Health

Krisp Unveils AI Accent Conversion for Latin America

German Study Validates Life-Changing Effects of Assistive Technologies

Firstsource and Sanas Partner to Redefine Customer Conversations with AI