How Developers Can Choose the Right API Standard

While speech technology today still falls short in many ways, it is clear that there is no shortage of "standard" APIs. Application developers need to take the time to choose a development environment that will fit both current and future needs.

There is not one example of a general purpose recognition engine being able to offer an effective interface into multiple applications.

Thus, while speech technology has been a godsend for many disabled users, general purpose solutions tend to be ignored by the masses. The instances where productivity or usability gains have been greatest have been with targeted, tightly integrated applications. This is why we are beginning to see more applications spend the time to integrate with speech. The more tightly integrated a solution is, the more likely it is to be accepted by end users. Application Developers prefer programming to standard APIs because they are not tied to a specific technology vendor. So the question remains, which API should a developer write to?

This article will review some of the commonalties and differences between some of the major Speech API efforts available today, including (are you hungry for some acronyms?) MS SAPI, SRAPI, JSAPI, ECTF S.100 and MacinTalk.

Origins of Speech APIs

It turns out that none of the more common API available today come from standards bodies. Rather, consortiums have been formed in order to fulfill common goals and needs.

The beginnings of speech standards began in 1993, when IBM, Dragon Systems and Kurzweil AI announced their intention to produce a common API for dictation technology. Soon after, WordPerfect (and later Novell) took a lead in the effort and founded the Speech Recognition API (SRAPI) Committee. Thus, the SRAPI Committee has its roots in dictation technology, but has since expanded to include support for Speech Synthesis, Command Technology, Speaker Verification and Speaker Identification.

Novell has developed Network extensions to SRAPI that allow SRAPI applications to take advantage of speech technology running anywhere on the network. Speech Clients connecting to vendors over an IP connection has been demonstrated on Windows 95, Windows NT and the Macintosh. OS/2 and Unix clients will be demonstrated in the near future.

Simultaneously and independently to the SRAPI effort, Microsoft introduced the Microsoft Speech API (MS SAPI) for Windows. Though concentrating on Command Technology, and Speech Synthesis, the specification also included support for dictation. The specification was also designed to work with simple telephony applications. MS SAPI uses the COM model,

In the telephony arena, Dialogic was actively promoting broad standards for telephony, including multi-media extensions for Speech Recognition and Speech Synthesis, as part of the SCSA (Standard Computer System Architecture).

In order to promote SCSA as a non-proprietary standard, Dialogic presented SCSA to the Enterprise Computer Telephony Forum (ECTF). Interestingly, the ECTF is not constituted as a standards body. Rather, its stated goal is to address interoperability issues for CTI applications. Thus, ECTF does not aim to compete with other telephony standards such as TSAPI (Telephony Server API), TAPI (Telephony API), or Java Telephony (JTAPI). By providing support for both TAPI and TSAPI in approved specifications, ECTF complies with its specified goals.

The first software specification approved by ECTF is the S.100 ("S" for Software, 100 for release 1). The S.100 specification is designed for application developers. The soon to be released S.300 specification is designed for technology vendors. Thus, the application developer writes to the S.100 spec, and the technology vendor writes to the S.300 spec. Even though the S.300 specification is not yet final, Dialogic has begun to make systems available for application developers. The S.100 specification is broad based. There are 14 working groups that contribute to the specification. Two of the groups are dedicated to speech technology - the Speech Technology Working Group, and the Time Varying Media Working Group.

Apple has been showing speech technology as part of the Mac OS for several years, yet it was not until the recently released System 7.5 that the technology was opened to application developers. The API for the Speech Recognition Manager and the Speech Synthesis Manager now allows applications to take advantage of the technology embedded in the OS.

Sun Micosystems recently announced that it will produce standard multi-media extensions for the popular Java platform that will include the Java Speech API (JSAPI) to support both speech recognition and speech synthesis. A major advantage of Java is the theme "Write once, run anywhere," that enables developers to write software that will run on multiple platforsm, without having to rewrite the code on each platform. Speech enabled Java applets will be able to make Web pages speak and listen. Java application developers will be able to add full speech integration to their cross-platform Java applications. Speech vendors will be able to implement speech technology in native Java code or they can choose to use Java wrappers and the Java Native Interface to access existing speech technology. Using this approach, they can access technology that implements SRAPI, MS SAPI, Apple Speech Managers, and vendor specific APIs.

Standards that work together

Most of the speech API efforts strive to work with each other. The ECTF Speech Recognition Working Group has adopted SRAPI's grammar specification for use with S.100. In turn, SRAPI has announced that it intends to be S.100 compliant in future versions. Sun has announced that they will provide integration layers for SRAPI, SAPI and MacinTalk. SRAPI is working on a solution that will use the java.speech corba to provide a cross-platform solution for SRAPI developers. Novell has demonstrated SRAPI for Java clients on Windows 95, Windows NT, and Macintosh accessing technology over an IP connection.

These are just a few of the examples of how the various API efforts have pursued to achieve the goal of interopability. Over time, the API efforts will likely grow closer in functionality. This seems inevitable since the consortiums tend to be made up from similar companies. For instance, Novell chairs the SRAPI Committee, and is an active participant in the java.speech Working Group, and is also active in the ECTF effort. IBM, Dragon Systems, Phillips and Kurzweil AI each sit on the MS SAPI Working Group, SRAPI Committee and Java.speech Review Group. Many other examples of cross-pollination exist.

Each API committee or working group was formed to fill a specific need. The needs of an enterprise telephony system are vastly different from a desktop command technology system. A single committee or working group that would meet the needs of both is extremely unlikely. The efforts of each group will continue to move forward to meet the needs for which they were formed. We will see an increase in collaboration and more robust compatibility layers between the efforts, but it is unlikely and unnecessary for the efforts to merge to a single API.

Choosing a Development Platform

The choice of your development platform for speech should be based on several factors, including development platform, development environment, whether you will use telephony, the type of speech technology that you want to use. The diagram accompanying this article outlines the announce support targets for each of the API efforts.

Announced Support of the Various API Efforts
-	SRAPI	MS SAPI	Mac	ECTF S.100	JSAPI
Platforms	-	-	-	-	-
Windows 95	X	X	-	X	X
Windows NT	X	X	-	X	X
Macintosh	X	-	X	X	X
OS/2	X	-	-	X	-
Unix	X	-	-	X	X
Environment	-	-	-	-	-
C++	X	X	X	X	-
COM	-	X	-	-	-
Java	X	-	-	-	X
Apple Events	-	-	X	-	-
Client/Server	X	-	-	X	X
Visual Basic	-	X	-	-	-
Telephony	-	-	-	-	-
TAPI	-	X	-	X	X
TSAPI	-	-	-	X	X
ECTF	X	-	-	X	-
JAVA Telophony API	-	-	-	-	X
Technology	-	-	-	-	-
Command Recognition	X	X	X	X	X
Isolated Dictation	X	X	-	-	X
Continuous Dictation	X	-	-	-	X
Speech Synthesis	X	X	X	X	X
Speaker Verification	X	-	-	-	-
Speaker Identification	X	-	-	-	-

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

How Developers Can Choose the Right API Standard

Voice Deepfake Fraud Surged 1,300 Percent

Sanas Unveils Simultaneous Real-Time Speech-to-Speech Translation

ESTsoft Partners with ElevenLabs

Deepgram Launches Voice Agent API