The Evolution of Global Speech Technology

The voice user interface is evolving into a standard means for communication between humans and technology and is having a profound influence on the way people live. As such, the market for these speech applications is growing worldwide. Following necessary research and development, multilingual product offerings continue to expand, particularly for countries where more advanced telecommunications technology is common. While use of a voice interface spans a wide range of markets, the two primary global markets for speech applications are telephony and telematics (automobiles). Other markets include PC-based dictation and embedded consumer electronics. Within these markets, common speech technologies include automatic speech recognition (ASR), text-to-speech (TTS) and speaker verification. Perhaps the most significant business opportunities for speech applications are in the wireless market, which is well developed in the U.S. and even more so in several other international countries. Managing personal needs with well-developed user interfaces is clearly the direction being taken in "mobile" speech application deployments. Other telephony applications (wireless and landline) include speech-enabled directory assistance, call center automation, voice commerce and content provisioning. Outside the U.S., speech applications are quite prominent in Scandinavian countries. Italy and France are rich with deployments as the UK and Germany follow. Applications in South America are also becoming prominent with new deployments of voice-activated dialing and speech-enabled directory assistance. Without doubt, there is considerable market potential for speech in Asia. In fact, Japanese, Korean, Mandarin and Cantonese speech applications have been deployed for years. It is interesting to note that Microsoft has spent significant resources toward Japanese dictation development. In general, PC-based dictation offers numerous advantages, but it provides significant benefits when applied to Asian languages where typing is more cumbersome. To the extent that touch-tone keypads are global, for many years "equivalent" digit recognition has been commercially available in dozens of languages. Large vocabulary telephony ASR capabilities have continued to evolve, making the telephone a global, multi-purpose appliance. Some would argue the point, but TTS has only recently become "good", although over the past two years, technology advances have been quite impressive. Among the speech technology vendors, it is common to see product offerings in a wide variety of different languages, with some languages being more developed than others. For each language, the "product quality" is highly correlated to the development effort, which is often business opportunity driven. For example, a vendor's English TTS may sound better than its Portuguese TTS, simply because more effort was put into developing English. In addition, some languages are more inherently difficult to model than others. Speech recognition products have been available for over three decades. Early products were, for the most part, speaker-dependent and could be used in any language. As time progressed, speaker-independent technology became viable, but every new language required special development efforts. With the exception of certain tonal languages (e.g., Mandarin and Cantonese), developing a new language involves training a language-agnostic ASR engine with appropriate speech data. Speech data is collected to the model phonetic sounds of the target language, and to the environment associated with the target applications. Starting from scratch, a new ASR language needs data from about two thousand different speakers. As a rule, the speech data should represent a wide range of accents and environmental conditions. TTS products also require special development efforts for each language offering (actually, for each TTS voice). In addition to modeling each new language, acoustic inventories (speech audio collections) are a prerequisite. In contrast to speaker-independent recognition, a new voice for TTS requires a significant amount of speech data from one speaker (as opposed to a population of speakers, needed for ASR). With the advent of standards like VoiceXML, application portability is becoming more common. However, application portability must be approached with caution because user interface design is culture dependent. For example, a speech application protocol that works well in Germany may not work at all in China. The SALT initiative should stimulate the Web development community to include the voice interface in multimodal applications. SALT is expected to become a global standard that will stimulate the global market for speech applications. Technology in the global market continues to expand. From the telephone and the Internet to automobiles and handheld devices, usage is growing at unprecedented rates. Along with that growth, increases in the use and deployment of speech recognition and the voice user interface will continue to expand and provide better means for humans and technology to connect.

Tom Schalk is an active board member and former president of AVIOS and currently serves as chief technical officer of Wirenix Corporation.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

The Evolution of Global Speech Technology

IBM Releases Granite 3.3 8B Speech Recognition Model

Nari Labs Launches Dia TTS Model

SoundHound AI Partners with Tencent to Bring Conversational AI to Auto Brands

Mango AI Offers Free Voice Cloning