Matching Technology and Application

If the Star-Trek Communicator existed, communicating with machines would not require learning a programming language or the 'pidginization' of language it often feels like we have to endure today. Communication ease is limited by the state of core technologies; improving these technologies can lead to more speech applications. Improved speech applications can also come from rethinking the voice user interface. However, the greatest bottleneck to widespread adoption of speech technologies is the difficulty of balancing the possibilities and limitations of speech technologies and matching them to compelling applications.

To use document preparation as a metaphor, you can improve the speed and ease of document preparation by building a better keyboard, as was long the goal of typewriter designs. User interface improvements led to development of easier carriage-returns, correction keys and changeable fonts. However, the most dramatic change, one that made typewriters obsolete, was recognizing the more abstract goal of document preparation and pulling in new technologies to enable word processing.

A deep knowledge of core technologies is not required to balance technology and applications appropriately any more than a deep knowledge of the physics of car mechanisms is needed to improve gas mileage or to choose a car for a race or for pulling a trailer. However, knowing a few of the factors that affect performance can help greatly. Knowing that these factors can be traded off with each other and also with factors such as cost, speed, and storage, can also help in designing effective applications. Greatly simplified, these factors include:

For speech recognition: noisier speech is more difficult than quiet speech, casual speech is harder than careful speech, more variability is harder, and distinguishing more potential items and actions is harder.
For speaker identification the tradeoffs are typically between security (letting no false users in) and convenience (not rejecting legitimate users).
For speech synthesis the tradeoffs are typically between naturalness of recorded speech samples and the flexibility of the more robot-like formant-based synthesis.
In natural language understanding, the tradeoffs are often between clarity (ensuring the user knows what the system can and cannot do) and efficiency (allowing the user to perform a great variety of tasks as quickly as possible)

Although it's possible for technology to inspire applications, this is not usually the most fruitful approach. The creativity of technology developers usually arises from a deep knowledge of the core science and engineering. The tendency is to focus in one area of technical expertise, improving core technology in one area incrementally. This approach can lead to great science and/or great engineering, but rarely results in great applications unless it meets with a deep understanding of an application. In fact, it is useful for scientists and engineers regularly to observe applications to inspire research based on factors that may limit technology use. For example, 20 years ago, speech recognition was focused on very careful speech in very quiet rooms with very expensive microphones; almost no work was conducted on noise robustness. Taking the speech out of the lab required backing off on factors such as vocabulary size, but led to many important new results in, e.g., noise robustness and adaptation. Thus, balancing technology understanding (knowing to back off even to only 'yes/no' in the face of a wide variety of users and telephone noise) with application understanding ('yes/no' recognition can save millions of dollars in costs for collect calls) broadened the field of speech applications. For the goal of changing the world by creating the next 'killer app,' more depth in more technologies might be relevant, but it will pay to focus on knowing how to avoid the edges of where the technology becomes fragile. Knowing current applications can't hurt unless it leads to the assumption that current approaches are the only ones, or if it limits imagination. For example, in word processing, the goal is NOT to type on a keyboard efficiently, but to enter text, or to record and organize thoughts to share with others. As another example, although there are probably gains to be made in searching for what might be shared across various DTMF applications, thinking about these as 'DTMF applications', or as 'voice applications', loses sight of the important fact that the goal is NOT likely to be poking numbered buttons or voice recording. These actions are a means to another end, which may vary with each particular application. Ten years ago, speech applications appeared to be, for the most part, dictation applications. Carving out a relatively new application area (call centers), rather than playing catch-up in existing areas, proved to be successful for the next wave of speech applications. This high-risk/high pay-off approach can create new markets rather than simply divide up existing markets. Today, however, the success of speech in call center applications has led to a tendency for many new startups to assume that speech applications must be call center applications. In sum, try this recipe for changing the world:
Gather as many technologies (or technology experts) as possible. Abstract out an understanding of what is possible and how to stay in the areas of reliability. Select an application and become an expert in what it accomplishes and by whom. (Here's where your social relevance goals, monetary goals, etc. might come into play). Abstract out the application and user goals and pull in all technology and user interface strategies that can accomplish the goals meeting your other constraints (such as time, money, etc.). Iterate, because you won't get it right the first time, and because all the parameters may change. (This is article is a summary of part of a tutorial I gave at the spring 2004 AVIOS/SpeechTEK meeting in San Francisco. I thank Mike Cohen and Alex Rudnicky for collaborating on the tutorial and for sharing thoughts that affected this article.

Dr. Patti Price has over 20 years experience in developing and transferring speech and language technology, including the co-founding of three companies (Nuance, BravoBrava, Soliloquy Learning). She specializes in speech interfaces, especially for applications in education and training. See www.pprice.com for more information.

Matching Technology and Application

Deepdub Partners with Wonderful

Boost.ai Introduces Adaptive Voice

Krisp Launches Listener-Side Accent Conversion for Meetings, CX and Voice AI Agents

Deepgram Partners with IBM