Is There a Future for Speech in Vehicles?

Today, speech recognition technology is becoming an important component in how people are using and interacting with their cars. It offers a natural interface to access in-car navigation systems, media, information and remote communication functions. However, there are core challenges that must be addressed to provide state-of-the art speech solutions for rich in-vehicle telematics services and value-added solutions.

Many people associate speech in cars with science fiction movies and television shows where the cars act like R2D2 robots on wheels. In today’s world the main reason for using speech is less Hollywood and more pragmatic. In fact, it usually boils down to safety. Ask any automotive manufacturer and they will tell you that improving safety is all about decreasing the distraction level for the driver and increasing consistency of use regardless of a wireless network connection. They've found that this can be overcome quite effectively by the intelligent use of speech recognition technology.

The emphasis is on intelligent. Throwing a few features into a car for the gee-whiz factor is not what this business is about. Rather, it is about choosing the right way to embed reliable and useful speech functions that help drivers keep their concentration on the road and their passengers away from harm.

Given these goals, it’s not surprising to learn that voice dialing is the predominant catalyst for introducing voice recognition to cars. One of the most important applications is hands-free dialing, which is already mandated in many states in the US and countries worldwide. Taken a step further, you have the OnStar paradigm – a service designed to help drivers with everything from emergency assistance to directions. And finally, speech can be used to control the basic operations in the car – navigation control (e.g. destination specification), climate control, radio/CD controls – basically anything that would take your eyes off the road or hands from the wheel can be done with voice commands.

Safety is paramount but once car makers feel that they have met those needs they can work with their technology partners, ISVs and developers to break into new applications that provide overall ease of use and new services. Today's cars are phoning home to auto manufacturers, suppliers and service industries more frequently, and new emergency services let drivers obtain fast, relevant roadside assistance if there's a breakdown. A recent JD Power and Associates survey, rating customer satisfaction with in-car navigation systems, found three of the top five cars were from Honda and Acura and all contain voice recognition technology.

Challenges facing speech in cars
The car represents a very challenging environment for voice technologies. The challenges range from creating optimal operation in an unpredictable and noisy environment to dealing with very limited system resources, such as memory/CPU. Balancing the trade-offs between the demanding requirements of voice technologies and the available system capacity of the automotive environment is very complex. Also, drivers spend long hours in their cars and the quality of the conversation with the dashboard is critical to user acceptability. A distracting voice or a repetition of prompts might result in the customer returning the car to the dealership and asking to get the "annoying" voice disabled.

All speech recognition is highly dependent on the quality of the input audio representing what a person is saying. In a car there are many environmental attributes that can adversely affect the quality of the input audio. One of the major factors in good speech recognition accuracy is the signal-to-noise ratio (SNR) of the input audio. In this case, the “signal” is the user’s speech, and the SNR is the speech energy to background noise energy of the incoming audio. High SNR is best, when the speech signal is clearly more powerful than the background noise, but this is often not the case in the automotive environment. The general din in a vehicle - caused by the road, wind, air conditioning (A/C fan speed), windows, etc. - results in a very dynamic and noisy environment. Also, some speakers simply speak quietly, adding to the difficultly of recognition in such a harsh environment

The variable nature of noise in an automotive environment (type and intensity) dictates that the acoustic model’s training process (data and algorithms) and the recognition system can not assume a single noise condition, but rather must account for a wide range of possibilities. Another aspect of the noise environment is the transient or burst noises that can occur. These are typically short in duration and high in energy content and can be confused with speech sounds. For example, clicks, bumps, horns, wipers, etc. are examples of transient noises that are difficult for the speech recognition system to identify and discriminate during the recognition process. All of these environmental components contribute to reducing the SNR of the input signal and thus adversely affect speech recognition performance.

One way that environmental influences can be controlled is through directional, noise reducing microphones positioned close to the speakers. The basic design of the microphones allows them to be more sensitive to sounds within a limited area chosen by the designers, thus reducing the sounds outside of this area. When these microphones are positioned optimally, the speaker is located in a more sensitive area and the noises generated outside of it are minimized, resulting in higher SNR for the input audio signal. Unfortunately, the SNR improving characteristics of these microphones can not always be fully exploited. Due to vehicle cabin design, manufacturing limitations, and cost, the microphone can not always be placed close to the user, resulting in lower SNR.

In addition to noise, the user population of any given vehicle is very diverse and offers its own set of unique challenges. Speech applications in the past were created for a small, target user population but applications in a car must address a very large and varied user population. This factor leads to the most strenuous test of speech recognition in the area of speaker characteristics. The wide array of dialects and accents along with speaking styles (loud, soft, slow, fast, etc.) challenge the acoustic model and speech recognition engine design. People are not typically capable of, or willing to, change their own speech so recognition systems must instead adapt to a voice and a car’s audio characteristics.

There are also user expectation challenges. From Hollywood movies such as 2001: A Space Odyssey, Star Trek, and many more, users have been lead to believe that when they speak to computers the same level of complexity and accuracy can be achieved. Due to these many influences - speaking styles and preconceived notions of speech functionality - users bring extremely high expectations that have to be addressed by any speech application developer or system provider.

Over-riding the challenges
Many of these issues can be addressed by more advanced processing of the input audio before the speech recognition engine processing occurs. Examples of this are: advanced microphones, pre-processing input audio algorithms, increasing the complexity of the speech recognition system algorithms, and increasing acoustic model size. These techniques can improve overall recognition accuracy and account for some of the challenges, but are ultimately limited by system constraints in the car environment. The computing platforms in vehicles are limited by cost and general automotive requirements for size and reliability. Therefore, in many cases it is not feasible to use multi-element microphones with beam forming technology, more complex algorithms that increase search dynamics of the speech engine, or larger acoustic models trained on more diverse data sets. Speech application and engine developers struggle over the balance between available system requirements and the complexity of the solutions that can be applied in order to achieve the highest level of speech recognition accuracy for the largest population possible.

One method to address the trade-off system requirements and the complexity of the solutions is through speech application design. Good speech application design can also greatly improve overall perceived recognition accuracy. Directed dialogs and use of context can limit the variability of commands. Also, with feedback from the system relative to signal levels and confidence, confirmation commands can be used which mimic human dialog while again directing and limiting choices for commands.

By making systems more intuitive, it is possible to free drivers from the frustration of having to memorize rigid phrases and instead let them simply express what they want. Whether asking for directions to the nearest Chinese restaurant or changing the radio station, the driver should be able to get in the car and be understood immediately. This can be achieved by adding a voice biometric solution to the car so it will recognize the driver and automatically re-set preferences like radio pre-sets, seat position or mirror angles.

It is also important for the driver's experience to be enriched by the system, providing access to new services without requiring new buttons on the dashboard. These services could include location-based offerings, like a notification for a coupon offer as the driver approaches a favorite restaurant. The voice interface should also be able to learn a driver's preferences. For example, if the driver consistently asks for a country western radio station, the system should realize that and ask if that should be a pre-set.

Perhaps one of the greatest breakthroughs in safety and ease-of-use telematics is its development of conversational telematics, and Conversational Interface for Telematics (CIT) is a lead example.

The CIT is an in-vehicle, voice-interactive system using conversational language for driver to computer communications. The prototype system enables hands-free operation of vehicle functions, such as e-mails, navigation, audio and climate control, while minimizing driver distraction. The technology is also designed to help detect driver drowsiness and respond by engaging the driver in interactive discussions or games. In addition, the system enables Web-based services such as traffic alerts, weather and flight information.

The Audio Visual Speech Recognition (AVSR) project gives a whole new meaning to the phrase, “read my lips.”

Whether it is navigation information, faxes, phone calls, or Web content that a driver requires, AVSR will boost the accuracy of the speech recognition engine and help to eliminate the need to repeat information. Instead, cameras focused on drivers’ mouths and “trained” to read lips will vastly improve the accuracy of speech recognition in noisy environments -- clearly a challenge for the most advanced current systems. This technology will significantly increase the probabilities that drivers will be understood when giving voice commands in their cars, even where background noise is present. AVSR will also monitor for drowsiness – for example, eyes open versus eyes closed – and assist the driver accordingly with interactive responses.

To make telematics more readily available to car buyers – and more cost effective for automakers to develop, deploy and manage – there is also an “off board” approach. By reducing the cost and bulk of the technology, automakers can expand telematics functionality across all price classes – not just luxury models -- and increase profitability before and after the sale of their vehicles. The Telematics Resource Manager enables the on-demand retrieval of live data from heterogeneous sources outside the car. Examples include information from traffic-reporting centers, other running vehicles and road sensors. What’s more, the framework facilitates real-time applications like “shortest-time” routing based on live road conditions, remote diagnostics, and “proximity based” e-coupon delivery. For example, you can access notification and delivery of e-coupons on consumer product goods at the grocery store that you just might have passed.

In the coming months, expect to see the following;

In-car navigation systems that use advanced speech recognition and text-to-speech capabilities identifying spoken street and city names, so drivers can speak the actual street address and receive turn-by-turn guidance to their destination. This new type of speech synthesis captures the characteristics of the human voice - a result of extensive work recording voices and digitally segmenting speech and intonations, so that a vehicle can communicate naturally with a driver.
Nationwide dining information from the most recognized names in restaurant guides, so that not only can drivers ask for the name of and directions to, for instance, the nearest Italian or French restaurant in the area, but also view and listen to a review of the restaurant included in the guide.
Real-time traffic navigation systems that integrate immediately occurring traffic data into the navigation display. Additional integrated features will include a link that communicates information between the dealer and the driver; and another that uses BlueTooth™ technology to synchronize personal cell phone data within the car environment for hands-free, speech-enabled dialing (on phones with built-in BlueTooth capability).

With increasing numbers of auto parts and devices featuring embedded chips, today's cars are phoning home to auto manufacturers, suppliers and service industries more frequently. This, in turn, is translating into potential new offerings for customers in a variety of industries. Speech technology has gone mainstream, gaining consumer acceptance through innovation and technical quality. From customer service applications to PDAs to cars, it's demonstrated its utility and acceptance throughout a wide range of environments. Given the broad and growing range of in-vehicle applications now being installed, having a conversation with your car should seem natural.

Kenneth White is a senior software engineer for IBM's Pervasive Computing Division working as a solution architect on Embedded ViaVoice. He spent the last four years in the Embedded ViaVoice Organization.

Harvey Ruback is a senior software engineer for IBM's Pervasive Computer Division. He is currently the lead Embedded ViaVoice Architect.

Dr. Ing Sicconi has been with the Human Language Technology group at IBM T.J. Watson Research Center since 2000. He currently manages development of exploratory prototypes of conversational and multimodal user interfaces in smartphones, consumer devices and particularly in cars, both in stand-alone and in network-connected configurations.

Is There a Future for Speech in Vehicles?

Aircall Acquires Vogent

Krisp Launches VIVA 2.0, an Infrastructure for Voice AI Agents

DomoAI Launches TTS and Integrates OpenAI's GPT Image 2.0 in Talking Avatar Workflow

Copperline Golf Launches AI Voice Caddy