March 11, 2022
By James A. Larson program co-chair, SpeechTEK 2021
Forward Thinking

Your Voice Tells All

The sound and intonation of human voices tell keen listeners about the feelings and conditions of speakers. Using artificial intelligence, speech app developers are finding ways to interpret these feelings and conditions to develop new applications that can tell users these things about you:

Who you are. Most people immediately recognize James Earl Jones by his clear, deep voice even before seeing his face. His voice is exceptional, but not really an exception; most of us can recognize familiar speakers by their voices. Computers recognize voices too. Speaker identification and authentification technology is used widely today to recognize and authenticate users.

Speech recognition systems often include technology that not only identifies a speaker but verifies that the speech is in person and not a recording. For example, one technique to detect “liveness” asks users to repeat themselves. If the two utterances are similar but not identical, the speech likely is not synthesized nor recorded. Even in noisy backgrounds and environments with multiple speakers, speaker identification technologies are fast and non-intrusive. Speaker identification and verification and liveness tests are applied at the start of each conversation and can be repeated during the conversation to verify that the person speaking has not been replaced by a second person.

What language you speak. Most of us quickly realize when a speaker uses a foreign language, but few of us can instantly recognize precisely which of the more than 8,000 languages worldwide is being spoken. Fortunately, computers can recognize foreign languages within three or four words and quickly switch to that tongue if there is an automatic speech recognition (ASR) capability for it.

Your emotions. Actors are experts at conveying emotions using their voices. By using inflections and changing the speed and pitch of the words they speak, they convey happiness, sadness, anger, fear, stress, etc. Participants in a verbal conversation detect the emotions conveyed by conversational participants not only by the choice of words uttered but also in the manner of the utterance. Computers can also detect emotions by analyzing user speech patterns and utterances. Applications that listen to users speak can adjust the user-computer dialogue to reflect user emotions. For example, if a user sounds disappointed, the application modifies the dialogue appropriately.

Your truthfulness. Can speech apps determine if the user is lying? The answer is yes, indirectly. It can detect user stress and sudden changes in voice patterns, both of which may indicate when a user is untruthful. Skilled speakers, however, may be able to hide these particular signs of untruthfulness.

Your medical conditions. It is easy to determine if someone has a cold or allergies by the sound of their voice. Speech recognition also detects changes in voice patterns typical of Parkinson’s disease, multiple sclerosis, Alzheimer’s disease, and other conditions that affect the vocal cords and throat. Soon phone apps will perform preliminary screening of these and other diseases, including COVID-19 and the flu.

If you are intoxicated. Speech recognition systems can detect word slurring and other signs of intoxication due to alcohol and chemical substance abuse. Whether this is can be considered legal proof of intoxication has yet to be determined.

Your location. Although background noises are not spoken, they can contain information about the speaker—where the speaker is and what the speaker might be doing. Ambient noises like those that occur on farms, at the seashore, in the city, or inside a warehouse can be captured as you speak and provide clues to your location. Specific sounds such as the distinctive bong of Big Ben indicate where you are (in this case, London).

Privacy Concerns

But with all this progress comes concerns about privacy and bad actors misusing information gleaned from voices for nefarious purposes. Bad actors may also use recorded voice to train a speech synthesis application to produce a voice with your characteristics, which can used to misrepresent you in business transactions or informational messages, or make you appear foolish on social media.

Both the European Data Protection Board (EDPB) and the Open Voice Forum (OVON) and have published privacy protection guidelines. The EDPB document (https://edpb.europa.eu/our-work-tools/our-documents/guidelines/guidelines-022021-virtual-voice-assistants_en) seeks to provide guidance in the form of definitions, use cases, and guidelines. The OVON document outlines 10 guidelines to maintain user privacy. Neither of these documents are legally binding.

Voice application developers should carefully review the EDPB and OVON guidelines. Join the various discussion groups addressing voice privacy and lobby for the guidelines, procedures, and laws you feel are necessary to protect user privacy.

James A Larson, Ph.D., is senior advisor to the Open Voice Network and is the co-program chair of the SpeehTEK Conference. He can be reached at jlarson@intotoday.com.

Your Voice Tells All

Privacy Concerns

DentScribe Launches DentScribe Perio Charting 3.0

Krisp Launches Voice Translation v3

Treble Technologies and Hugging Face Benchmark ASR Models

Why Better Client Tracking Starts With Better Capture of Spoken Clinical Interactions