I just completed a paper for the book Speaker Classification, edited by Christian Müller and Suzanne Schötz, in which I describe ways that speaker identification and verification (SIV) need and use classification. Preparing the article reminded me of how central classification of speech and speakers is to speech processing as a whole. Emotional/affective speech is, for example, widely-recognized as a factor affecting all kinds of speech-processing technologies and systems. In fact, there is a great deal of active research on emotion and stress, some of which is presented in the classification book cited above and some appeared in articles I wrote/cowrote for Speech Technology magazine last year in the January/February and May/June issues.
Emotion is just one of many aspects of spoken behavior that affect the performance of speech-processing systems and also lend themselves to being partitioned into useful categories.
Classifications based on biological characteristics, such as a speaker’s sex and age, are among the most potent. Knowing the sex of a speaker is useful for both core technologies and applications. Some speaker identification and automated speech recognition systems use gender classification to reduce the search space, which, in turn, enhances the speed and accuracy of the system. For text-to-speech synthesis, the choice of voices that will be used in the application often begins with a decision about whether the voices are to be female or male. The globalization of ASR and TTS has added gender-linked cultural factors to these considerations and I expect that as personalization advances, it will incorporate gender into its criteria as well.
Discussions about age recognition generally center on its use for adult-only back-end applications, but according to the above-mentioned book, it can be a potent, personalization factor and appears to lend itself to automation. Knowing the age category of a caller could help select the right agent for that caller or the most effective target ads. Age categorization can enhance usability by indicating, for example, how to organize menu selections so that the most likely options for a specific age group are spoken first or by slowing the rate of speech in prompts to elderly callers.
Selection of a good TTS voice or voice talent can also be strongly influenced by knowing the age group of likely callers. A savvy company whose customers are teens and young adults will select a voice that is quite different from a company that serves seniors. In both cases, the voice will sound and talk like a member of the dominant age group.
Automatic language identification classifies speakers based on the language they are using. It enables ASR and TTS systems to listen and speak effectively. It is also critical for language-dependent SIV systems because they use language-specific phonetics and phonemics to make their SIV decisions.
Speakers with strong regional and non-native accents represent a challenge to speaker-independent ASR technology as well. Historically, ASR developers have incorporated small amounts of these outlier accents into their generic language models or larger amounts for systems that will be deployed in locations with a sizeable population of dialect speakers. Classification could enhance ASR robustness by signaling the presence of a strong dialect or non-native accent when there are recognition problems. This could lead to a language-model switch, transfer to an agent (perhaps one who speaks the right language), or another customer-service solution.
Forensic speaker identification looks for speakers of interest based on dialect and foreign accent, as well as language, but full automation of these technologies is still evolving, according to several contributors to the book. Ulrike Gut even found that non-native speakers tend to speak more slowly than natives and share other common attributes that could be used by computers to automatically separate native from non-native speakers.
The classification categories discussed here are not the only ones of interest to speech systems, but they are the ones that are receiving the most attention by both researchers and developers. The publication of the two-volume book on speaker classification reveals how important those categories are to the systems we build. It also highlights the challenges that still remain.
Judith Markowitz is the technology editor of Speech Technology and is a leading independent analyst in the speech technology and voice biometric fields. She can be reached at (773) 769-9243 or firstname.lastname@example.org.