Speech Recognition at KAIST

A group of researchers at the Brain Science Research Center of KAIST (Korea Advanced Institute of Science and Technology) in Taejon, Korea has been working on modeling of human auditory central nerve systems for noise-robust speech recognition. This research has been funded as a part of the Brain Neuroinformatics Research Program, one of three major national brain research programs started in November 1998 for 10 years by the Korean Ministry of Science and Technology. Some of the research results have already come to fruition in the industry through a venture company in Korea, Extell Technology Corporation, and include speech recognition chips featuring auditory models and speech enhancement technology based on binaural signal processing. The Brain Neuroinformatics Research Program has two goals: to understand information processing mechanisms in biological brains, and to develop intelligent machines with human-like functions based on the mechanism. It is a joint effort of researchers from many different academic disciplines, including cognitive neuroscience, mathematics, electrical engineering and computer science. In this program, human brain functions are divided into four modules: vision, auditory, cognition and behavior modules. Humans use five “sensors” to receive information from the outside environment, perform some information processing based on these sensor signals, and provide motor controls for interaction. Among the five sensors, the vision and acoustic sensors provide the richest information, and allow complex information processing. All the sensory information is integrated and processed in the cognition module, which provides learning, memory and decision-making functions. The last module, behavior, plans for human reactions and generates signals required for sensory motor controls. Research activities focusing on the auditory module are based on the simplified diagram of the human auditory central nerve system as shown in Figure 1. The mechanical vibration of the eardrum is converted into neural signals at inner hair cells (IHCs) through the basilar membrane in the cochlea. Each IHC signal represents an acoustic input signal with specific frequency filtering and nonlinear characteristics. Hence it is believed that the “front end” of the human ear is a nonlinear spectral analyzer and processor which detects and captures spectral information of the audio input signal. The IHC signals from the left and right cochleae are combined at superior olivery complexes (SOCs), and further go to auditory cortexes through inferior colliculus (IC) and MGB. This binaural signal processing at SOCs conducts sound localization and noise reduction. Although earlier auditory signal processing mechanisms at the cochlea and possibly up to the SOC-level are relatively well understood, the signal processing mechanism between the SOC and auditory cortex is less understood, and therefore represented as dotted lines. It is also known that some neurons at the auditory cortex layer respond to specific sound components with complex time-frequency characteristics. Speech recognition and language understanding take place at the higher-level brain. It is worth noticing that, in addition to forward signal paths, there also exist backward signal paths, which have not been incorporated into many existing auditory models. This backward path is responsible for top-down attention, which filters out irrelevant components from noisy input speeches. Detail functions currently under modeling are summarized in Figure 2. The object path, or “what” path, includes nonlinear feature extraction, time-frequency masking and complex feature formation from cochlea to auditory cortex. These are the basic components of speech-feature extraction for speech recognition. The spatial path, or “where” path, consists of sound localization and noise reduction with binaural processing. The attention path includes both bottom-up (BU) and top-down (TD) attention. However, all of these components are coupled together. Especially, the combined efforts of both BU and TD attentions control the object and spatial signal paths. The nonlinear feature extraction model is based on cochlear filter bank and logarithmic nonlinearity. The cochlear filter bank consists of many bandpass filters, of which center frequencies are distributed linearly in logarithmic scale. The quality factor Q, i.e., ratio of center frequency to bandwidth, of bandpass filters is quite low, and there are overlaps in frequency characteristics. The logarithmic nonlinearity provides wide dynamic range and robustness to additive noises. Time-frequency masking is a psychoacoustic phenomenon, where a stronger signal suppresses weaker signals in nearby time and frequency domains. Frequency masking is modeled by lateral inhibition in the frequency domain, which also helps to increase frequency selectivity with overlapping filters. Time masking is also implemented as lateral inhibition, but only forward (progressive) time masking is incorporated. Modeling of complex features such as onset/offset and frequency modulation features are regarded as future research topics. For the binaural processing at the spatial path, conventional models estimate interaural time delay, i.e., time-delay between signals from left and right ears, based on cross-correlation, and utilize the time-delay for sound localization and noise reduction. Interaural intensity difference is also utilized for advanced models. However, these models assume only direct sound paths from a sound source to two ears, which is not valid for many real-world environments with multipath reverberation and multiple sound sources (e.g., speech inside an automobile with external road and wind noise, and reverberation of speech mixed with music from the audio system). Therefore, it is required to incorporate deconvolution and separation algorithms in the binaural processing. Due to the increased number of parameters for time-delayed components the simple correlation measure is not good enough, and an extended binaural processing model has been developed based on this information theory. For the attention path, a model is being developed to combine both the bottom-up (BU) and top-down (TD) attention mechanisms. The BU attention is usually incurred from strong sound intensity and/or rapid intensity changes in time, and closely related to the time-frequency masking. However, TD attention comes from familiarity and importance of the sound, and relies on existing knowledge of each person. For example, a word or a person’s voice may trigger TD attention for relevant people only. Therefore, TD attention originates from the higher-level brain, which is modeled by a speech recognition system. A simple yet efficient TD attention model has been developed based on error back-propagation algorithm with multi-layer Perceptron classifiers. TD attention also provides a reference signal to the extended binaural processor, which then works similarly to active noise canceling for much better performance. Many auditory models require intensive computing, and special hardware has been developed for real-time applications. A speech recognition chip has been developed as a System-On-Chip solution, which consists of circuit blocks for A/D conversion, nonlinear speech feature extraction, programmable processor for recognition, and D/A conversion. Also, the extended binaural processing model has been implemented at the FPGA level, and an ASIC version will be introduced in the near future. All three functional paths are coupled together, and models of each path only are expected to contribute slightly to the overall performance improvement. Therefore, the functional models developed will be integrated into a single model, and all the parameters will be optimized simultaneously in the future. However, even with the current results, these auditory models demonstrate significant recognition performance enhancement in real-world noisy environments, and are good enough for commercialization. For more on the research conducted at KAIST on auditory processing and brain neuroinformatics for noise robust speech processing and more, please visit http://bsrc.kaist.ac.kr/English/eng-main.htm.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Speech Recognition at KAIST

Leena AI Launches Agentic AI Colleagues

Hyperlink InfoSystem Launches Clever247.ai Voice AI

SoundHound Partners with Acrelec

Deepfake AI Market to Generate $41.36 Billion by 2032