Speech Science in the Era of Big Data
When we refer to natural language processing (NLP) systems that process speech or text, we are generally talking about the building of adequate language models and the deployment of reliable statistical heuristics. Big Data is a blessing for NLP systems in general, and for speech in particular. But it does require managing, filtering, and categorizing data into usable components. This challenge is not at all trivial and poses important questions about tools and processing procedures.
The potential for using data culled from the Internet to enhance speech processing is huge, and there is ample reason to conduct this kind of research. Creating large databases of speech data is essential for training algorithms and getting large quantities of segments needed for statistical models within a short period of time.
Today we can buy these classical databases ready-made and preprocessed, usually for the more dominant languages. These databases are well designed, and usually they are expensive. But from the Internet we're able to recover voice files representing a host of different languages, styles, and dialects that are not necessarily found in mainstream databases, and we can get data for the less common languages. Many researchers are eager to explore the potential of creating databases with less commonly available languages and dialects, for less money and in a shorter period of time, using available sources on the Internet.
Multiple types of speech files can be processed from Internet sources. These include audio files of speech, as from radio programs, audio lectures, and voice messages. We can also access speech from video sources, like movies, lectures, and presentations. Some of the speech in video appears with text transcription in one or more languages. And speech records are available in different language styles: slang (stand-up comedy, mass media), formal speech (politics, literature), speech from social networks, and lectures (from universities, TED talks, classrooms, and so on).
Unlike classical speech databases, Internet sources can supply speech data that occurs in conjunction with modalities such as gesture, lip movement, face movement, and emotions. This opens up possibilities for crossing between modalities, for purposes of alignment and disambiguation. In addition, Internet speech files' meta data can provide information about sources, speakers, topics, and more; this data can be used to enhance segmentation and classification accuracy. Finally, speech databases from Internet sources could be beneficial for speech machine translation—one of the hardest tasks in the area of NLP.
Though we recognize the potential of speech databases built from the Internet, we must also consider the challenges posed by data collected from such wide and potentially chaotic sources. What features should be used to categorize the collected speech files, in order to form a high-quality database? What filters and identifiers will be required to prune the data into relevant, high-quality material? The data must be filtered with care so that the value and relevancy of what remains is clearly evident. To estimate the resources needed, we'd need to establish how much effort is required to determine the data's reliability, and how much of that effort can be automated.
We believe that deeper use of speech data from the Internet will create new sources of valuable information for speech and NLP engines. Multimodal information and meta data from these speech files can be employed for additional NLP processes like disambiguation. In research currently under way at the Holon Institute of Technology, we are exploring options to better leverage Internet speech data, and we are conducting experiments and research to address challenges in data cleansing and reliability assessment.
Nava Shaked is the CEO of Brit Business Technologies Ltd., a call center optimization consulting practice specializing in speech technologies. She is acting on the AVIOS board and is the chairperson of AVIOS Israel. Some of the issues mentioned in the article are part of ongoing research conducted at the Holon Institute of Technology in Israel by Shaked, Nissim Harel, and their graduating students. Shaked can be reached at firstname.lastname@example.org.