The impressive progress in speech technologies over the last decades is undeniable. However despite the enormous progress which has led to the development of numerous voice-enabled technologies in wide spread use, most of these services are only available in a limited number of languages. With the availability of speech processing technologies and reasonably accurate speech-to-text transcription systems, interest has been growing in using these for linguistic studies in particular as an aide for linguistic exploration and documentation. In this talk, I will present some of the work my colleagues and I have carried out aiming to explore linguistic properties such as language change, languages in contact, dialectal variation (lexical and phonological) as well as description and validation of linguistic properties of so-called rare or low-resourced languages.
Lori Lamel is a senior research scientist at the CNRS which she joined in 1991. Her research interests include large vocabulary continuous speech recognition; acoustic-phonetic studies; lexical and phonological modeling; speaker and language identification; speech recognition and keyword search in low resourced languages. She has also contributed to the design, analysis, and realization of large speech corpora, most notably TIMIT, BREF and TED. She is a member of the Speech Communication Editorial board, and the Editorial Board of the Journal of Natural Language Engineering, the IEEE James L. Flanagan Speech & Audio Processing Award Committee, and has served on the scientific/program committees of workshops/conferences. She was named an ISCA fellow in 2015 and currently serves on the ISCA (International Speech Communication Association) board.
The cocktail party problem, or speech separation, has evaded a solution for decades in speech and audio processing. I have been advocating a new formulation of this old challenge that estimates an ideal time-frequency mask (binary or ratio). This formulation turns the classical signal processing problem into a machine learning problem, and deep neural networks (DNNs) are particularly well-suited for this task due to their representational capacity. I will describe recent algorithms that employ deep learning for supervised speech separation, including speech enhancement and speaker separation. DNN-based mask estimation elevates speech separation performance to new levels, and produces the first demonstration of substantial speech intelligibility improvements for both hearing-impaired and normal-hearing listeners in background interference. These advances represent big strides towards solving the cocktail party problem.
DeLiang Wang received the B.S. degree and the M.S. degree from Peking (Beijing) University and the Ph.D. degree in 1991 from the University of Southern California all in computer science. Since 1991, he has been with the Department of Computer Science & Engineering and the Center for Cognitive and Brain Sciences at The Ohio State University, where he is a Professor and University Distinguished Scholar. He also holds a visiting appointment at the Center of Intelligent Acoustics and Immersive Communications, Northwestern Polytechnical University. He received the U.S. Office of Naval Research Young Investigator Award in 1996, the 2005 Best Paper Award from IEEE Transactions on Neural Networks, and the 2008 Helmholtz Award from the International Neural Network Society. He is an IEEE Fellow and Co-Editor-in-Chief of Neural Networks.
Speech synthesis for an unwritten language requires a non-text input. An image2speech system satisfies this requirement: an image2speech system observes an image, and generates a spoken description of the activity portrayed in the image. An image2speech system has three components: an image understanding front end embeds the image into a sequence of semantic vectors, a neural machine translator converts the semantic vectors into a sequence of phones, and a speech synthesizer generates audio from the phone sequence. Image2speech generates intelligible and meaningful speech output if it uses a phone set derived from an ASR in the same language. Unfortunately, ASR is usually not available in unwritten languages. One alternative is a cross-language ASR: ASR is trained in a different (high-resourced) language, then used to phonetically transcribe a corpus of speech audio in the unwritten language. Unfortunately, the high error rate of cross-language ASR results in phone strings that are hard to use in image2speech. Another alternative is mismatched transcription. A mismatched transcript is a transcription of the speech by a person (or ASR) who doesn't understand the language; since he doesn't understand what he's listening to, the transcriber is able to correctly transcribe only the phonetic features that are shared in common between his language and the language he's transcribing. When multiple transcribers are available (e.g., one who speaks English, one who speaks Mandarin), it's possible to infer the phoneme inventory of the unwritten language by cross-checking the phonetic features transcribed by the different transcribers. Lexical tones of the unwritten language can also be inferred, though with greater variance, because tonal universals are often masked by transcriber-language biases. This talk will define image2speech, and provide examples using same-language and cross-language ASR. I will then explore the use of mismatched crowdsourcing to infer the phone set and lexical tone set of an unwritten language, and to infer probabilistic transcriptions of untranscribed audio. I'll conclude the talk by describing the setup for an image2speech system that uses mismatched transcription, cross-checked across multiple transcriber languages, to define the phone units for image2speech synthesis in an unwritten language.
Mark Hasegawa-Johnson has been on the faculty at the University of Illinois since 1999, where he is currently a Professor of Electrical and Computer Engineering. He received his Ph.D. in 1996 at MIT, with a thesis titled "Formant and Burst Spectral Measures with Quantitative Error Models for Speech Sound Classification," after which he was a post-doc at UCLA from 1996-1999. Prof. Hasegawa-Johnson is a Fellow of the Acoustical Society of America, and a Senior Member of IEEE and ACM. He is currently Treasurer of ISCA, and Senior Area Editor of the IEEE Transactions on Audio, Speech and Language. He has published 280 peer-reviewed journal articles and conference papers in the general area of automatic speech analysis, including machine learning models of articulatory and acoustic phonetics, prosody, dysarthria, non-speech acoustic events, audio source separation, and under-resourced languages.