Abstract: Statistical language models (LMs), which estimate the joint probabilities of natural sentences, form a crucial component in many artificial intelligence applications, such as speech recognition and machine translation. In terms of probabilistic graphical modeling, language modeling methods can be categorized into two classes. One class is directed graphical models (DGMs), where the joint probability of a word sequence is factored into the product of local conditional probabilities. The other is undirected graphical models (UGMs), where the joint probability of the whole sentence is defined to be proportional to the product of local potential functions. For DGM based LMs, this tutorial introduces the classic n-gram LMs and the neural network LMs with typical network structures. Then the methods of reducing the computational cost and handling the OOV problem are presented. In the second part of this tutorial, we will first introduce some typical UGM based LMs, including the WSME (whole sentence maximun entropy) LMs, TRF (trans-dimensional random field) LMs and the whole sentence neural LMs, and then present the training algorithms, including the augmented stochastic approximation (AugSA) method and the noise-contrastive estimation (NCE) method. In addition, we will provide open-source codes (https://github.com/wbengine/SPMILM) and hands-on exercises to help the audience to get familiar with the state-of-the-art techniques of language modeling.
Abstract: Until recently, the goal of developing open-domain dialogue systems that not only emulate human conversation but fulfill complex tasks, such as travel planning, seemed elusive. However, we start to observe promising results in the last few years as the large amount of conversation data is available for training and the breakthroughs in deep learning and reinforcement learning are applied to dialogue. In this tutorial, we start with a brief introduction to the history of dialogue research. Then, we describe in detail the deep learning and reinforcement learning technologies that have been developed for two types of dialogue systems. First is a task-oriented dialogue system that can help users accomplish tasks, ranging from meeting scheduling to vacation planning. Second is a social bot that can converse seamlessly and appropriately with humans. In the final part of the tutorial, we review attempts to developing open-domain neural dialogue systems by combining the strengths of task-oriented dialogue systems and social bots.
Abstract: Generative adversarial network (GAN) is a new idea for training models, in which a generator and a discriminator compete against each other to improve the generation quality. Recently, GAN has shown amazing results in image generation, and a large amount and a wide variety of new ideas, techniques, and applications have been developed based on it. Although there are only few successful cases, GAN has great potential to be applied to text and speech generations to overcome limitations in the conventional methods. There are three parts in this tutorial. In the first part, we will give an introduction of generative adversarial network (GAN) and provide a thorough review about this technology. In the second part, we will focus on the applications of GAN to speech signal processing, including speech enhancement, voice conversion, speech synthesis, speech and speaker recognition, and lip reading. In the third part, we will describe the major challenge of sentence generation by GAN and review a series of approaches dealing with the challenge. Meanwhile, we will present algorithms that use GAN to improve the quality of the generated sentences of chat-bots, to achieve unsupervised machine translation, and to perform text style transformations without paired data.
Abstract: Streaming automatic speech recognition (ASR) systems are comprised of a set of separate components, namely an acoustic model (AM); a pronunciation model (PM); a language model (LM) and an endpointer (EP). The AM takes acoustic features as input and predicts a distribution over subword units, typically context-dependent phonemes. The PM, which is traditionally a hand-engineered lexicon maps the sequence of subword units produced by the acoustic model to words. The LM assigns probabilities to various word hypotheses. Finally, the EP determines when the user of a system has finished speaking. In traditional ASR systems, these components are trained independently on different datasets, with a number of independence assumptions which are made for tractability.
Over the last several years, there has been a growing interest in developing end-to-end systems, which attempt to learn these separate components jointly in a single system. Examples of such systems include attention-based models [1, 6], the recurrent neural network transducer [2, 3], the recurrent neural aligner , and connectionist temporal classification with word targets . A common feature of all of these models is that they are composed into a single neural network, which when given input acoustic frames directly outputs a probability distribution over graphemes or word hypotheses. In fact, as has been demonstrated in recent work, such end-to-end models can surpass the performance of a conventional ASR systems 
In this tutorial, we will provide a detailed introduction to the topic of end-to-end modeling in the context of ASR. We will begin by charting out the historical development of these systems, while emphasizing the commonalities and the differences between the various end-to-end approaches that have been considered in the literature. We will then discuss a number of recently introduced innovations that have significantly improved the performance of end-to-end models, allowing these to surpass the performance of conventional ASR systems. The tutorial will then describe some of the exciting applications of this research, along with possible fruitful directions to explore.
Finally, the tutorial will discuss some of the shortcomings of existing end-to-end modeling approaches and discuss ongoing efforts to address these challenges.
 W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, Attend and Spell,” in Proc. ICASSP, 2016.
 A. Graves, “Sequence transduction with recurrent neural networks,” in Proc. of ICASSP, 2012.
 K. Rao, H. Sak, and R. Prabhavalkar, “Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer,” in Proc. ASRU, 2017.
 H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence-to-sequence mapping,” in Proc. Interspeech, 2017.
 H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: acoustic-to-word LSTM model for large vocabulary speech recognition,” in Proc. of Interspeech, 2017.
 C.C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski and M. Bacchiani, “State-of-the-art Speech Recognition With Sequence-to-Sequence Models,” in Proc. ICASSP, 2018.