Automatic speech recognition (ASR) is a key core technology for the information age. ASR systems have evolved from discriminating among isolated digits to recognizing telephone-quality, spontaneous speech, allowing for a growing number of practical applications in various sectors. Nevertheless, there are still serious challenges facing ASR which require major improvement in almost every stage of the speech recognition process. Until very recently, the standard approach to ASR had remained largely unchanged for many years. It used Hidden Markov Models (HMMs) to model the sequential structure of speech signals, with each HMM state using a mixture of diagonal covariance Gaussians (GMM) to model a spectral representation of the sound wave.
This thesis describes new acoustic models based on Deep Neural Networks (DNN) that have begun to replace GMMs. For ASR, the deep structure of a DNN as well as its distributed representations allow for better generalization of learned features to new situations, even when only small amounts of training data are available. In addition, DNN acoustic models scale well to large vocabulary tasks significantly improving upon the best previous systems.
Different input feature representations are analyzed to determine which one is more suitable for DNN acoustic models. Mel-frequency cepstral coefficients (MFCC) are inferior to log Mel-frequency spectral coefficients (MFSC) which help DNN models marginalize out speaker-specific information while focusing on discriminant phonetic features. Various speaker adaptation techniques are also introduced to further improve DNN performance.
Another deep acoustic model based on Convolutional Neural Networks (CNN) is also proposed. Rather than using fully connected hidden layers as in a DNN, a CNN uses a pair of convolutional and pooling layers as building blocks. The convolution operation scans the frequency axis using a learned local spectro-temporal filter while in the pooling layer a maximum operation is applied to the learned features utilizing the smoothness of the input MFSC features to eliminate speaker variations expressed as shifts along the frequency axis in a way similar to vocal tract length normalization (VTLN) techniques.
We show that the proposed DNN and CNN acoustic models achieve significant improvements over GMMs on various small and large vocabulary tasks.
Identifer | oai:union.ndltd.org:TORONTO/oai:tspace.library.utoronto.ca:1807/44123 |
Date | 01 April 2014 |
Creators | Mohamed, Abdel-rahman |
Contributors | Penn, Gerald, Hinton, Geoffrey |
Source Sets | University of Toronto |
Language | en_ca |
Detected Language | English |
Type | Thesis |
Page generated in 0.0018 seconds