421 |
A comparison of Gaussian mixture variants with application to automatic phoneme recognitionBrand, Rinus 12 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2007. / The diagonal covariance Gaussian Probability Density Function (PDF) has been a very
popular choice as the base PDF for Automatic Speech Recognition (ASR) systems. The
only choices thus far have been between the spherical, diagonal and full covariance Gaussian
PDFs. These classic methods have been used for some time, but no single document could be
found that contains a comparative study on these methods in the use of Pattern Recognition
(PR).
There also is a gap between the complexity and speed of the diagonal and full covariance
Gaussian implementations. The performance differences in accuracy, speed and size between
these two methods differ drastically. There is a need to find one or more models that cover
this area between these two classic methods.
The objectives of this thesis are to evaluate three new PDF types that fit into the area
between the diagonal and full covariance Gaussian implementations to broaden the choices
for ASR, to document a comparative study on the three classic methods and the newly
implemented methods (from previous work) and to construct a test system to evaluate these
methods on phoneme recognition.
The three classic density functions are examined and issues regarding the theory, implementation
and usefulness of each are discussed. A visual example of each is given to show
the impact of assumptions made by each (if any).
The three newly implemented PDFs are the Sparse-, Probabilistic Principal Component
Analysis- (PPCA) and Factor Analysis (FA) covariance Gaussian PDFs. The theory, implementation
and practical usefulness are shown and discussed. Again visual examples are
provided to show the difference in modelling methodologies.
The construction of a test system using two speech corpora is shown and includes issues
involving signal processing, PR and evaluation of the results. The NTIMIT and AST speech
corpora were used in initialisation and training the test system. The usage of the system to
evaluate the PDFs discussed in this work is explained.
The testing results of the three new methods confirmed that they indeed fill the gap
between the diagonal and full covariance Gaussians. In our tests the newly implemented
methods produced a relative improvement in error rate over a similar implemented diagonal
covariance Gaussian of 0.3–4%, but took 35–78% longer to evaluate. When compared relative
to the full covariance Gaussian the error rates were 18–22% worse, but the evaluation times
were 61–70% faster. When all the methods were scaled to approximately the same accuracy,
all the above methods were 29–143% slower than the diagonal covariance Gaussian (excluding the spherical covariance method).
|
422 |
Graphical Models for Robust Speech Recognition in Adverse EnvironmentsRennie, Steven J. 01 August 2008 (has links)
Robust speech recognition in acoustic environments that contain multiple speech sources and/or complex non-stationary noise is a difficult problem, but one of great practical interest. The formalism of probabilistic graphical models constitutes a relatively new and very powerful tool for better understanding and extending existing
models, learning, and inference algorithms; and a bedrock for the creative, quasi-systematic development of new ones. In this thesis a collection of new graphical models and inference algorithms for robust speech recognition are presented.
The problem of speech separation using multiple microphones is first treated. A family of variational algorithms for tractably combining multiple acoustic models of speech with observed sensor likelihoods is presented. The algorithms recover high quality estimates of the speech sources even when there are more sources than microphones, and have improved upon the state-of-the-art in terms of SNR gain by over 10 dB.
Next the problem of background compensation in non-stationary acoustic environments is treated. A new dynamic noise adaptation (DNA) algorithm for robust noise compensation is presented, and shown to outperform several existing state-of-the-art
front-end denoising systems on the new DNA + Aurora II and Aurora II-M extensions of the Aurora II task.
Finally, the problem of speech recognition in speech using a single microphone is treated. The Iroquois system for multi-talker speech separation and recognition
is presented. The system won the 2006 Pascal International Speech Separation Challenge, and amazingly, achieved super-human recognition performance on a majority of test cases in the task. The result marks a significant first in automatic speech recognition, and a milestone in computing.
|
423 |
Prédiction de performances des systèmes de Reconnaissance Automatique de la Parole / Performance prediction of Automatic Speech Recognition systemsElloumi, Zied 18 March 2019 (has links)
Nous abordons dans cette thèse la tâche de prédiction de performances des systèmes de reconnaissance automatique de la parole (SRAP).Il s'agit d'une tâche utile pour mesurer la fiabilité d'hypothèses de transcription issues d'une nouvelle collection de données, lorsque la transcription de référence est indisponible et que le SRAP utilisé est inconnu (boîte noire).Notre contribution porte sur plusieurs axes:d'abord, nous proposons un corpus français hétérogène pour apprendre et évaluer des systèmes de prédiction de performances ainsi que des systèmes de RAP.Nous comparons par la suite deux approches de prédiction: une approche à l'état de l'art basée sur l'extraction explicite de traitset une nouvelle approche basée sur des caractéristiques entraînées implicitement à l'aide des réseaux neuronaux convolutifs (CNN).L'utilisation jointe de traits textuels et acoustiques n'apporte pas de gains avec de l'approche état de l'art,tandis qu'elle permet d'obtenir de meilleures prédictions en utilisant les CNNs. Nous montrons également que les CNNs prédisent clairement la distribution des taux d'erreurs sur une collection d'enregistrements, contrairement à l'approche état de l'art qui génère une distribution éloignée de la réalité.Ensuite, nous analysons des facteurs impactant les deux approches de prédiction. Nous évaluons également l'impact de la quantité d'apprentissage des systèmes de prédiction ainsi que la robustesse des systèmes appris avec les sorties d'un système de RAP particulier et utilisés pour prédire la performance sur une nouvelle collection de données.Nos résultats expérimentaux montrent que les deux approches de prédiction sont robustes et que la tâche de prédiction est plus difficile sur des tours de parole courts ainsi que sur les tours de parole ayant un style de parole spontané.Enfin, nous essayons de comprendre quelles informations sont capturées par notre modèle neuronal et leurs liens avec différents facteurs.Nos expériences montrent que les représentations intermédiaires dans le réseau encodent implicitementdes informations sur le style de la parole, l'accent du locuteur ainsi que le type d'émission.Pour tirer profit de cette analyse, nous proposons un système multi-tâche qui se montre légèrement plus efficace sur la tâche de prédiction de performance. / In this thesis, we focus on performance prediction of automatic speech recognition (ASR) systems.This is a very useful task to measure the reliability of transcription hypotheses for a new data collection, when the reference transcription is unavailable and the ASR system used is unknown (black box).Our contribution focuses on several areas: first, we propose a heterogeneous French corpus to learn and evaluate ASR prediction systems.We then compare two prediction approaches: a state-of-the-art (SOTA) performance prediction based on engineered features and a new strategy based on learnt features using convolutional neural networks (CNNs).While the joint use of textual and signal features did not work for the SOTA system, the combination of inputs for CNNs leads to the best WER prediction performance. We also show that our CNN prediction remarkably predicts the shape of the WER distribution on a collection of speech recordings.Then, we analyze factors impacting both prediction approaches. We also assess the impact of the training size of prediction systems as well as the robustness of systems learned with the outputs of a particular ASR system and used to predict performance on a new data collection.Our experimental results show that both prediction approaches are robust and that the prediction task is more difficult on short speech turns as well as spontaneous speech style.Finally, we try to understand which information is captured by our neural model and its relation with different factors.Our experiences show that intermediate representations in the network automatically encode information on the speech style, the speaker's accent as well as the broadcast program type.To take advantage of this analysis, we propose a multi-task system that is slightly more effective on the performance prediction task.
|
424 |
Dynamic Time Warping baseado na transformada wavelet / Dynamic Time Warping based-on wavelet transformBarbon Júnior, Sylvio 31 August 2007 (has links)
Dynamic Time Warping (DTW) é uma técnica do tipo pattern matching para reconhecimento de padrões de voz, sendo baseada no alinhamento temporal de um sinal com os diversos modelos de referência. Uma desvantagem da DTW é o seu alto custo computacional. Este trabalho apresenta uma versão da DTW que, utilizando a Transformada Wavelet Discreta (DWT), reduz a sua complexidade. O desempenho obtido com a proposta foi muito promissor, ganhando em termos de velocidade de reconhecimento e recursos de memória consumidos, enquanto a precisão da DTW não é afetada. Os testes foram realizados com alguns fonemas extraídos da base de dados TIMIT do Linguistic Data Consortium (LDC) / Dynamic TimeWarping (DTW) is a pattern matching technique for speech recognition, that is based on a temporal alignment of the input signal with the template models. One drawback of this technique is its high computational cost. This work presents a modified version of the DTW, based on the DiscreteWavelet Transform (DWT), that reduces the complexity of the original algorithm. The performance obtained with the proposed algorithm is very promising, improving the recognition in terms of time and memory allocation, while the precision is not affected. Tests were performed with speech data collected from TIMIT corpus provided by Linguistic Data Consortium (LDC).
|
425 |
Traitement de l'incertitude pour la reconnaissance de la parole robuste au bruit / Uncertainty learning for noise robust ASRTran, Dung Tien 20 November 2015 (has links)
Cette thèse se focalise sur la reconnaissance automatique de la parole (RAP) robuste au bruit. Elle comporte deux parties. Premièrement, nous nous focalisons sur une meilleure prise en compte des incertitudes pour améliorer la performance de RAP en environnement bruité. Deuxièmement, nous présentons une méthode pour accélérer l'apprentissage d'un réseau de neurones en utilisant une fonction auxiliaire. Dans la première partie, une technique de rehaussement multicanal est appliquée à la parole bruitée en entrée. La distribution a posteriori de la parole propre sous-jacente est alors estimée et représentée par sa moyenne et sa matrice de covariance, ou incertitude. Nous montrons comment propager la matrice de covariance diagonale de l'incertitude dans le domaine spectral à travers le calcul des descripteurs pour obtenir la matrice de covariance pleine de l'incertitude sur les descripteurs. Le décodage incertain exploite cette distribution a posteriori pour modifier dynamiquement les paramètres du modèle acoustique au décodage. La règle de décodage consiste simplement à ajouter la matrice de covariance de l'incertitude à la variance de chaque gaussienne. Nous proposons ensuite deux estimateurs d'incertitude basés respectivement sur la fusion et sur l'estimation non-paramétrique. Pour construire un nouvel estimateur, nous considérons la combinaison linéaire d'estimateurs existants ou de fonctions noyaux. Les poids de combinaison sont estimés de façon générative en minimisant une mesure de divergence par rapport à l'incertitude oracle. Les mesures de divergence utilisées sont des versions pondérées des divergences de Kullback-Leibler (KL), d'Itakura-Saito (IS) ou euclidienne (EU). En raison de la positivité inhérente de l'incertitude, ce problème d'estimation peut être vu comme une instance de factorisation matricielle positive (NMF) pondérée. De plus, nous proposons deux estimateurs d'incertitude discriminants basés sur une transformation linéaire ou non linéaire de l'incertitude estimée de façon générative. Cette transformation est entraînée de sorte à maximiser le critère de maximum d'information mutuelle boosté (bMMI). Nous calculons la dérivée de ce critère en utilisant la règle de dérivation en chaîne et nous l'optimisons par descente de gradient stochastique. Dans la seconde partie, nous introduisons une nouvelle méthode d'apprentissage pour les réseaux de neurones basée sur une fonction auxiliaire sans aucun réglage de paramètre. Au lieu de maximiser la fonction objectif, cette technique consiste à maximiser une fonction auxiliaire qui est introduite de façon récursive couche par couche et dont le minimum a une expression analytique. Grâce aux propriétés de cette fonction, la décroissance monotone de la fonction objectif est garantie / This thesis focuses on noise robust automatic speech recognition (ASR). It includes two parts. First, we focus on better handling of uncertainty to improve the performance of ASR in a noisy environment. Second, we present a method to accelerate the training process of a neural network using an auxiliary function technique. In the first part, multichannel speech enhancement is applied to input noisy speech. The posterior distribution of the underlying clean speech is then estimated, as represented by its mean and its covariance matrix or uncertainty. We show how to propagate the diagonal uncertainty covariance matrix in the spectral domain through the feature computation stage to obtain the full uncertainty covariance matrix in the feature domain. Uncertainty decoding exploits this posterior distribution to dynamically modify the acoustic model parameters in the decoding rule. The uncertainty decoding rule simply consists of adding the uncertainty covariance matrix of the enhanced features to the variance of each Gaussian component. We then propose two uncertainty estimators based on fusion to nonparametric estimation, respectively. To build a new estimator, we consider a linear combination of existing uncertainty estimators or kernel functions. The combination weights are generatively estimated by minimizing some divergence with respect to the oracle uncertainty. The divergence measures used are weighted versions of Kullback-Leibler (KL), Itakura-Saito (IS), and Euclidean (EU) divergences. Due to the inherent nonnegativity of uncertainty, this estimation problem can be seen as an instance of weighted nonnegative matrix factorization (NMF). In addition, we propose two discriminative uncertainty estimators based on linear or nonlinear mapping of the generatively estimated uncertainty. This mapping is trained so as to maximize the boosted maximum mutual information (bMMI) criterion. We compute the derivative of this criterion using the chain rule and optimize it using stochastic gradient descent. In the second part, we introduce a new learning rule for neural networks that is based on an auxiliary function technique without parameter tuning. Instead of minimizing the objective function, this technique consists of minimizing a quadratic auxiliary function which is recursively introduced layer by layer and which has a closed-form optimum. Based on the properties of this auxiliary function, the monotonic decrease of the new learning rule is guaranteed.
|
426 |
Measuring Speech Intelligibility in Voice Alarm Communication SystemsGeoffroy, Nancy Anne 04 May 2005 (has links)
Speech intelligibility of voice alarm communication systems is extremely important for proper notification and direction of building occupants. Currently, there is no minimum standard to which all voice alarm communication systems must be held. Tests were conducted to determine how system and room characteristics, and the addition of occupants, affect the intelligibility of a voice signal. This research outlines a methodology for measuring the speech intelligibility of a room and describes the impact of numerous variables on these measurements. Eight variables were considered for this study: speaker quantity and location, speaker power tap, sound pressure level (SPL), number and location of occupants, presence of furniture, location of intelligibility measurements, data collection method, and floor covering. All room characteristics had some affect on the room intelligibility; the sound pressure level of the signal and the number and location of occupants had the greatest overall impact on the intelligibility of the room. It is recommended, based on the results of this study, that further investigation be conducted in the following areas: floor finishes, speaker directivity, various population densities, furniture packages and room sizes.
|
427 |
A computational model for studying L1’s effect on L2 speech learningJanuary 2018 (has links)
abstract: Much evidence has shown that first language (L1) plays an important role in the formation of L2 phonological system during second language (L2) learning process. This combines with the fact that different L1s have distinct phonological patterns to indicate the diverse L2 speech learning outcomes for speakers from different L1 backgrounds. This dissertation hypothesizes that phonological distances between accented speech and speakers' L1 speech are also correlated with perceived accentedness, and the correlations are negative for some phonological properties. Moreover, contrastive phonological distinctions between L1s and L2 will manifest themselves in the accented speech produced by speaker from these L1s. To test the hypotheses, this study comes up with a computational model to analyze the accented speech properties in both segmental (short-term speech measurements on short-segment or phoneme level) and suprasegmental (long-term speech measurements on word, long-segment, or sentence level) feature space. The benefit of using a computational model is that it enables quantitative analysis of L1's effect on accent in terms of different phonological properties. The core parts of this computational model are feature extraction schemes to extract pronunciation and prosody representation of accented speech based on existing techniques in speech processing field. Correlation analysis on both segmental and suprasegmental feature space is conducted to look into the relationship between acoustic measurements related to L1s and perceived accentedness across several L1s. Multiple regression analysis is employed to investigate how the L1's effect impacts the perception of foreign accent, and how accented speech produced by speakers from different L1s behaves distinctly on segmental and suprasegmental feature spaces. Results unveil the potential application of the methodology in this study to provide quantitative analysis of accented speech, and extend current studies in L2 speech learning theory to large scale. Practically, this study further shows that the computational model proposed in this study can benefit automatic accentedness evaluation system by adding features related to speakers' L1s. / Dissertation/Thesis / Doctoral Dissertation Speech and Hearing Science 2018
|
428 |
Perceptual features for speech recognitionHaque, Serajul January 2008 (has links)
Automatic speech recognition (ASR) is one of the most important research areas in the field of speech technology and research. It is also known as the recognition of speech by a machine or, by some artificial intelligence. However, in spite of focused research in this field for the past several decades, robust speech recognition with high reliability has not been achieved as it degrades in presence of speaker variabilities, channel mismatch condi- tions, and in noisy environments. The superb ability of the human auditory system has motivated researchers to include features of human perception in the speech recognition process. This dissertation investigates the roles of perceptual features of human hearing in automatic speech recognition in clean and noisy environments. Methods of simplified synaptic adaptation and two-tone suppression by companding are introduced by temporal processing of speech using a zero-crossing algorithm. It is observed that a high frequency enhancement technique such as synaptic adaptation performs better in stationary Gaussian white noise, whereas a low frequency enhancement technique such as the two-tone sup- pression performs better in non-Gaussian non-stationary noise types. The effects of static compression on ASR parametrization are investigated as observed in the psychoacoustic input/output (I/O) perception curves. A method of frequency dependent asymmetric compression technique, that is, higher compression in the higher frequency regions than the lower frequency regions, is proposed. By asymmetric compression, degradation of the spectral contrast of the low frequency formants due to the added compression is avoided. A novel feature extraction method for ASR based on the auditory processing in the cochlear nucleus is presented. The processings for synchrony detection, average discharge (mean rate) processing and the two tone suppression are segregated and processed separately at the feature extraction level according to the differential processing scheme as observed in the AVCN, PVCN and the DCN, respectively, of the cochlear nucleus. It is further observed that improved ASR performances can be achieved by separating the synchrony detection from the synaptic processing. A time-frequency perceptual spectral subtraction method based on several psychoacoustic properties of human audition is developed and evaluated by an ASR front-end. An auditory masking threshold is determined based on these psychoacoustic e?ects. It is observed that in speech recognition applications, spec- tral subtraction utilizing psychoacoustics may be used for improved performance in noisy conditions. The performance may be further improved if masking of noise by the tonal components is augmented by spectral subtraction in the masked region.
|
429 |
Spectro-Temporal Features For Robust Automatic Speech RecognitionSuryanarayana, Venkata K 01 1900 (has links)
The speech signal is inherently characterized by its variations in time, which get reflected as variations in frequency. The specto temporal changes are due to changes in vocaltract, intonation, co-articulation and successive articulation of different phonetic sounds. In this thesis we are looking for improving the speech recognition performance through better feature parameters using a non-stationary model of speech. One effective means of modeling a general non-stationary signal is using the AM-FM model. AM-FM model can be extended to speech through a sub-band analysis, which can be mimic the auditory analysis. In this thesis, we explore new methods for estimating AM and FM parameters based on the non-uniform samples of the signal. The non-uniform sample approach along with adaptive window estimation provides for important advantage because of multi-resolution analysis. We develop several new methods based on ZC intervals, local extrema intervals and signal derivative at ZC’s as different sample measures of the signal and explore their effectiveness for instantaneous frequency (IF) and instantaneous envelope (IE) estimation. To deal with speech signal for automatic speech recognition, we explore the use of auditory motivated spectro temporal information through the use of an auditory filter bank and signal parameters (or features) are derived from the instantaneous energy in each band using the non-linear energy operator over a larger window length. The temporal correlation present in the signal is exploited by using DCT and keeping the lower few coefficients of DCT to keep the trend in the energy in each band. The DCT coefficients from different frequency bands are concatenated together, and a further spectral decorrelation is achieved through KLT (Karhunen-Loeve Transform) of the concatenated feature vector. The changes in the vocaltract are well captured by the change in the formant structure and to emphasize these details for ASR we have defined a temporal formant by using the AM-FM decomposition of sub-band speech. A uniform wideband non-overlaping filters are used for sub-band decomposition. The temporal formant is defined using the AM-FM parameters of each subband signal. The temporal evolution of a formant is represented by the lower order DCT coefficients of the temporal formant in each band and its use for ASR is explored. To address the robustness of ASR performance to environmental noisy conditions, we have used a hybrid approach of enhancing the speech signal using statistical models of the speech and noise. Use of GMM for statistical speech enhancement has been shown to be effective. It is found that the spectro-temporal features derived from enhanced speech provide further improvement to ASR performance.
|
430 |
Speech Signal Classification Using Support Vector MachinesSood, Gaurav 07 1900 (has links)
Hidden Markov Models (HMMs) are, undoubtedly, the most employed core technique for Automatic Speech Recognition (ASR). Nevertheless, we are still far from achieving high‐performance ASR systems. Some alternative approaches, most of them based on Artificial Neural Networks (ANNs), were proposed during the late eighties and early nineties. Some of them tackled the ASR problem using predictive ANNs, while others proposed hybrid HMM/ANN systems. However, despite some achievements, nowadays, the dependency on Hidden Markov Models is a fact. During the last decade, however, a new tool appeared in the field of machine learning that has proved to be able to cope with hard classification problems in several fields of application: the Support Vector Machines (SVMs). The SVMs are effective discriminative classifiers with several outstanding characteristics, namely: their solution is that with maximum margin; they are capable to deal with samples of a very higher dimensionality; and their convergence to the minimum of the associated cost function is guaranteed.
In this work a novel approach based upon probabilistic kernels in support vector machines have been attempted for speech data classification. The classification accuracy in case of support vector classification depends upon the kernel function used which in turn depends upon the data set in hand. But still as of now there is no way to know a priori which kernel will give us best results
The kernel used in this work tries to normalize the time dimension by fitting a probability distribution over individual data points which normalizes the time dimension inherent to speech signals which facilitates the use of support vector machines since it acts on static data only. The divergence between these probability distributions fitted over individual speech utterances is used to form the kernel matrix. Vowel Classification, Isolated Word Recognition (Digit Recognition), have been attempted and results are compared with state of art systems.
|
Page generated in 0.1335 seconds