Global ETD Search

1	Accounting for Individual Speaker Properties in Automatic Speech Recognition Elenius, Daniel January 2010 (has links) <p>In this work, speaker characteristic modeling has been applied in the fields of automatic speech recognition (ASR) and automatic speaker verification (ASV). In ASR, a key problem is that acoustic mismatch between training and test conditions degrade classification per- formance. In this work, a child exemplifies a speaker not represented in training data and methods to reduce the spectral mismatch are devised and evaluated. To reduce the acoustic mismatch, predictive modeling based on spectral speech transformation is applied. Follow- ing this approach, a model suitable for a target speaker, not well represented in the training data, is estimated and synthesized by applying vocal tract predictive modeling (VTPM). In this thesis, the traditional static modeling on the utterance level is extended to dynamic modeling. This is accomplished by operating also on sub-utterance units, such as phonemes, phone-realizations, sub-phone realizations and sound frames.</p><p>Initial experiments shows that adaptation of an acoustic model trained on adult speech significantly reduced the word error rate of ASR for children, but not to the level of a model trained on children’s speech. Multi-speaker-group training provided an acoustic model that performed recognition for both adults and children within the same model at almost the same accuracy as speaker-group dedicated models, with no added model complexity. In the analysis of the cause of errors, body height of the child was shown to be correlated to word error rate.</p><p>A further result is that the computationally demanding iterative recognition process in standard VTLN can be replaced by synthetically extending the vocal tract length distribution in the training data. A multi-warp model is trained on the extended data and recognition is performed in a single pass. The accuracy is similar to that of the standard technique.</p><p>A concluding experiment in ASR shows that the word error rate can be reduced by ex- tending a static vocal tract length compensation parameter into a temporal parameter track. A key component to reach this improvement was provided by a novel joint two-level opti- mization process. In the process, the track was determined as a composition of a static and a dynamic component, which were simultaneously optimized on the utterance and sub- utterance level respectively. This had the principal advantage of limiting the modulation am- plitude of the track to what is realistic for an individual speaker. The recognition error rate was reduced by 10% relative compared with that of a standard utterance-specific estimation technique.</p><p>The techniques devised and evaluated can also be applied to other speaker characteristic properties, which exhibit a dynamic nature.</p><p>An excursion into ASV led to the proposal of a statistical speaker population model. The model represents an alternative approach for determining the reject/accept threshold in an ASV system instead of the commonly used direct estimation on a set of client and impos- tor utterances. This is especially valuable in applications where a low false reject or false ac- cept rate is required. In these cases, the number of errors is often too few to estimate a reli- able threshold using the direct method. The results are encouraging but need to be verified on a larger database.</p> / Pf-Star / KOBRA MAP MLLR VTLN speaker characteristics dynamic modeling child Information and language technology Informations- och språkteknologi
2	Utilisation des coefficients de régression linéaire par maximum de vraisemblance comme paramètres pour la reconnaissance automatique du locuteur Ferràs Font, Marc 10 July 2009 (has links) (PDF) The goal of this thesis is to find new and efficient features for speaker recognition. We are mostly concerned with the use of the Maximum-Likelihood Linear Regression (MLLR) family of adaptation techniques as features in speaker recognition systems. MLLR transformcoefficients are able to capture speaker cues after adaptation of a speaker-independent model using speech data. The resulting supervectors are high-dimensional and no underlying model guiding its generation is assumed a priori, becoming suitable for SVM for classification. This thesis brings some contributions to the speaker recognition field by proposing new approaches to feature extraction and studying existing ones via experimentation on large corpora: 1. We propose a compact yet efficient system, MLLR-SVM, which tackles the issues of transcript- and language-dependency of the standard MLLR-SVM approach by using single-class Constrained MLLR (CMLLR) adaptation transforms together with Speaker Adaptive Training (SAT) of a Universal Background Model (UBM). 1- When less data samples than dimensions are available. 2- We propose several alternative representations of CMLLR transformcoefficients based on the singular value and symmetric/skew-symmetric decompositions of transform matrices. 3- We develop a novel framework for feature-level inter-session variability compensation based on compensation of CMLLR transform supervectors via Nuisance Attribute Projection (NAP). 4- We perform a comprehensive experimental study of multi-class (C)MLLR-SVM systems alongmultiple axes including front-end, type of transform, type fmodel,model training and number of transforms. 5- We compare CMLLR and MLLR transform matrices based on an analysis of properties of their singular values. 6- We propose the use of lattice-basedMLLR as away to copewith erroneous transcripts in MLLR-SVMsystems using phonemic acoustic models. traitement du langage MLLR reconnaissance du locuteur adaptation du locuteur
3	A Study of the Automatic Speech Recognition Process and Speaker Adaptation Stokes-Rees, Ian James January 2000 (has links) This thesis considers the entire automated speech recognition process and presents a standardised approach to LVCSR experimentation with HMMs. It also discusses various approaches to speaker adaptation such as MLLR and multiscale, and presents experimental results for cross-task speaker adaptation. An analysis of training parameters and data sufficiency for reasonable system performance estimates are also included. It is found that Maximum Likelihood Linear Regression (MLLR) supervised adaptation can result in 6% reduction (absolute) in word error rate given only one minute of adaptation data, as compared with an unadapted model set trained on a different task. The unadapted system performed at 24% WER and the adapted system at 18% WER. This is achieved with only 4 to 7 adaptation classes per speaker, as generated from a regression tree. Electrical & Computer Engineering automatic speech recognition speaker adaptation HTK HMM MLLR LVCSR
4	A Study of the Automatic Speech Recognition Process and Speaker Adaptation Stokes-Rees, Ian James January 2000 (has links) This thesis considers the entire automated speech recognition process and presents a standardised approach to LVCSR experimentation with HMMs. It also discusses various approaches to speaker adaptation such as MLLR and multiscale, and presents experimental results for cross-task speaker adaptation. An analysis of training parameters and data sufficiency for reasonable system performance estimates are also included. It is found that Maximum Likelihood Linear Regression (MLLR) supervised adaptation can result in 6% reduction (absolute) in word error rate given only one minute of adaptation data, as compared with an unadapted model set trained on a different task. The unadapted system performed at 24% WER and the adapted system at 18% WER. This is achieved with only 4 to 7 adaptation classes per speaker, as generated from a regression tree. Electrical & Computer Engineering automatic speech recognition speaker adaptation HTK HMM MLLR LVCSR
5	Comparison of the distribution of combined immunological and virological responses in adult HIV positive patients across Antiretroviral Therapy (ART)providers in Tshwane : a multilevel analysis Wandai, Elia Muchiri January 2014 (has links) Background: Immunological and virological responses to ART are important outcome indicators that are mostly used to evaluate the success of an ART program. A comparative performance between ART providers based on the two outcomes can be useful in optimising resources to underperforming providers and advising quality improvement plans. Aim: To compare immunological and virological responses of ART for adult HIV positive patients between providers in Tshwane District, Gauteng Province, South Africa. Methodology: This study was an analytical observational study that retrospectively compared patient treatment outcomes on immunological and virological responses between 16 Antiretroviral Therapy (ART) providers. The analysis compared baseline patients’ status on these two outcomes with their statuses after 6 and12 months on ART. Ordinary logistic regression was used to calculate Standardised Incidence Ratios (SIR), while multilevel model analysis was used to calculate specific provider random effects of poor immunological and virological responses. Results: After 6 months of treatment, the SIR of poor immunological outcome for all clinics under study, as predicted by the unadjusted logistic regression models was 0.29 (95% CI: 0.27-0.31), but varied from a low of 0.14 (95% CI: 0.00-0.40) to a high of 0.66 (95% CI: 0.13-1.20) between the clinics. Two clinics had a Standardised Incidence Ratio (SIR) of poor immunological response that was significantly below 1 (poor immunological rate below average), while three clinics had an SIR above 1 (poor immunological rate above average) under the unadjusted logistic models. After adjusting for the effects of gender, age, drug combination, religion and present virological status, no clinic had a SIR that was significantly below 1, but two clinics had a SIR that was significantly above 1. xi Under the logistic multilevel (MLLR) analysis, the unadjusted model flagged two clinics whose clinic specific effects were below zero (lower rate of poor immunological outcome below that of the total sample) and one clinic whose clinic specific effect was above zero (higher rate of poor immunological outcome below the total sample rate). The adjusted model showed that no clinic had residual effects that were significantly below or above zero. The confidence intervals for MLLR model were found not to be wider than those of the logistic regression (LR) models particularly for clinics with small sample sizes. A number of clinics changed the relative order of their SIR/random effects after case-mix adjustments under both the LR and MLLR modelling. For poor virological response, both the LRD and MLLR models indicated no clinic specific effects. The predicted poor virological response rate by the case-mix unadjusted LR model was 0.12 (95% CI 0.11 - 0.13). All clinics except one had SIRs that were not significantly different from 1. After adjusting for CD4 count and age, no clinic had an SIR that was significantly different from 1. Conclusions: Case-mix or patients baseline characteristics explained much of the variation in the Standardised Incidence Ratios (SIR) of poor immunological outcome after 6 months of patient treatment, while provider (clinic) specific effects explained much of the variation after 12 months of treatment. After 6 months of treatment, the results also showed that there were significant differences in the SIR between the clinics before case-mix adjustments, but the differences disappeared after case-mix adjustments. This shows that comparison of treatment outcomes between providers (clinics) can be misleading if no proper adjustment are made for confounding factors. Differences in the SIRs for poor virological outcome, after 6 months of patient treatment were no longer significant between clinics after taking account of CD4 count and age. / Dissertation (MSc)--University of Pretoria, 2014. / gm2014 / School of Health Systems and Public Health / unrestricted Immunological response Virological response Case-mix adjustment SIR Clinic specific effects LR LRD MLLR UCTD
6	Accounting for Individual Speaker Properties in Automatic Speech Recognition Elenius, Daniel January 2010 (has links) In this work, speaker characteristic modeling has been applied in the fields of automatic speech recognition (ASR) and automatic speaker verification (ASV). In ASR, a key problem is that acoustic mismatch between training and test conditions degrade classification per- formance. In this work, a child exemplifies a speaker not represented in training data and methods to reduce the spectral mismatch are devised and evaluated. To reduce the acoustic mismatch, predictive modeling based on spectral speech transformation is applied. Follow- ing this approach, a model suitable for a target speaker, not well represented in the training data, is estimated and synthesized by applying vocal tract predictive modeling (VTPM). In this thesis, the traditional static modeling on the utterance level is extended to dynamic modeling. This is accomplished by operating also on sub-utterance units, such as phonemes, phone-realizations, sub-phone realizations and sound frames. Initial experiments shows that adaptation of an acoustic model trained on adult speech significantly reduced the word error rate of ASR for children, but not to the level of a model trained on children’s speech. Multi-speaker-group training provided an acoustic model that performed recognition for both adults and children within the same model at almost the same accuracy as speaker-group dedicated models, with no added model complexity. In the analysis of the cause of errors, body height of the child was shown to be correlated to word error rate. A further result is that the computationally demanding iterative recognition process in standard VTLN can be replaced by synthetically extending the vocal tract length distribution in the training data. A multi-warp model is trained on the extended data and recognition is performed in a single pass. The accuracy is similar to that of the standard technique. A concluding experiment in ASR shows that the word error rate can be reduced by ex- tending a static vocal tract length compensation parameter into a temporal parameter track. A key component to reach this improvement was provided by a novel joint two-level opti- mization process. In the process, the track was determined as a composition of a static and a dynamic component, which were simultaneously optimized on the utterance and sub- utterance level respectively. This had the principal advantage of limiting the modulation am- plitude of the track to what is realistic for an individual speaker. The recognition error rate was reduced by 10% relative compared with that of a standard utterance-specific estimation technique. The techniques devised and evaluated can also be applied to other speaker characteristic properties, which exhibit a dynamic nature. An excursion into ASV led to the proposal of a statistical speaker population model. The model represents an alternative approach for determining the reject/accept threshold in an ASV system instead of the commonly used direct estimation on a set of client and impos- tor utterances. This is especially valuable in applications where a low false reject or false ac- cept rate is required. In these cases, the number of errors is often too few to estimate a reli- able threshold using the direct method. The results are encouraging but need to be verified on a larger database. / QC 20110502 / Pf-Star / KOBRA MAP MLLR VTLN speaker characteristics dynamic modeling child
7	Automated phoneme mapping for cross-language speech recognition Sooful, Jayren Jugpal 11 January 2005 (has links) This dissertation explores a unique automated approach to map one phoneme set to another, based on the acoustic distances between the individual phonemes. Although the focus of this investigation is on cross-language applications, this automated approach can be extended to same-language but different-database applications as well. The main goal of this investigation is to be able to use the data of a source language, to train the initial acoustic models of a target language for which very little speech data may be available. To do this, an automatic technique for mapping the phonemes of the two data sets must be found. Using this technique, it would be possible to accelerate the development of a speech recognition system for a new language. The current research in the cross-language speech recognition field has focused on manual methods to map phonemes. This investigation has considered an English-to-Afrikaans phoneme mapping, as well as an Afrikaans-to-English phoneme mapping. This has been previously applied to these language instances, but utilising manual phoneme mapping methods. To determine the best phoneme mapping, different acoustic distance measures are compared. The distance measures that are considered are the Kullback-Leibler measure, the Bhattacharyya distance metric, the Mahalanobis measure, the Euclidean measure, the L2 metric and the Jeffreys-Matusita distance. The distance measures are tested by comparing the cross-database recognition results obtained on phoneme models created from the TIMIT speech corpus and a locally-compiled South African SUN Speech database. By selecting the most appropriate distance measure, an automated procedure to map phonemes from the source language to the target language can be done. The best distance measure for the mapping gives recognition rates comparable to a manual mapping process undertaken by a phonetic expert. This study also investigates the effect of the number of Gaussian mixture components on the mapping and on the speech recognition system’s performance. The results indicate that the recogniser’s performance increases up to a limit as the number of mixtures increase. In addition, this study has explored the effect of excluding the Mel Frequency delta and acceleration cepstral coefficients. It is found that the inclusion of these temporal features help improve the mapping and the recognition system’s phoneme recognition rate. Experiments are also carried out to determine the impact of the number of HMM recogniser states. It is found that single-state HMMs deliver the optimum cross-language phoneme recognition results. After having done the mapping, speaker adaptation strategies are applied on the recognisers to improve their target-language performance. The models of a fully trained speech recogniser in a source language are adapted to target-language models using Maximum Likelihood Linear Regression (MLLR) followed by Maximum A Posteriori (MAP) techniques. Embedded Baum-Welch re-estimation is used to further adapt the models to the target language. These techniques result in a considerable improvement in the phoneme recognition rate. Although a combination of MLLR and MAP techniques have been used previously in speech adaptation studies, the combination of MLLR, MAP and EBWR in cross-language speech recognition is a unique contribution of this study. Finally, a data pooling technique is applied to build a new recogniser using the automatically mapped phonemes from the target language as well as the source language phonemes. This new recogniser demonstrates moderate bilingual phoneme recognition capabilities. The bilingual recogniser is then further adapted to the target language using MAP and embedded Baum-Welch re-estimation techniques. This combination of adaptation techniques together with the data pooling strategy is uniquely applied in the field of cross-language recognition. The results obtained using this technique outperform all other techniques tested in terms of phoneme recognition rates, although it requires a considerably more time consuming training process. It displays only slightly poorer phoneme recognition than the recognisers trained and tested on the same language database. / Dissertation (MEng (Computer Engineering))--University of Pretoria, 2006. / Electrical, Electronic and Computer Engineering / unrestricted Map Mllr Data pooling Embedded baum-welch re-estimation Transformation-based adaptation Cross-language Phoneme mapping Acoustic distance measures UCTD

1

Page generated in 0.0165 seconds