161 |
Evaluating the Effects of Automatic Speech Recognition Word AccuracyDoe, Hope L. 10 August 1998 (has links)
Automatic Speech Recognition (ASR) research has been primarily focused towards large-scale systems and industry, while other areas that require attention are often over-looked by researchers. For this reason, this research looked at automatic speech recognition at the consumer level. Many individual consumers will purchase and use automatic software recognition for a different purpose than that of the military or commercial industries, such as telecommunications. Consumers who purchase the software for personal use will mainly use ASR for dictation of correspondences and documents. Two ASR dictation software packages were used to conduct the study. The research examined the relationships between (1) speech recognition software training and word accuracy, (2) error-correction time by the user and word accuracy, and (3) correspondence type and word accuracy. The correspondences evaluated were those that resemble Personal, Business, and Technical Correspondences. Word accuracy was assessed after initial system training, five minutes of error-correction time, and ten minutes of error-correction time.
Results indicated that word recognition accuracy achieved does affect user satisfaction. It was also found that with increased error-correction time, word accuracy results improved. Additionally, the results found that Personal Correspondence achieved the highest mean word accuracy rate for both systems and that Dragon Systems achieved the highest mean word accuracy recognition for the Correspondences explored in this research. Results were discussed in terms of subjective and objective measures, advantages and disadvantages of speech input, and design recommendations were provided. / Master of Science
|
162 |
Speech coding and transmission for improved automatic recognition in communication networksZhong, Xin 01 December 2003 (has links)
Acknowledgements section and vita redacted from this digital version.
|
163 |
Using observation uncertainty for robust speech recognitionArrowood, Jon A. 01 December 2003 (has links)
No description available.
|
164 |
A study on acoustic modeling and adaptation in HMM-based speech recognitionMa, Bin, 馬斌 January 2000 (has links)
published_or_final_version / Computer Science and Information Systems / Doctoral / Doctor of Philosophy
|
165 |
Cross-lingual automatic speech recognition using tandem featuresLal, Partha January 2011 (has links)
Automatic speech recognition requires many hours of transcribed speech recordings in order for an acoustic model to be effectively trained. However, recording speech corpora is time-consuming and expensive, so such quantities of data exist only for a handful of languages — there are many languages for which little or no data exist. Given that there are acoustic similarities between different languages, it may be fruitful to use data from a well-supported source language for the task of training a recogniser in a target language with little training data. Since most languages do not share a common phonetic inventory, we propose an indirect way of transferring information from a source language model to a target language model. Tandem features, in which class-posteriors from a separate classifier are decorrelated and appended to conventional acoustic features, are used to do that. They have the advantage that the language used to train the classifier, typically a Multilayer Perceptron (MLP) need not be the same as the target language being recognised. Consistent with prior work, positive results are achieved for monolingual systems in a number of different languages. Furthermore, improvements are also shown for the cross-lingual case, in which the tandem features were generated using a classifier not trained for the target language. We examine factors which may predict the relative improvements brought about by tandem features for a given source and target pair. We examine some cross-corpus normalization issues that naturally arise in multilingual speech recognition and validate our solution in terms of recognition accuracy and a mutual information measure. The tandem classifier in work up to this point in the thesis has been a phoneme classifier. Articulatory features (AFs), represented here as a multi-stream, discrete, multivalued labelling of speech, can be used as an alternative task. The motivation for this is that since AFs are a set of physically grounded categories that are not language-specific they may be more suitable for cross-lingual transfer. Then, using either phoneme or AF classification as our MLP task, we look at training the MLP using data from more than one language — again we hypothesise that AF tandem will resulting greater improvements in accuracy. We also examine performance where only limited amounts of target language data are available, and see how our various tandem systems perform under those conditions.
|
166 |
Automatic Speech Recognition for ageing voicesVipperla, Ravichander January 2011 (has links)
With ageing, human voices undergo several changes which are typically characterised by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking rate. The focus of this thesis is to understand the impact of ageing on Automatic Speech Recognition (ASR) performance and improve the ASR accuracies for older voices. Baseline results on three corpora indicate that the word error rates (WER) for older adults are significantly higher than those of younger adults and the decrease in accuracies is higher for males speakers as compared to females. Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies were found to be significantly higher for older adults. However, the hypothesis that these changes explain the differences in WER for the two age groups is proven incorrect. Experiments with artificial introduction of glottal source disfluencies in speech from younger adults do not display a significant impact on WERs. Changes in fundamental frequency observed quite often in older voices has a marginal impact on ASR accuracies. Analysis of phoneme errors between younger and older speakers shows a pattern of certain phonemes especially lower vowels getting more affected with ageing. These changes however are seen to vary across speakers. Another factor that is strongly associated with ageing voices is a decrease in the rate of speech. Experiments to analyse the impact of slower speaking rate on ASR accuracies indicate that the insertion errors increase while decoding slower speech with models trained on relatively faster speech. We then propose a way to characterise speakers in acoustic space based on speaker adaptation transforms and observe that speakers (especially males) can be segregated with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical acoustic models based on gender and age. Significant improvements in word accuracies are achieved over the baseline results with such models. The idea is then extended to construct unsupervised hierarchical models which also outperform the baseline models by a good margin. Finally, we hypothesize that the ASR accuracies can be improved by augmenting the adaptation data with speech from acoustically closest speakers. A strategy to select the augmentation speakers is proposed. Experimental results on two corpora indicate that the hypothesis holds true only when the amount of available adaptation is limited to a few seconds. The efficacy of such a speaker selection strategy is analysed for both younger and older adults.
|
167 |
Automatic Speech Recognition Using Finite Inductive SequencesCherri, Mona Youssef, 1956- 08 1900 (has links)
This dissertation addresses the general problem of recognition of acoustic signals which may be derived from speech, sonar, or acoustic phenomena. The specific problem of recognizing speech is the main focus of this research. The intention is to design a recognition system for a definite number of discrete words. For this purpose specifically, eight isolated words from the T1MIT database are selected. Four medium length words "greasy," "dark," "wash," and "water" are used. In addition, four short words are considered "she," "had," "in," and "all." The recognition system addresses the following issues: filtering or preprocessing, training, and decision-making. The preprocessing phase uses linear predictive coding of order 12. Following the filtering process, a vector quantization method is used to further reduce the input data and generate a finite inductive sequence of symbols representative of each input signal. The sequences generated by the vector quantization process of the same word are factored, and a single ruling or reference template is generated and stored in a codebook. This system introduces a new modeling technique which relies heavily on the basic concept that all finite sequences are finitely inductive. This technique is used in the training stage. In order to accommodate the variabilities in speech, the training is performed casualty, and a large number of training speakers is used from eight different dialect regions. Hence, a speaker independent recognition system is realized. The matching process compares the incoming speech with each of the templates stored, and a closeness ration is computed. A ratio table is generated anH the matching word that corresponds to the smallest ratio (i.e. indicating that the ruling has removed most of the symbols) is selected. Promising results were obtained for isolated words, and the recognition rates ranged between 50% and 100%.
|
168 |
Using commercial-off-the-shelf speech recognition software for conning U.S. warshipsTamez, Dorothy J. 06 1900 (has links)
Approved for public release; distribution is unlimited / The U.S. Navy's Transformation Roadmap is leading the fleet in a smaller, faster, and more technologically advanced direction. Smaller platforms and reduced manpower resources create opportunities to fill important positions, including ship-handling control, with technology. This thesis investigates the feasibility of using commercial-off-the-shelf (COTS) speech recognition software (SRS) for conning a Navy ship. Dragon NaturallySpeaking Version 6.0 software and a SHURE wireless microphone were selected for this study. An experiment, with a limited number of subjects, was conducted at the Marine Safety International, San Diego, California ship-handling simulation facility. It measured the software error rate during conning operations. Data analysis sought to determine the types and significant causes of error. Analysis includes factors such as iteration number, subject, scenario, setting and ambient noise. Their significance provides key insights for future experimentation. The selected COTS technology for this study proved promising overcoming irregularities particular to conning, but the software vocabulary and grammar were problematic. The use of SRS for conning ships merits additional research, using a limited lexicon and a modified grammar which supports conning commands. Cooperative research between the Navy and industry could produce the "Helmsman" of the future. / Lieutenant, United States Navy
|
169 |
Learning representations for speech recognition using artificial neural networksSwietojanski, Paweł January 2016 (has links)
Learning representations is a central challenge in machine learning. For speech recognition, we are interested in learning robust representations that are stable across different acoustic environments, recording equipment and irrelevant inter– and intra– speaker variabilities. This thesis is concerned with representation learning for acoustic model adaptation to speakers and environments, construction of acoustic models in low-resource settings, and learning representations from multiple acoustic channels. The investigations are primarily focused on the hybrid approach to acoustic modelling based on hidden Markov models and artificial neural networks (ANN). The first contribution concerns acoustic model adaptation. This comprises two new adaptation transforms operating in ANN parameters space. Both operate at the level of activation functions and treat a trained ANN acoustic model as a canonical set of fixed-basis functions, from which one can later derive variants tailored to the specific distribution present in adaptation data. The first technique, termed Learning Hidden Unit Contributions (LHUC), depends on learning distribution-dependent linear combination coefficients for hidden units. This technique is then extended to altering groups of hidden units with parametric and differentiable pooling operators. We found the proposed adaptation techniques pose many desirable properties: they are relatively low-dimensional, do not overfit and can work in both a supervised and an unsupervised manner. For LHUC we also present extensions to speaker adaptive training and environment factorisation. On average, depending on the characteristics of the test set, 5-25% relative word error rate (WERR) reductions are obtained in an unsupervised two-pass adaptation setting. The second contribution concerns building acoustic models in low-resource data scenarios. In particular, we are concerned with insufficient amounts of transcribed acoustic material for estimating acoustic models in the target language – thus assuming resources like lexicons or texts to estimate language models are available. First we proposed an ANN with a structured output layer which models both context–dependent and context–independent speech units, with the context-independent predictions used at runtime to aid the prediction of context-dependent states. We also propose to perform multi-task adaptation with a structured output layer. We obtain consistent WERR reductions up to 6.4% in low-resource speaker-independent acoustic modelling. Adapting those models in a multi-task manner with LHUC decreases WERRs by an additional 13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that one can build better acoustic models with unsupervised multi– and cross– lingual initialisation and find that pre-training is a largely language-independent. Up to 14.4% WERR reductions are observed, depending on the amount of the available transcribed acoustic data in the target language. The third contribution concerns building acoustic models from multi-channel acoustic data. For this purpose we investigate various ways of integrating and learning multi-channel representations. In particular, we investigate channel concatenation and the applicability of convolutional layers for this purpose. We propose a multi-channel convolutional layer with cross-channel pooling, which can be seen as a data-driven non-parametric auditory attention mechanism. We find that for unconstrained microphone arrays, our approach is able to match the performance of the comparable models trained on beamform-enhanced signals.
|
170 |
Chromosome classification and speech recognition using inferred Markov networks with empirical landmarks.January 1993 (has links)
by Law Hon Man. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1993. / Includes bibliographical references (leaves 67-70). / Chapter 1 --- Introduction --- p.1 / Chapter 2 --- Automated Chromosome Classification --- p.4 / Chapter 2.1 --- Procedures in Chromosome Classification --- p.6 / Chapter 2.2 --- Sample Preparation --- p.7 / Chapter 2.3 --- Low Level Processing and Measurement --- p.9 / Chapter 2.4 --- Feature Extraction --- p.11 / Chapter 2.5 --- Classification --- p.15 / Chapter 3 --- Inference of Markov Networks by Dynamic Programming --- p.17 / Chapter 3.1 --- Markov Networks --- p.18 / Chapter 3.2 --- String-to-String Correction --- p.19 / Chapter 3.3 --- String-to-Network Alignment --- p.21 / Chapter 3.4 --- Forced Landmarks in String-to-Network Alignment --- p.31 / Chapter 4 --- Landmark Finding in Markov Networks --- p.34 / Chapter 4.1 --- Landmark Finding without a priori Knowledge --- p.34 / Chapter 4.2 --- Chromosome Profile Processing --- p.37 / Chapter 4.3 --- Analysis of Chromosome Networks --- p.39 / Chapter 4.4 --- Classification Results --- p.45 / Chapter 5 --- Speech Recognition using Inferred Markov Networks --- p.48 / Chapter 5.1 --- Linear Predictive Analysis --- p.48 / Chapter 5.2 --- TIMIT Speech Database --- p.50 / Chapter 5.3 --- Feature Extraction --- p.51 / Chapter 5.4 --- Empirical Landmarks in Speech Networks --- p.52 / Chapter 5.5 --- Classification Results --- p.55 / Chapter 6 --- Conclusion --- p.57 / Chapter 6.1 --- Suggested Improvements --- p.57 / Chapter 6.2 --- Concluding remarks --- p.61 / Appendix A --- p.63 / Reference --- p.67
|
Page generated in 0.1428 seconds