• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 510
  • 40
  • 37
  • 35
  • 27
  • 25
  • 21
  • 21
  • 11
  • 11
  • 11
  • 11
  • 11
  • 11
  • 11
  • Tagged with
  • 924
  • 924
  • 509
  • 216
  • 165
  • 150
  • 148
  • 100
  • 98
  • 86
  • 78
  • 72
  • 71
  • 71
  • 67
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
251

Improved MFCC Front End Using Spectral Maxima For Noisy Speech Recognition

Sujatha, J 11 1900 (has links) (PDF)
No description available.
252

A novel lip geometry approach for audio-visual speech recognition

Ibrahim, Zamri January 2014 (has links)
By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. Various method have been studied by research group around the world to incorporate lip movements into speech recognition in recent years, however exactly how best to incorporate the additional visual information is still not known. This study aims to extend the knowledge of relationships between visual and speech information specifically using lip geometry information due to its robustness to head rotation and the fewer number of features required to represent movement. A new method has been developed to extract lip geometry information, to perform classification and to integrate visual and speech modalities. This thesis makes several contributions. First, this work presents a new method to extract lip geometry features using the combination of a skin colour filter, a border following algorithm and a convex hull approach. The proposed method was found to improve lip shape extraction performance compared to existing approaches. Lip geometry features including height, width, ratio, area, perimeter and various combinations of these features were evaluated to determine which performs best when representing speech in the visual domain. Second, a novel template matching technique able to adapt dynamic differences in the way words are uttered by speakers has been developed, which determines the best fit of an unseen feature signal to those stored in a database template. Third, following on evaluation of integration strategies, a novel method has been developed based on alternative decision fusion strategy, in which the outcome from the visual and speech modality is chosen by measuring the quality of audio based on kurtosis and skewness analysis and driven by white noise confusion. Finally, the performance of the new methods introduced in this work are evaluated using the CUAVE and LUNA-V data corpora under a range of different signal to noise ratio conditions using the NOISEX-92 dataset.
253

A study on acoustic modeling and adaptation in HMM-based speech recognition

Ma, Bin, 馬斌 January 2000 (has links)
published_or_final_version / Computer Science and Information Systems / Doctoral / Doctor of Philosophy
254

Combining games and speech recognition in a multilingual educational environment / M. Booth

Booth, Martin January 2014 (has links)
Playing has been part of people's lives since the beginning of time. However, play does not take place in silence (isolated from speech and sound). The games people play allow them to interact and to learn through experiences. Speech often forms an integral part of playing games. Video games also allow players to interact with a virtual world and learn through those experiences. Speech input has previously been explored as a way of interacting with a game, as talking is a natural way of communicating. By talking to a game, the experiences created during gameplay become more valuable, which in turn facilitates effective learning. In order to enable a game to “hear", some issues need to be considered. A game, that will serve as a platform for speech input, has to be developed. If the game will contain learning elements, expert knowledge regarding the learning content needs to be obtained. The game needs to communicate with a speech recognition system, which will recognise players' speech inputs. To understand the role of speech recognition in a game, players need to be tested while playing the game. The players' experiences and opinions can then be fed back into the development of speech recognition in educational games. This process was followed with six Financial Management students on the NWU Vaal Triangle campus. The students played FinMan, a game which teaches the fundamental concepts of the “Time value of money" principle. They played the game with the keyboard and mouse, as well as via speech commands. The students shared their experiences through a focus group discussion and by completing a questionnaire. Quantitative data was collected to back the students' experiences. The results show that, although the recognition accuracies and response times are important issues, speech recognition can play an essential part in educational games. By freeing learners to focus on the game content, speech recognition can make games more accessible and engaging, and consequently lead to more effective learning experiences. / MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014
255

Cross-lingual automatic speech recognition using tandem features

Lal, Partha January 2011 (has links)
Automatic speech recognition requires many hours of transcribed speech recordings in order for an acoustic model to be effectively trained. However, recording speech corpora is time-consuming and expensive, so such quantities of data exist only for a handful of languages — there are many languages for which little or no data exist. Given that there are acoustic similarities between different languages, it may be fruitful to use data from a well-supported source language for the task of training a recogniser in a target language with little training data. Since most languages do not share a common phonetic inventory, we propose an indirect way of transferring information from a source language model to a target language model. Tandem features, in which class-posteriors from a separate classifier are decorrelated and appended to conventional acoustic features, are used to do that. They have the advantage that the language used to train the classifier, typically a Multilayer Perceptron (MLP) need not be the same as the target language being recognised. Consistent with prior work, positive results are achieved for monolingual systems in a number of different languages. Furthermore, improvements are also shown for the cross-lingual case, in which the tandem features were generated using a classifier not trained for the target language. We examine factors which may predict the relative improvements brought about by tandem features for a given source and target pair. We examine some cross-corpus normalization issues that naturally arise in multilingual speech recognition and validate our solution in terms of recognition accuracy and a mutual information measure. The tandem classifier in work up to this point in the thesis has been a phoneme classifier. Articulatory features (AFs), represented here as a multi-stream, discrete, multivalued labelling of speech, can be used as an alternative task. The motivation for this is that since AFs are a set of physically grounded categories that are not language-specific they may be more suitable for cross-lingual transfer. Then, using either phoneme or AF classification as our MLP task, we look at training the MLP using data from more than one language — again we hypothesise that AF tandem will resulting greater improvements in accuracy. We also examine performance where only limited amounts of target language data are available, and see how our various tandem systems perform under those conditions.
256

Automatic Speech Recognition for ageing voices

Vipperla, Ravichander January 2011 (has links)
With ageing, human voices undergo several changes which are typically characterised by increased hoarseness, breathiness, changes in articulatory patterns and slower speaking rate. The focus of this thesis is to understand the impact of ageing on Automatic Speech Recognition (ASR) performance and improve the ASR accuracies for older voices. Baseline results on three corpora indicate that the word error rates (WER) for older adults are significantly higher than those of younger adults and the decrease in accuracies is higher for males speakers as compared to females. Acoustic parameters such as jitter and shimmer that measure glottal source disfluencies were found to be significantly higher for older adults. However, the hypothesis that these changes explain the differences in WER for the two age groups is proven incorrect. Experiments with artificial introduction of glottal source disfluencies in speech from younger adults do not display a significant impact on WERs. Changes in fundamental frequency observed quite often in older voices has a marginal impact on ASR accuracies. Analysis of phoneme errors between younger and older speakers shows a pattern of certain phonemes especially lower vowels getting more affected with ageing. These changes however are seen to vary across speakers. Another factor that is strongly associated with ageing voices is a decrease in the rate of speech. Experiments to analyse the impact of slower speaking rate on ASR accuracies indicate that the insertion errors increase while decoding slower speech with models trained on relatively faster speech. We then propose a way to characterise speakers in acoustic space based on speaker adaptation transforms and observe that speakers (especially males) can be segregated with reasonable accuracies based on age. Inspired by this, we look at supervised hierarchical acoustic models based on gender and age. Significant improvements in word accuracies are achieved over the baseline results with such models. The idea is then extended to construct unsupervised hierarchical models which also outperform the baseline models by a good margin. Finally, we hypothesize that the ASR accuracies can be improved by augmenting the adaptation data with speech from acoustically closest speakers. A strategy to select the augmentation speakers is proposed. Experimental results on two corpora indicate that the hypothesis holds true only when the amount of available adaptation is limited to a few seconds. The efficacy of such a speaker selection strategy is analysed for both younger and older adults.
257

Using commercial-off-the-shelf speech recognition software for conning U.S. warships

Tamez, Dorothy J. 06 1900 (has links)
Approved for public release; distribution is unlimited / The U.S. Navy's Transformation Roadmap is leading the fleet in a smaller, faster, and more technologically advanced direction. Smaller platforms and reduced manpower resources create opportunities to fill important positions, including ship-handling control, with technology. This thesis investigates the feasibility of using commercial-off-the-shelf (COTS) speech recognition software (SRS) for conning a Navy ship. Dragon NaturallySpeaking Version 6.0 software and a SHURE wireless microphone were selected for this study. An experiment, with a limited number of subjects, was conducted at the Marine Safety International, San Diego, California ship-handling simulation facility. It measured the software error rate during conning operations. Data analysis sought to determine the types and significant causes of error. Analysis includes factors such as iteration number, subject, scenario, setting and ambient noise. Their significance provides key insights for future experimentation. The selected COTS technology for this study proved promising overcoming irregularities particular to conning, but the software vocabulary and grammar were problematic. The use of SRS for conning ships merits additional research, using a limited lexicon and a modified grammar which supports conning commands. Cooperative research between the Navy and industry could produce the "Helmsman" of the future. / Lieutenant, United States Navy
258

Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos

Zylich, Brian Matthew 25 April 2019 (has links)
We explore how to automatically detect specific phrases in audio from noisy, multi-speaker videos using deep neural networks. Specifically, we focus on classroom observation videos that contain a few adult teachers and several small children (< 5 years old). At any point in these videos, multiple people may be talking, shouting, crying, or singing simultaneously. Our goal is to recognize polite speech phrases such as "Good job", "Thank you", "Please", and "You're welcome", as the occurrence of such speech is one of the behavioral markers used in classroom observation coding via the Classroom Assessment Scoring System (CLASS) protocol. Commercial speech recognition services such as Google Cloud Speech are impractical because of data privacy concerns. Therefore, we train and test our own custom models using a combination of publicly available classroom videos from YouTube, as well as a private dataset of real classroom observation videos collected by our colleagues at the University of Virginia. We also crowdsource an additional 1152 recordings of polite speech phrases to augment our training dataset. Our contributions are the following: (1) we design a crowdsourcing task for efficiently labeling speech events in classroom videos, (2) we develop a neural network-based architecture for speech recognition, robust to noise and overlapping speech, and (3) we explore methods to synthesize new and authentic audio data, both to increase the training set size and reduce the class imbalance. Finally, using our trained polite speech detector, (4) we investigate the relationship between polite speech and CLASS scores and enable teachers to visualize their use of polite language.
259

Learning representations for speech recognition using artificial neural networks

Swietojanski, Paweł January 2016 (has links)
Learning representations is a central challenge in machine learning. For speech recognition, we are interested in learning robust representations that are stable across different acoustic environments, recording equipment and irrelevant inter– and intra– speaker variabilities. This thesis is concerned with representation learning for acoustic model adaptation to speakers and environments, construction of acoustic models in low-resource settings, and learning representations from multiple acoustic channels. The investigations are primarily focused on the hybrid approach to acoustic modelling based on hidden Markov models and artificial neural networks (ANN). The first contribution concerns acoustic model adaptation. This comprises two new adaptation transforms operating in ANN parameters space. Both operate at the level of activation functions and treat a trained ANN acoustic model as a canonical set of fixed-basis functions, from which one can later derive variants tailored to the specific distribution present in adaptation data. The first technique, termed Learning Hidden Unit Contributions (LHUC), depends on learning distribution-dependent linear combination coefficients for hidden units. This technique is then extended to altering groups of hidden units with parametric and differentiable pooling operators. We found the proposed adaptation techniques pose many desirable properties: they are relatively low-dimensional, do not overfit and can work in both a supervised and an unsupervised manner. For LHUC we also present extensions to speaker adaptive training and environment factorisation. On average, depending on the characteristics of the test set, 5-25% relative word error rate (WERR) reductions are obtained in an unsupervised two-pass adaptation setting. The second contribution concerns building acoustic models in low-resource data scenarios. In particular, we are concerned with insufficient amounts of transcribed acoustic material for estimating acoustic models in the target language – thus assuming resources like lexicons or texts to estimate language models are available. First we proposed an ANN with a structured output layer which models both context–dependent and context–independent speech units, with the context-independent predictions used at runtime to aid the prediction of context-dependent states. We also propose to perform multi-task adaptation with a structured output layer. We obtain consistent WERR reductions up to 6.4% in low-resource speaker-independent acoustic modelling. Adapting those models in a multi-task manner with LHUC decreases WERRs by an additional 13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that one can build better acoustic models with unsupervised multi– and cross– lingual initialisation and find that pre-training is a largely language-independent. Up to 14.4% WERR reductions are observed, depending on the amount of the available transcribed acoustic data in the target language. The third contribution concerns building acoustic models from multi-channel acoustic data. For this purpose we investigate various ways of integrating and learning multi-channel representations. In particular, we investigate channel concatenation and the applicability of convolutional layers for this purpose. We propose a multi-channel convolutional layer with cross-channel pooling, which can be seen as a data-driven non-parametric auditory attention mechanism. We find that for unconstrained microphone arrays, our approach is able to match the performance of the comparable models trained on beamform-enhanced signals.
260

Chromosome classification and speech recognition using inferred Markov networks with empirical landmarks.

January 1993 (has links)
by Law Hon Man. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1993. / Includes bibliographical references (leaves 67-70). / Chapter 1 --- Introduction --- p.1 / Chapter 2 --- Automated Chromosome Classification --- p.4 / Chapter 2.1 --- Procedures in Chromosome Classification --- p.6 / Chapter 2.2 --- Sample Preparation --- p.7 / Chapter 2.3 --- Low Level Processing and Measurement --- p.9 / Chapter 2.4 --- Feature Extraction --- p.11 / Chapter 2.5 --- Classification --- p.15 / Chapter 3 --- Inference of Markov Networks by Dynamic Programming --- p.17 / Chapter 3.1 --- Markov Networks --- p.18 / Chapter 3.2 --- String-to-String Correction --- p.19 / Chapter 3.3 --- String-to-Network Alignment --- p.21 / Chapter 3.4 --- Forced Landmarks in String-to-Network Alignment --- p.31 / Chapter 4 --- Landmark Finding in Markov Networks --- p.34 / Chapter 4.1 --- Landmark Finding without a priori Knowledge --- p.34 / Chapter 4.2 --- Chromosome Profile Processing --- p.37 / Chapter 4.3 --- Analysis of Chromosome Networks --- p.39 / Chapter 4.4 --- Classification Results --- p.45 / Chapter 5 --- Speech Recognition using Inferred Markov Networks --- p.48 / Chapter 5.1 --- Linear Predictive Analysis --- p.48 / Chapter 5.2 --- TIMIT Speech Database --- p.50 / Chapter 5.3 --- Feature Extraction --- p.51 / Chapter 5.4 --- Empirical Landmarks in Speech Networks --- p.52 / Chapter 5.5 --- Classification Results --- p.55 / Chapter 6 --- Conclusion --- p.57 / Chapter 6.1 --- Suggested Improvements --- p.57 / Chapter 6.2 --- Concluding remarks --- p.61 / Appendix A --- p.63 / Reference --- p.67

Page generated in 0.0736 seconds