Global ETD Search

311	An Evaluation Framework for Adaptive User Interface Noriega Atala, Enrique January 2014 (has links) With the rise of powerful mobile devices and the broad availability of computing power, Automatic Speech Recognition is becoming ubiquitous. A flawless ASR system is still far from existence. Because of this, interactive applications that make use of ASR technology not always recognize speech perfectly, when not, the user must be engaged to repair the transcriptions. We explore a rational user interface that uses of machine learning models to make its best effort in presenting the best repair strategy available to reduce the time in spent the interaction between the user and the system as much as possible. A study is conducted to determine how different candidate policies perform and results are analyzed. After the analysis, the methodology is generalized in terms of a decision theoretical framework that can be used to evaluate the performance of other rational user interfaces that try to optimize an expected cost or utility. automatic speech recognition expected utility machine learning rational user interfaces Computer Science adaptive user interfaces
312	Deep Neural Network Acoustic Models for ASR Mohamed, Abdel-rahman 01 April 2014 (has links) Automatic speech recognition (ASR) is a key core technology for the information age. ASR systems have evolved from discriminating among isolated digits to recognizing telephone-quality, spontaneous speech, allowing for a growing number of practical applications in various sectors. Nevertheless, there are still serious challenges facing ASR which require major improvement in almost every stage of the speech recognition process. Until very recently, the standard approach to ASR had remained largely unchanged for many years. It used Hidden Markov Models (HMMs) to model the sequential structure of speech signals, with each HMM state using a mixture of diagonal covariance Gaussians (GMM) to model a spectral representation of the sound wave. This thesis describes new acoustic models based on Deep Neural Networks (DNN) that have begun to replace GMMs. For ASR, the deep structure of a DNN as well as its distributed representations allow for better generalization of learned features to new situations, even when only small amounts of training data are available. In addition, DNN acoustic models scale well to large vocabulary tasks significantly improving upon the best previous systems. Different input feature representations are analyzed to determine which one is more suitable for DNN acoustic models. Mel-frequency cepstral coefficients (MFCC) are inferior to log Mel-frequency spectral coefficients (MFSC) which help DNN models marginalize out speaker-specific information while focusing on discriminant phonetic features. Various speaker adaptation techniques are also introduced to further improve DNN performance. Another deep acoustic model based on Convolutional Neural Networks (CNN) is also proposed. Rather than using fully connected hidden layers as in a DNN, a CNN uses a pair of convolutional and pooling layers as building blocks. The convolution operation scans the frequency axis using a learned local spectro-temporal filter while in the pooling layer a maximum operation is applied to the learned features utilizing the smoothness of the input MFSC features to eliminate speaker variations expressed as shifts along the frequency axis in a way similar to vocal tract length normalization (VTLN) techniques. We show that the proposed DNN and CNN acoustic models achieve significant improvements over GMMs on various small and large vocabulary tasks. Deep Neural Networks Acoustic models Automatic speech recognition Machine learning 0984
313	Deep Neural Network Acoustic Models for ASR Mohamed, Abdel-rahman 01 April 2014 (has links) Automatic speech recognition (ASR) is a key core technology for the information age. ASR systems have evolved from discriminating among isolated digits to recognizing telephone-quality, spontaneous speech, allowing for a growing number of practical applications in various sectors. Nevertheless, there are still serious challenges facing ASR which require major improvement in almost every stage of the speech recognition process. Until very recently, the standard approach to ASR had remained largely unchanged for many years. It used Hidden Markov Models (HMMs) to model the sequential structure of speech signals, with each HMM state using a mixture of diagonal covariance Gaussians (GMM) to model a spectral representation of the sound wave. This thesis describes new acoustic models based on Deep Neural Networks (DNN) that have begun to replace GMMs. For ASR, the deep structure of a DNN as well as its distributed representations allow for better generalization of learned features to new situations, even when only small amounts of training data are available. In addition, DNN acoustic models scale well to large vocabulary tasks significantly improving upon the best previous systems. Different input feature representations are analyzed to determine which one is more suitable for DNN acoustic models. Mel-frequency cepstral coefficients (MFCC) are inferior to log Mel-frequency spectral coefficients (MFSC) which help DNN models marginalize out speaker-specific information while focusing on discriminant phonetic features. Various speaker adaptation techniques are also introduced to further improve DNN performance. Another deep acoustic model based on Convolutional Neural Networks (CNN) is also proposed. Rather than using fully connected hidden layers as in a DNN, a CNN uses a pair of convolutional and pooling layers as building blocks. The convolution operation scans the frequency axis using a learned local spectro-temporal filter while in the pooling layer a maximum operation is applied to the learned features utilizing the smoothness of the input MFSC features to eliminate speaker variations expressed as shifts along the frequency axis in a way similar to vocal tract length normalization (VTLN) techniques. We show that the proposed DNN and CNN acoustic models achieve significant improvements over GMMs on various small and large vocabulary tasks. Deep Neural Networks Acoustic models Automatic speech recognition Machine learning 0984
314	The investigation into an algorithm based on wavelet basis functions for the spatial and frequency decomposition of arbitrary signals. Goldstein, Hilton. January 1994 (has links) The research was directed toward the viability of an O(n) algorithm which could decompose an arbitrary signal (sound, vibration etc.) into its time-frequency space. The well known Fourier Transform uses sine and cosine functions (having infinite support on t) as orthonormal basis functions to decompose a signal i(t) in the time domain to F(w) in the frequency . domain, where the Fourier coefficients F(w) are the contributions of each frequency in the original signal. Due to the non-local support of these basis functions, a signal containing a sharp localised transient does not have localised coefficients, but rather coefficients that decay slowly. Another problem is that the coefficients F(w) do not convey any time information. The windowed Fourier Transform, or short-time Fourier Transform, does attempt to resolve the latter, but has had limited success. Wavelets are basis functions, usually mutually orthonormal, having finite support in t and are therefore spatially local. Using non-orthogonal wavelets, the Dominant Scale Transform (DST) designed by the author, decomposes a signal into its approximate time-frequency space. The associated Dominant Scale Algorithm (DSA) has O(n) complexity and is integer-based. These two characteristics make the DSA extremely efficient. The thesis also investigates the problem of converting a music signal into it's equivalent music score. The old problem of speech recognition is also examined. The results obtained from the DST are shown to be consistent with those of other authors who have utilised other methods. The resulting DST coefficients are shown to render the DST particularly useful in speech segmentation (silence regions, voiced speech regions, and frication). Moreover, the Spectrogram Dominant Scale Transform (SDST), formulated from the DST, was shown to approximate the Fourier coefficients over fixed time intervals within vowel regions of human speech. / Thesis (Ph.D.)-University of Natal, Durban, 1994. Fourier transformations. Automatic speech recognition. Algorithms. Theses--Computer science. Signal processing. Wavelets (Mathematics)
315	Blind source separation of the audio signals in a real world Choi, Hyung Keun 08 1900 (has links) No description available. Automatic speech recognition Speech processing systems Computer sound processing Adaptive filters Adaptive signal processing
316	Lecture transcription systems in resource-scarce environments / Pieter Theunis de Villiers De Villiers, Pieter Theunis January 2014 (has links) Classroom note taking is a fundamental task performed by learners on a daily basis. These notes provide learners with valuable offline study material, especially in the case of more difficult subjects. The use of class notes has been found to not only provide students with a better learning experience, but also leads to an overall higher academic performance. In a previous study, an increase of 10.5% in student grades was observed after these students had been provided with multimedia class notes. This is not surprising, as other studies have found that the rate of successful transfer of information to humans increases when provided with both visual and audio information. Note taking might seem like an easy task; however, students with hearing impairments, visual impairments, physical impairments, learning disabilities or even non-native listeners find this task very difficult to impossible. It has also been reported that even non-disabled students find note taking time consuming and that it requires a great deal of mental effort while also trying to pay full attention to the lecturer. This is illustrated by a study where it was found that college students were only able to record ~40% of the data presented by the lecturer. It is thus reasonable to expect an automatic way of generating class notes to be beneficial to all learners. Lecture transcription (LT) systems are used in educational environments to assist learners by providing them with real-time in-class transcriptions or recordings and transcriptions for offline use. Such systems have already been successfully implemented in the developed world where all required resources were easily obtained. These systems are typically trained on hundreds to thousands of hours of speech while their language models are trained on millions or even hundreds of millions of words. These amounts of data are generally not available in the developing world. In this dissertation, a number of approaches toward the development of LT systems in resource-scarce environments are investigated. We focus on different approaches to obtaining sufficient amounts of well transcribed data for building acoustic models, using corpora with few transcriptions and of variable quality. One approach investigates the use of alignment using a dynamic programming phone string alignment procedure to harvest as much usable data as possible from approximately transcribed speech data. We find that target-language acoustic models are optimal for this purpose, but encouraging results are also found when using models from another language for alignment. Another approach entails using unsupervised training methods where an initial low accuracy recognizer is used to transcribe a set of untranscribed data. Using this poorly transcribed data, correctly recognized portions are extracted based on a word confidence threshold. The initial system is retrained along with the newly recognized data in order to increase its overall accuracy. The initial acoustic models are trained using as little as 11 minutes of transcribed speech. After several iterations of unsupervised training, a noticeable increase in accuracy was observed (47.79% WER to 33.44% WER). Similar results were however found (35.97% WER) after using a large speaker-independent corpus to train the initial system. Usable LMs were also created using as few as 17955 words from transcribed lectures; however, this resulted in large out-of-vocabulary rates. This problem was solved by means of LM interpolation. LM interpolation was found to be very beneficial in cases where subject specific data (such as lecture slides and books) was available. We also introduce our NWU LT system, which was developed for use in learning environments and was designed using a client/server based architecture. Based on the results found in this study we are confident that usable models for use in LT systems can be developed in resource-scarce environments. / MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014 Acoustic modeling Automatic speech recognition Language modeling Lecture transcription Unsupervised training
317	Improving Grapheme-based speech recognition through P2G transliteration / W.D. Basson Basson, Willem Diederick January 2014 (has links) Grapheme-based speech recognition systems are faster to develop, but typically do not reach the same level of performance as phoneme-based systems. Using Afrikaans speech recognition as a case study, we first analyse the reasons for the discrepancy in performance, before introducing a technique for improving the performance of standard grapheme-based systems. It is found that by handling a relatively small number of irregular words through phoneme-to-grapheme (P2G) transliteration – transforming the original orthography of irregular words to an ‘idealised’ orthography – grapheme-based accuracy can be improved. An analysis of speech recognition accuracy based on word categories shows that P2G transliteration succeeds in improving certain word categories in which grapheme-based systems typically perform poorly, and that the problematic categories can be identified prior to system development. An evaluation is offered of when category-based P2G transliteration is beneficial and methods to implement the technique in practice are discussed. Comparative results are obtained for a second language (Vietnamese) in order to determine whether the technique can be generalised. / MSc (Computer Science) North-West University, Vaal Triangle Campus, 2014 Automatic speech recognition Phoneme-to-grapheme Transliteration Gra- pheme-based ASR
318	The evaluation of the stability of acoustic features in affective conveyance across multiple emotional databases Sun, Rui 20 September 2013 (has links) The objective of the research presented in this thesis was to systematically investigate the computational structure for cross-database emotion recognition. The research consisted of evaluating the stability of acoustic features, particularly the glottal and Teager Energy based features, and investigating three normalization methods and two data fusion techniques. One of the challenges of cross-database training and testing is accounting for the potential variation in the types of emotions expressed as well as the recording conditions. In an attempt to alleviate the impact of these types of variations, three normalization methods on the acoustic data were studied. Motivated by the lack of large and diverse enough emotional database to train the classifier, using multiple databases to train posed another challenge: data fusion. This thesis proposed two data fusion techniques, pre-classification SDS and post-classification ROVER to study the issue. Using the glottal, TEO and TECC features, of which the stability of emotion distinguishing ability has been highlighted on multiple databases, the systematic computational structure proposed in this thesis could improve the performance of cross-database binary-emotion recognition by up to 23% for neutral vs. emotional and 10% for positive vs. negative. Emotion recognition in speech Cross-database evaluation Language and emotions Emotions Automatic speech recognition Speech processing systems
319	Voice query-by-example for resource-limited languages using an ergodic hidden Markov model of speech Ali, Asif 13 January 2014 (has links) An ergodic hidden Markov model (EHMM) can be useful in extracting underlying structure embedded in connected speech without the need for a time-aligned transcribed corpus. In this research, we present a query-by-example (QbE) spoken term detection system based on an ergodic hidden Markov model of speech. An EHMM-based representation of speech is not invariant to speaker-dependent variations due to the unsupervised nature of the training. Consequently, a single phoneme may be mapped to a number of EHMM states. The effects of speaker-dependent and context-induced variation in speech on its EHMM-based representation have been studied and used to devise schemes to minimize these variations. Speaker-invariance can be introduced into the system by identifying states with similar perceptual characteristics. In this research, two unsupervised clustering schemes have been proposed to identify perceptually similar states in an EHMM. A search framework, consisting of a graphical keyword modeling scheme and a modified Viterbi algorithm, has also been implemented. An EHMM-based QbE system has been compared to the state-of-the-art and has been demonstrated to have higher precisions than those based on static clustering schemes. Speech recognition Hidden Markov model Ergodic theory Hidden Markov models Automatic speech recognition
320	The Variable Pronunciations of Word-final Consonant Clusters in a Force Aligned Corpus of Spoken French Milne, Peter 23 May 2014 (has links) This thesis project examined both schwa insertion and simplification following word-final consonant clusters in a large corpus of spoken French. Two main research questions were addressed. Can a system of forced alignment reliably reproduce pronunciation judgements that closely match those of a human researcher? How do variables, such as speech style, following context, motivation for simplification and speech rate, affect the variable pronunciations of word-final consonant clusters? This project describes the creation and testing of a novel system of forced alignment capable of segmenting recorded French speech. The results of comparing the pronunciation judgements between automatic and manual methods of recognition suggest that a system of forced alignment using speaker adapted acoustic models performed better than other acoustic models; produced results that are likely to be similar to the results produced by manual identification; and that the results of forced alignment are not likely to be affected by changes in speech style or speech rate. This project also described the application of forced alignment on a corpus of natural language spoken French. The results presented in this large sample corpus analysis suggest that the dialectal differences between Québec and France are not as simple as ``simplification in Québec, schwa insertion in France". While the results presented here suggest that the process of simplification following a word-final consonant cluster is similar in both dialects, the process of schwa insertion is likely to be different in each dialect. In both dialects, word-final consonant cluster simplification is more frequent in a preconsonantal context; is most likely in a spontaneous or less formal speech style and in that speech style is positively associated with higher speaking rates. Schwa insertion following a word-final consonant cluster displays much stronger dialectal differences. Schwa insertion in the dialect from France is strongly affected by following context and possibly speech style. Schwa insertion in the dialect from Québec is not affected by following context and is strongly predicted by a lack of consonant cluster simplification. French phonology French schwa Automatic speech recognition Word-final consonant clusters

Search results