Global ETD Search

1	Linear Dynamic Model for Continuous Speech Recognition Ma, Tao 30 April 2011 (has links) In the past decades, statistics-based hidden Markov models (HMMs) have become the predominant approach to speech recognition. Under this framework, the speech signal is modeled as a piecewise stationary signal (typically over an interval of 10 milliseconds). Speech features are assumed to be temporally uncorrelated. While these simplifications have enabled tremendous advances in speech processing systems, for the past several years progress on the core statistical models has stagnated. Since machine performance still significantly lags human performance, especially in noisy environments, researchers have been looking beyond the traditional HMM approach. Recent theoretical and experimental studies suggest that exploiting frame-torame correlations in a speech signal further improves the performance of ASR systems. This is typically accomplished by developing an acoustic model which includes higher order statistics or trajectories. Linear Dynamic Models (LDMs) have generated significant interest in recent years due to their ability to model higher order statistics. LDMs use a state space-like formulation that explicitly models the evolution of hidden states using an autoregressive process. This smoothed trajectory model allows the system to better track the speech dynamics in noisy environments. In this dissertation, we develop a hybrid HMM/LDM speech recognizer that effectively integrates these two powerful technologies. This hybrid system is capable of handling large recognition tasks, is robust to noise-corrupted speech data and mitigates the ill-effects of mismatched training and evaluation conditions. This two-pass system leverages the temporal modeling and N-best list generation capabilities of the traditional HMM architecture in a first pass analysis. In the second pass, candidate sentence hypotheses are re-ranked using a phone-based LDM model. The Wall Street Journal (WSJ0) derived Aurora-4 large vocabulary corpus was chosen as the training and evaluation dataset. This corpus is a well-established LVCSR benchmark with six different noisy conditions. The implementation and evaluation of the proposed hybrid HMM/LDM speech recognizer is the major contribution of this dissertation. speech recognition acoustic modeling
2	Support Vector Machines for Speech Recognition Ganapathiraju, Aravind 11 May 2002 (has links) Hidden Markov models (HMM) with Gaussian mixture observation densities are the dominant approach in speech recognition. These systems typically use a representational model for acoustic modeling which can often be prone to overfitting and does not translate to improved discrimination. We propose a new paradigm centered on principles of structural risk minimization using a discriminative framework for speech recognition based on support vector machines (SVMs). SVMs have the ability to simultaneously optimize the representational and discriminative ability of the acoustic classifiers. We have developed the first SVM-based large vocabulary speech recognition system that improves performance over traditional HMM-based systems. This hybrid system achieves a state-of-the-art word error rate of 10.6% on a continuous alphadigit task ? a 10% improvement relative to an HMM system. On SWITCHBOARD, a large vocabulary task, the system improves performance over a traditional HMM system from 41.6% word error rate to 40.6%. This dissertation discusses several practical issues that arise when SVMs are incorporated into the hybrid system. classification kernel methods acoustic modeling
3	Representative Environments for Reduced Estimation Time of Wide Area Acoustic Performance Fabre, Josette 14 May 2010 (has links) Advances in ocean modeling (Barron et al., 2006) have improved such that ocean forecasts and even ensembles (e.g., Coelho et al., 2009) representing ocean uncertainty are becoming more widely available. This facilitates nowcasts (current time ocean fields / analyses) and forecasts (predicted ocean fields) of acoustic propagation conditions in the ocean which can greatly improve the planning of acoustic experiments. Modeling of acoustic transmission loss (TL) provides information about how the environment impacts acoustic performance for various systems and system configurations of interest. It is, however, very time consuming to compute acoustic propagation to and from many potential source and receiver locations for multiple locations on an area-wide grid for multiple analysis / forecast times, ensembles and scenarios of interest. Currently, to make such wide area predictions, an area is gridded and acoustic predictions for multiple directions (or radials) at each grid point for a single time period or ensemble, are computed to estimate performance on the grid. This grid generally does not consider the environment and can neglect important environmental acoustic features or can overcompute in areas of environmental acoustic isotropy. This effort develops two methods to pre-examine the area and time frame in terms of the environmental acoustics in order to prescribe an environmentally optimized computational grid that takes advantage of environmental-acoustic similarities and differences to characterize an area, time frame and ensemble with fewer acoustic model predictions and thus less computation time. Such improvement allows for a more thorough characterization of the time frame and area of interest. The first method is based on critical factors in the environment that typically indicate acoustic response, and the second method is based on a more robust full waveguide mode-based description of the environment. Results are shown for the critical factors method and show that this proves to be a viable solution for most cases studied. Limitations are at areas of high loss, which may not be of concern for exercise planning. The mode-based method is developed for range independent environments and shows significant promise for future development. Underwater acoustics acoustic provincing environmental data acoustic modeling uncertainty
4	Evaluation of modern large-vocabulary speech recognition techniques and their implementation Swart, Ranier Adriaan 03 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2009. / In this thesis we studied large-vocabulary continuous speech recognition. We considered the components necessary to realise a large-vocabulary speech recogniser and how systems such as Sphinx and HTK solved the problems facing such a system. Hidden Markov Models (HMMs) have been a common approach to acoustic modelling in speech recognition in the past. HMMs are well suited to modelling speech, since they are able to model both its stationary nature and temporal e ects. We studied HMMs and the algorithms associated with them. Since incorporating all knowledge sources as e ciently as possible is of the utmost importance, the N-Best paradigm was explored along with some more advanced HMM algorithms. The way in which sounds and words are constructed has been studied extensively in the past. Context dependency on the acoustic level and on the linguistic level can be exploited to improve the performance of a speech recogniser. We considered some of the techniques used in the past to solve the associated problems. We implemented and combined some chosen algorithms to form our system and reported the recognition results. Our nal system performed reasonably well and will form an ideal framework for future studies on large-vocabulary speech recognition at the University of Stellenbosch. Many avenues of research for future versions of the system were considered. Dissertations -- Electronic engineering Theses -- Electronic engineering Acoustic modeling Automatic speech recognition Electrical and Electronic Engineering
5	Lecture transcription systems in resource-scarce environments / Pieter Theunis de Villiers De Villiers, Pieter Theunis January 2014 (has links) Classroom note taking is a fundamental task performed by learners on a daily basis. These notes provide learners with valuable offline study material, especially in the case of more difficult subjects. The use of class notes has been found to not only provide students with a better learning experience, but also leads to an overall higher academic performance. In a previous study, an increase of 10.5% in student grades was observed after these students had been provided with multimedia class notes. This is not surprising, as other studies have found that the rate of successful transfer of information to humans increases when provided with both visual and audio information. Note taking might seem like an easy task; however, students with hearing impairments, visual impairments, physical impairments, learning disabilities or even non-native listeners find this task very difficult to impossible. It has also been reported that even non-disabled students find note taking time consuming and that it requires a great deal of mental effort while also trying to pay full attention to the lecturer. This is illustrated by a study where it was found that college students were only able to record ~40% of the data presented by the lecturer. It is thus reasonable to expect an automatic way of generating class notes to be beneficial to all learners. Lecture transcription (LT) systems are used in educational environments to assist learners by providing them with real-time in-class transcriptions or recordings and transcriptions for offline use. Such systems have already been successfully implemented in the developed world where all required resources were easily obtained. These systems are typically trained on hundreds to thousands of hours of speech while their language models are trained on millions or even hundreds of millions of words. These amounts of data are generally not available in the developing world. In this dissertation, a number of approaches toward the development of LT systems in resource-scarce environments are investigated. We focus on different approaches to obtaining sufficient amounts of well transcribed data for building acoustic models, using corpora with few transcriptions and of variable quality. One approach investigates the use of alignment using a dynamic programming phone string alignment procedure to harvest as much usable data as possible from approximately transcribed speech data. We find that target-language acoustic models are optimal for this purpose, but encouraging results are also found when using models from another language for alignment. Another approach entails using unsupervised training methods where an initial low accuracy recognizer is used to transcribe a set of untranscribed data. Using this poorly transcribed data, correctly recognized portions are extracted based on a word confidence threshold. The initial system is retrained along with the newly recognized data in order to increase its overall accuracy. The initial acoustic models are trained using as little as 11 minutes of transcribed speech. After several iterations of unsupervised training, a noticeable increase in accuracy was observed (47.79% WER to 33.44% WER). Similar results were however found (35.97% WER) after using a large speaker-independent corpus to train the initial system. Usable LMs were also created using as few as 17955 words from transcribed lectures; however, this resulted in large out-of-vocabulary rates. This problem was solved by means of LM interpolation. LM interpolation was found to be very beneficial in cases where subject specific data (such as lecture slides and books) was available. We also introduce our NWU LT system, which was developed for use in learning environments and was designed using a client/server based architecture. Based on the results found in this study we are confident that usable models for use in LT systems can be developed in resource-scarce environments. / MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014 Acoustic modeling Automatic speech recognition Language modeling Lecture transcription Unsupervised training
6	Automatic Identification System of Merchant Shipping in the Application of the Kaohsiung Harbor Protection Wu, Cheng-Feng 24 July 2012 (has links) Kaohsiung Harbor is one of the major commercial ports in Taiwan, located at the hub of northeastern and southeastern Asia shipping lanes. Therefore there are a considerable number of commercial shipping channels distributed around Kaohsiung Harbor. The security of Kaohsiung Harbor becomes more difficult to defense than others due to the complexity of channels. In this study, Automatic Identification System (AIS) system is used to collect the ships information from June 1, 2010 to June 30, 2011. The collected AIS data were decoded, converted, corrected, integrated and analyzed systematically, which will become the base of future database. The information of the AIS includes Maritime Mobile Service Identity (MMSI), latitude and longitude, heading, course, speed, and others. The activities of ships can be monitored by AIS, so the density and distribution of ships on each major channel can be obtained by grid computing. By the results of one-year AIS data, three major shipping channels of Kaohsiung Harbor can be identified, which are north-western, north-southern, and east-western. Based on this kind of long term shipping statistics, possible novel harbor security defense may be founded. Although the AIS was designed to monitor the ship activities, but it can be viciously shut down, or signal is out of range sometimes, then it will become the possible security breach. Nevertheless, ships at sea will generate certain kind of noises, such as from engine and propeller. With efficient propagation of sound waves in water, acoustic technology may compensate the limitations of AIS, to be a feasible method of detecting unknown ships. In this study, acoustic modeling code ¡§Acoustic Module for Sea-surface Noise¡¨ (AMSN) is applied by using the ship position information from AIS, to calculate the related underwater noise sound field of Kaohsiung Harbor. Discussions were made on the dependence of noise level variation with ship density. As a conclusion, with sufficient understanding of sound field statistics of harbor, any anomaly of noise level can be an indication of hostile intrusion, thus harbor security can be further assured. Kaohsiung Harbor Ship Noise Acoustic Modeling Automatic Identification System (AIS) Port Security Ship Density
7	Simulated and laboratory models of aircraft sound transmission Thomas, Ashwin Paul 27 August 2014 (has links) With increased exposure to transportation noise, there have been continued efforts to help insulate homes from aircraft noise. Current aircraft noise guidelines are based primarily on outdoor sound levels. As people spend the majority of their time indoors, however, human perception is evidently more related to indoor sound levels. Investigations are being made to provide further insight into how typical residential constructions affect indoor response. A pilot study has built a single-room "test house", according to typical construction for mixed-humid climate regions, and has directly measured outdoor-to-indoor transmission of sound - with specific focus on continuous commercial aircraft signatures. The results of this study are being used to validate and improve modelling software that simulates a wide range of construction types and configurations for other US climate regions. The improved models will allow for increased flexibility in simulating the impacts of acoustic and energy retrofits. Overall, the project intends to improve the ability to predict acoustic performance for typical US construction types as well as for any possible design alterations for sound insulation. Aircraft noise Community noise Sound-insulating structures Transmission loss Acoustic modeling Acoustic measurement techniques
8	Lecture transcription systems in resource-scarce environments / Pieter Theunis de Villiers De Villiers, Pieter Theunis January 2014 (has links) Classroom note taking is a fundamental task performed by learners on a daily basis. These notes provide learners with valuable offline study material, especially in the case of more difficult subjects. The use of class notes has been found to not only provide students with a better learning experience, but also leads to an overall higher academic performance. In a previous study, an increase of 10.5% in student grades was observed after these students had been provided with multimedia class notes. This is not surprising, as other studies have found that the rate of successful transfer of information to humans increases when provided with both visual and audio information. Note taking might seem like an easy task; however, students with hearing impairments, visual impairments, physical impairments, learning disabilities or even non-native listeners find this task very difficult to impossible. It has also been reported that even non-disabled students find note taking time consuming and that it requires a great deal of mental effort while also trying to pay full attention to the lecturer. This is illustrated by a study where it was found that college students were only able to record ~40% of the data presented by the lecturer. It is thus reasonable to expect an automatic way of generating class notes to be beneficial to all learners. Lecture transcription (LT) systems are used in educational environments to assist learners by providing them with real-time in-class transcriptions or recordings and transcriptions for offline use. Such systems have already been successfully implemented in the developed world where all required resources were easily obtained. These systems are typically trained on hundreds to thousands of hours of speech while their language models are trained on millions or even hundreds of millions of words. These amounts of data are generally not available in the developing world. In this dissertation, a number of approaches toward the development of LT systems in resource-scarce environments are investigated. We focus on different approaches to obtaining sufficient amounts of well transcribed data for building acoustic models, using corpora with few transcriptions and of variable quality. One approach investigates the use of alignment using a dynamic programming phone string alignment procedure to harvest as much usable data as possible from approximately transcribed speech data. We find that target-language acoustic models are optimal for this purpose, but encouraging results are also found when using models from another language for alignment. Another approach entails using unsupervised training methods where an initial low accuracy recognizer is used to transcribe a set of untranscribed data. Using this poorly transcribed data, correctly recognized portions are extracted based on a word confidence threshold. The initial system is retrained along with the newly recognized data in order to increase its overall accuracy. The initial acoustic models are trained using as little as 11 minutes of transcribed speech. After several iterations of unsupervised training, a noticeable increase in accuracy was observed (47.79% WER to 33.44% WER). Similar results were however found (35.97% WER) after using a large speaker-independent corpus to train the initial system. Usable LMs were also created using as few as 17955 words from transcribed lectures; however, this resulted in large out-of-vocabulary rates. This problem was solved by means of LM interpolation. LM interpolation was found to be very beneficial in cases where subject specific data (such as lecture slides and books) was available. We also introduce our NWU LT system, which was developed for use in learning environments and was designed using a client/server based architecture. Based on the results found in this study we are confident that usable models for use in LT systems can be developed in resource-scarce environments. / MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014 Acoustic modeling Automatic speech recognition Language modeling Lecture transcription Unsupervised training
9	Nonparametric Bayesian Approaches for Acoustic Modeling Harati Nejad Torbati, Amir Hossein January 2015 (has links) The goal of Bayesian analysis is to reduce the uncertainty about unobserved variables by combining prior knowledge with observations. A fundamental limitation of a parametric statistical model, including a Bayesian approach, is the inability of the model to learn new structures. The goal of the learning process is to estimate the correct values for the parameters. The accuracy of these parameters improves with more data but the model’s structure remains fixed. Therefore new observations will not affect the overall complexity (e.g. number of parameters in the model). Recently, nonparametric Bayesian methods have become a popular alternative to Bayesian approaches because the model structure is learned simultaneously with the parameter distributions in a data-driven manner. The goal of this dissertation is to apply nonparametric Bayesian approaches to the acoustic modeling problem in continuous speech recognition. Three important problems are addressed: (1) statistical modeling of sub-word acoustic units; (2) semi-supervised training algorithms for nonparametric acoustic models; and (3) automatic discovery of sub-word acoustic units. We have developed a Doubly Hierarchical Dirichlet Process Hidden Markov Model (DHDPHMM) with a non-ergodic structure that can be applied to problems involving sequential modeling. DHDPHMM shares mixture components between states using two Hierarchical Dirichlet Processes (HDP). An inference algorithm for this model has been developed that enables DHDPHMM to outperform both its hidden Markov model (HMM) and HDP HMM (HDPHMM) counterparts. This inference algorithm is shown to also be computationally less expensive than a comparable algorithm for HDPHMM. In addition to sharing data, the proposed model can learn non-ergodic structures and non-emitting states, something that HDPHMM does not support. This extension to the model is used to model finite length sequences. We have also developed a generative model for semi-supervised training of DHDPHMMs. Semi-supervised learning is an important practical requirement for many machine learning applications including acoustic modeling in speech recognition. The relative improvement in error rates on classification and recognition tasks is shown to be 22% and 7% respectively. Semi-supervised training results are slightly better than supervised training (29.02% vs. 29.71%). Context modeling was also investigated and results show a modest improvement of 1.5% relative over the baseline system. We also introduce a nonparametric Bayesian transducer based on an ergodic HDPHMM/DHDPHMM that automatically segments and clusters the speech signal using an unsupervised approach. This transducer was used in several applications including speech segmentation, acoustic unit discovery, spoken term detection and automatic generation of a pronunciation lexicon. For the segmentation problem, an F¬¬¬¬¬¬-score of 76.62% was achieved which represents a 9% relative improvement over the baseline system. On the spoken term detection tasks, an average precision of 64.91% was achieved, which represents a 20% improvement over the baseline system. Lexicon generation experiments also show automatically discovered units (ADU) generalize to new datasets. In this dissertation, we have established the foundation for applications of non-parametric Bayesian modeling to problems such as speech recognition that involve sequential modeling. These models allow a new generation of machine learning systems that adapt their overall complexity in a data-driven manner and yet preserve meaningful modalities in the data. As a result, these models improve generalization and offer higher performance at lower complexity. / Electrical and Computer Engineering Electrical Engineering Statistics Computer Science Acoustic Modeling Dhdphmm Hdphmm Non-parametric Bayesian
10	Geometric Acoustic Modeling of the LDS Conference Center Smith, Heather 09 November 2004 (has links) (PDF) This thesis discusses the process of modeling a 21,000 seat fan-shaped auditorium using methods of geometric acoustics. Two commercial geometric acoustics software packages were used in the research: CATT-Acoustic™ 8.0 and EASE™ 4.1. The process first included creating preliminary models of the hall using published absorption coefficients for its surfaces and approximate scattering coefficients based on current best-known techniques. A detailed analysis determined the minimum numbers of rays needed in both packages to produce reliable results with these coefficient values. It was found that 100,000 rays were needed for CATT™ and 500,000 rays were needed for EASE™. Analysis was also done to determine whether the model was sensitive to the scattering coefficients of the seating areas. It was found that most acoustic parameters were not significantly affected by scattering coefficient variation. The models were subsequently refined by including measured absorption coefficients of dominant surfaces in the hall: the seats, audience and suspended absorptive panels. Comparisons were made between measurements made in the hall and results from the computer models with impulse responses, acoustic parameters, and auralizations. The results have shown that the models have been successful at representing characteristics of the hall at some positions but less successful at representing them at other positions. Comparisons have shown that positions on the rostrum were especially difficult positions to model in this hall. Significant differences were not found between the preliminary models and the refined models. There was not significant evidence showing that either the EASE™ or the CATT™ model was more successful in accurately representing the acoustical conditions of the hall. The results from this research suggest that more work must be done to improve the modeling capabilities of these packages for this application. acoustic computer modeling auralization Conference Center geometric acoustic modeling geometric acoustics scattering coefficients absorption coefficients Astrophysics and Astronomy Physics

Search results