• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 4
  • 2
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 12
  • 12
  • 12
  • 5
  • 5
  • 4
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Nonlinear compensation and heterogeneous data modeling for robust speech recognition

Zhao, Yong 21 February 2013 (has links)
The goal of robust speech recognition is to maintain satisfactory recognition accuracy under mismatched operating conditions. This dissertation addresses the robustness issue from two directions. In the first part of the dissertation, we propose the Gauss-Newton method as a unified approach to estimating noise parameters for use in prevalent nonlinear compensation models, such as vector Taylor series (VTS), data-driven parallel model combination (DPMC), and unscented transform (UT), for noise-robust speech recognition. While iterative estimation of noise means in a generalized EM framework has been widely known, we demonstrate that such approaches are variants of the Gauss-Newton method. Furthermore, we propose a novel noise variance estimation algorithm that is consistent with the Gauss-Newton principle. The formulation of the Gauss-Newton method reduces the noise estimation problem to determining the Jacobians of the corrupted speech parameters. For sampling-based compensations, we present two methods, sample Jacobian average (SJA) and cross-covariance (XCOV), to evaluate these Jacobians. The Gauss-Newton method is closely related to another noise estimation approach, which views the model compensation from a generative perspective, giving rise to an EM-based algorithm analogous to the ML estimation for factor analysis (EM-FA). We demonstrate a close connection between these two approaches: they belong to the family of gradient-based methods except with different convergence rates. Note that the convergence property can be crucial to the noise estimation in many applications where model compensation may have to be frequently carried out in changing noisy environments to retain desired performance. Furthermore, several techniques are explored to further improve the nonlinear compensation approaches. To overcome the demand of the clean speech data for training acoustic models, we integrate nonlinear compensation with adaptive training. We also investigate the fast VTS compensation to improve the noise estimation efficiency, and combine the VTS compensation with acoustic echo cancellation (AEC) to mitigate issues due to interfering background speech. The proposed noise estimation algorithm is evaluated for various compensation models on two tasks. The first is to fit a GMM model to artificially corrupted samples, the second is to perform speech recognition on the Aurora 2 database, and the third is on a speech corpus simulating the meeting of multiple competing speakers. The significant performance improvements confirm the efficacy of the Gauss-Newton method to estimating the noise parameters of the nonlinear compensation models. The second research work is devoted to developing more effective models to take full advantage of heterogeneous speech data, which are typically collected from thousands of speakers in various environments via different transducers. The proposed synchronous HMM, in contrast to the conventional HMMs, introduces an additional layer of substates between the HMM state and the Gaussian component variables. The substates have the capability to register long-span non-phonetic attributes, such as gender, speaker identity, and environmental condition, which are integrally called speech scenes in this study. The hierarchical modeling scheme allows an accurate description of probability distribution of speech units in different speech scenes. To address the data sparsity problem in estimating parameters of multiple speech scene sub-models, a decision-based clustering algorithm is presented to determine the set of speech scenes and to tie the substate parameters, allowing us to achieve an excellent balance between modeling accuracy and robustness. In addition, by exploiting the synchronous relationship among the speech scene sub-models, we propose the multiplex Viterbi algorithm to efficiently decode the synchronous HMM within a search space of the same size as for the standard HMM. The multiplex Viterbi can also be generalized to decode an ensemble of isomorphic HMM sets, a problem often arising in the multi-model systems. The experiments on the Aurora 2 task show that the synchronous HMMs produce a significant improvement in recognition performance over the HMM baseline at the expense of a moderate increase in the memory requirement and computational complexity.
2

Robust Speech Recognition by Combining Short-Term and Long-Term Spectrum Based Position-Dependent CMN with Conventional CMN

KITAOKA, Norihide, NAKAGAWA, Seiichi, WANG, Longbiao 01 March 2008 (has links)
No description available.
3

Data-Driven Rescaling of Energy Features for Noisy Speech Recognition

Luan, Miau 18 July 2012 (has links)
In this paper, we investigate rescaling of energy features for noise-robust speech recognition. The performance of the speech recognition system will degrade very quickly by the influence of environmental noise. As a result, speech robustness technique has become an important research issue for a long time. However, many studies have pointed out that the impact of speech recognition under the noisy environment is enormous. Therefore, we proposed the data-driven energy features rescaling (DEFR) to adjust the features. The method is divided into three parts, that are voice activity detection (VAD), piecewise log rescaling function and parameter searching algorithm. The purpose is to reduce the difference of noisy and clean speech features. We apply this method on Mel-frequency cepstral coefficients (MFCC) and Teager energy cepstral coefficients (TECC), and we compare the proposed method with mean subtraction (MS) and mean and variance normalization (MVN). We use the Aurora 2.0 and Aurora 3.0 databases to evaluate the performance. From the experimental results, we proved that the proposed method can effectively improve the recognition accuracy.
4

Towards robust conversational speech recognition and understanding

Weng, Chao 12 January 2015 (has links)
While significant progress has been made in automatic speech recognition (ASR) during the last few decades, recognizing and understanding unconstrained conversational speech remains a challenging problem. In this dissertation, five methods/systems are proposed towards a robust conversational speech recognition and understanding system. I. A non-uniform minimum classification error (MCE) approach is proposed which can achieve consistent and significant keyword spotting performance gains on both English and Mandarin large-scale spontaneous conversational speech tasks (Switchboard and HKUST Mandarin CTS). II. A hybrid recurrent DNN-HMM system is proposed for robust acoustic modeling and a new way of backpropagation through time (BPTT) is introduced. The proposed system achieves state-of-the-art performances on two benchmark datasets, the 2nd CHiME challenge (track 2) and Aurora-4, without front-end preprocessing, speaker adaptive training or multiple decoding passes. III. To study the specific case of conversational speech recognition in the presence of competing talkers, several multi-style training setups of DNNs are investigated and a joint decoder operating on multi-talker speech is introduced. The proposed combined system improves upon the previous state-of-the-art IBM superhuman system by 2.8% absolute on the 2006 speech separation challenge dataset. IV. Latent semantic rational kernels (LSRKs) are proposed for spotting the semantic notions on conversational speech. The proposed framework is generalized using tf-idf weighting, latent semantic analysis, WordNet, probabilistic topic models and neural network learned representations and is shown to achieve substantial topic spotting performance gains on two conversational speech tasks, Switchboard and AT&T HMIHY initial collection. V. Non-uniform sequential discriminative training (DT) of DNNs with LSRKs is proposed which directly links the information of the proposed LSRK framework to the objective function of the DT. The experimental results on the subset of Switchboard show the proposed method can lead the acoustic modeling to a more robust system with respect to the semantic decoder.
5

En undersökning och jämförelse av två röststyrningsramverk för Android i bullriga miljöer / An examination and comparison of two speech recognition frameworks for Android in noisy environments

Sandström, Rasmus, Renngård, Jonas January 2017 (has links)
Voice control is a technology that most people encounter or use on a daily basis. The voice control technology can be used to interpret voice commands and execute tasks based on the command pronounced. According to previous studies problems arise with the precision when the voice control technologies are used in noisy environments. This study has been conducted as an experiment where the precision in two voice control frameworks for Android has been examined. The purpose with this study is to examine the precision in these two frameworks to assist a decision making for an organisation who has developed an application which will be used by midwives in low and middle income countries. Two prototypes was developed using the two voice control frameworks PocketSphinx and iSpeech. The precision of these frameworks was tested in three different surroundings. The surroundings the frameworks was tested in had the decibel levels 25, 60, and 80. The result shows that the number of correctly registered voice commands reduces considerably depending on which sound level the frameworks are being tested in. The framework who got the most voice commands correctly registered was PocketSphinx, but even this framework had a big margin of error. / Röststyrning är idag en teknologi som de flesta människor någon gång stöter på eller använder sig av dagligen. Röststyrningsteknologin kan användas för att tolka vissa kommandon som sedan utför en uppgift baserat på det kommando som uttalas. Enligt tidigare studier uppkommer det problem med precisionen hos de röststyrningsramverk som används i bullriga miljöer. Denna studie har utförts som ett experiment där precisionen hos två stycken röststyrningsramverk för Android har undersökts. Syftet med denna studie var att undersöka precisionen hos dessa ramverk för att bistå med underlag till en organisation som utvecklat en applikation som används av barnmorskor i låg- och medelinkomstländer. Två stycken prototyper utvecklades med hjälp av röststyrningsramverken PocketSphinx och iSpeech. Dessa ramverks precision testades i tre stycken olika miljöer. De miljöer som prototyperna testades i hade ljudnivåerna 25dB, 60dB samt 80dB. Resultatet påvisar att antalet korrekt registrerade kommandon minskar avsevärt beroende på vilken ljudnivå som ramverken testas i. Det ramverk som korrekt registrerade flest röstkommandon var PocketSphinx men även denna hade en stor felmarginal.
6

An integrated approach to feature compensation combining particle filters and Hidden Markov Models for robust speech recognition

Mushtaq, Aleem 19 September 2013 (has links)
The performance of automatic speech recognition systems often degrades in adverse conditions where there is a mismatch between training and testing conditions. This is true for most modern systems which employ Hidden Markov Models (HMMs) to decode speech utterances. One strategy is to map the distorted features back to clean speech features that correspond well to the features used for training of HMMs. This can be achieved by treating the noisy speech as the distorted version of the clean speech of interest. Under this framework, we can track and consequently extract the underlying clean speech from the noisy signal and use this derived signal to perform utterance recognition. Particle filter is a versatile tracking technique that can be used where often conventional techniques such as Kalman filter fall short. We propose a particle filters based algorithm to compensate the corrupted features according to an additive noise model incorporating both the statistics from clean speech HMMs and observed background noise to map noisy features back to clean speech features. Instead of using specific knowledge at the model and state levels from HMMs which is hard to estimate, we pool model states into clusters as side information. Since each cluster encompasses more statistics when compared to the original HMM states, there is a higher possibility that the newly formed probability density function at the cluster level can cover the underlying speech variation to generate appropriate particle filter samples for feature compensation. Additionally, a dynamic joint tracking framework to monitor the clean speech signal and noise simultaneously is also introduced to obtain good noise statistics. In this approach, the information available from clean speech tracking can be effectively used for noise estimation. The availability of dynamic noise information can enhance the robustness of the algorithm in case of large fluctuations in noise parameters within an utterance. Testing the proposed PF-based compensation scheme on the Aurora 2 connected digit recognition task, we achieve an error reduction of 12.15% from the best multi-condition trained models using this integrated PF-HMM framework to estimate the cluster-based HMM state sequence information. Finally, we extended the PFC framework and evaluated it on a large-vocabulary recognition task, and showed that PFC works well for large-vocabulary systems also.
7

Graphical Models for Robust Speech Recognition in Adverse Environments

Rennie, Steven J. 01 August 2008 (has links)
Robust speech recognition in acoustic environments that contain multiple speech sources and/or complex non-stationary noise is a difficult problem, but one of great practical interest. The formalism of probabilistic graphical models constitutes a relatively new and very powerful tool for better understanding and extending existing models, learning, and inference algorithms; and a bedrock for the creative, quasi-systematic development of new ones. In this thesis a collection of new graphical models and inference algorithms for robust speech recognition are presented. The problem of speech separation using multiple microphones is first treated. A family of variational algorithms for tractably combining multiple acoustic models of speech with observed sensor likelihoods is presented. The algorithms recover high quality estimates of the speech sources even when there are more sources than microphones, and have improved upon the state-of-the-art in terms of SNR gain by over 10 dB. Next the problem of background compensation in non-stationary acoustic environments is treated. A new dynamic noise adaptation (DNA) algorithm for robust noise compensation is presented, and shown to outperform several existing state-of-the-art front-end denoising systems on the new DNA + Aurora II and Aurora II-M extensions of the Aurora II task. Finally, the problem of speech recognition in speech using a single microphone is treated. The Iroquois system for multi-talker speech separation and recognition is presented. The system won the 2006 Pascal International Speech Separation Challenge, and amazingly, achieved super-human recognition performance on a majority of test cases in the task. The result marks a significant first in automatic speech recognition, and a milestone in computing.
8

Spectro-Temporal Features For Robust Automatic Speech Recognition

Suryanarayana, Venkata K 01 1900 (has links)
The speech signal is inherently characterized by its variations in time, which get reflected as variations in frequency. The specto temporal changes are due to changes in vocaltract, intonation, co-articulation and successive articulation of different phonetic sounds. In this thesis we are looking for improving the speech recognition performance through better feature parameters using a non-stationary model of speech. One effective means of modeling a general non-stationary signal is using the AM-FM model. AM-FM model can be extended to speech through a sub-band analysis, which can be mimic the auditory analysis. In this thesis, we explore new methods for estimating AM and FM parameters based on the non-uniform samples of the signal. The non-uniform sample approach along with adaptive window estimation provides for important advantage because of multi-resolution analysis. We develop several new methods based on ZC intervals, local extrema intervals and signal derivative at ZC’s as different sample measures of the signal and explore their effectiveness for instantaneous frequency (IF) and instantaneous envelope (IE) estimation. To deal with speech signal for automatic speech recognition, we explore the use of auditory motivated spectro temporal information through the use of an auditory filter bank and signal parameters (or features) are derived from the instantaneous energy in each band using the non-linear energy operator over a larger window length. The temporal correlation present in the signal is exploited by using DCT and keeping the lower few coefficients of DCT to keep the trend in the energy in each band. The DCT coefficients from different frequency bands are concatenated together, and a further spectral decorrelation is achieved through KLT (Karhunen-Loeve Transform) of the concatenated feature vector. The changes in the vocaltract are well captured by the change in the formant structure and to emphasize these details for ASR we have defined a temporal formant by using the AM-FM decomposition of sub-band speech. A uniform wideband non-overlaping filters are used for sub-band decomposition. The temporal formant is defined using the AM-FM parameters of each subband signal. The temporal evolution of a formant is represented by the lower order DCT coefficients of the temporal formant in each band and its use for ASR is explored. To address the robustness of ASR performance to environmental noisy conditions, we have used a hybrid approach of enhancing the speech signal using statistical models of the speech and noise. Use of GMM for statistical speech enhancement has been shown to be effective. It is found that the spectro-temporal features derived from enhanced speech provide further improvement to ASR performance.
9

Graphical Models for Robust Speech Recognition in Adverse Environments

Rennie, Steven J. 01 August 2008 (has links)
Robust speech recognition in acoustic environments that contain multiple speech sources and/or complex non-stationary noise is a difficult problem, but one of great practical interest. The formalism of probabilistic graphical models constitutes a relatively new and very powerful tool for better understanding and extending existing models, learning, and inference algorithms; and a bedrock for the creative, quasi-systematic development of new ones. In this thesis a collection of new graphical models and inference algorithms for robust speech recognition are presented. The problem of speech separation using multiple microphones is first treated. A family of variational algorithms for tractably combining multiple acoustic models of speech with observed sensor likelihoods is presented. The algorithms recover high quality estimates of the speech sources even when there are more sources than microphones, and have improved upon the state-of-the-art in terms of SNR gain by over 10 dB. Next the problem of background compensation in non-stationary acoustic environments is treated. A new dynamic noise adaptation (DNA) algorithm for robust noise compensation is presented, and shown to outperform several existing state-of-the-art front-end denoising systems on the new DNA + Aurora II and Aurora II-M extensions of the Aurora II task. Finally, the problem of speech recognition in speech using a single microphone is treated. The Iroquois system for multi-talker speech separation and recognition is presented. The system won the 2006 Pascal International Speech Separation Challenge, and amazingly, achieved super-human recognition performance on a majority of test cases in the task. The result marks a significant first in automatic speech recognition, and a milestone in computing.
10

CIAIR実走行車内音声データベース

ITAKURA, Fumitada, TAKEDA, Kazuya, YAMAGUCHI, Yukiko, MATSUBARA, Shigeki, KAWAGUCHI, Nobuo, 板倉, 文忠, 武田, 一哉, 山口, 由紀子, 松原, 茂樹, 河口, 信夫 18 December 2003 (has links)
情報処理学会研究報告. SLP, 音声言語情報処理; 2003-SLP-49-24 第5回音声言語シンポジウム

Page generated in 0.0999 seconds