Global ETD Search

1	Noisy Speech Recognition Based on Integration/Selection of Multiple Noise Suppression Methods Using Noise GMMs NAKAGAWA, Seiichi, HAMAGUCHI, Souta, KITAOKA, Norihide 01 March 2008 (has links) No description available. CENSREC-1 noise suppression method selection noisy speech recognition
2	AURORA-2J: An Evaluation Framework for Japanese Noisy Speech Recognition ENDO, Toshiki, FUJIMOTO, Masakiyo, MIYAJIMA, Chiyomi, MIZUMACHI, Mitsunori, SASOU, Akira, NISHIURA, Takanobu, KITAOKA, Norihide, KUROIWA, Shingo, YAMADA, Takeshi, YAMAMOTO, Kazumasa, TAKEDA, Kazuya, NAKAMURA, Satoshi 01 March 2005 (has links) No description available. evaluation categories performance differences over speakers evaluation platform noisy speech recognition
3	CENSREC-3: An Evaluation Framework for Japanese Speech Recognition in Real Car-Driving Environments NAKAMURA, Satoshi, TAKEDA, Kazuya, FUJIMOTO, Masakiyo 01 November 2006 (has links) No description available. CENSREC-3 in-car speech database common evaluation framework noisy speech recognition
4	Nonstationary Techniques For Signal Enhancement With Applications To Speech, ECG, And Nonuniformly-Sampled Signals Sreenivasa Murthy, A January 2012 (has links) (PDF) For time-varying signals such as speech and audio, short-time analysis becomes necessary to compute specific signal attributes and to keep track of their evolution. The standard technique is the short-time Fourier transform (STFT), using which one decomposes a signal in terms of windowed Fourier bases. An advancement over STFT is the wavelet analysis in which a function is represented in terms of shifted and dilated versions of a localized function called the wavelet. A specific modeling approach particularly in the context of speech is based on short-time linear prediction or short-time Wiener filtering of noisy speech. In most nonstationary signal processing formalisms, the key idea is to analyze the properties of the signal locally, either by first truncating the signal and then performing a basis expansion (as in the case of STFT), or by choosing compactly-supported basis functions (as in the case of wavelets). We retain the same motivation as these approaches, but use polynomials to model the signal on a short-time basis (“short-time polynomial representation”). To emphasize the local nature of the modeling aspect, we refer to it as “local polynomial modeling (LPM).” We pursue two main threads of research in this thesis: (i) Short-time approaches for speech enhancement; and (ii) LPM for enhancing smooth signals, with applications to ECG, noisy nonuniformly-sampled signals, and voiced/unvoiced segmentation in noisy speech. Improved iterative Wiener filtering for speech enhancement A constrained iterative Wiener filter solution for speech enhancement was proposed by Hansen and Clements. Sreenivas and Kirnapure improved the performance of the technique by imposing codebook-based constraints in the process of parameter estimation. The key advantage is that the optimal parameter search space is confined to the codebook. The Nonstationary signal enhancement solutions assume stationary noise. However, in practical applications, noise is not stationary and hence updating the noise statistics becomes necessary. We present a new approach to perform reliable noise estimation based on spectral subtraction. We first estimate the signal spectrum and perform signal subtraction to estimate the noise power spectral density. We further smooth the estimated noise spectrum to ensure reliability. The key contributions are: (i) Adaptation of the technique for non-stationary noises; (ii) A new initialization procedure for faster convergence and higher accuracy; (iii) Experimental determination of the optimal LP-parameter space; and (iv) Objective criteria and speech recognition tests for performance comparison. Optimal local polynomial modeling and applications We next address the problem of fitting a piecewise-polynomial model to a smooth signal corrupted by additive noise. Since the signal is smooth, it can be represented using low-order polynomial functions provided that they are locally adapted to the signal. We choose the mean-square error as the criterion of optimality. Since the model is local, it preserves the temporal structure of the signal and can also handle nonstationary noise. We show that there is a trade-off between the adaptability of the model to local signal variations and robustness to noise (bias-variance trade-off), which we solve using a stochastic optimization technique known as the intersection of confidence intervals (ICI) technique. The key trade-off parameter is the duration of the window over which the optimum LPM is computed. Within the LPM framework, we address three problems: (i) Signal reconstruction from noisy uniform samples; (ii) Signal reconstruction from noisy nonuniform samples; and (iii) Classification of speech signals into voiced and unvoiced segments. The generic signal model is x(tn)=s(tn)+d(tn),0 ≤ n ≤ N - 1. In problems (i) and (iii) above, tn=nT(uniform sampling); in (ii) the samples are taken at nonuniform instants. The signal s(t)is assumed to be smooth; i.e., it should admit a local polynomial representation. The problem in (i) and (ii) is to estimate s(t)from x(tn); i.e., we are interested in optimal signal reconstruction on a continuous domain starting from uniform or nonuniform samples. We show that, in both cases, the bias and variance take the general form: The mean square error (MSE) is given by where L is the length of the window over which the polynomial fitting is performed, f is a function of s(t), which typically comprises the higher-order derivatives of s(t), the order itself dependent on the order of the polynomial, and g is a function of the noise variance. It is clear that the bias and variance have complementary characteristics with respect to L. Directly optimizing for the MSE would give a value of L, which involves the functions f and g. The function g may be estimated, but f is not known since s(t)is unknown. Hence, it is not practical to compute the minimum MSE (MMSE) solution. Therefore, we obtain an approximate result by solving the bias-variance trade-off in a probabilistic sense using the ICI technique. We also propose a new approach to optimally select the ICI technique parameters, based on a new cost function that is the sum of the probability of false alarm and the area covered over the confidence interval. In addition, we address issues related to optimal model-order selection, search space for window lengths, accuracy of noise estimation, etc. The next issue addressed is that of voiced/unvoiced segmentation of speech signal. Speech segments show different spectral and temporal characteristics based on whether the segment is voiced or unvoiced. Most speech processing techniques process the two segments differently. The challenge lies in making detection techniques offer robust performance in the presence of noise. We propose a new technique for voiced/unvoiced clas-sification by taking into account the fact that voiced segments have a certain degree of regularity, and that the unvoiced segments do not possess any smoothness. In order to capture the regularity in voiced regions, we employ the LPM. The key idea is that regions where the LPM is inaccurate are more likely to be unvoiced than voiced. Within this frame-work, we formulate a hypothesis testing problem based on the accuracy of the LPM fit and devise a test statistic for performing V/UV classification. Since the technique is based on LPM, it is capable of adapting to nonstationary noises. We present Monte Carlo results to demonstrate the accuracy of the proposed technique. Signal Processing Local Polynomial Modeling Iterative Wiener Filtering Speech Enhancement Speech Signals ECG Signals Time-varying Signals Signal Enhancement Speech Processing Noisy Non-Stationary Signals Noisy Speech Local Polynomial Model (LPM) Communication Engineering
5	Pitch tracking and speech enhancement in noisy and reverberant environments Wu, Mingyang 07 November 2003 (has links) No description available. Artificial Intelligence channel selection correlogram hidden Markov model (HMM) multipitch tracking noisy speech pitch detection pitch strength reverberant speech inverse filtering signal-to-reverberant energy ratio (SRR) reverberation reverberation time
6	Advances in deep learning methods for speech recognition and understanding Serdyuk, Dmitriy 10 1900 (has links) Ce travail expose plusieurs études dans les domaines de la reconnaissance de la parole et compréhension du langage parlé. La compréhension sémantique du langage parlé est un sous-domaine important de l'intelligence artificielle. Le traitement de la parole intéresse depuis longtemps les chercheurs, puisque la parole est une des charactéristiques qui definit l'être humain. Avec le développement du réseau neuronal artificiel, le domaine a connu une évolution rapide à la fois en terme de précision et de perception humaine. Une autre étape importante a été franchie avec le développement d'approches bout en bout. De telles approches permettent une coadaptation de toutes les parties du modèle, ce qui augmente ainsi les performances, et ce qui simplifie la procédure d'entrainement. Les modèles de bout en bout sont devenus réalisables avec la quantité croissante de données disponibles, de ressources informatiques et, surtout, avec de nombreux développements architecturaux innovateurs. Néanmoins, les approches traditionnelles (qui ne sont pas bout en bout) sont toujours pertinentes pour le traitement de la parole en raison des données difficiles dans les environnements bruyants, de la parole avec un accent et de la grande variété de dialectes. Dans le premier travail, nous explorons la reconnaissance de la parole hybride dans des environnements bruyants. Nous proposons de traiter la reconnaissance de la parole, qui fonctionne dans un nouvel environnement composé de différents bruits inconnus, comme une tâche d'adaptation de domaine. Pour cela, nous utilisons la nouvelle technique à l'époque de l'adaptation du domaine antagoniste. En résumé, ces travaux antérieurs proposaient de former des caractéristiques de manière à ce qu'elles soient distinctives pour la tâche principale, mais non-distinctive pour la tâche secondaire. Cette tâche secondaire est conçue pour être la tâche de reconnaissance de domaine. Ainsi, les fonctionnalités entraînées sont invariantes vis-à-vis du domaine considéré. Dans notre travail, nous adoptons cette technique et la modifions pour la tâche de reconnaissance de la parole dans un environnement bruyant. Dans le second travail, nous développons une méthode générale pour la régularisation des réseaux génératif récurrents. Il est connu que les réseaux récurrents ont souvent des difficultés à rester sur le même chemin, lors de la production de sorties longues. Bien qu'il soit possible d'utiliser des réseaux bidirectionnels pour une meilleure traitement de séquences pour l'apprentissage des charactéristiques, qui n'est pas applicable au cas génératif. Nous avons développé un moyen d'améliorer la cohérence de la production de longues séquences avec des réseaux récurrents. Nous proposons un moyen de construire un modèle similaire à un réseau bidirectionnel. L'idée centrale est d'utiliser une perte L2 entre les réseaux récurrents génératifs vers l'avant et vers l'arrière. Nous fournissons une évaluation expérimentale sur une multitude de tâches et d'ensembles de données, y compris la reconnaissance vocale, le sous-titrage d'images et la modélisation du langage. Dans le troisième article, nous étudions la possibilité de développer un identificateur d'intention de bout en bout pour la compréhension du langage parlé. La compréhension sémantique du langage parlé est une étape importante vers le développement d'une intelligence artificielle de type humain. Nous avons vu que les approches de bout en bout montrent des performances élevées sur les tâches, y compris la traduction automatique et la reconnaissance de la parole. Nous nous inspirons des travaux antérieurs pour développer un système de bout en bout pour la reconnaissance de l'intention. / This work presents several studies in the areas of speech recognition and understanding. The semantic speech understanding is an important sub-domain of the broader field of artificial intelligence. Speech processing has had interest from the researchers for long time because language is one of the defining characteristics of a human being. With the development of neural networks, the domain has seen rapid progress both in terms of accuracy and human perception. Another important milestone was achieved with the development of end-to-end approaches. Such approaches allow co-adaptation of all the parts of the model thus increasing the performance, as well as simplifying the training procedure. End-to-end models became feasible with the increasing amount of available data, computational resources, and most importantly with many novel architectural developments. Nevertheless, traditional, non end-to-end, approaches are still relevant for speech processing due to challenging data in noisy environments, accented speech, and high variety of dialects. In the first work, we explore the hybrid speech recognition in noisy environments. We propose to treat the recognition in the unseen noise condition as the domain adaptation task. For this, we use the novel at the time technique of the adversarial domain adaptation. In the nutshell, this prior work proposed to train features in such a way that they are discriminative for the primary task, but non-discriminative for the secondary task. This secondary task is constructed to be the domain recognition task. Thus, the features trained are invariant towards the domain at hand. In our work, we adopt this technique and modify it for the task of noisy speech recognition. In the second work, we develop a general method for regularizing the generative recurrent networks. It is known that the recurrent networks frequently have difficulties staying on same track when generating long outputs. While it is possible to use bi-directional networks for better sequence aggregation for feature learning, it is not applicable for the generative case. We developed a way improve the consistency of generating long sequences with recurrent networks. We propose a way to construct a model similar to bi-directional network. The key insight is to use a soft L2 loss between the forward and the backward generative recurrent networks. We provide experimental evaluation on a multitude of tasks and datasets, including speech recognition, image captioning, and language modeling. In the third paper, we investigate the possibility of developing an end-to-end intent recognizer for spoken language understanding. The semantic spoken language understanding is an important step towards developing a human-like artificial intelligence. We have seen that the end-to-end approaches show high performance on the tasks including machine translation and speech recognition. We draw the inspiration from the prior works to develop an end-to-end system for intent recognition. Deep learning Machine learning Speech recognition Neural networks Domain adaptation Noisy speech recognition Adversarial learning Recurrent neural networks Sequence generation Spoken language understanding End-to-end learning Apprentissage profond Apprentissage automatique Reconnaissance de la parole Réseaux de neurones Adaptation de domaine Reconnaissance de la parole bruyante Apprentissage antogoniste Réseaux de neurones récurrents Génération de séquences Compréhension du langage vocal Apprentissage de bout en bout

1

Page generated in 0.0297 seconds