• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 256
  • 47
  • 25
  • 21
  • 16
  • 16
  • 16
  • 16
  • 16
  • 16
  • 12
  • 11
  • 6
  • 2
  • 2
  • Tagged with
  • 442
  • 442
  • 322
  • 144
  • 120
  • 79
  • 79
  • 69
  • 53
  • 43
  • 42
  • 41
  • 40
  • 39
  • 30
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
341

Redução de ruído em sinais de voz usando curvas especializadas de modificação dos coeficientes da transformada em co-seno. / Speech denoising by softsoft thresholding.

Antunes Júnior, Irineu 24 April 2006 (has links)
Muitos métodos de redução de ruído se baseiam na possibilidade de representar o sinal original com um reduzido número de coeficientes de uma transformada, ou melhor, obtém-se um sinal com menos ruído pelo cancelamento dos coeficientes abaixo de um valor adequadamente estabelecido de magnitude. Deve-se supor que a contribuição do ruído se distribua de maneira uniforme por todos os coeficientes. Uma desvantagem destes métodos, quando aplicados a sinais de voz, é a distorção introduzida pela eliminação dos coeficientes de pequena magnitude, juntamente com a presença de sinais espúrios, como o “ruído musical" produzido por coeficientes ruidosos isolados que eventualmente ultrapassam o limiar. Para as transformadas usualmente empregadas, o histograma da distribuição dos coeficientes do sinal de voz possui um grande número de coeficientes próximos à origem. Diante disto, propomos uma nova função de “thresholding" concebida especialmente para redução de ruído em sinais de voz adicionados a AWGN (“Additive, White, and Gaussian Noise"). Esta função, chamada de SoftSoft, depende de dois valores de limiar: um nível inferior, ajustado para reduzir a distorção da voz, e um nível superior, ajustado para eliminar ruído. Os valores ótimos de limiar são calculados para minimizar uma estimativa do erro quadrático médio (MSE): diretamente, supondo conhecido o sinal original; indiretamente, usando uma função de interpolação para o MSE, levando a um método prático. A função SoftSoft alcança um MSE inferior ao que se obtém pelo emprego das conhecidas operações de “Soft" ou “Hard-thresholding", as quais dispõem apenas do limiar superior. Ainda que a melhoria em termos de MSE não seja muito expressiva, a melhoria da qualidade perceptual foi certificada tanto por um ouvinte quanto por uma medida perceptual de distorção (a distância log-espectral). / Many noise-reduction methods are based on the possibility of representing the clean signal as a reduced number of coefficients of a block transform, so that cancelling coefficients below a certain thresholding level will produce an enhanced reconstructed signal. It is necessary to assume that the clean signal has a sparse representation, while the noise energy is spread over all coefficients. The main drawback of those methods is the speech distortion introduced by eliminating small magnitude coefficients, and the presence of artifacts (“musical noise") produced by isolated noisy coefficients randomly crossing the thresholding level. Based on the observation that the speech coefficient histogram has many important coefficients close to origin, we propose a custom thresholding function to perform noise reduction in speech signals corrupted by AWGN. This function, called SoftSoft, has two thresholding levels: a lower level adjusted to reduce speech distortion, and a higher level adjusted to remove noise. The joint optimal values can be determined by minimizing the resulting mean square error (MSE). We also verify that this new thresholding function leads to a lower MSE than the well-known Soft and Hard-thresholding functions, which employ only a higher thresholding level. Although the improvement in terms of MSE is not expressive, a perceptual distortion measure (the log-spectral distance, LSD) is employed to prove the higher performance of the proposed thresholding scheme.
342

Determinadores de pitch / not available

Razera, Daniel Espanhol 05 May 2004 (has links)
Os parâmetros acústicos da voz abordados em diversas pesquisas de análise digital da voz, apresentam-se válidos para o uso em processo diagnóstico e terapêutico. O grupo de parâmetros de perturbação da voz necessita do conhecimento de todos os períodos do trecho de sinal de voz analisado, para ter seu valor calculado. Esta tarefa é desempenhada pelos determinadores de pitch, e a sua precisão determina a confiabilidade que se pode ter nos parâmetros calculados. Este trabalho visa estudar diversos métodos propostos ao longo dos anos e estabelecer qual destes tem a melhor precisão e robustez, quando utilizados com vozes patológicas. Estuda-se também algoritmos estimadores de pitch como uma ferramenta de auxílio para a correção e ajuste dos determinadores. Os resultados obtidos demonstram a necessidade de modificações externas e internas aos algoritmos determinadores e estimadores, para alcançarem a robustez e precisão desejada. Dois algoritmos determinadores, determinador por autocorrelação e por extração de harmônicas, mostraram-se dentro das metas estabelecidas e confirmam-se como os mais promissores em aplicações para obtenção de parâmetros acústicos da voz. / Several researches of digital speech processing validate the use of acoustic parameters of the voice in diagnosis and therapeutic processes. Perturbation parameters need the knowledge of all the periods of the analyzed voice signal, to have their values calculated. This task is carried out by the pitch trackers and their precision determines the reliability off the evaluated parameters. The purpose of this work is to study several methods proposed along the years and to establish which algorithm has the best precision and robustness, when used with pathological voices. The pitch estimation is also studied as an aid tool for the correction and adjustment of the pitch trackers. The results demonstrate the need of external and internal modifications of the trackers and detector algorithms to reach the wanted robustness and precision. The algorithms for autocorrelation and for extraction of harmonics are confirmed as the most promising in applications for obtaining of acoustic parameters of the voice.
343

Évaluation expérimentale d'un système statistique de synthèse de la parole, HTS, pour la langue française / Experimental evaluation of a statistical speech synthesis system, HTS, for french

Le Maguer, Sébastien 05 July 2013 (has links)
Les travaux présentés dans cette thèse se situent dans le cadre de la synthèse de la parole à partir du texte et, plus précisément, dans le cadre de la synthèse paramétrique utilisant des règles statistiques. Nous nous intéressons à l'influence des descripteurs linguistiques utilisés pour caractériser un signal de parole sur la modélisation effectuée dans le système de synthèse statistique HTS. Pour cela, deux méthodologies d'évaluation objective sont présentées. La première repose sur une modélisation de l'espace acoustique, généré par HTS par des mélanges gaussiens (GMM). En utilisant ensuite un ensemble de signaux de parole de référence, il est possible de comparer les GMM entre eux et ainsi les espaces acoustiques générés par les différentes configurations de HTS. La seconde méthodologie proposée repose sur le calcul de distances entre trames acoustiques appariées pour pouvoir évaluer la modélisation effectuée par HTS de manière plus locale. Cette seconde méthodologie permet de compléter les diverses analyses en contrôlant notamment les ensembles de données générées et évaluées. Les résultats obtenus selon ces deux méthodologies, et confirmés par des évaluations subjectives, indiquent que l'utilisation d'un ensemble complexe de descripteurs linguistiques n'aboutit pas nécessairement à une meilleure modélisation et peut s'avérer contre-productif sur la qualité du signal de synthèse produit. / The work presented in this thesis is about TTS speech synthesis and, more particularly, about statistical speech synthesis for French. We present an analysis on the impact of the linguistic contextual factors on the synthesis achieved by the HTS statistical speech synthesis system. To conduct the experiments, two objective evaluation protocols are proposed. The first one uses Gaussian mixture models (GMM) to represent the acoustical space produced by HTS according to a contextual feature set. By using a constant reference set of natural speech stimuli, GMM can be compared between themselves and consequently acoustic spaces generated by HTS. The second objective evaluation that we propose is based on pairwise distances between natural speech and synthetic speech generated by HTS. Results obtained by both protocols, and confirmed by subjective evaluations, show that using a large set of contextual factors does not necessarily improve the modeling and could be counter-productive on the speech quality.
344

Redução de ruído em sinais de voz usando curvas especializadas de modificação dos coeficientes da transformada em co-seno. / Speech denoising by softsoft thresholding.

Irineu Antunes Júnior 24 April 2006 (has links)
Muitos métodos de redução de ruído se baseiam na possibilidade de representar o sinal original com um reduzido número de coeficientes de uma transformada, ou melhor, obtém-se um sinal com menos ruído pelo cancelamento dos coeficientes abaixo de um valor adequadamente estabelecido de magnitude. Deve-se supor que a contribuição do ruído se distribua de maneira uniforme por todos os coeficientes. Uma desvantagem destes métodos, quando aplicados a sinais de voz, é a distorção introduzida pela eliminação dos coeficientes de pequena magnitude, juntamente com a presença de sinais espúrios, como o “ruído musical” produzido por coeficientes ruidosos isolados que eventualmente ultrapassam o limiar. Para as transformadas usualmente empregadas, o histograma da distribuição dos coeficientes do sinal de voz possui um grande número de coeficientes próximos à origem. Diante disto, propomos uma nova função de “thresholding” concebida especialmente para redução de ruído em sinais de voz adicionados a AWGN (“Additive, White, and Gaussian Noise”). Esta função, chamada de SoftSoft, depende de dois valores de limiar: um nível inferior, ajustado para reduzir a distorção da voz, e um nível superior, ajustado para eliminar ruído. Os valores ótimos de limiar são calculados para minimizar uma estimativa do erro quadrático médio (MSE): diretamente, supondo conhecido o sinal original; indiretamente, usando uma função de interpolação para o MSE, levando a um método prático. A função SoftSoft alcança um MSE inferior ao que se obtém pelo emprego das conhecidas operações de “Soft” ou “Hard-thresholding”, as quais dispõem apenas do limiar superior. Ainda que a melhoria em termos de MSE não seja muito expressiva, a melhoria da qualidade perceptual foi certificada tanto por um ouvinte quanto por uma medida perceptual de distorção (a distância log-espectral). / Many noise-reduction methods are based on the possibility of representing the clean signal as a reduced number of coefficients of a block transform, so that cancelling coefficients below a certain thresholding level will produce an enhanced reconstructed signal. It is necessary to assume that the clean signal has a sparse representation, while the noise energy is spread over all coefficients. The main drawback of those methods is the speech distortion introduced by eliminating small magnitude coefficients, and the presence of artifacts (“musical noise”) produced by isolated noisy coefficients randomly crossing the thresholding level. Based on the observation that the speech coefficient histogram has many important coefficients close to origin, we propose a custom thresholding function to perform noise reduction in speech signals corrupted by AWGN. This function, called SoftSoft, has two thresholding levels: a lower level adjusted to reduce speech distortion, and a higher level adjusted to remove noise. The joint optimal values can be determined by minimizing the resulting mean square error (MSE). We also verify that this new thresholding function leads to a lower MSE than the well-known Soft and Hard-thresholding functions, which employ only a higher thresholding level. Although the improvement in terms of MSE is not expressive, a perceptual distortion measure (the log-spectral distance, LSD) is employed to prove the higher performance of the proposed thresholding scheme.
345

Real-time adaptive noise cancellation for automatic speech recognition in a car environment : a thesis presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering at Massey University, School of Engineering and Advanced Technology, Auckland, New Zealand

Qi, Ziming January 2008 (has links)
This research is mainly concerned with a robust method for improving the performance of a real-time speech enhancement and noise cancellation for Automatic Speech Recognition (ASR) in a real-time environment. Therefore, the thesis titled, “Real-time adaptive beamformer for Automatic speech Recognition in a car environment” presents an application technique of a beamforming method and Automatic Speech Recognition (ASR) method. In this thesis, a novel solution is presented to the question as below, namely: How can the driver’s voice control the car using ASR? The solution in this thesis is an ASR using a hybrid system with acoustic beamforming Voice Activity Detector (VAD) and an Adaptive Wiener Filter. The beamforming approach is based on a fundamental theory of normalized least-mean squares (NLMS) to improve Signal to Noise Ratio (SNR). The microphone has been implemented with a Voice Activity Detector (VAD) which uses time-delay estimation together with magnitude-squared coherence (MSC). An experiment clearly shows the ability of the composite system to reduce noise outside of a defined active zone. In real-time environments a speech recognition system in a car has to receive the driver’s voice only whilst suppressing background noise e.g. voice from radio. Therefore, this research presents a hybrid real-time adaptive filter which operates within a geometrical zone defined around the head of the desired speaker. Any sound outside of this zone is considered to be noise and suppressed. As this defined geometrical zone is small, it is assumed that only driver's speech is incoming from this zone. The technique uses three microphones to define a geometric based voice-activity detector (VAD) to cancel the unwanted speech coming from outside of the zone. In the case of a sole unwanted speech incoming from outside of a desired zone, this speech is muted at the output of the hybrid noise canceller. In case of an unwanted speech and a desired speech are incoming at the same time, the proposed VAD fails to identify the unwanted speech or desired speech. In such a situation an adaptive Wiener filter is switched on for noise reduction, where the SNR is improved by as much as 28dB. In order to identify the signal quality of the filtered signal from Wiener filter, a template matching speech recognition system that uses a Wiener filter is designed for testing. In this thesis, a commercial speech recognition system is also applied to test the proposed beamforming based noise cancellation and the adaptive Wiener filter.
346

Speech Analysis and Cognition Using Category-Dependent Features in a Model of the Central Auditory System

Jeon, Woojay 13 November 2006 (has links)
It is well known that machines perform far worse than humans in recognizing speech and audio, especially in noisy environments. One method of addressing this issue of robustness is to study physiological models of the human auditory system and to adopt some of its characteristics in computers. As a first step in studying the potential benefits of an elaborate computational model of the primary auditory cortex (A1) in the central auditory system, we qualitatively and quantitatively validate the model under existing speech processing recognition methodology. Next, we develop new insights and ideas on how to interpret the model, and reveal some of the advantages of its dimension-expansion that may be potentially used to improve existing speech processing and recognition methods. This is done by statistically analyzing the neural responses to various classes of speech signals and forming empirical conjectures on how cognitive information is encoded in a category-dependent manner. We also establish a theoretical framework that shows how noise and signal can be separated in the dimension-expanded cortical space. Finally, we develop new feature selection and pattern recognition methods to exploit the category-dependent encoding of noise-robust cognitive information in the cortical response. Category-dependent features are proposed as features that "specialize" in discriminating specific sets of classes, and as a natural way of incorporating them into a Bayesian decision framework, we propose methods to construct hierarchical classifiers that perform decisions in a two-stage process. Phoneme classification tasks using the TIMIT speech database are performed to quantitatively validate all developments in this work, and the results encourage future work in exploiting high-dimensional data with category(or class)-dependent features for improved classification or detection.
347

Physiologically Motivated Methods For Audio Pattern Classification

Ravindran, Sourabh 20 November 2006 (has links)
Human-like performance by machines in tasks of speech and audio processing has remained an elusive goal. In an attempt to bridge the gap in performance between humans and machines there has been an increased effort to study and model physiological processes. However, the widespread use of biologically inspired features proposed in the past has been hampered mainly by either the lack of robustness across a range of signal-to-noise ratios or the formidable computational costs. In physiological systems, sensor processing occurs in several stages. It is likely the case that signal features and biological processing techniques evolved together and are complementary or well matched. It is precisely for this reason that modeling the feature extraction processes should go hand in hand with modeling of the processes that use these features. This research presents a front-end feature extraction method for audio signals inspired by the human peripheral auditory system. New developments in the field of machine learning are leveraged to build classifiers to maximize the performance gains afforded by these features. The structure of the classification system is similar to what might be expected in physiological processing. Further, the feature extraction and classification algorithms can be efficiently implemented using the low-power cooperative analog-digital signal processing platform. The usefulness of the features is demonstrated for tasks of audio classification, speech versus non-speech discrimination, and speech recognition. The low-power nature of the classification system makes it ideal for use in applications such as hearing aids, hand-held devices, and surveillance through acoustic scene monitoring
348

Investigation Of The Significance Of Periodicity Information In Speaker Identification

Gursoy, Secil 01 April 2008 (has links) (PDF)
In this thesis / general feature selection methods and especially the use of periodicity and aperiodicity information in speaker verification task is searched. A software system is constructed to obtain periodicity and aperiodicity information from speech. Periodicity and aperiodicity information is obtained by using a 16 channel filterbank and analyzing channel outputs frame by frame according to the pitch of that frame. Pitch value of a frame is also found by using periodicity algorithms. Parzen window (kernel density estimation) is used to represent each person&rsquo / s selected phoneme. Constructed method is tested for different phonemes in order to find out its usability in different phonemes. Periodicity features are also used with MFCC features to find out their contribution to speaker identification problem.
349

Estimation of glottal source features from the spectral envelope of the acoustic speech signal

Torres, Juan Félix 17 May 2010 (has links)
Speech communication encompasses diverse types of information, including phonetics, affective state, voice quality, and speaker identity. From a speech production standpoint, the acoustic speech signal can be mainly divided into glottal source and vocal tract components, which play distinct roles in rendering the various types of information it contains. Most deployed speech analysis systems, however, do not explicitly represent these two components as distinct entities, as their joint estimation from the acoustic speech signal becomes an ill-defined blind deconvolution problem. Nevertheless, because of the desire to understand glottal behavior and how it relates to perceived voice quality, there has been continued interest in explicitly estimating the glottal component of the speech signal. To this end, several inverse filtering (IF) algorithms have been proposed, but they are unreliable in practice because of the blind formulation of the separation problem. In an effort to develop a method that can bypass the challenging IF process, this thesis proposes a new glottal source information extraction method that relies on supervised machine learning to transform smoothed spectral representations of speech, which are already used in some of the most widely deployed and successful speech analysis applications, into a set of glottal source features. A transformation method based on Gaussian mixture regression (GMR) is presented and compared to current IF methods in terms of feature similarity, reliability, and speaker discrimination capability on a large speech corpus, and potential representations of the spectral envelope of speech are investigated for their ability represent glottal source variation in a predictable manner. The proposed system was found to produce glottal source features that reasonably matched their IF counterparts in many cases, while being less susceptible to spurious errors. The development of the proposed method entailed a study into the aspects of glottal source information that are already contained within the spectral features commonly used in speech analysis, yielding an objective assessment regarding the expected advantages of explicitly using glottal information extracted from the speech signal via currently available IF methods, versus the alternative of relying on the glottal source information that is implicitly contained in spectral envelope representations.
350

Probabilistic space maps for speech with applications

Kalgaonkar, Kaustubh 22 August 2011 (has links)
The objective of the proposed research is to develop a probabilistic model of speech production that exploits the multiplicity of mapping between the vocal tract area functions (VTAF) and speech spectra. Two thrusts are developed. In the first, a latent variable model that captures uncertainty in estimating the VTAF from speech data is investigated. The latent variable model uses this uncertainty to generate many-to-one mapping between observations of the VTAF and speech spectra. The second uses the probabilistic model of speech production to improve the performance of traditional speech algorithms, such as enhancement, acoustic model adaptation, etc. In this thesis, we propose to model the process of speech production with a probability map. This proposed model treats speech production as a probabilistic process with many-to-one mapping between VTAF and speech spectra. The thesis not only outlines a statistical framework to generate and train these probabilistic models from speech, but also demonstrates its power and flexibility with such applications as enhancing speech from both perceptual and recognition perspectives.

Page generated in 0.101 seconds