11 |
Rozpoznání emočního stavu člověka z řeči / Automatic vocal-oriented recognition of human emotionsHoudek, Miroslav January 2009 (has links)
This master thesis concerns with emotional states and gender recognition on the basis of speech signal analysis. We used various prosodic and cepstral features for the description of the speech signal. In the text we describe non-invasive methods for glottal pulses estimation. The described features of speech were implemented in MATLAB. For their classification we used the GMM classifier, which uses the Gaussian probability distribution for modeling a feature space. Furthermore, we constructed a system for recognition of emotional states of the speaker and a system for gender recognition from speech. We tested the success of created systems with several features on speech signal segments of various lengths and compared the results. In the last part we tested the influence of speaker and gender on the success of emotional states recognition.
|
12 |
Robust Single-Channel Speech Enhancement and Speaker Localization in Adverse EnvironmentsMosayyebpour, Saeed 30 April 2014 (has links)
In speech communication systems such as voice-controlled systems, hands-free mobile telephones and hearing aids, the received signals are degraded by room reverberation and background noise. This degradation can reduce the perceived quality and intelligibility of the speech, and decrease the performance of speech enhancement and source localization. These problems are difficult to solve due to the colored and nonstationary nature of the speech signals, and features of the Room Impulse Response (RIR) such as its long duration and non-minimum phase. In this dissertation, we focus on two topics of speech enhancement and speaker localization in noisy reverberant environments.
A two-stage speech enhancement method is presented
to suppress both early and late reverberation in noisy speech using only one microphone. It is shown that this method works well even in highly reverberant rooms.
Experiments under different acoustic conditions confirm that the proposed blind method is superior in terms of reducing early and late reverberation effects and noise compared to other well known single-microphone techniques in the literature.
Time Difference Of Arrival (TDOA)-based methods usually provide the most accurate source localization in adverse conditions. The key issue for these methods is to accurately estimate the TDOA using the smallest number of microphones.
Two robust Time Delay Estimation (TDE) methods are proposed which use the information from only two microphones. One method is based on adaptive inverse filtering which provides superior performance even in highly reverberant and moderately noisy conditions. It also has negligible failure estimation which makes it a reliable method in realistic environments. This method has high computational complexity due to the estimation in the first stage for the first microphone. As a result, it can not be applied in time-varying environments and real-time applications. Our second method improves this problem by introducing two effective preprocessing stages for the conventional Cross Correlation (CC)-based methods. The results obtained in different noisy reverberant conditions including a real and time-varying environment demonstrate that the proposed methods are superior compared to the conventional TDE methods. / Graduate / 0544 / 0984 / saeed.mosayyebpour@gmail.com
|
13 |
Robust Single-Channel Speech Enhancement and Speaker Localization in Adverse EnvironmentsMosayyebpour, Saeed 30 April 2014 (has links)
In speech communication systems such as voice-controlled systems, hands-free mobile telephones and hearing aids, the received signals are degraded by room reverberation and background noise. This degradation can reduce the perceived quality and intelligibility of the speech, and decrease the performance of speech enhancement and source localization. These problems are difficult to solve due to the colored and nonstationary nature of the speech signals, and features of the Room Impulse Response (RIR) such as its long duration and non-minimum phase. In this dissertation, we focus on two topics of speech enhancement and speaker localization in noisy reverberant environments.
A two-stage speech enhancement method is presented
to suppress both early and late reverberation in noisy speech using only one microphone. It is shown that this method works well even in highly reverberant rooms.
Experiments under different acoustic conditions confirm that the proposed blind method is superior in terms of reducing early and late reverberation effects and noise compared to other well known single-microphone techniques in the literature.
Time Difference Of Arrival (TDOA)-based methods usually provide the most accurate source localization in adverse conditions. The key issue for these methods is to accurately estimate the TDOA using the smallest number of microphones.
Two robust Time Delay Estimation (TDE) methods are proposed which use the information from only two microphones. One method is based on adaptive inverse filtering which provides superior performance even in highly reverberant and moderately noisy conditions. It also has negligible failure estimation which makes it a reliable method in realistic environments. This method has high computational complexity due to the estimation in the first stage for the first microphone. As a result, it can not be applied in time-varying environments and real-time applications. Our second method improves this problem by introducing two effective preprocessing stages for the conventional Cross Correlation (CC)-based methods. The results obtained in different noisy reverberant conditions including a real and time-varying environment demonstrate that the proposed methods are superior compared to the conventional TDE methods. / Graduate / 2015-04-23 / 0544 / 0984 / saeed.mosayyebpour@gmail.com
|
14 |
How accuracy of estimated glottal flow waveforms affects spoofed speech detection performanceDeivard, Johannes January 2020 (has links)
In the domain of automatic speaker verification, one of the challenges is to keep the malevolent people out of the system. One way to do this is to create algorithms that are supposed to detect spoofed speech. There are several types of spoofed speech and several ways to detect them, one of which is to look at the glottal flow waveform (GFW) of a speech signal. This waveform is often estimated using glottal inverse filtering (GIF), since, in order to create the ground truth GFW, special invasive equipment is required. To the author’s knowledge, no research has been done where the correlation of GFW accuracy and spoofed speech detection (SSD) performance is investigated. This thesis tries to find out if the aforementioned correlation exists or not. First, the performance of different GIF methods is evaluated, then simple SSD machine learning (ML) models are trained and evaluated based on their macro average precision. The ML models use different datasets composed of parametrized GFWs estimated with the GIF methods from the previous step. Results from the previous tasks are then combined in order to spot any correlations. The evaluations of the different methods showed that they created GFWs of varying accuracy. The different machine learning models also showed varying performance depending on what type of dataset that was being used. However, when combining the results, no obvious correlations between GFW accuracy and SSD performance were detected. This suggests that the overall accuracy of a GFW is not a substantial factor in the performance of machine learning-based SSD algorithms.
|
15 |
Pitch tracking and speech enhancement in noisy and reverberant environmentsWu, Mingyang 07 November 2003 (has links)
No description available.
|
16 |
Analyse de la qualité vocale appliquée à la parole expressive / Voice quality analysis applied to expressive speechSturmel, Nicolas 02 March 2011 (has links)
L’analyse des signaux de parole permet de comprendre le fonctionnement de l’appareil vocal, mais aussi de décrire de nouveaux paramètres permettant de qualifier et quantifier la perception de la voix. Dans le cas de la parole expressive, l'intérêt se porte sur des variations importantes de qualité vocales et sur leurs liens avec l’expressivité et l’intention du sujet. Afin de décrire ces liens, il convient de pouvoir estimer les paramètres du modèle de production mais aussi de décomposer le signal vocal en chacune des parties qui contribuent à ce modèle. Le travail réalisé au cours de cette thèse s’axe donc autour de la segmentation et la décomposition des signaux vocaux et de l’estimation des paramètres du modèle de production vocale : Tout d’abord, la décomposition multi-échelles des signaux vocaux est abordée. En reprenant la méthode LoMA qui trace des lignes suivant les amplitudes maximum sur les réponses temporelles au banc de filtre en ondelettes, il est possible d’y détecter un certain nombre de caractéristiques du signal vocal : les instants de fermeture glottique, l’énergie associée à chaque cycle ainsi que sa distribution spectrale, le quotient ouvert du cycle glottique (par l’observation du retard de phase du premier harmonique). Cette méthode est ensuite testée sur des signaux synthétiques et réels. Puis, la décomposition harmonique + bruit des signaux vocaux est abordée. Une méthode existante (PAPD - Périodic/APériodic Décomposition) est adaptée aux variations de fréquence fondamentale par le biais de la variation dynamique de la taille de la fenêtre d’analyse et est appelée PAP-A. Cette nouvelle méthode est ensuite testée sur une base de signaux synthétiques. La sensibilité à la précision d’estimation de la fréquence fondamentale est notamment abordée. Les résultats montrent des décompositions de meilleures qualité pour PAP-A par rapport à PAPD. Ensuite, le problème de la déconvolution source/filtre est abordé. La séparation source/filtre par ZZT (zéros de la transformée en Z) est comparée aux méthodes usuelles à base de prédiction linéaire. La ZZT est utilisée pour estimer les paramètres du modèle de la source glottique via une méthode simple mais robuste qui permet une estimation conjointe de deux paramètres du débit glottique : le quotient ouvert et l'asymétrie. La méthode ainsi développée est testée et combinée à l’estimation du quotient ouvert par ondelettes. Finalement, ces trois méthodes d’estimations sont appliquées à un grand nombre de fichiers d’une base de données comportant différents styles d’élocution. Les résultats de cette analyse sont discutés afin de caractériser le lien entre style, valeur des paramètres de la production vocale et qualité vocale. On constate notamment l’émergence très nette de groupes de styles. / Analysis of speech signals is a good way of understanding how the voice is produced, but it is also important as a way of describing new parameters in order to define the perception of voice quality. This study focuses on expressive speech, where voice quality varies a lot and is explicitly linked to the expressivity or intention of the speaker. In order to define those links, one has to be able to estimate a high number of parameters of the speech production model, but also be able to decompose the speech signal into each parts that contributes to this model. The work presented in this thesis addresses the segmentation of speech signals, their decomposition and the estimation of the voice production model parameters. At first, multi-scale analysis of speech signals is studied. Using the LoMA method that traces lines across scales from one maximum to the other on the time domain response of a wavelet filter bank, it is possible to detect a number of features on voiced speech, namely : the glottal closing instants, the energy associated to each glottal cycle, the open quotient (by estimating the time delay of the first harmonic). This method is then tested on both synthetic and real speech. Secondly, harmonic plus noise decomposition of speech signals is studied. An existing method (PAPD standing for Periodic/Aperiodic Decomposition) is modified to dynamically adapt the analysis window length to the fundamental frequency (F0) of the signal. The new method is then tested on synthetic speech where the sensibility to the estimation error on F0 is also discussed. Decomposition on real speech, along with their audio files, are also discussed. Results shows that this new method provides better quality of decomposition. Thirdly, the problem of source/filter deconvolution is addressed. The ZZT (Zeros of the Z Transform) method is compared to classical methods based on linear prediction. ZZT is then used for the estimation of the glottal flow parameters with a simple but robust method based on the joint estimation of both the open quotient and the asymmetry. The later method is then combined to the estimation of the open quotient using wavelet analysis. Finally, the three estimation methods developed in this thesis are used to analyze a large number of files from a database presenting different speaking styles. Results are discussed in order to characterize the link between style, model parameters and voice quality. We especially notice the neat appearance of speaking style groups
|
17 |
Detekce nemocí pomocí analýzy hlasu / Voice Analysis for Detection of DiseasesChytil, Pavel January 2008 (has links)
Tato disertační práce je zaměřena na analýzu řečového signálu za učelem detekce nemocí ovlivňujících strukturu hlasových orgánů, obzvláště těch, které mění strukturální character hlasivek. Poskytnut je přehled současných technik. Dále jsou popsány zdroje použitých nahrávek pro zdravé a nemocné mlučí. Hlavním učelem této disertační práce je popsat vypočetní postup k odhadu parametrů modelu hlasového zdroje, které umožní následnou detekci a klasifikaci nemocí hlasivek. Poskytujeme detailní popis analýzy řečových signálů, které mohou být odvozeny z parametrických modelů hlasivek.
|
18 |
The Aerodynamic, Glottographic, and Acoustic Effects of Clear Speech.Tahamtan, Mahdi 06 September 2022 (has links)
No description available.
|
Page generated in 0.0951 seconds