Global ETD Search

41	Speech Enhancement Utilizing Phase Continuity Between Consecutive Analysis Windows Mehmetcik, Erdal 01 September 2011 (has links) (PDF) It is commonly accepted that the induced noise on DFT phase spectrum has a negligible effect on speech intelligibility for short durations of analysis windows, as the early intelligibility studies pointed out. This fact is confirmed by recent intelligibility studies as well. Based on this phenomenon, classical speech enhancement algorithms do not modify DFT phase spectrum and only make changes in the DFT magnitude spectrum. However, in recent studies it is also indicated that these classical speech enhancement algorithms are not capable of improving the intelligibility scores of noise degraded speech signals. In other words, the contained information in a noise degraded signal cannot be increased by classical enhancement methods. Instead the ease of listening, i.e. quality, can be improved. Hence additional effort can be made to increase the amount of quality improvement using both DFT magnitude and DFT phase. Therefore if the performances of the classical methods are to be improved in terms of speech quality, the effect of DFT phase on speech quality needs to be studied. In this work, the contribution of DFT phase on speech quality is investigated through some simulations using an objective quality assessment criterion. It is concluded from these simulations that, the phase spectrum has a significant effect on speech quality for short durations of analysis windows. Furthermore, phase values of low frequency components are found to have the largest contribution to this quality improvement. Under the motivation of these results, a new enhancement method is proposed which modifies the phase of certain low frequency components as well as the magnitude spectrum. The proposed algorithm is implemented in MATLAB environment. The results indicate that the proposed system improves the performance of the classical methods in terms of speech quality.
42	Probabilistic space maps for speech with applications Kalgaonkar, Kaustubh 22 August 2011 (has links) The objective of the proposed research is to develop a probabilistic model of speech production that exploits the multiplicity of mapping between the vocal tract area functions (VTAF) and speech spectra. Two thrusts are developed. In the first, a latent variable model that captures uncertainty in estimating the VTAF from speech data is investigated. The latent variable model uses this uncertainty to generate many-to-one mapping between observations of the VTAF and speech spectra. The second uses the probabilistic model of speech production to improve the performance of traditional speech algorithms, such as enhancement, acoustic model adaptation, etc. In this thesis, we propose to model the process of speech production with a probability map. This proposed model treats speech production as a probabilistic process with many-to-one mapping between VTAF and speech spectra. The thesis not only outlines a statistical framework to generate and train these probabilistic models from speech, but also demonstrates its power and flexibility with such applications as enhancing speech from both perceptual and recognition perspectives. Automatic bandwidth expansion Probabilistic space maps Statistical models Acoustic model adaptation Speech enhancement Speech perception Speech processing systems
43	Nie-destruktiewe klankonttrekking, restourasie en spraakverheffing van Edison-fonograafsilinders Van der Westhuizen, Ewald 12 1900 (has links) Thesis (MScEng)--University of Stellenbosch, 2003. / ENGLISH ABSTRACT: Two non-destructive methods of extracting audio from Edison phonographic cylinders were investigated. A recording device with high accuracy positioning was designed and manufactured. A microscopic image method was investigated first. Surface images of the cylinder were obtained using a webcamera. An audio signal was then extracted from the width modulation. Results were not pleasing as echoes caused by intergroove modulation were perceptable. The audio also lacked resolution. The true modulation of the audio is not embedded in the width, but in the depth of the groove. The second audio extraction method involved using a laser pick-up from a compact disc player to measure the depth of the groove. Three laser recording methods were investigated. The first was forward recording, that measured the depth modulation in the recording direction of the groove. The second method, backward recording, was identical to forward recording with the mechanical system moving in reverse. Four recordings from different positions in the groove were combined to create an audio signal. This combination of recordings showed a substantial improvement in the signal-to-noise ratio. A third recording method, transverse recording, that measured the whole depth profile of the groove was also investigated. The groove profile was then processed to an audio signal. A manual audio restoration program was written to replace visible sections of distorted data with better interpolations. Two speech enhancement methods were investigated, the first being the most commonly used speech enhancement method for digital audio restoration, Short-Time Spectral Attenuation (STSA). The second method is based on linear predictive coefficient (LPC) estimation of short-time frames. The two methods were evaluated by means of listening tests. The LPC enhancement method was preferred because it enhanced the intelligibility of the speech. / AFRIKAANSE OPSOMMING: Twee nie-destruktiewe metodes om klank van Edison-fonograafsilinders te onttrek, is ondersoek. 'n Opneemtoestel, wat die silinders met baie hoë akkuraatheid posisioneer, IS ontwerp en vervaardig. 'n Mikroskopiese beeldrnetode IS as eerste klankonttrekkingsmetode ondersoek. Mikroskopiese beelde is met 'n webkamera van die silinderoppervlak geneem. Klank is vanuit die wydtemodulasie sigbaar in die beelde onttrek. Resultate was nie bevredigend nie weens groefintermodulasie-eggo's en 'n tekort aan resolusie. Die ware modulasie van die klank is nie in die wydte van die groefie gegraveer nie, maar in die diepte. Die tweede klankonttrekkingsmetode gebruik 'n aangepaste lasersensor van 'n CD-speler om die dieptemodulasie van die groefie te meet. Drie laseropneemmetodes is ondersoek. Die eerste is voorwaartse opname, wat die dieptemodulasie in die opneemrigting van die groefie meet. 'n Tweede opneemmetode, truwaartse opname, is identies aan voorwaartse opname, behalwe dat die meganiese stelsel in trurat beweeg. Vier opnames vanuit verskillende posisies in die groefbreedte is gekombineer om 'n klanksein te vorm. Die kombinasie van vier opnames toon 'n beduidende verbetering op die sein-tot-ruis-verhouding. Dit het aanleiding gegee tot die derde opneemmetode, dwarsskandering, wat die hele profiel van die groef meet. Die groefprofiel word dan verwerk tot 'n klanksein. 'n Handoudiorestourasieprogram is geskryf om sigbare verwringing in die klanksein met beter interpolasies te vervang. Twee spraakverheffingsmetodes is ondersoek. Short- Time Spectral Attenuation (STSA) is die mees gebruikte metode vir oudiorestourasie. 'n Tweede spraakverheffingsmetode wat van 'n lineêre voorspellingskoëffisiëntafskatting (LPC-afskatting) van korttydraampies gebruik maak, is ook toegepas. Die twee metodes is deur luistertoetse teen mekaar opgeweeg. Die LPCmetode is verkies aangesien dit die verstaanbaarheid van die spraak beter behoue laat bly. Phonocylinders Sound recordings Sound -- Recording and reproducing Dissertations -- Electronic engineering Speech enhancement Audio restoration Audio extraction Theses -- Electronic engineering
44	Odstraňování šumu pomocí neuronových sítí s cyklickou konzistencí / Speech Enhancement with Cycle-Consistent Neural Networks Karlík, Pavol January 2020 (has links) Hlboké neurónové siete sa bežne používajú v oblasti odstraňovania šumu. Trénovací proces neurónovej siete je možné rožšíriť využitím druhej neurónovej siete, ktorej cieľom je vložiť šum do čistej rečovej nahrávky. Tieto dve siete sa môžu spolu využiť k rekonštrukcii pôvodných čistých a zašumených nahrávok. Táto práca skúma efektivitu tejto techniky, zvanej cyklická konzistencia. Cyklická konzistencia zlepšuje robustnosť neurónovej siete bez toho, aby sa daná sieť akokoľvek modifikovala, nakoľko vystavuje sieť na odstraňovanie šumu rôznorodejšiemu množstvu zašumených dát. Avšak, táto technika vyžaduje trénovacie dáta skladajúce sa z párov vstupných a referenčných nahrávok. Tieto dáta niesu vždy dostupné. Na trénovanie modelov s nepárovanými dátami využívame generatívne neurónové siete s cyklickou konzistenciou. V tejto práci sme vykonali veľké množstvo experimentov s modelmi trénovanými na párovaných a nepárovaných dátach. Naše výsledky ukazujú, že využitie cyklickej konzistencie výrazne zlepšuje výkonnosť modelov.
45	Robust Audio Scene Analysis for Rescue Robots / レスキューロボットのための頑健な音環境理解 Bando, Yoshiaki 26 March 2018 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第21209号 / 情博第662号 / 新制\|\|情\|\|114(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授河原達也, 教授鹿島久嗣, 教授田中利幸, 講師吉井和佳 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM Audio signal processing Multi-modal signal processing Rescue robotics Speech enhancement Posture estimation Hose-shaped rescue robot 007
46	Transfer learning approaches for feature denoising and low-resource speech recognition Bagchi, Deblin 10 September 2020 (has links) No description available. Computer Science Computer Engineering Acoustics
47	Deep Learning Based Array Processing for Speech Separation, Localization, and Recognition Wang, Zhong-Qiu 15 September 2020 (has links) No description available. Computer Science Computer Engineering microphone array processing speech enhancement speech separation robust automatic speech recognition deep learning
48	Traitement de l'incertitude pour la reconnaissance de la parole robuste au bruit / Uncertainty learning for noise robust ASR Tran, Dung Tien 20 November 2015 (has links) Cette thèse se focalise sur la reconnaissance automatique de la parole (RAP) robuste au bruit. Elle comporte deux parties. Premièrement, nous nous focalisons sur une meilleure prise en compte des incertitudes pour améliorer la performance de RAP en environnement bruité. Deuxièmement, nous présentons une méthode pour accélérer l'apprentissage d'un réseau de neurones en utilisant une fonction auxiliaire. Dans la première partie, une technique de rehaussement multicanal est appliquée à la parole bruitée en entrée. La distribution a posteriori de la parole propre sous-jacente est alors estimée et représentée par sa moyenne et sa matrice de covariance, ou incertitude. Nous montrons comment propager la matrice de covariance diagonale de l'incertitude dans le domaine spectral à travers le calcul des descripteurs pour obtenir la matrice de covariance pleine de l'incertitude sur les descripteurs. Le décodage incertain exploite cette distribution a posteriori pour modifier dynamiquement les paramètres du modèle acoustique au décodage. La règle de décodage consiste simplement à ajouter la matrice de covariance de l'incertitude à la variance de chaque gaussienne. Nous proposons ensuite deux estimateurs d'incertitude basés respectivement sur la fusion et sur l'estimation non-paramétrique. Pour construire un nouvel estimateur, nous considérons la combinaison linéaire d'estimateurs existants ou de fonctions noyaux. Les poids de combinaison sont estimés de façon générative en minimisant une mesure de divergence par rapport à l'incertitude oracle. Les mesures de divergence utilisées sont des versions pondérées des divergences de Kullback-Leibler (KL), d'Itakura-Saito (IS) ou euclidienne (EU). En raison de la positivité inhérente de l'incertitude, ce problème d'estimation peut être vu comme une instance de factorisation matricielle positive (NMF) pondérée. De plus, nous proposons deux estimateurs d'incertitude discriminants basés sur une transformation linéaire ou non linéaire de l'incertitude estimée de façon générative. Cette transformation est entraînée de sorte à maximiser le critère de maximum d'information mutuelle boosté (bMMI). Nous calculons la dérivée de ce critère en utilisant la règle de dérivation en chaîne et nous l'optimisons par descente de gradient stochastique. Dans la seconde partie, nous introduisons une nouvelle méthode d'apprentissage pour les réseaux de neurones basée sur une fonction auxiliaire sans aucun réglage de paramètre. Au lieu de maximiser la fonction objectif, cette technique consiste à maximiser une fonction auxiliaire qui est introduite de façon récursive couche par couche et dont le minimum a une expression analytique. Grâce aux propriétés de cette fonction, la décroissance monotone de la fonction objectif est garantie / This thesis focuses on noise robust automatic speech recognition (ASR). It includes two parts. First, we focus on better handling of uncertainty to improve the performance of ASR in a noisy environment. Second, we present a method to accelerate the training process of a neural network using an auxiliary function technique. In the first part, multichannel speech enhancement is applied to input noisy speech. The posterior distribution of the underlying clean speech is then estimated, as represented by its mean and its covariance matrix or uncertainty. We show how to propagate the diagonal uncertainty covariance matrix in the spectral domain through the feature computation stage to obtain the full uncertainty covariance matrix in the feature domain. Uncertainty decoding exploits this posterior distribution to dynamically modify the acoustic model parameters in the decoding rule. The uncertainty decoding rule simply consists of adding the uncertainty covariance matrix of the enhanced features to the variance of each Gaussian component. We then propose two uncertainty estimators based on fusion to nonparametric estimation, respectively. To build a new estimator, we consider a linear combination of existing uncertainty estimators or kernel functions. The combination weights are generatively estimated by minimizing some divergence with respect to the oracle uncertainty. The divergence measures used are weighted versions of Kullback-Leibler (KL), Itakura-Saito (IS), and Euclidean (EU) divergences. Due to the inherent nonnegativity of uncertainty, this estimation problem can be seen as an instance of weighted nonnegative matrix factorization (NMF). In addition, we propose two discriminative uncertainty estimators based on linear or nonlinear mapping of the generatively estimated uncertainty. This mapping is trained so as to maximize the boosted maximum mutual information (bMMI) criterion. We compute the derivative of this criterion using the chain rule and optimize it using stochastic gradient descent. In the second part, we introduce a new learning rule for neural networks that is based on an auxiliary function technique without parameter tuning. Instead of minimizing the objective function, this technique consists of minimizing a quadratic auxiliary function which is recursively introduced layer by layer and which has a closed-form optimum. Based on the properties of this auxiliary function, the monotonic decrease of the new learning rule is guaranteed. Reconnaissance automatique de la parole Robustesse au bruit Rehaussement de la parole Propagation de l’incertitude Automatic speech recognition Noise robustness Speech enhancement Uncertainty propagation 006.454 621.399
49	Multisensor Segmentation-based Noise Suppression for Intelligibility Improvement in MELP Coders Demiroglu, Cenk 18 January 2006 (has links) This thesis investigates the use of an auxiliary sensor, the GEMS device, for improving the quality of noisy speech and designing noise preprocessors to MELP speech coders. Use of auxiliary sensors for noise-robust ASR applications is also investigated to develop speech enhancement algorithms that use acoustic-phonetic properties of the speech signal. A Bayesian risk minimization framework is developed that can incorporate the acoustic-phonetic properties of speech sounds and knowledge of human auditory perception into the speech enhancement framework. Two noise suppression systems are presented using the ideas developed in the mathematical framework. In the first system, an aharmonic comb filter is proposed for voiced speech where low-energy frequencies are severely suppressed while high-energy frequencies are suppressed mildly. The proposed system outperformed an MMSE estimator in subjective listening tests and DRT intelligibility test for MELP-coded noisy speech. The effect of aharmonic comb filtering on the linear predictive coding (LPC) parameters is analyzed using a missing data approach. Suppressing the low-energy frequencies without any modification of the high-energy frequencies is shown to improve the LPC spectrum using the Itakura-Saito distance measure. The second system combines the aharmonic comb filter with the acoustic-phonetic properties of speech to improve the intelligibility of the MELP-coded noisy speech. Noisy speech signal is segmented into broad level sound classes using a multi-sensor automatic segmentation/classification tool, and each sound class is enhanced differently based on its acoustic-phonetic properties. The proposed system is shown to outperform both the MELPe noise preprocessor and the aharmonic comb filter in intelligibility tests when used in concatenation with the MELP coder. Since the second noise suppression system uses an automatic segmentation/classification algorithm, exploiting the GEMS signal in an automatic segmentation/classification task is also addressed using an ASR approach. Current ASR engines can segment and classify speech utterances in a single pass; however, they are sensitive to ambient noise. Features that are extracted from the GEMS signal can be fused with the noisy MFCC features to improve the noise-robustness of the ASR system. In the first phase, a voicing feature is extracted from the clean speech signal and fused with the MFCC features. The actual GEMS signal could not be used in this phase because of insufficient sensor data to train the ASR system. Tests are done using the Aurora2 noisy digits database. The speech-based voicing feature is found to be effective at around 10 dB but, below 10 dB, the effectiveness rapidly drops with decreasing SNR because of the severe distortions in the speech-based features at these SNRs. Hence, a novel system is proposed that treats the MFCC features in a speech frame as missing data if the global SNR is below 10 dB and the speech frame is unvoiced. If the global SNR is above 10 dB of the speech frame is voiced, both MFCC features and voicing feature are used. The proposed system is shown to outperform some of the popular noise-robust techniques at all SNRs. In the second phase, a new isolated monosyllable database is prepared that contains both speech and GEMS data. ASR experiments conducted for clean speech showed that the GEMS-based feature, when fused with the MFCC features, decreases the performance. The reason for this unexpected result is found to be partly related to some of the GEMS data that is severely noisy. The non-acoustic sensor noise exists in all GEMS data but the severe noise happens rarely. A missing data technique is proposed to alleviate the effects of severely noisy sensor data. The GEMS-based feature is treated as missing data when it is detected to be severely noisy. The combined features are shown to outperform the MFCC features for clean speech when the missing data technique is applied. Speech intelligibility Speech quality GEMS Multi-sensor Automatic speech recognition Speech enhancement Segmentation-based enhancement Noise-robust automatic segmentation Comb filter Data marginalization Data fusion Missing data
50	Implementation and Evaluation of Spectral Subtraction with Minimum Statistics using WOLA and FFT Modulated Filter Banks Rao, Peddi Srinivas, Sreelatha, Vallabhaneni January 2014 (has links) In communication system environment speech signal is corrupted due to presence of additive acoustic noise, so with this distortion the effective communication is degraded in terms of the quality and intelligibility of speech. Now present research is going how effectively acoustic noise can be eliminated without affecting the original speech quality, this tends to be our challenging in this current research thesis work. Here this work proposes multi-tiered detection method that is based on time-frequency analysis (i.e. filter banks concept) of the noisy speech signals, by using standard speech enhancement method based on the proven spectral subtraction, for single channel speech data and for a wide range of noise types at various noise levels. There were various variants have been introduced to standard spectral subtraction proposed by S.F.Boll. In this thesis we designed and implemented a novel approach of Spectral Subtraction based on Minimum Statistics [MinSSS]. This means that the power spectrum of the non-stationary noise signal is estimated by finding the minimum values of a smoothed power spectrum of the noisy speech signal and thus circumvents the speech activity detection problem. This approach is also capable of dealing with non-stationary noise signals. In order to analyze the system in time frequency domain, we have implemented two different filter bank approaches such as Weighted OverLap Added (WOLA) and Fast Fourier Transform Modulated (FFTMod). The proposed systems were implemented and evaluated offline using simulation tool Matlab and then validated their performances based on the objective quality measures such as Signal to Noise Ratio Improvement (SNRI) and Perceptual Evaluation Speech Quality (PESQ) measure. The systems were tested with a pure speech combination of male and female sampled at 8 kHz, these signals were corrupted with various kinds of noises at different noise power levels. The MinSSS algorithm implemented using FFTMod filter bank approach outperforms when compared the WOLA filter bank approach. Speech Enhancement Filter Bank Spectral Subtraction Wiener Filtering PESQ DFT

Search results