441 |
Real-time adaptive noise cancellation for automatic speech recognition in a car environment : a thesis presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering at Massey University, School of Engineering and Advanced Technology, Auckland, New ZealandQi, Ziming January 2008 (has links)
This research is mainly concerned with a robust method for improving the performance of a real-time speech enhancement and noise cancellation for Automatic Speech Recognition (ASR) in a real-time environment. Therefore, the thesis titled, “Real-time adaptive beamformer for Automatic speech Recognition in a car environment” presents an application technique of a beamforming method and Automatic Speech Recognition (ASR) method. In this thesis, a novel solution is presented to the question as below, namely: How can the driver’s voice control the car using ASR? The solution in this thesis is an ASR using a hybrid system with acoustic beamforming Voice Activity Detector (VAD) and an Adaptive Wiener Filter. The beamforming approach is based on a fundamental theory of normalized least-mean squares (NLMS) to improve Signal to Noise Ratio (SNR). The microphone has been implemented with a Voice Activity Detector (VAD) which uses time-delay estimation together with magnitude-squared coherence (MSC). An experiment clearly shows the ability of the composite system to reduce noise outside of a defined active zone. In real-time environments a speech recognition system in a car has to receive the driver’s voice only whilst suppressing background noise e.g. voice from radio. Therefore, this research presents a hybrid real-time adaptive filter which operates within a geometrical zone defined around the head of the desired speaker. Any sound outside of this zone is considered to be noise and suppressed. As this defined geometrical zone is small, it is assumed that only driver's speech is incoming from this zone. The technique uses three microphones to define a geometric based voice-activity detector (VAD) to cancel the unwanted speech coming from outside of the zone. In the case of a sole unwanted speech incoming from outside of a desired zone, this speech is muted at the output of the hybrid noise canceller. In case of an unwanted speech and a desired speech are incoming at the same time, the proposed VAD fails to identify the unwanted speech or desired speech. In such a situation an adaptive Wiener filter is switched on for noise reduction, where the SNR is improved by as much as 28dB. In order to identify the signal quality of the filtered signal from Wiener filter, a template matching speech recognition system that uses a Wiener filter is designed for testing. In this thesis, a commercial speech recognition system is also applied to test the proposed beamforming based noise cancellation and the adaptive Wiener filter.
|
442 |
Automatic phoneme recognition of South African EnglishEngelbrecht, Herman Arnold 03 1900 (has links)
Thesis (MEng)--University of Stellenbosch, 2004. / ENGLISH ABSTRACT: Automatic speech recognition applications have been developed for many languages in
other countries but not much research has been conducted on developing Human Language
Technology (HLT) for S.A. languages. Research has been performed on informally
gathered speech data but until now a speech corpus that could be used to develop HLT
for S.A. languages did not exist. With the development of the African Speech Technology
Speech Corpora, it has now become possible to develop commercial applications of HLT.
The two main objectives of this work are the accurate modelling of phonemes, suitable
for the purposes of LVCSR, and the evaluation of the untried S.A. English speech corpus.
Three different aspects of phoneme modelling was investigated by performing isolated
phoneme recognition on the NTIMIT speech corpus. The three aspects were signal
processing, statistical modelling of HMM state distributions and context-dependent
phoneme modelling. Research has shown that the use of phonetic context when modelling
phonemes forms an integral part of most modern LVCSR systems. To facilitate
the context-dependent phoneme modelling, a method of constructing robust and accurate
models using decision tree-based state clustering techniques is described. The strength
of this method is the ability to construct accurate models of contexts that did not occur
in the training data. The method incorporates linguistic knowledge about the phonetic
context, in conjunction with the training data, to decide which phoneme contexts are
similar and should share model parameters.
As LVCSR typically consists of continuous recognition of spoken words, the contextdependent
and context-independent phoneme models that were created for the isolated
recognition experiments are evaluated by performing continuous phoneme recognition.
The phoneme recognition experiments are performed, without the aid of a grammar or
language model, on the S.A. English corpus. As the S.A. English corpus is newly created,
no previous research exist to which the continuous recognition results can be compared to.
Therefore, it was necessary to create comparable baseline results, by performing continuous
phoneme recognition on the NTIMIT corpus. It was found that acceptable recognition
accuracy was obtained on both the NTIMIT and S.A. English corpora. Furthermore, the
results on S.A. English was 2 - 6% better than the results on NTIMIT, indicating that the
S.A. English corpus is of a high enough quality that it can be used for the development
of HLT. / AFRIKAANSE OPSOMMING: Automatiese spraak-herkenning is al ontwikkel vir ander tale in ander lande maar, daar
nog nie baie navorsing gedoen om menslike taal-tegnologie (HLT) te ontwikkel vir Suid-
Afrikaanse tale. Daar is al navorsing gedoen op spraak wat informeel versamel is, maar tot
nou toe was daar nie 'n spraak databasis wat vir die ontwikkeling van HLT vir S.A. tale.
Met die ontwikkeling van die African Speech Technology Speech Corpora, het dit moontlik
geword om HLT te ontwikkel vir wat geskik is vir kornmersiele doeleindes.
Die twee hoofdoele van hierdie tesis is die akkurate modellering van foneme, geskik
vir groot-woordeskat kontinue spraak-herkenning (LVCSR), asook die evaluasie van die
S.A. Engels spraak-databasis.
Drie aspekte van foneem-modellering word ondersoek deur isoleerde foneem-herkenning
te doen op die NTIMIT spraak-databasis. Die drie aspekte wat ondersoek word is
sein prosessering, statistiese modellering van die HMM toestands distribusies, en konteksafhanklike
foneem-modellering. Navorsing het getoon dat die gebruik van fonetiese konteks
'n integrale deel vorm van meeste moderne LVCSR stelsels. Dit is dus nodig om robuuste
en akkurate konteks-afhanklike modelle te kan bou. Hiervoor word 'n besluitnemingsboom-
gebaseerde trosvormings tegniek beskryf. Die tegniek is ook in staat is om akkurate
modelle te bou van kontekste van nie voorgekom het in die afrigdata nie. Om te besluit
watter fonetiese kontekste is soortgelyk en dus model parameters moet deel, maak die
tegniek gebruik van die afrigdata en inkorporeer taalkundige kennis oor die fonetiese kontekste.
Omdat LVCSR tipies is oor die kontinue herkenning van woorde, word die konteksafhanklike
en konteks-onafhanklike modelle, wat gebou is vir die isoleerde foneem-herkenningseksperimente,
evalueer d.m.v. kontinue foneem-herkening. Die kontinue foneemherkenningseksperimente
word gedoen op die S.A. Engels databasis, sonder die hulp van
'n taalmodel of grammatika. Omdat die S.A. Engels databasis nuut is, is daar nog geen
ander navorsing waarteen die result ate vergelyk kan word nie. Dit is dus nodig om kontinue
foneem-herkennings result ate op die NTIMIT databasis te genereer, waarteen die
S.A. Engels resulte vergelyk kan word. Die resulate dui op aanvaarbare foneem her kenning
op beide die NTIMIT en S.A. Engels databassise. Die resultate op S.A. Engels
is selfs 2 - 6% beter as die resultate op NTIMIT, wat daarop dui dat die S.A. Engels
spraak-databasis geskik is vir die ontwikkeling van HLT.
|
443 |
Phonene-based topic spotting on the switchboard corpusTheunissen, M. W. (Marthinus Wilhelmus) 04 1900 (has links)
Thesis (MScEng)--Stellenbosch University, 2002. / ENGLISH ABSTRACT: The field of topic spotting in conversational speech deals with the problem of identifying
"interesting" conversations or speech extracts contained within large volumes of speech
data. Typical applications where the technology can be found include the surveillance
and screening of messages before referring to human operators. Closely related methods
can also be used for data-mining of multimedia databases, literature searches, language
identification, call routing and message prioritisation.
The first topic spotting systems used words as the most basic units. However, because of the
poor performance of speech recognisers, a large amount of topic-specific hand-transcribed
training data is needed. It is for this reason that researchers started concentrating on methods
using phonemes instead, because the errors then occur on smaller, and therefore less
important, units. Phoneme-based methods consequently make it feasible to use computer
generated transcriptions as training data.
Building on word-based methods, a number of phoneme-based systems have emerged.
The two most promising ones are the Euclidean Nearest Wrong Neighbours (ENWN) algorithm
and the newly developed Stochastic Method for the Automatic Recognition of
Topics (SMART). Previous experiments on the Oregon Graduate Institute of Science and
Technology's Multi-Language Telephone Speech Corpus suggested that SMART yields a
large improvement over ENWN which outperformed competing phoneme-based systems
in evaluations. However, the small amount of data available for these experiments meant
that more rigorous testing was required.
In this research, the algorithms were therefore re-implemented to run on the much larger
Switchboard Corpus. Subsequently, a substantial improvement of SMART over ENWN
was observed, confirming the result that was previously obtained. In addition to this,
an investigation was conducted into the improvement of SMART. This resulted in a new
counting strategy with a corresponding improvement in performance. / AFRIKAANSE OPSOMMING: Die veld van onderwerp-herkenning in spraak het te doen met die probleem om "interessante"
gesprekke of spraaksegmente te identifiseer tussen groot hoeveelhede spraakdata.
Die tegnologie word tipies gebruik om gesprekke te verwerk voor dit verwys word na
menslike operateurs. Verwante metodes kan ook gebruik word vir die ontginning van
data in multimedia databasisse, literatuur-soektogte, taal-herkenning, oproep-kanalisering
en boodskap-prioritisering.
Die eerste onderwerp-herkenners was woordgebaseerd, maar as gevolg van die swak resultate
wat behaal word met spraak-herkenners, is groot hoeveelhede hand-getranskribeerde
data nodig om sulke stelsels af te rig. Dit is om hierdie rede dat navorsers tans foneemgebaseerde
benaderings verkies, aangesien die foute op kleiner, en dus minder belangrike,
eenhede voorkom. Foneemgebaseerde metodes maak dit dus moontlik om rekenaargegenereerde
transkripsies as afrigdata te gebruik.
Verskeie foneemgebaseerde stelsels het verskyn deur voort te bou op woordgebaseerde
metodes. Die twee belowendste stelsels is die "Euclidean Nearest Wrong Neighbours"
(ENWN) algoritme en die nuwe "Stochastic Method for the Automatic Recognition of
Topics" (SMART). Vorige eksperimente op die "Oregon Graduate Institute of Science and
Technology's Multi-Language Telephone Speech Corpus" het daarop gedui dat die SMART
algoritme beter vaar as die ENWN-stelsel wat ander foneemgebaseerde algoritmes geklop
het. Die feit dat daar te min data beskikbaar was tydens die eksperimente het daarop
gedui dat strenger toetse nodig was.
Gedurende hierdie navorsing is die algoritmes dus herimplementeer sodat eksperimente
op die "Switchboard Corpus" uitgevoer kon word. Daar is vervolgens waargeneem dat
SMART aansienlik beter resultate lewer as ENWN en dit het dus die geldigheid van die
vorige resultate bevestig. Ter aanvulling hiervan, is 'n ondersoek geloods om SMART te
probeer verbeter. Dit het tot 'n nuwe telling-strategie gelei met 'n meegaande verbetering
in resultate.
|
444 |
Unsupervised clustering of audio data for acoustic modelling in automatic speech recognition systemsGoussard, George Willem 03 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2011. / ENGLISH ABSTRACT: This thesis presents a system that is designed to replace the manual process of
generating a pronunciation dictionary for use in automatic speech recognition.
The proposed system has several stages.
The first stage segments the audio into what will be known as the subword
units, using a frequency domain method. In the second stage, dynamic
time warping is used to determine the similarity between the segments of each
possible pair of these acoustic segments. These similarities are used to cluster
similar acoustic segments into acoustic clusters. The final stage derives a
pronunciation dictionary from the orthography of the training data and corresponding
sequence of acoustic clusters. This process begins with an initial
mapping between words and their sequence of clusters, established by Viterbi
alignment with the orthographic transcription. The dictionary is refined iteratively
by pruning redundant mappings, hidden Markov model estimation and
Viterbi re-alignment in each iteration.
This approach is evaluated experimentally by applying it to two subsets of
the TIMIT corpus. It is found that, when test words are repeated often in the
training material, the approach leads to a system whose accuracy is almost as
good as one trained using the phonetic transcriptions. When test words are
not repeated often in the training set, the proposed approach leads to better
results than those achieved using the phonetic transcriptions, although the
recognition is poor overall in this case. / AFRIKAANSE OPSOMMING: Die doelwit van die tesis is om ’n stelsel te beskryf wat ontwerp is om die
handgedrewe proses in die samestelling van ’n woordeboek, vir die gebruik
in outomatiese spraakherkenningsstelsels, te vervang. Die voorgestelde stelsel
bestaan uit ’n aantal stappe.
Die eerste stap is die segmentering van die oudio in sogenaamde sub-woord
eenhede deur gebruik te maak van ’n frekwensie gebied tegniek. Met die tweede
stap word die dinamiese tydverplasingsalgoritme ingespan om die ooreenkoms
tussen die segmente van elkeen van die moontlike pare van die akoestiese segmente
bepaal. Die ooreenkomste word dan gebruik om die akoestiese segmente
te groepeer in akoestiese groepe. Die laaste stap stel die woordeboek
saam deur gebruik te maak van die ortografiese transkripsie van afrigtingsdata
en die ooreenstemmende reeks akoestiese groepe. Die finale stap begin met
’n aanvanklike afbeelding vanaf woorde tot hul reeks groep identifiseerders,
bewerkstellig deur Viterbi belyning en die ortografiese transkripsie. Die woordeboek
word iteratief verfyn deur oortollige afbeeldings te snoei, verskuilde
Markov modelle af te rig en deur Viterbi belyning te gebruik in elke iterasie.
Die benadering is getoets deur dit eksperimenteel te evalueer op twee subversamelings
data vanuit die TIMIT korpus. Daar is bevind dat, wanneer
woorde herhaal word in die afrigtingsdata, die stelsel se benadering die akkuraatheid
ewenaar van ’n stelsel wat met die fonetiese transkripsie afgerig is.
As die woorde nie herhaal word in die afrigtingsdata nie, is die akkuraatheid
van die stelsel se benadering beter as wanneer die stelsel afgerig word met die
fonetiese transkripsie, alhoewel die akkuraatheid in die algemeen swak is.
|
445 |
Recurrent neural network language models for automatic speech recognitionGangireddy, Siva Reddy January 2017 (has links)
The goal of this thesis is to advance the use of recurrent neural network language models (RNNLMs) for large vocabulary continuous speech recognition (LVCSR). RNNLMs are currently state-of-the-art and shown to consistently reduce the word error rates (WERs) of LVCSR tasks when compared to other language models. In this thesis we propose various advances to RNNLMs. The advances are: improved learning procedures for RNNLMs, enhancing the context, and adaptation of RNNLMs. We learned better parameters by a novel pre-training approach and enhanced the context using prosody and syntactic features. We present a pre-training method for RNNLMs, in which the output weights of a feed-forward neural network language model (NNLM) are shared with the RNNLM. This is accomplished by first fine-tuning the weights of the NNLM, which are then used to initialise the output weights of an RNNLM with the same number of hidden units. To investigate the effectiveness of the proposed pre-training method, we have carried out text-based experiments on the Penn Treebank Wall Street Journal data, and ASR experiments on the TED lectures data. Across the experiments, we observe small but significant improvements in perplexity (PPL) and ASR WER. Next, we present unsupervised adaptation of RNNLMs. We adapted the RNNLMs to a target domain (topic or genre or television programme (show)) at test time using ASR transcripts from first pass recognition. We investigated two approaches to adapt the RNNLMs. In the first approach the forward propagating hidden activations are scaled - learning hidden unit contributions (LHUC). In the second approach we adapt all parameters of RNNLM.We evaluated the adapted RNNLMs by showing the WERs on multi genre broadcast speech data. We observe small (on an average 0.1% absolute) but significant improvements in WER compared to a strong unadapted RNNLM model. Finally, we present the context-enhancement of RNNLMs using prosody and syntactic features. The prosody features were computed from the acoustics of the context words and the syntactic features were from the surface form of the words in the context. We trained the RNNLMs with word duration, pause duration, final phone duration, syllable duration, syllable F0, part-of-speech tag and Combinatory Categorial Grammar (CCG) supertag features. The proposed context-enhanced RNNLMs were evaluated by reporting PPL and WER on two speech recognition tasks, Switchboard and TED lectures. We observed substantial improvements in PPL (5% to 15% relative) and small but significant improvements in WER (0.1% to 0.5% absolute).
|
446 |
Dynamic Time Warping baseado na transformada wavelet / Dynamic Time Warping based-on wavelet transformSylvio Barbon Júnior 31 August 2007 (has links)
Dynamic Time Warping (DTW) é uma técnica do tipo pattern matching para reconhecimento de padrões de voz, sendo baseada no alinhamento temporal de um sinal com os diversos modelos de referência. Uma desvantagem da DTW é o seu alto custo computacional. Este trabalho apresenta uma versão da DTW que, utilizando a Transformada Wavelet Discreta (DWT), reduz a sua complexidade. O desempenho obtido com a proposta foi muito promissor, ganhando em termos de velocidade de reconhecimento e recursos de memória consumidos, enquanto a precisão da DTW não é afetada. Os testes foram realizados com alguns fonemas extraídos da base de dados TIMIT do Linguistic Data Consortium (LDC) / Dynamic TimeWarping (DTW) is a pattern matching technique for speech recognition, that is based on a temporal alignment of the input signal with the template models. One drawback of this technique is its high computational cost. This work presents a modified version of the DTW, based on the DiscreteWavelet Transform (DWT), that reduces the complexity of the original algorithm. The performance obtained with the proposed algorithm is very promising, improving the recognition in terms of time and memory allocation, while the precision is not affected. Tests were performed with speech data collected from TIMIT corpus provided by Linguistic Data Consortium (LDC).
|
447 |
Sistema de inferência genético-nebuloso para reconhecimento de voz: Uma abordagem em modelos preditivos de baixa ordem utilizando a transformada cosseno discreta / System of genetic hazy inference for speech recognition: one approach to predictive models of low-order using the discrete cosine transformSilva, Washington Luis Santos 20 March 2015 (has links)
Made available in DSpace on 2016-08-17T16:54:32Z (GMT). No. of bitstreams: 1
TESE_WASHINGTON LUIS SANTOS SILVA.pdf: 2994073 bytes, checksum: 86620806fbcc7af4fcf423defd5776bc (MD5)
Previous issue date: 2015-03-20 / This thesis proposes a methodology that uses an intelligent system for voice recognition. It uses the definition of intelligent system, as the system has the ability to adapt their behavior to achieve their goals in a variety of environments. It is used also, the definition of Computational Intelligence, as the simulation of intelligent behavior in terms of computational process. In addition the speech signal pre-processing with mel-cepstral coefficients, the discrete cosine transform (DCT) is used to generate a two-dimensional array to model each pattern to be recognized. A Mamdani fuzzy inference system for speech recognition is optimized by genetic algorithm to maximize the amount of correct classification of standards with a reduced number of parameters. The experimental results achieved in speech recognition with the proposed methodology were compared with the Hidden Markov Models-HMM and the classifiers Gaussians Mixtures Models-GMM and Support Vector Machine-SVM. The recognition system used in this thesis was called Intelligent Methodology for Speech Recognition-IMSR / Neste trabalho propõe-se uma metodologia que utiliza um sistema inteligente para reconhecimento de voz. Utiliza-se a definição de sistema inteligente, como o sistema que possui a capacidade de adaptar seu comportamento para atingir seus objetivos em uma variedade de ambientes. Utiliza-se, também, a definição de Inteligência Computacional, como sendo a simulação de comportamentos inteligentes em termos de processo computacional. Além do pré-processamento do sinal de voz com coeficientes mel-cepstrais, a transformada discreta cosseno (TCD) é utilizada para gerar uma matriz bidimensional para modelar cada padrão a ser reconhecido. Um sistema de inferências nebuloso Mamdani para reconhecimento de voz é otimizado por algoritmo genético para maximizar a quantidade de acertos na classificação dos padrões com um número reduzido de parâmetros. Os resultados experimentais alcançados no reconhecimento de voz com a metodologia proposta foram comparados com o Hidden Markov Models-HMM e com os classificadores Gaussian Mixture Models-GMM e máquina de vetor de suporte (Support Vector Machine-SVM) com intuito de avaliação de desempenho. O sistema de reconhecimento usado neste trabalho foi denominado Intelligent Methodology for Speech Recognition-IMSR.
|
448 |
A Comparative acoustic analysis of the long vowels and diphthongs of Afrikaans and South African EnglishPrinsloo, Claude Pierre 03 March 2006 (has links)
Please read the abstract in the section 00front of this document / Dissertation (MEng (Computer Engineering))--University of Pretoria, 2006. / Electrical, Electronic and Computer Engineering / unrestricted
|
449 |
Modèles de langage ad hoc pour la reconnaissance automatique de la parole / Ad-hoc language models for automatic speech recognitionOger, Stanislas 30 November 2011 (has links)
Les trois piliers d’un système de reconnaissance automatique de la parole sont le lexique,le modèle de langage et le modèle acoustique. Le lexique fournit l’ensemble des mots qu’il est possible de transcrire, associés à leur prononciation. Le modèle acoustique donne une indication sur la manière dont sont réalisés les unités acoustiques et le modèle de langage apporte la connaissance de la manière dont les mots s’enchaînent.Dans les systèmes de reconnaissance automatique de la parole markoviens, les modèles acoustiques et linguistiques sont de nature statistique. Leur estimation nécessite de gros volumes de données sélectionnées, normalisées et annotées.A l’heure actuelle, les données disponibles sur le Web constituent de loin le plus gros corpus textuel disponible pour les langues française et anglaise. Ces données peuvent potentiellement servir à la construction du lexique et à l’estimation et l’adaptation du modèle de langage. Le travail présenté ici consiste à proposer de nouvelles approches permettant de tirer parti de cette ressource.Ce document est organisé en deux parties. La première traite de l’utilisation des données présentes sur le Web pour mettre à jour dynamiquement le lexique du moteur de reconnaissance automatique de la parole. L’approche proposée consiste à augmenter dynamiquement et localement le lexique du moteur de reconnaissance automatique de la parole lorsque des mots inconnus apparaissent dans le flux de parole. Les nouveaux mots sont extraits du Web grâce à la formulation automatique de requêtes soumises à un moteur de recherche. La phonétisation de ces mots est obtenue grâce à un phonétiseur automatique.La seconde partie présente une nouvelle manière de considérer l’information que représente le Web et des éléments de la théorie des possibilités sont utilisés pour la modéliser. Un modèle de langage possibiliste est alors proposé. Il fournit une estimation de la possibilité d’une séquence de mots à partir de connaissances relatives à ’existence de séquences de mots sur le Web. Un modèle probabiliste Web reposant sur le compte de documents fourni par un moteur de recherche Web est également présenté. Plusieurs approches permettant de combiner ces modèles avec des modèles probabilistes classiques estimés sur corpus sont proposées. Les résultats montrent que combiner les modèles probabilistes et possibilistes donne de meilleurs résultats que es modèles probabilistes classiques. De plus, les modèles estimés à partir des données Web donnent de meilleurs résultats que ceux estimés sur corpus. / The three pillars of an automatic speech recognition system are the lexicon, the languagemodel and the acoustic model. The lexicon provides all the words that can betranscribed, associated with their pronunciation. The acoustic model provides an indicationof how the phone units are pronounced, and the language model brings theknowledge of how words are linked. In modern automatic speech recognition systems,the acoustic and language models are statistical. Their estimation requires large volumesof data selected, standardized and annotated.At present, the Web is by far the largest textual corpus available for English andFrench languages. The data it holds can potentially be used to build the vocabularyand the estimation and adaptation of language model. The work presented here is topropose new approaches to take advantage of this resource in the context of languagemodeling.The document is organized into two parts. The first deals with the use of the Webdata to dynamically update the lexicon of the automatic speech recognition system.The proposed approach consists on increasing dynamically and locally the lexicon onlywhen unknown words appear in the speech. New words are extracted from the Webthrough the formulation of queries submitted toWeb search engines. The phonetizationof the words is obtained by an automatic grapheme-to-phoneme transcriber.The second part of the document presents a new way of handling the informationcontained on the Web by relying on possibility theory concepts. A Web-based possibilisticlanguage model is proposed. It provides an estition of the possibility of a wordsequence from knowledge of the existence of its sub-sequences on the Web. A probabilisticWeb-based language model is also proposed. It relies on Web document countsto estimate n-gram probabilities. Several approaches for combining these models withclassical models are proposed. The results show that combining probabilistic and possibilisticmodels gives better results than classical probabilistic models alone. In addition,the models estimated from Web data perform better than those estimated on corpus.
|
450 |
L’analyse factorielle pour la modélisation acoustique des systèmes de reconnaissance de la parole / Factor analysis for acoustic modeling of speech recognition systemsBouallegue, Mohamed 16 December 2013 (has links)
Dans cette thèse, nous proposons d’utiliser des techniques fondées sur l’analyse factorielle pour la modélisation acoustique pour le traitement automatique de la parole, notamment pour la Reconnaissance Automatique de la parole. Nous nous sommes, dans un premier temps, intéressés à la réduction de l’empreinte mémoire des modèles acoustiques. Notre méthode à base d’analyse factorielle a démontré une capacité de mutualisation des paramètres des modèles acoustiques, tout en maintenant des performances similaires à celles des modèles de base. La modélisation proposée nous conduit à décomposer l’ensemble des paramètres des modèles acoustiques en sous-ensembles de paramètres indépendants, ce qui permet une grande flexibilité pour d’éventuelles adaptations (locuteurs, genre, nouvelles tâches).Dans les modélisations actuelles, un état d’un Modèle de Markov Caché (MMC) est représenté par un mélange de Gaussiennes (GMM : Gaussian Mixture Model). Nous proposons, comme alternative, une représentation vectorielle des états : les fac- teur d’états. Ces facteur d’états nous permettent de mesurer efficacement la similarité entre les états des MMC au moyen d’une distance euclidienne, par exemple. Grâce à cette représenation vectorielle, nous proposons une méthode simple et efficace pour la construction de modèles acoustiques avec des états partagés. Cette procédure s’avère encore plus efficace dans le cas de langues peu ou très peu dotées en ressouces et enconnaissances linguistiques. Enfin, nos efforts se sont portés sur la robustesse des systèmes de reconnaissance de la parole face aux variabilités acoustiques, et plus particulièrement celles générées par l’environnement. Nous nous sommes intéressés, dans nos différentes expérimentations, à la variabilité locuteur, à la variabilité canal et au bruit additif. Grâce à notre approche s’appuyant sur l’analyse factorielle, nous avons démontré la possibilité de modéliser ces différents types de variabilité acoustique nuisible comme une composante additive dans le domaine cepstral. Nous soustrayons cette composante des vecteurs cepstraux pour annuler son effet pénalisant pour la reconnaissance de la parole / In this thesis, we propose to use techniques based on factor analysis to build acoustic models for automatic speech processing, especially Automatic Speech Recognition (ASR). Frstly, we were interested in reducing the footprint memory of acoustic models. Our factor analysis-based method demonstrated that it is possible to pool the parameters of acoustic models and still maintain performance similar to the one obtained with the baseline models. The proposed modeling leads us to deconstruct the ensemble of the acoustic model parameters into independent parameter sub-sets, which allow a great flexibility for particular adaptations (speakers, genre, new tasks etc.). With current modeling techniques, the state of a Hidden Markov Model (HMM) is represented by a combination of Gaussians (GMM : Gaussian Mixture Model). We propose as an alternative a vector representation of states : the factors of states. These factors of states enable us to accurately measure the similarity between the states of the HMM by means of an euclidean distance for example. Using this vector represen- tation, we propose a simple and effective method for building acoustic models with shared states. This procedure is even more effective when applied to under-resourced languages. Finally, we concentrated our efforts on the robustness of the speech recognition sys- tems to acoustic variabilities, particularly those generated by the environment. In our various experiments, we examined speaker variability, channel variability and additive noise. Through our factor analysis-based approach, we demonstrated the possibility of modeling these different types of acoustic variability as an additive component in the cepstral domain. By compensation of this component from the cepstral vectors, we are able to cancel out the harmful effect it has on speech recognition
|
Page generated in 0.1508 seconds