Spelling suggestions: "subject:"speechrecognition"" "subject:"breedsrecognition""
671 |
Automatic speech recognition for resource-scarce environments / N.T. Kleynhans.Kleynhans, Neil Taylor January 2013 (has links)
Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. However, ASR system development is a resource intensive activity and requires language resources in the form of text annotated audio recordings and pronunciation dictionaries. Unfortunately, many languages found in the developing world fall into the resource-scarce category and due to this resource scarcity the deployment of ASR systems in the developing world is severely inhibited. In this thesis we present research into developing techniques and tools to (1) harvest audio data, (2) rapidly adapt ASR systems and (3) select “useful” training samples in order to assist with resource-scarce ASR system development.
We demonstrate an automatic audio harvesting approach which efficiently creates a speech recognition corpus by harvesting an easily available audio resource. We show that by starting with bootstrapped acoustic models, trained with language data obtain from a dialect, and then running through a few iterations of an alignment-filter-retrain phase it is possible to create an accurate speech recognition corpus. As a demonstration we create a South African English speech recognition corpus by using our approach and harvesting an internet website which provides audio and approximate transcriptions. The acoustic models developed from harvested data are evaluated on independent corpora and show that the proposed harvesting approach provides a robust means to create ASR resources.
As there are many acoustic model adaptation techniques which can be implemented by an ASR system developer it becomes a costly endeavour to select the best adaptation technique. We investigate the dependence of the adaptation data amount and various adaptation techniques by systematically varying the adaptation data amount and comparing the performance of various adaptation techniques. We establish a guideline which can be used by an ASR developer to chose the best adaptation technique given a size constraint on the adaptation data, for the scenario where adaptation between narrow- and wide-band corpora must be performed. In addition, we investigate the effectiveness of a novel channel normalisation technique and compare the performance with standard normalisation and adaptation techniques.
Lastly, we propose a new data selection framework which can be used to design a speech recognition corpus. We show for limited data sets, independent of language and bandwidth, the most effective strategy for data selection is frequency-matched selection and that the widely-used maximum entropy methods generally produced the least promising results. In our model, the frequency-matched selection method corresponds to a logarithmic relationship between accuracy and corpus size; we also investigated other model relationships, and found that a hyperbolic relationship (as suggested from simple asymptotic arguments in learning theory) may lead to somewhat better performance under certain conditions. / Thesis (PhD (Computer and Electronic Engineering))--North-West University, Potchefstroom Campus, 2013.
|
672 |
Language Modeling For Turkish Continuous Speech RecognitionSahin, Serkan 01 December 2003 (has links) (PDF)
This study aims to build a new language model for Turkish continuous speech recognition. Turkish is very productive language in terms of word forms because of its agglutinative nature. For such languages like Turkish, the vocabulary size is far from being acceptable from only one simple stem, thousands of new words can be generated using inflectional and derivational suffixes. In this work, word are parsed into their stem and endings. First of all, we consider endings as words and we obtained bigram probabilities using stem and endings. Then, bigram probabilities are obtained using only the stems. Single pass recognition was performed by using bigram probabilities. As a second job, two pass recognition was performed. Firstly, previous bigram probabilities were used to create word lattices. Secondly, trigram probabilities were obtained from a larger text. Finally, one-best results were obtained by using word lattices and trigram probabilities. All work is done in Hidden Markov Model Toolkit (HTK) environment, except parsing and network transforming.
|
673 |
A multi-objective programming perspective to statistical learning problemsYaman, Sibel 17 November 2008 (has links)
It has been increasingly recognized that realistic problems often involve a tradeoff among many conflicting objectives. Traditional methods aim
at satisfying multiple objectives by combining them into a global cost function, which
in most cases overlooks the underlying tradeoffs between the conflicting objectives.
This raises the issue about how different objectives should be combined to yield a
final solution. Moreover, such approaches promise that the chosen overall objective
function is optimized over the training samples. However, there is no guarantee on
the performance in terms of the individual objectives since they are not considered
on an individual basis.
Motivated by these shortcomings of traditional methods, the objective in this
dissertation is to investigate theory, algorithms, and applications for problems with
competing objectives and to understand the behavior of the proposed algorithms
in light of some applications. We develop a multi-objective programming (MOP)
framework for finding compromise solutions that are satisfactory for each of multiple
competing performance criteria. The fundamental idea for our formulation, which we
refer to as iterative constrained optimization (ICO), evolves around improving one
objective while allowing the rest to degrade. This is achieved by the optimization of
individual objectives with proper constraints on the remaining competing objectives.
The constraint bounds are adjusted based on the objective functions obtained in
the most recent iteration. An aggregated utility function is used to evaluate the
acceptability of local changes in competing criteria, i.e., changes from one iteration
to the next.
Conflicting objectives arise in different contexts in many problems of speech and
language technologies. In this dissertation, we consider two applications. The first
application is language model (LM) adaptation, where a general LM is adapted to a
specific application domain so that the adapted LM is as close as possible to both the
general model and the application domain data. Language modeling and adaptation is
used in many speech and language processing applications such as speech recognition,
machine translation, part-of-speech tagging, parsing, and information retrieval.
The second application is automatic language identification (LID), where the standard detection performance evaluation measures false-rejection (or miss) and false-acceptance (or false alarm) rates for a number of languages are to be simultaneously minimized. LID systems might be used as a pre-processing stage for understanding
systems and for human listeners, and find applications in, for example, a hotel lobby
or an international airport where one might speak to a multi-lingual voice-controlled
travel information retrieval system.
This dissertation is expected to provide new insights and techniques for accomplishing significant performance improvement over existing approaches in terms of the individual competing objectives. Meantime, the designer has a better control over what is achieved in terms of the individual objectives. Although many MOP approaches developed so far are formal and extensible to large number of competing objectives, their capabilities are examined only with two or three objectives. This is mainly because practical problems become significantly harder to manage when the number of objectives gets larger. We, however, illustrate the proposed framework with a larger number of objectives.
|
674 |
Soft margin estimation for automatic speech recognitionLi, Jinyu 27 August 2008 (has links)
In this study, a new discriminative learning framework, called soft margin estimation (SME), is proposed for estimating the parameters of continuous density hidden Markov models (HMMs). The proposed method makes direct use of the successful ideas of margin in support vector machines to improve generalization capability and decision feedback learning in discriminative training to enhance model separation in classifier design. SME directly maximizes the separation of competing models to enhance the testing samples to approach a correct decision if the deviation from training samples is within a safe margin. Frame and utterance selections are integrated into a unified framework to select the training utterances and frames critical for discriminating competing models. SME offers a flexible and rigorous framework to facilitate the incorporation of new margin-based optimization criteria into HMMs training. The choice of various loss functions is illustrated and different kinds of separation measures are defined under a unified SME framework. SME is also shown to be able to jointly optimize feature extraction and HMMs. Both the generalized probabilistic descent algorithm and the Extended Baum-Welch algorithm are applied to solve SME.
SME has demonstrated its great advantage over other discriminative training methods in several speech recognition tasks. Tested on the TIDIGITS digit recognition task, the proposed SME approach achieves a string accuracy of 99.61%, the best result ever reported in literature. On the 5k-word Wall Street Journal task, SME reduced the word error rate (WER) from 5.06% of MLE models to 3.81%, with relative 25% WER reduction. This is the first attempt to show the effectiveness of margin-based acoustic modeling for large vocabulary continuous speech recognition in a HMMs framework. The generalization of SME was also well demonstrated on the Aurora 2 robust speech recognition task, with around 30% relative WER reduction from the clean-trained baseline.
|
675 |
Μελέτη γλωσσολογικών μοντέλων για αναγνώριση συναισθημάτων ομιλητήΑποστολόπουλος, Γεώργιος 07 June 2010 (has links)
Με τη συνεχώς αυξανόμενη παρουσία αυτόματων συστημάτων στην καθημερινότητά μας, εισέρχεται και το βάρος της αλληλεπίδρασης με αυτά τα συστήματα εξαιτίας της έλλειψης συναισθηματικής νοημοσύνης από την πλευρά των μηχανών [1]. Η συναισθηματική πληροφορία που μεταδίδεται μέσω της ανθρώπινης ομιλίας αποτελεί σημαντικό παράγοντα στις ανθρώπινες επικοινωνίες και αλληλεπιδράσεις. Όταν οι άνθρωποι αλληλεπιδρούν με μηχανές ή υπολογιστικά συστήματα υπάρχει ένα κενό μεταξύ της πληροφορίας που μεταδίδεται και αυτής που γίνεται αντιληπτή. Η εργασία αυτή επικεντρώνεται στον τρόπο με τον οποίο ένα υπολογιστικό σύστημα μπορεί να αντιληφθεί την συναισθηματική πληροφορία που υποβόσκει στην ανθρώπινη ομιλία χρησιμοποιώντας την πληροφορία που βρίσκεται στα διάφορα γλωσσολογικά μοντέλα. Γίνεται μελέτη ενός συστήματος αναγνώρισης της συναισθηματικής κατάστασης του ομιλητή, και πιο συγκεκριμένα επικεντρωνόμαστε στην επεξεργασία ομιλίας και την εξαγωγή των κατάλληλων παραμέτρων, οι οποίες θα μπορέσουν να χαρακτηρίσουν μονοσήμαντα κάθε συναισθηματική κατάσταση. Κάνουμε επεξεργασία οπτικοακουστικού υλικού χρησιμοποιώντας διάφορα εργαλεία λογισμικού με σκοπό να αντλήσουμε αξιόπιστη γλωσσολογική πληροφορία, η οποία να είναι αντιπροσωπευτική των διαφόρων συναισθημάτων που εξετάζουμε. Συνδυάζοντας τη γλωσσολογική με την ακουστική πληροφορία καταλήγουμε σε ένα ολοκληρωμένο μοντέλο αναγνώρισης συναισθημάτων. Τα αποτελέσματά μας υποδεικνύουν το ποσοστό κατά το οποίο τα εξαγόμενα γλωσσολογικά μοντέλα μπορούν να μας προσφέρουν αξιόπιστη αναγνώριση συναισθημάτων ενός ομιλητή. / Along with the constantly increasing presence of automatic systems in our everyday lives, there comes the problem of interaction with thesse sytems because of the lack of artificial intelligence from the systems themselves. Emotion information transcripted through human language is an important factor of human interactions and conversations. When people interact with computer systems though, there is a gap between the information sent and the information perceived. This diploma thesis focuses on the way a computer system can perceive the information of emotions that underlies in human speech, by using the information found in linguistic models. We study a recognition system for the emotional state of the speaker himself and specifically we focus on the speech recognition and its parameters, which could uniquely identify every emotional state. We edit some video samples using the appropriate software in order to draw credible linguistic information, which is representative of the examined emotions. By combining the linguistic information with the aural information, we can reach a state where we can have a complete speech recognition system. The results of our work present the percentage at which these models can provide acceptable emotional recognition of a speaker.
|
676 |
Automatic classification of spoken South African English variants using a transcription-less speech recognition approachDu Toit, A. (Andre) 03 1900 (has links)
Thesis (MEng)--University of Stellenbosch, 2004. / ENGLISH ABSTRACT: We present the development of a pattern recognition system which is capable of classifying
different Spoken Variants (SVs) of South African English (SAE) using a transcriptionless
speech recognition approach. Spoken Variants (SVs) allow us to unify the linguistic
concepts of accent and dialect from a pattern recognition viewpoint. The need for the SAE
SV classification system arose from the multi-linguality requirement for South African
speech recognition applications and the costs involved in developing such applications. / AFRIKAANSE OPSOMMING: Ons beskryf die ontwikkeling van 'n patroon herkenning stelsel wat in staat is om verskillende
Gesproke Variante (GVe) van Suid Afrikaanse Engels (SAE) te klassifiseer met
behulp van 'n transkripsielose spraak herkenning metode. Gesproke Variante (GVe) stel
ons in staat om die taalkundige begrippe van aksent en dialek te verenig vanuit 'n patroon
her kenning oogpunt. Die behoefte aan 'n SAE GV klassifikasie stelsel het ontstaan
uit die meertaligheid vereiste vir Suid Afrikaanse spraak herkenning stelsels en die koste
verbonde aan die ontwikkeling van sodanige stelsels.
|
677 |
Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia / Utilisation de ressources dans une langue proche pour la reconnaissance automatique de la parole pour les langues peu dotées de MalaisieSamson Juan, Sarah Flora 09 July 2015 (has links)
Les langues en Malaisie meurent à un rythme alarmant. A l'heure actuelle, 15 langues sont en danger alors que deux langues se sont éteintes récemment. Une des méthodes pour sauvegarder les langues est de les documenter, mais c'est une tâche fastidieuse lorsque celle-ci est effectuée manuellement.Un système de reconnaissance automatique de la parole (RAP) serait utile pour accélérer le processus de documentation de ressources orales. Cependant, la construction des systèmes de RAP pour une langue cible nécessite une grande quantité de données d'apprentissage comme le suggèrent les techniques actuelles de l'état de l'art, fondées sur des approches empiriques. Par conséquent, il existe de nombreux défis à relever pour construire des systèmes de transcription pour les langues qui possèdent des quantités de données limitées.L'objectif principal de cette thèse est d'étudier les effets de l'utilisation de données de langues étroitement liées, pour construire un système de RAP pour les langues à faibles ressources en Malaisie. Des études antérieures ont montré que les méthodes inter-lingues et multilingues pourraient améliorer les performances des systèmes de RAP à faibles ressources. Dans cette thèse, nous essayons de répondre à plusieurs questions concernant ces approches: comment savons-nous si une langue est utile ou non dans un processus d'apprentissage trans-lingue ? Comment la relation entre la langue source et la langue cible influence les performances de la reconnaissance de la parole ? La simple mise en commun (pooling) des données d'une langue est-elle une approche optimale ?Notre cas d'étude est l'iban, une langue peu dotée de l'île de Bornéo. Nous étudions les effets de l'utilisation des données du malais, une langue locale dominante qui est proche de l'iban, pour développer un système de RAP pour l'iban, sous différentes contraintes de ressources. Nous proposons plusieurs approches pour adapter les données du malais afin obtenir des modèles de prononciation et des modèles acoustiques pour l'iban.Comme la contruction d'un dictionnaire de prononciation à partir de zéro nécessite des ressources humaines importantes, nous avons développé une approche semi-supervisée pour construire rapidement un dictionnaire de prononciation pour l'iban. Celui-ci est fondé sur des techniques d'amorçage, pour améliorer la correspondance entre les données du malais et de l'iban.Pour augmenter la performance des modèles acoustiques à faibles ressources, nous avons exploré deux techniques de modélisation : les modèles de mélanges gaussiens à sous-espaces (SGMM) et les réseaux de neurones profonds (DNN). Nous avons proposé, dans ce cadre, des méthodes de transfert translingue pour la modélisation acoustique permettant de tirer profit d'une grande quantité de langues “proches” de la langue cible d'intérêt. Les résultats montrent que l'utilisation de données du malais est bénéfique pour augmenter les performances des systèmes de RAP de l'iban. Par ailleurs, nous avons également adapté les modèles SGMM et DNN au cas spécifique de la transcription automatique de la parole non native (très présente en Malaisie). Nous avons proposé une approche fine de fusion pour obtenir un SGMM multi-accent optimal. En outre, nous avons développé un modèle DNN spécifique pour la parole accentuée. Les deux approches permettent des améliorations significatives de la précision du système de RAP. De notre étude, nous observons que les modèles SGMM et, de façon plus surprenante, les modèles DNN sont très performants sur des jeux de données d'apprentissage en quantité limités. / Languages in Malaysia are dying in an alarming rate. As of today, 15 languages are in danger while two languages are extinct. One of the methods to save languages is by documenting languages, but it is a tedious task when performed manually.Automatic Speech Recognition (ASR) system could be a tool to help speed up the process of documenting speeches from the native speakers. However, building ASR systems for a target language requires a large amount of training data as current state-of-the-art techniques are based on empirical approach. Hence, there are many challenges in building ASR for languages that have limited data available.The main aim of this thesis is to investigate the effects of using data from closely-related languages to build ASR for low-resource languages in Malaysia. Past studies have shown that cross-lingual and multilingual methods could improve performance of low-resource ASR. In this thesis, we try to answer several questions concerning these approaches: How do we know which language is beneficial for our low-resource language? How does the relationship between source and target languages influence speech recognition performance? Is pooling language data an optimal approach for multilingual strategy?Our case study is Iban, an under-resourced language spoken in Borneo island. We study the effects of using data from Malay, a local dominant language which is close to Iban, for developing Iban ASR under different resource constraints. We have proposed several approaches to adapt Malay data to obtain pronunciation and acoustic models for Iban speech.Building a pronunciation dictionary from scratch is time consuming, as one needs to properly define the sound units of each word in a vocabulary. We developed a semi-supervised approach to quickly build a pronunciation dictionary for Iban. It was based on bootstrapping techniques for improving Malay data to match Iban pronunciations.To increase the performance of low-resource acoustic models we explored two acoustic modelling techniques, the Subspace Gaussian Mixture Models (SGMM) and Deep Neural Networks (DNN). We performed cross-lingual strategies using both frameworks for adapting out-of-language data to Iban speech. Results show that using Malay data is beneficial for increasing the performance of Iban ASR. We also tested SGMM and DNN to improve low-resource non-native ASR. We proposed a fine merging strategy for obtaining an optimal multi-accent SGMM. In addition, we developed an accent-specific DNN using native speech data. After applying both methods, we obtained significant improvements in ASR accuracy. From our study, we observe that using SGMM and DNN for cross-lingual strategy is effective when training data is very limited.
|
678 |
La représentation des documents par réseaux de neurones pour la compréhension de documents parlés / Neural network representations for spoken documents understandingJanod, Killian 27 November 2017 (has links)
Les méthodes de compréhension de la parole visent à extraire des éléments de sens pertinents du signal parlé. On distingue principalement deux catégories dans la compréhension du signal parlé : la compréhension de dialogues homme/machine et la compréhension de dialogues homme/homme. En fonction du type de conversation, la structure des dialogues et les objectifs de compréhension varient. Cependant, dans les deux cas, les systèmes automatiques reposent le plus souvent sur une étape de reconnaissance automatique de la parole pour réaliser une transcription textuelle du signal parlé. Les systèmes de reconnaissance automatique de la parole, même les plus avancés, produisent dans des contextes acoustiques complexes des transcriptions erronées ou partiellement erronées. Ces erreurs s'expliquent par la présence d'informations de natures et de fonction variées, telles que celles liées aux spécificités du locuteur ou encore l'environnement sonore. Celles-ci peuvent avoir un impact négatif important pour la compréhension. Dans un premier temps, les travaux de cette thèse montrent que l'utilisation d'autoencodeur profond permet de produire une représentation latente des transcriptions d'un plus haut niveau d'abstraction. Cette représentation permet au système de compréhension de la parole d'être plus robuste aux erreurs de transcriptions automatiques. Dans un second temps, nous proposons deux approches pour générer des représentations robustes en combinant plusieurs vues d'un même dialogue dans le but d'améliorer les performances du système la compréhension. La première approche montre que plusieurs espaces thématiques différents peuvent être combinés simplement à l'aide d'autoencodeur ou dans un espace thématique latent pour produire une représentation qui augmente l'efficacité et la robustesse du système de compréhension de la parole. La seconde approche propose d'introduire une forme d'information de supervision dans les processus de débruitages par autoencodeur. Ces travaux montrent que l'introduction de supervision de transcription dans un autoencodeur débruitant dégrade les représentations latentes, alors que les architectures proposées permettent de rendre comparables les performances d'un système de compréhension reposant sur une transcription automatique et un système de compréhension reposant sur des transcriptions manuelles. / Application of spoken language understanding aim to extract relevant items of meaning from spoken signal. There is two distinct types of spoken language understanding : understanding of human/human dialogue and understanding in human/machine dialogue. Given a type of conversation, the structure of dialogues and the goal of the understanding process varies. However, in both cases, most of the time, automatic systems have a step of speech recognition to generate the textual transcript of the spoken signal. Speech recognition systems in adverse conditions, even the most advanced one, produce erroneous or partly erroneous transcript of speech. Those errors can be explained by the presence of information of various natures and functions such as speaker and ambience specificities. They can have an important adverse impact on the performance of the understanding process. The first part of the contribution in this thesis shows that using deep autoencoders produce a more abstract latent representation of the transcript. This latent representation allow spoken language understanding system to be more robust to automatic transcription mistakes. In the other part, we propose two different approaches to generate more robust representation by combining multiple views of a given dialogue in order to improve the results of the spoken language understanding system. The first approach combine multiple thematic spaces to produce a better representation. The second one introduce new autoencoders architectures that use supervision in the denoising autoencoders. These contributions show that these architectures reduce the difference in performance between a spoken language understanding using automatic transcript and one using manual transcript.
|
679 |
Reconhecimento de voz atrav?s de unidades menores do que a palavra, utilizando Wavelet Packet e SVM, em uma nova estrutura hier?rquica de decis?oBresolin, Adriano de Andrade 02 December 2008 (has links)
Made available in DSpace on 2014-12-17T14:54:51Z (GMT). No. of bitstreams: 1
AdrianoAB.pdf: 2240966 bytes, checksum: d9e93de6b9ef6f0023ed591b4d760ff9 (MD5)
Previous issue date: 2008-12-02 / The automatic speech recognition by machine has been the target of researchers in the past five decades. In this period have been numerous advances, such as in the field of recognition of isolated words (commands), which has very high rates of recognition, currently. However, we are still far from developing a system that could have a performance similar to the human being (automatic continuous speech recognition). One of the great challenges of searches for continuous speech recognition is the large amount of pattern. The modern languages such as English, French, Spanish and Portuguese
have approximately 500,000 words or patterns to be identified. The purpose of this study is to use smaller units than the word such as phonemes, syllables and difones units as the basis for the speech recognition, aiming to recognize any words without necessarily using them. The main goal is to reduce the restriction imposed by the excessive amount of patterns. In order to validate this proposal, the system was tested in the isolated word recognition in dependent-case. The phonemes characteristics of the Brazil s Portuguese language were used to developed the hierarchy decision system. These decisions are made through the use of neural networks SVM (Support Vector Machines). The main speech features used were obtained from the Wavelet Packet Transform. The descriptors MFCC (Mel-Frequency Cepstral Coefficient) are also used in this work. It was concluded that the method proposed in this work, showed good results in the
steps of recognition of vowels, consonants (syllables) and words when compared with other existing methods in literature / O reconhecimento autom?tico da voz por m?quinas inteligentes tem sido a meta de muitos pesquisadores nas ?ltimas cinco d?cadas. Neste per?odo, in?meros avan?os foram alcan?ados, como por exemplo no campo de reconhecimento de palavras isoladas (comandos), o qual atualmente apresenta taxas de reconhecimento muito altas. No entanto, ainda se est? longe de desenvolver um sistema que possa ter um desempenho parecido com o ser humano, ou seja, reconhecimento autom?tico de voz em modo cont?nuo. Um dos grandes desafios das pesquisas de reconhecimento de voz cont?nuo ? a grande quantidade de padr?es existentes, pois as linguagens modernas tais como: Ingl?s, Franc?s,
Espanhol e Portugu?s possuem aproximadamente 500.000 palavras ou padr?es a serem identificados.
A proposta deste trabalho ? utilizar unidades menores do que a palavra tais como: fonemas, difones e s?labas como unidades base para o reconhecimento da voz, visando o
reconhecimento quaisquer palavras sem necessariamente utiliz?-las. O objetivo principal deste trabalho ? reduzir a restri??o imposta pela quantidade excessiva de padr?es
existentes, ou seja, a quantidade excessiva de palavras. Com o objetivo de validar esta proposta, o sistema foi desenvolvido e testado para o reconhecimento de palavras isoladas no modo dependente do locutor.
O sistema apresentado neste trabalho foi desenvolvido com uma l?gica de reconhecimento hier?rquica baseada nas caracter?sticas de produ??o dos fonemas da l?ngua
Portuguesa do Brasil. Estas decis?es s?o feitas atrav?s da utiliza??o de redes neurais do tipo M?quinas de Vetor de Suporte agrupadas na forma de M?quinas de C?mite.
Os principais descritores do sinal de voz utilizados, foram obtidos atrav?s da Transformada Wavelet Packet. Os descritores MFCC (Mel-Frequency Cepstral Coefficient)
tamb?m s?o utilizados neste trabalho. Pode-se concluir que o m?todo proposto apresentou bons resultados nas etapas de reconhecimento de vogais, consoantes (s?labas) e palavras se comparado com outros m?todos existentes na literatura
|
680 |
Uso de parâmetros multifractais no reconhecimento de locutor / Use of multifractal parameters for speaker recognitionGonzález González, Diana Cristina, 1984- 19 August 2018 (has links)
Orientadores: Lee Luan Ling, Fábio Violaro / Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação / Made available in DSpace on 2018-08-19T05:40:32Z (GMT). No. of bitstreams: 1
GonzalezGonzalez_DianaCristina_M.pdf: 2589944 bytes, checksum: ddbbbef6076eb402f4abe638ebcd232b (MD5)
Previous issue date: 2011 / Resumo: Esta dissertação apresenta a implementação de um sistema de Reconhecimento Automático de Locutor (ASR). Este sistema emprega um novo parâmetro de características de locutor baseado no modelo multifractal "VVGM" (Variable Variance Gaussian Multiplier). A metodologia adotada para o desenvolvimento deste sistema foi formulada em duas etapas. Inicialmente foi implementado um sistema ASR tradicional, usando como vetor de características os MFCCs (Mel-Frequency Cepstral Coefficients) e modelo de mistura gaussiana (GMM) como classificador, uma vez que é uma configuração clássica, adotada como referência na literatura. Este procedimento permite ter um conhecimento amplo sobre a produção de sinais de voz, além de um sistema de referência para comparar o desempenho do novo parâmetro VVGM. A segunda etapa foi dedicada ao estudo de processos multifractais em sinais de fala, já que eles enfatizam-se na análise das informações contidas nas partes não estacionárias do sinal avaliado. Aproveitando essa característica, sinais de fala são modelados usando o modelo VVGM. Este modelo é baseado no processo de cascata multiplicativa binomial, e usa as variâncias dos multiplicadores de cada estágio como um novo vetor de característica. As informações obtidas pelos dois métodos são diferentes e complementares. Portanto, é interessante combinar os parâmetros clássicos com os parâmetros multifractais, a fim de melhorar o desempenho dos sistemas de reconhecimento de locutor. Os sistemas propostos foram avaliados por meio de três bases de dados de fala com diferentes configurações, tais como taxas de amostragem, número de falantes e frases e duração do treinamento e teste. Estas diferentes configurações permitem determinar as características do sinal de fala requeridas pelo sistema. Do resultado dos experimentos foi observado que o sistema de identificação de locutor usando os parâmetros VVGM alcançou taxas de acerto significativas, o que mostra que este modelo multifractal contém informações relevantes sobre a identidade de cada locutor. Por exemplo, a segunda base de dados é composta de sinais de fala de 71 locutores (50 homens e 21 mulheres) digitalizados a 22,05 kHz com 16 bits/amostra. O treinamento foi feito com 20 frases para cada locutor, com uma duração total de cerca de 70 s. Avaliando o sistema ASR baseado em VVGM, com locuções de teste de 3 s de comprimento, foi obtida uma taxa de reconhecimento de 91,30%. Usando estas mesmas condições, o sistema ASR baseado em MFCCs atingiu uma taxa de reconhecimento de 98,76%. No entanto, quando os dois parâmetros foram combinados, a taxa de reconhecimento aumentou para 99,43%, mostrando que a nova característica acrescenta informações importantes para o sistema de reconhecimento de locutor / Abstract: This dissertation presents an Automatic Speaker Recognition (ASR) system, which employs a new parameter based on the ¿VVGM? (Variable Variance Gaussian Multiplier) multifractal model. The methodology adopted for the development of this system is formulated in two stages. Initially, a traditional ASR system was implemented, based on the use of Mel-Frequency Cepstral Coefficients (MFCCs) and the Gaussian mixture models (GMMs) as the classifier, since it is the method with the best results in the literature. This procedure allows having a broad knowledge about the production of speech signals and a reference system to compare the performance of the new VVGM parameter. The second stage was dedicated to the study of the multifractal processes for speech signals, given that with them, it is possible to analyze information contained in non-stationary parts of the evaluated signal. Taking advantage of this characteristic, speech signals are modeled using the VVGM model, which is based on the binomial multiplicative cascade process, and uses the variances of multipliers for each state as a new speech feature. The information obtained by the two methods is different and complementary. Therefore, it is interesting to combine the classic parameters with the multifractal parameters in order to improve the performance of speaker recognition systems. The proposed systems were evaluated using three databases with different settings, such as sampling rates, number of speakers and phrases, duration of training and testing. These different configurations allow the determination of characteristics of the speech signal required by the system. With the experiments, the speaker identification system based on the VVGM parameters achieved significant success rates, which shows that this multifractal model contains relevant information of the identity of each speaker. For example, the second database is composed of speech signals of 71 speakers (50 men and 21 women) digitized at 22.05 kHz with 16 bits/sample. The training was done with 20 phrases for each speaker, with an approximately total duration of 70 s. Evaluating the ASR system based on VVGM, with this database and using test locutions with 3s of duration, it was obtained a recognition rate of 91.3%. Using these same conditions, the ASR system based on MFCCs reached a recognition rate of 98.76%. However, when the two parameters are combined, the recognition rate increased to 99.43%, showing that the new feature adds substantial information to the speaker recognition system / Mestrado / Telecomunicações e Telemática / Mestre em Engenharia Elétrica
|
Page generated in 0.0735 seconds