Global ETD Search

1	Wavelet-based techniques for speech recognition Farooq, Omar January 2002 (has links) In this thesis, new wavelet-based techniques have been developed for the extraction of features from speech signals for the purpose of automatic speech recognition (ASR). One of the advantages of the wavelet transform over the short time Fourier transform (STFT) is its capability to process non-stationary signals. Since speech signals are not strictly stationary the wavelet transform is a better choice for time-frequency transformation of these signals. In addition it has compactly supported basis functions, thereby reducing the amount of computation as opposed to STFT where an overlapping window is needed. 621 Phoneme recognition
2	Phoneme Recognition by hidden Markov modeling Brighton, Andrew P. January 1989 (has links) No description available. Phoneme Recognition Hidden Markov Modeling Speech Recognition
3	Fusion of phoneme recognisers for South African English Strydom, George Wessel 03 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2009. / ENGLISH ABSTRACT: Phoneme recognition systems typically suffer from low classification accuracy. Recognition for South African English is especially difficult, due to the variety of vastly different accent groups. This thesis investigates whether a fusion of classifiers, each trained on a specific accent group, can outperform a single general classifier trained on all. We implemented basic voting and score fusion techniques from which a small increase in classifier accuracy could be seen. To ensure that similarly-valued output scores from different classifiers imply the same opinion, these classifiers need to be calibrated before fusion. The main focus point of this thesis is calibration with the Pool Adjacent Violators algorithm. We achieved impressive gains in accuracy with this method and an in-depth investigation was made into the role of the prior and the connection with the proportion of target to non-target scores. Calibration and fusion using the information metric Cllr was showed to perform impressively with synthetic data, but minor increases in accuracy was found for our phoneme recognition system. The best results for this technique was achieved by calibrating each classifier individually, fusing these calibrated classifiers and then finally calibrating the fused system. Boosting and Bagging classifiers were also briefly investigated as possible phoneme recognisers. Our attempt did not achieve the target accuracy of the classifier trained on all the accent groups. The inherent difficulties typical of phoneme recognition were highlighted. Low per-class accuracies, a large number of classes and an unbalanced speech corpus all had a negative influence on the effectivity of the tested calibration and fusion techniques. / AFRIKAANSE OPSOMMING: Foneemherkenningstelsels het tipies lae klassifikasie akkuraatheid. As gevolg van die verskeidenheid verskillende aksent groepe is herkenning vir Suid-Afrikaanse Engels veral moeilik. Hierdie tesis ondersoek of ’n fusie van klassifiseerders, elk afgerig op ’n spesifieke aksent groep, beter kan doen as ’n enkele klassifiseerder wat op alle groepe afgerig is. Ons het basiese stem- en tellingfusie tegnieke ge¨ımplementeer, wat tot ’n klein verbetering in klassifiseerder akkuraatheid gelei het. Om te verseker dat soortgelyke uittreetellings van verskillende klassifiseerders dieselfde opinie impliseer, moet hierdie klassifiseerders gekalibreer word voor fusie. Die hoof fokuspunt van hierdie tesis is kalibrasie met die Pool Adja- cent Violators algoritme. Indrukwekkende toenames in akkuraatheid is behaal met hierdie metode en ’n in-diepte ondersoek is ingestel oor die rol van die aanneemlikheidswaarskynlikhede en die verwantskap met die verhouding van teiken tot nie-teiken tellings. Kalibrasie en fusie met behulp van die informasie maatstaf Cllr lewer indrukwekkende resultate met sintetiese data, maar slegs klein verbeterings in akkuraatheid is gevind vir ons foneemherkenningstelsel. Die beste resultate vir hierdie tegniek is verkry deur elke klassifiseerder afsonderlik te kalibreer, hierdie gekalibreerde klassifiseerders dan te kombineer en dan die finale gekombineerde stelsel weer te kalibreer. Boosting en Bagging klassifiseerders is ook kortliks ondersoek as moontlike foneem herkenners. Ons poging het nie die akkuraatheid van ons basislyn klassifiseerder (wat op alle data afgerig is) bereik nie. Die inherente probleme wat tipies is tot foneemherkenning is uitgewys. Lae per-klas akkuraatheid, ’n groot hoeveelheid klasse en ’n ongebalanseerde spraak korpus het almal ’n negatiewe invloed op die effektiwiteit van die getoetsde kalibrasie en fusie tegnieke gehad. Phoneme recognition Dissertations -- Electronic engineering Theses -- Electronic engineering Automatic speech recognition Electrical and Electronic Engineering
4	Ανάπτυξη συστήματος αναγνώρισης πολυ-γλωσσικών φωνημάτων για τις ανάγκες της αυτόματης αναγνώρισης γλώσσας Γιούρα, Ευδοκία 04 February 2008 (has links) Στα πλάισια της ανάλυσης ομιλίας, η παρούσα διατριβή παρουσιάζει έναν εύρωστο Αναγνωριστή Φωνημάτων ανεξαρτήτου Γλώσσας για τις ανάγκες της Αυτόματης Αναγνώρισης Γλώσσας. Η υλοποίηση του Συστήματος βασίζεται στους MFCC συντλεστές οι οποίοι αποτελούν τους χαρακτηριστικούς περιγραφείς ομιλίας, στη διαδικασία διανυσματικής κβαντοποίησης κατά την οποία δημιουργούνται τα κωδικά βιβλία εκπαίδευσης του Συστήματος και στα πιθανοτικά νευρωνικά δίκτυα (Propabilistic Neural Networks) για την εκπαίδευση του Συστήματος και την αναγνώριση των άγνωστων φωνημάτων. / In our thesis we present a language-independent phoneme recognizer for the needs of Automatic Language Identification. The system is based on: 1. the MFCCs for acoustic-spectral representation of phonemes, 2. vector quantization for creating the training codebooks of the system and 3. the Propabilistic Neural Networks (PNNs) for the training of the system and the classification of unknown phonemes. Αναγνώριση φωνήματος Ανεξαρτήτου γλωσσας 621.382 23 Phoneme recognition Language independent Polyphonemes Monophonemes
5	Phoneme Recognition Using Neural Network and Sequence Learning Model Huang, Yiming 27 April 2009 (has links) No description available. Artificial Intelligence Design Electrical Engineering Engineering Phoneme Recognition Neural Network Long-Term Memory Sequence Learning
6	Fast accurate diphone-based phoneme recognition Du Preez, Marianne 03 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2009. / Statistical speech recognition systems typically utilise a set of statistical models of subword units based on the set of phonemes in a target language. However, in continuous speech it is important to consider co-articulation e ects and the interactions between neighbouring sounds, as over-generalisation of the phonetic models can negatively a ect system accuracy. Traditionally co-articulation in continuous speech is handled by incorporating contextual information into the subword model by means of context-dependent models, which exponentially increase the number of subword models. In contrast, transitional models aim to handle co-articulation by modelling the interphone dynamics found in the transitions between phonemes. This research aimed to perform an objective analysis of diphones as subword units for use in hidden Markov model-based continuous-speech recognition systems, with special emphasis on a direct comparison to a context-dependent biphone-based system in terms of complexity, accuracy and computational e ciency in similar parametric conditions. To simulate practical conditions, the experiments were designed to evaluate these systems in a low resource environment { limited supply of training data, computing power and system memory { while still attempting fast, accurate phoneme recognition. Adaptation techniques designed to exploit characteristics inherent in diphones, as well as techniques used for e ective parameter estimation and state-level tying were used to reduce resource requirements while simultaneously increasing parameter reliability. These techniques include diphthong splitting, utilisation of a basic diphone grammar, diphone set completion, maximum a posteriori estimation and decision-tree based state clustering algorithms. The experiments were designed to evaluate the contribution of each adaptation technique individually and subsequently compare the optimised diphone-based recognition system to a biphone-based recognition system that received similar treatment. Results showed that diphone-based recognition systems perform better than both traditional phoneme-based systems and context-dependent biphone-based systems when evaluated in similar parametric conditions. Therefore, diphones are e ective subword units, which carry suprasegmental knowledge of speech signals and provide an excellent compromise between detailed co-articulation modelling and acceptable system performance Phoneme recognition Diphones Acoustic speech modelling Automatic speech recognition Electrical and Electronic Engineering
7	Perception of prosody by cochlear implant recipients Van Zyl, Marianne January 2014 (has links) Recipients of present-day cochlear implants (CIs) display remarkable success with speech recognition in quiet, but not with speech recognition in noise. Normal-hearing (NH) listeners, in contrast, perform relatively well with speech recognition in noise. Understanding which speech features support successful perception in noise in NH listeners could provide insight into the difficulty that CI listeners experience in background noise. One set of speech features that has not been thoroughly investigated with regard to its noise immunity is prosody. Existing reports show that CI users have difficulty with prosody perception. The present study endeavoured to determine if prosody is particularly noise-immune in NH listeners and whether the difficulty that CI users experience in noise can be partly explained by poor prosody perception. This was done through the use of three listening experiments. The first listening experiment examined the noise immunity of prosody in NH listeners by comparing perception of a prosodic pattern to word recognition in speech-weighted noise (SWN). Prosody perception was tested in a two-alternatives forced-choice (2AFC) test paradigm using sentences conveying either conditional or unconditional permission, agreement or approval. Word recognition was measured in an open set test paradigm using meaningful sentences. Results indicated that the deterioration slope of prosody recognition (corrected for guessing) was significantly shallower than that of word recognition. At the lowest signal-to-noise ratio (SNR) tested, prosody recognition was significantly better than word recognition. The second experiment compared recognition of prosody and phonemes in SWN by testing perception of both in a 2AFC test paradigm. NH and CI listeners were tested using single words as stimuli. Two prosody recognition tasks were used; the first task required discrimination between questions and statements, while the second task required discrimination between a certain and a hesitant attitude. Phoneme recognition was measured with three vowel pairs selected according to specific acoustic cues. Contrary to the first experiment, the results of this experiment indicated that vowel recognition was significantly better than prosody recognition in noise in both listener groups. The difference between the results of the first and second experiments was thought to have been due to either the test paradigm difference in the first experiment (closed set versus open set), or a difference in stimuli between the experiments (single words versus sentences). The third experiment tested emotional prosody and phoneme perception of NH and CI listeners in SWN using sentence stimuli and a 4AFC test paradigm for both tasks. In NH listeners, deterioration slopes of prosody and phonemes (vowels and consonants) did not differ significantly, and at the lowest SNR tested there was no significant difference in recognition of the different types of speech material. In the CI group, prosody and vowel perception deteriorated with a similar slope, while consonant recognition showed a steeper slope than prosody recognition. It is concluded that while prosody might support speech recognition in noise in NH listeners, explicit recognition of prosodic patterns is not particularly noise-immune and does not account for the difficulty that CI users experience in noise. ## Ontvangers van hedendaagse kogleêre inplantings (KI’s) behaal merkwaardige sukses met spraakherkenning in stilte, maar nie met spraakherkenning in geraas nie. Normaalhorende (NH) luisteraars, aan die ander kant, vaar relatief goed met spraakherkenning in geraas. Begrip van die spraakeienskappe wat suksesvolle persepsie in geraas ondersteun in NH luisteraars, kan lei tot insig in die probleme wat KI-gebruikers in agtergrondgeraas ervaar. Een stel spraakeienskappe wat nog nie deeglik ondersoek is met betrekking tot ruisimmuniteit nie, is prosodie. Bestaande navorsing wys dat KI-gebruikers sukkel met persepsie van prosodie. Die huidige studie is onderneem om te bepaal of prosodie besonder ruisimmuun is in NH luisteraars en of die probleme wat KI-gebruikers in geraas ondervind, deels verklaar kan word deur swak prosodie-persepsie. Dit is gedoen deur middel van drie luistereksperimente. Die eerste luistereksperiment het die ruisimmuniteit van prosodie in NH luisteraars ondersoek deur die persepsie van ’n prosodiese patroon te vergelyk met woordherkenning in spraakgeweegde ruis (SGR). Prosodie-persepsie is getoets in ’n twee-alternatiewe-gedwonge-keuse- (2AGK) toetsparadigma met sinne wat voorwaardelike of onvoorwaardelike toestemming, instemming of goedkeuring oordra. Woordherkenning is gemeet in ’n oopstel-toetsparadigma met betekenisvolle sinne. Resultate het aangedui dat die helling van agteruitgang van prosodieherkenning (gekorrigeer vir raai) betekenisvol platter was as dié van woordherkenning, en dat by die laagste sein-tot-ruiswaarde (STR) wat getoets is, prosodieherkenning betekenisvol beter was as woordherkenning. Die tweede eksperiment het prosodie- en foneemherkenning in SGR vergelyk deur die persepsie van beide te toets in ’n 2AGK-toetsparadigma. NH en KI-luisteraars is getoets met enkelwoorde as stimuli. Twee prosodieherkenningstake is gebruik; die eerste taak het diskriminasie tussen vrae en stellings vereis, terwyl die tweede taak diskriminasie tussen ’n seker en onseker houding vereis het. Foneemherkenning is gemeet met drie vokaalpare wat geselekteer is na aanleiding van spesifieke akoestiese eienskappe. In teenstelling met die eerste eksperiment, het resultate van hierdie eksperiment aangedui dat vokaalherkenning betekenisvol beter was as prosodieherkenning in geraas in beide luisteraarsgroepe. Die verskil tussen die resultate van die eerste en tweede eksperimente kon moontlik die gevolg wees van óf die verskil in toetsparadigma in die eerste eksperiment (geslote- teenoor oop-stel), óf ’n verskil in stimuli tussen die eksperimente (enkelwoorde teenoor sinne). Die derde eksperiment het emosionele-prosodie- en foneempersepsie van NH en KI-luisteraars getoets in SGR met sinstimuli en ’n 4AGK-toetsparadigma vir beide take. In NH luisteraars het die helling van agteruitgang van die persepsie van prosodie en foneme (vokale en konsonante) nie betekenisvol verskil nie, en by die laagste STR wat getoets is, was daar nie ’n betekenisvolle verskil in die herkenning van die twee tipes spraakmateriaal nie. In die KI-groep het prosodie- en vokaalpersepsie met soortgelyke hellings agteruitgegaan, terwyl konsonantherkenning ’n steiler helling as prosodieherkenning vertoon het. Die gevolgtrekking was dat alhoewel prosodie spraakherkenning in geraas in NH luisteraars mag ondersteun, die eksplisiete herkenning van prosodiese patrone nie besonder ruisimmuun is nie en dus nie ’n verklaring bied vir die probleme wat KI-gebruikers in geraas ervaar nie. / Thesis (PhD)--University of Pretoria, 2014. / lk2014 / Electrical, Electronic and Computer Engineering / PhD / unrestricted Cochlear implants Prosody Suprasegmental cues, Speech recognition in noise Phoneme recognition UCTD Vocal emotion Speech-weighted noise Intonation
8	RAMBLE: robust acoustic modeling for Brazilian learners of English / RAMBLE: modelagem acústica robusta para estudantes brasileiros de Inglês Shulby, Christopher Dane 08 August 2018 (has links) The gains made by current deep-learning techniques have often come with the price tag of big data and where that data is not available, a new solution must be found. Such is the case for accented and noisy speech where large databases do not exist and data augmentation techniques, which are less than perfect, present an even larger obstacle. Another problem is that state-of-the-art results are rarely reproducible because they use proprietary datasets, pretrained networks and/or weight initializations from other larger networks. An example of a low resource scenario exists even in the fifth largest land in the world; home to most of the speakers of the seventh most spoken language on earth. Brazil is the leader in the Latin-American economy and as a BRIC country aspires to become an ever-stronger player in the global marketplace. Still, English proficiency is low, even for professionals in businesses and universities. Low intelligibility and strong accents can damage professional credibility. It has been established in the literature for foreign language teaching that it is important that adult learners are made aware of their errors as outlined by the Noticing Theory, explaining that a learner is more successful when he is able to learn from his own mistakes. An essential objective of this dissertation is to classify phonemes in the acoustic model which is needed to properly identify phonemic errors automatically. A common belief in the community is that deep learning requires large datasets to be effective. This happens because brute force methods create a highly complex hypothesis space which requires large and complex networks which in turn demand a great amount of data samples in order to generate useful networks. Besides that, the loss functions used in neural learning does not provide statistical learning guarantees and only guarantees the network can memorize the training space well. In the case of accented or noisy speech where a new sample can carry a great deal of variation from the training samples, the generalization of such models suffers. The main objective of this dissertation is to investigate how more robust acoustic generalizations can be made, even with little data and noisy accented-speech data. The approach here is to take advantage of raw feature extraction provided by deep learning techniques and instead focus on how learning guarantees can be provided for small datasets to produce robust results for acoustic modeling without the dependency of big data. This has been done by careful and intelligent parameter and architecture selection within the framework of the statistical learning theory. Here, an intelligently defined CNN architecture, together with context windows and a knowledge-driven hierarchical tree of SVM classifiers achieves nearly state-of-the-art frame-wise phoneme recognition results with absolutely no pretraining or external weight initialization. A goal of this thesis is to produce transparent and reproducible architectures with high frame-level accuracy, comparable to the state of the art. Additionally, a convergence analysis based on the learning guarantees of the statistical learning theory is performed in order to evidence the generalization capacity of the model. The model achieves 39.7% error in framewise classification and a 43.5% phone error rate using deep feature extraction and SVM classification even with little data (less than 7 hours). These results are comparable to studies which use well over ten times that amount of data. Beyond the intrinsic evaluation, the model also achieves an accuracy of 88% in the identification of epenthesis, the error which is most difficult for Brazilian speakers of English This is a 69% relative percentage gain over the previous values in the literature. The results are significant because it shows how deep feature extraction can be applied to little data scenarios, contrary to popular belief. The extrinsic, task-based results also show how this approach could be useful in tasks like automatic error diagnosis. Another contribution is the publication of a number of freely available resources which previously did not exist, meant to aid future researches in dataset creation. / Os ganhos obtidos pelas atuais técnicas de aprendizado profundo frequentemente vêm com o preço do big data e nas pesquisas em que esses grandes volumes de dados não estão disponíveis, uma nova solução deve ser encontrada. Esse é o caso do discurso marcado e com forte pronúncia, para o qual não existem grandes bases de dados; o uso de técnicas de aumento de dados (data augmentation), que não são perfeitas, apresentam um obstáculo ainda maior. Outro problema encontrado é que os resultados do estado da arte raramente são reprodutíveis porque os métodos usam conjuntos de dados proprietários, redes prétreinadas e/ou inicializações de peso de outras redes maiores. Um exemplo de um cenário de poucos recursos existe mesmo no quinto maior país do mundo em território; lar da maioria dos falantes da sétima língua mais falada do planeta. O Brasil é o líder na economia latino-americana e, como um país do BRIC, deseja se tornar um participante cada vez mais forte no mercado global. Ainda assim, a proficiência em inglês é baixa, mesmo para profissionais em empresas e universidades. Baixa inteligibilidade e forte pronúncia podem prejudicar a credibilidade profissional. É aceito na literatura para ensino de línguas estrangeiras que é importante que os alunos adultos sejam informados de seus erros, conforme descrito pela Noticing Theory, que explica que um aluno é mais bem sucedido quando ele é capaz de aprender com seus próprios erros. Um objetivo essencial desta tese é classificar os fonemas do modelo acústico, que é necessário para identificar automaticamente e adequadamente os erros de fonemas. Uma crença comum na comunidade é que o aprendizado profundo requer grandes conjuntos de dados para ser efetivo. Isso acontece porque os métodos de força bruta criam um espaço de hipóteses altamente complexo que requer redes grandes e complexas que, por sua vez, exigem uma grande quantidade de amostras de dados para gerar boas redes. Além disso, as funções de perda usadas no aprendizado neural não fornecem garantias estatísticas de aprendizado e apenas garantem que a rede possa memorizar bem o espaço de treinamento. No caso de fala marcada ou com forte pronúncia, em que uma nova amostra pode ter uma grande variação comparada com as amostras de treinamento, a generalização em tais modelos é prejudicada. O principal objetivo desta tese é investigar como generalizações acústicas mais robustas podem ser obtidas, mesmo com poucos dados e/ou dados ruidosos de fala marcada ou com forte pronúncia. A abordagem utilizada nesta tese visa tirar vantagem da raw feature extraction fornecida por técnicas de aprendizado profundo e obter garantias de aprendizado para conjuntos de dados pequenos para produzir resultados robustos para a modelagem acústica, sem a necessidade de big data. Isso foi feito por meio de seleção cuidadosa e inteligente de parâmetros e arquitetura no âmbito da Teoria do Aprendizado Estatístico. Nesta tese, uma arquitetura baseada em Redes Neurais Convolucionais (RNC) definida de forma inteligente, junto com janelas de contexto e uma árvore hierárquica orientada por conhecimento de classificadores que usam Máquinas de Vetores Suporte (Support Vector Machines - SVMs) obtém resultados de reconhecimento de fonemas baseados em frames quase no estado da arte sem absolutamente nenhum pré-treinamento ou inicialização de pesos de redes externas. Um objetivo desta tese é produzir arquiteturas transparentes e reprodutíveis com alta precisão em nível de frames, comparável ao estado da arte. Adicionalmente, uma análise de convergência baseada nas garantias de aprendizado da teoria de aprendizagem estatística é realizada para evidenciar a capacidade de generalização do modelo. O modelo possui um erro de 39,7% na classificação baseada em frames e uma taxa de erro de fonemas de 43,5% usando raw feature extraction e classificação com SVMs mesmo com poucos dados (menos de 7 horas). Esses resultados são comparáveis aos estudos que usam bem mais de dez vezes essa quantidade de dados. Além da avaliação intrínseca, o modelo também alcança uma precisão de 88% na identificação de epêntese, o erro que é mais difícil para brasileiros falantes de inglês. Este é um ganho relativo de 69% em relação aos valores anteriores da literatura. Os resultados são significativos porque mostram como raw feature extraction pode ser aplicada a cenários de poucos dados, ao contrário da crença popular. Os resultados extrínsecos também mostram como essa abordagem pode ser útil em tarefas como o diagnóstico automático de erros. Outra contribuição é a publicação de uma série de recursos livremente disponíveis que anteriormente não existiam, destinados a auxiliar futuras pesquisas na criação de conjuntos de dados. Acoustic modeling Aprendizado profundo Computer vision Convolutional neural networks Deep learning Máquinas de vetores de suporte Modelagem acústica Non-native phoneme recognition Processamento de fala Reconhecimento de fonemas não nativos Redes neurais convolucionais Speech processing Statistical learning theory Support vector machines Teoria do aprendizado estatístico Visão computacional
9	RAMBLE: robust acoustic modeling for Brazilian learners of English / RAMBLE: modelagem acústica robusta para estudantes brasileiros de Inglês Christopher Dane Shulby 08 August 2018 (has links) The gains made by current deep-learning techniques have often come with the price tag of big data and where that data is not available, a new solution must be found. Such is the case for accented and noisy speech where large databases do not exist and data augmentation techniques, which are less than perfect, present an even larger obstacle. Another problem is that state-of-the-art results are rarely reproducible because they use proprietary datasets, pretrained networks and/or weight initializations from other larger networks. An example of a low resource scenario exists even in the fifth largest land in the world; home to most of the speakers of the seventh most spoken language on earth. Brazil is the leader in the Latin-American economy and as a BRIC country aspires to become an ever-stronger player in the global marketplace. Still, English proficiency is low, even for professionals in businesses and universities. Low intelligibility and strong accents can damage professional credibility. It has been established in the literature for foreign language teaching that it is important that adult learners are made aware of their errors as outlined by the Noticing Theory, explaining that a learner is more successful when he is able to learn from his own mistakes. An essential objective of this dissertation is to classify phonemes in the acoustic model which is needed to properly identify phonemic errors automatically. A common belief in the community is that deep learning requires large datasets to be effective. This happens because brute force methods create a highly complex hypothesis space which requires large and complex networks which in turn demand a great amount of data samples in order to generate useful networks. Besides that, the loss functions used in neural learning does not provide statistical learning guarantees and only guarantees the network can memorize the training space well. In the case of accented or noisy speech where a new sample can carry a great deal of variation from the training samples, the generalization of such models suffers. The main objective of this dissertation is to investigate how more robust acoustic generalizations can be made, even with little data and noisy accented-speech data. The approach here is to take advantage of raw feature extraction provided by deep learning techniques and instead focus on how learning guarantees can be provided for small datasets to produce robust results for acoustic modeling without the dependency of big data. This has been done by careful and intelligent parameter and architecture selection within the framework of the statistical learning theory. Here, an intelligently defined CNN architecture, together with context windows and a knowledge-driven hierarchical tree of SVM classifiers achieves nearly state-of-the-art frame-wise phoneme recognition results with absolutely no pretraining or external weight initialization. A goal of this thesis is to produce transparent and reproducible architectures with high frame-level accuracy, comparable to the state of the art. Additionally, a convergence analysis based on the learning guarantees of the statistical learning theory is performed in order to evidence the generalization capacity of the model. The model achieves 39.7% error in framewise classification and a 43.5% phone error rate using deep feature extraction and SVM classification even with little data (less than 7 hours). These results are comparable to studies which use well over ten times that amount of data. Beyond the intrinsic evaluation, the model also achieves an accuracy of 88% in the identification of epenthesis, the error which is most difficult for Brazilian speakers of English This is a 69% relative percentage gain over the previous values in the literature. The results are significant because it shows how deep feature extraction can be applied to little data scenarios, contrary to popular belief. The extrinsic, task-based results also show how this approach could be useful in tasks like automatic error diagnosis. Another contribution is the publication of a number of freely available resources which previously did not exist, meant to aid future researches in dataset creation. / Os ganhos obtidos pelas atuais técnicas de aprendizado profundo frequentemente vêm com o preço do big data e nas pesquisas em que esses grandes volumes de dados não estão disponíveis, uma nova solução deve ser encontrada. Esse é o caso do discurso marcado e com forte pronúncia, para o qual não existem grandes bases de dados; o uso de técnicas de aumento de dados (data augmentation), que não são perfeitas, apresentam um obstáculo ainda maior. Outro problema encontrado é que os resultados do estado da arte raramente são reprodutíveis porque os métodos usam conjuntos de dados proprietários, redes prétreinadas e/ou inicializações de peso de outras redes maiores. Um exemplo de um cenário de poucos recursos existe mesmo no quinto maior país do mundo em território; lar da maioria dos falantes da sétima língua mais falada do planeta. O Brasil é o líder na economia latino-americana e, como um país do BRIC, deseja se tornar um participante cada vez mais forte no mercado global. Ainda assim, a proficiência em inglês é baixa, mesmo para profissionais em empresas e universidades. Baixa inteligibilidade e forte pronúncia podem prejudicar a credibilidade profissional. É aceito na literatura para ensino de línguas estrangeiras que é importante que os alunos adultos sejam informados de seus erros, conforme descrito pela Noticing Theory, que explica que um aluno é mais bem sucedido quando ele é capaz de aprender com seus próprios erros. Um objetivo essencial desta tese é classificar os fonemas do modelo acústico, que é necessário para identificar automaticamente e adequadamente os erros de fonemas. Uma crença comum na comunidade é que o aprendizado profundo requer grandes conjuntos de dados para ser efetivo. Isso acontece porque os métodos de força bruta criam um espaço de hipóteses altamente complexo que requer redes grandes e complexas que, por sua vez, exigem uma grande quantidade de amostras de dados para gerar boas redes. Além disso, as funções de perda usadas no aprendizado neural não fornecem garantias estatísticas de aprendizado e apenas garantem que a rede possa memorizar bem o espaço de treinamento. No caso de fala marcada ou com forte pronúncia, em que uma nova amostra pode ter uma grande variação comparada com as amostras de treinamento, a generalização em tais modelos é prejudicada. O principal objetivo desta tese é investigar como generalizações acústicas mais robustas podem ser obtidas, mesmo com poucos dados e/ou dados ruidosos de fala marcada ou com forte pronúncia. A abordagem utilizada nesta tese visa tirar vantagem da raw feature extraction fornecida por técnicas de aprendizado profundo e obter garantias de aprendizado para conjuntos de dados pequenos para produzir resultados robustos para a modelagem acústica, sem a necessidade de big data. Isso foi feito por meio de seleção cuidadosa e inteligente de parâmetros e arquitetura no âmbito da Teoria do Aprendizado Estatístico. Nesta tese, uma arquitetura baseada em Redes Neurais Convolucionais (RNC) definida de forma inteligente, junto com janelas de contexto e uma árvore hierárquica orientada por conhecimento de classificadores que usam Máquinas de Vetores Suporte (Support Vector Machines - SVMs) obtém resultados de reconhecimento de fonemas baseados em frames quase no estado da arte sem absolutamente nenhum pré-treinamento ou inicialização de pesos de redes externas. Um objetivo desta tese é produzir arquiteturas transparentes e reprodutíveis com alta precisão em nível de frames, comparável ao estado da arte. Adicionalmente, uma análise de convergência baseada nas garantias de aprendizado da teoria de aprendizagem estatística é realizada para evidenciar a capacidade de generalização do modelo. O modelo possui um erro de 39,7% na classificação baseada em frames e uma taxa de erro de fonemas de 43,5% usando raw feature extraction e classificação com SVMs mesmo com poucos dados (menos de 7 horas). Esses resultados são comparáveis aos estudos que usam bem mais de dez vezes essa quantidade de dados. Além da avaliação intrínseca, o modelo também alcança uma precisão de 88% na identificação de epêntese, o erro que é mais difícil para brasileiros falantes de inglês. Este é um ganho relativo de 69% em relação aos valores anteriores da literatura. Os resultados são significativos porque mostram como raw feature extraction pode ser aplicada a cenários de poucos dados, ao contrário da crença popular. Os resultados extrínsecos também mostram como essa abordagem pode ser útil em tarefas como o diagnóstico automático de erros. Outra contribuição é a publicação de uma série de recursos livremente disponíveis que anteriormente não existiam, destinados a auxiliar futuras pesquisas na criação de conjuntos de dados. Aprendizado profundo Máquinas de vetores de suporte Modelagem acústica Processamento de fala Reconhecimento de fonemas não nativos Redes neurais convolucionais Teoria do aprendizado estatístico Visão computacional Acoustic modeling Computer vision Convolutional neural networks Deep learning Non-native phoneme recognition Speech processing Statistical learning theory Support vector machines
10	Spoken language identification in resource-scarce environments Peche, Marius 24 August 2010 (has links) South Africa has eleven official languages, ten of which are considered “resource-scarce”. For these languages, even basic linguistic resources required for the development of speech technology systems can be difficult or impossible to obtain. In this thesis, the process of developing Spoken Language Identification (S-LID) systems in resource-scarce environments is investigated. A Parallel Phoneme Recognition followed by Language Modeling (PPR-LM) architecture is utilized and three specific scenarios are investigated: (1) incomplete resources, including the lack of audio transcriptions and/or pronunciation dictionaries; (2) inconsistent resources, including the use of speech corpora that are unmatched with regard to domain or channel characteristics; and (3) poor quality resources, such as wrongly labeled or poorly transcribed data. Each situation is analysed, techniques defined to mitigate the effect of limited or poor quality resources, and the effectiveness of these techniques evaluated experimentally. Techniques evaluated include the development of orthographic tokenizers, bootstrapping of transcriptions, filtering of low quality audio, diarization and channel normalization techniques, and the human verification of miss-classified utterances. The knowledge gained from this research is used to develop the first S-LID system able to distinguish between all South African languages. The system performs well, able to differentiate among the eleven languages with an accuracy of above 67%, and among the six primary South African language families with an accuracy of higher than 80%, on segments of speech of between 2s and 10s in length. AFRIKAANS : Suid-Afrika het elf amptelike tale waarvan tien as hulpbron-skaars beskou word. Vir die tien tale kan selfs die basiese hulpbronne wat benodig word om spraak tegnologie stelsels te ontwikkel moeilik wees om te bekom. Die proses om ‘n Gesproke Taal Identifisering stelsel vir hulpbron-skaars omgewings te ontwikkel, word in hierdie tesis ondersoek. ‘n Parallelle Foneem Herkenning gevolg deur Taal Modellering argitektuur word ingespan om drie spesifieke moontlikhede word ondersoek: (1) Onvolledige Hulpbronne, byvoorbeeld vermiste transkripsies en uitspraak woordeboeke; (2) Teenstrydige Hulpbronne, byvoorbeeld die gebruik van spraak data-versamelings wat teenstrydig is in terme van kanaal kenmerke; en (3) Hulpbronne van swak kwaliteit, byvoorbeeld foutief geklasifiseerde data en klank opnames wat swak getranskribeer is. Elke situasie word geanaliseer, tegnieke om die negatiewe effekte van min of swak hulpbronne te verminder word ontwikkel, en die bruikbaarheid van hierdie tegnieke word deur middel van eksperimente bepaal. Tegnieke wat ontwikkel word sluit die ontwikkeling van ortografiese ontleders, die outomatiese ontwikkeling van nuwe transkripsies, die filtrering van swak kwaliteit klank-data, klank-verdeling en kanaal normalisering tegnieke, en menslike verifikasie van verkeerd geklassifiseerde uitsprake in. Die kennis wat deur hierdie navorsing bekom word, word gebruik om die eerste Gesproke Taal Identifisering stelsel wat tussen al die tale van Suid-Afrika kan onderskei, te ontwikkel. Hierdie stelsel vaar relatief goed, en kan die elf tale met ‘n akkuraatheid van meer as 67% identifiseer. Indien daar op die ses taal families gefokus word, verbeter die persentasie tot meer as 80% vir segmente wat tussen 2 en 10 sekondes lank. Copyright / Dissertation (MEng)--University of Pretoria, 2010. / Electrical, Electronic and Computer Engineering / unrestricted Taal modellering Parallelle foneem herkenning Outomatiese spraak herkenning Gesproke taal identifisering Menslike taal tegnologie Suboptimal resources Mismatched resources Incomplete resources Language modeling Parallel phoneme recognition Automatic speech recognition Human language technologies Spoken language identification Onvolledige hulpbronne Teenstrydige hulpbronne Ondergeskikte hulpbronne UCTD

Search results