Global ETD Search

401	Tree-based Gaussian mixture models for speaker verification Cilliers, Francois Dirk 12 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2005. / The Gaussian mixture model (GMM) performs very effectively in applications such as speech and speaker recognition. However, evaluation speed is greatly reduced when the GMM has a large number of mixture components. Various techniques improve the evaluation speed by reducing the number of required Gaussian evaluations. Dissertations -- Electronic engineering Theses -- Electronic engineering Automatic speech recognition Speech processing systems Electrical and Electronic Engineering
402	The design of a high-performance, floating-point embedded system for speech recognition and audio research purposes Duckitt, William 03 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--Stellenbosch University, 2008. / This thesis describes the design of a high performance, floating-point, standalone embedded system that is appropriate for speech and audio processing purposes. The system successfully employs the Analog Devices TigerSHARC TS201 600MHz floating point digital signal processor as a CPU, and includes 512MB RAM, a Compact Flash storage card interface as non-volatile memory, a multi-channel audio input and output system with two programmable microphone preamplifiers offering up to 65dB gain, a USB interface, a LCD display and a push-button user interface. An Altera Cyclone II FPGA is used to interface the CPU with the various peripheral components. The FIFO buffers within the FPGA allow bulk DMA transfers of audio data for minimal processor delays. Similar approaches are taken for communication with the USB interface, the Compact Flash storage card and the LCD display. A logic analyzer interface allows system debugging via the FPGA. This interface can also in future be used to interface to additional components. The power distribution required a total of 11 different supplies to be provided with a total consumption of 16.8W. A 6 layer PCB incorporating 4 signal layers, a power plane and ground plane was designed for the final prototype. All system components were verified to be operating correctly by means of appropriate testing software, and the computational performance was measured by repeated calculation of a multi-dimensional Gaussian log-probability and found to be comparable with an Intel 1.8GHz Core2Duo processor. The design can therefore be considered a success, and the prototype is ready for development of suitable speech or audio processing software. Automatic speech recognition Speech processing systems Dissertations -- Electronic engineering Theses -- Electronic engineering Electrical and Electronic Engineering
403	Speech recognition of South African English accents Kamper, Herman 03 1900 (has links) Thesis (MScEng)--Stellenbosch University, 2012. / ENGLISH ABSTRACT: Several accents of English are spoken in South Africa. Automatic speech recognition (ASR) systems should therefore be able to process the di erent accents of South African English (SAE). In South Africa, however, system development is hampered by the limited availability of speech resources. In this thesis we consider di erent acoustic modelling approaches and system con gurations in order to determine which strategies take best advantage of a limited corpus of the ve accents of SAE for the purpose of ASR. Three acoustic modelling approaches are considered: (i) accent-speci c modelling, in which accents are modelled separately; (ii) accent-independent modelling, in which acoustic training data is pooled across accents; and (iii) multi-accent modelling, which allows selective data sharing between accents. For the latter approach, selective sharing is enabled by extending the decision-tree state clustering process normally used to construct tied-state hidden Markov models (HMMs) by allowing accent-based questions. In a rst set of experiments, we investigate phone and word recognition performance achieved by the three modelling approaches in a con guration where the accent of each test utterance is assumed to be known. Each utterance is therefore presented only to the matching model set. We show that, in terms of best recognition performance, the decision of whether to separate or to pool training data depends on the particular accents in question. Multi-accent acoustic modelling, however, allows this decision to be made automatically in a data-driven manner. When modelling the ve accents of SAE, multi-accent models yield a statistically signi cant improvement of 1.25% absolute in word recognition accuracy over accent-speci c and accentindependent models. In a second set of experiments, we consider the practical scenario where the accent of each test utterance is assumed to be unknown. Each utterance is presented simultaneously to a bank of recognisers, one for each accent, running in parallel. In this setup, accent identi cation is performed implicitly during the speech recognition process. A system employing multi-accent acoustic models in this parallel con guration is shown to achieve slightly improved performance relative to the con guration in which the accents are known. This demonstrates that accent identi cation errors made during the parallel recognition process do not a ect recognition performance. Furthermore, the parallel approach is also shown to outperform an accent-independent system obtained by pooling acoustic and language model training data. In a nal set of experiments, we consider the unsupervised reclassi cation of training set accent labels. Accent labels are assigned by human annotators based on a speaker's mother-tongue or ethnicity. These might not be optimal for modelling purposes. By classifying the accent of each utterance in the training set by using rst-pass acoustic models and then retraining the models, reclassi ed acoustic models are obtained. We show that the proposed relabelling procedure does not lead to any improvements and that training on the originally labelled data remains the best approach. / AFRIKAANSE OPSOMMING: Verskeie aksente van Engels word in Suid Afrika gepraat. Outomatiese spraakherkenningstelsels moet dus in staat wees om verskillende aksente van Suid Afrikaanse Engels (SAE) te kan hanteer. In Suid Afrika word die ontwikkeling van spraakherkenningstegnologie egter deur die beperkte beskikbaarheid van geannoteerde spraakdata belemmer. In hierdie tesis ondersoek ons verskillende akoestiese modelleringstegnieke en stelselkon gurasies ten einde te bepaal watter strategie e die beste gebruik maak van 'n databasis van die vyf aksente van SAE. Drie akoestiese modelleringstegnieke word ondersoek: (i) aksent-spesi eke modellering, waarin elke aksent apart gemodelleer word; (ii) aksent-onafhanklike modellering, waarin die akoestiese afrigdata van verskillende aksente saamgegooi word; en (iii) multi-aksent modellering, waarin data selektief tussen aksente gedeel word. Vir laasgenoemde word selektiewe deling moontlik gemaak deur die besluitnemingsboom-toestandbondeling-algoritme, wat gebruik word in die afrig van gebinde-toestand verskuilde Markov-modelle, uit te brei deur aksent-gebaseerde vrae toe te laat. In 'n eerste stel eksperimente word die foon- en woordherkenningsakkuraathede van die drie modelleringstegnieke vergelyk in 'n kon gurasie waarin daar aanvaar word dat die aksent van elke toetsspraakdeel bekend is. In hierdie kon gurasie word elke spraakdeel slegs gebied aan die modelstel wat ooreenstem met die aksent van die spraakdeel. In terme van herkenningsakkuraathede, wys ons dat die keuse tussen aksent-spesi eke en aksent-onafhanklike modellering afhanklik is van die spesi eke aksente wat ondersoek word. Multi-aksent akoestiese modellering stel ons egter in staat om hierdie besluit outomaties op 'n data-gedrewe wyse te neem. Vir die modellering van die vyf aksente van SAE lewer multi-aksent modelle 'n statisties beduidende verbetering van 1.25% absoluut in woordherkenningsakkuraatheid op in vergelyking met aksent-spesi eke en aksent-onafhanklike modelle. In 'n tweede stel eksperimente word die praktiese scenario ondersoek waar daar aanvaar word dat die aksent van elke toetsspraakdeel onbekend is. Elke spraakdeel word gelyktydig gebied aan 'n stel herkenners, een vir elke aksent, wat in parallel hardloop. In hierdie opstelling word aksentidenti kasie implisiet uitgevoer. Ons vind dat 'n stelsel wat multi-aksent akoestiese modelle in parallel inspan, e ense verbeterde werkverrigting toon in vergelyking met die opstelling waar die aksent bekend is. Dit dui daarop dat aksentidenti seringsfoute wat gemaak word gedurende herkenning, nie werkverrigting be nvloed nie. Verder wys ons dat die parallelle benadering ook beter werkverrigting toon as 'n aksent-onafhanklike stelsel wat verkry word deur akoestiese en taalmodelleringsafrigdata saam te gooi. In 'n nale stel eksperimente ondersoek ons die ongekontroleerde herklassi kasie van aksenttoekennings van die spraakdele in ons afrigstel. Aksente word gemerk deur menslike transkribeerders op grond van 'n spreker se moedertaal en ras. Hierdie toekennings is nie noodwendig optimaal vir modelleringsdoeleindes nie. Deur die aksent van elke spraakdeel in die afrigstel te klassi seer deur van aanvanklike akoestiese modelle gebruik te maak en dan weer modelle af te rig, word hergeklassi seerde akoestiese modelle verkry. Ons wys dat die voorgestelde herklassi seringsalgoritme nie tot enige verbeterings lei nie en dat dit die beste is om modelle op die oorspronklike data af te rig. Speech recognition South African English Multiple accents Accent identification Dissertations -- Electronic engineering Theses -- Electronic engineering
404	Automatic alignment and error detection for phonetic transcriptions in the African speech technology project databases De Villiers, Edward 03 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2006. / The African Speech Technology (AST) project ran from 2000 to 2004 and involved collecting speech data for five South African languages, transcribing the data and building automatic speech recognition systems in these languages. The work described here form part of this project and involved implementing methods for automatic boundary placement in manually labelled files and for determining errors made by transcribers during the labelling process. Automatic speech recognition Speech processing systems Dissertations -- Electronic engineering Theses -- Electronic engineering Electrical and Electronic Engineering
405	Evaluation of modern large-vocabulary speech recognition techniques and their implementation Swart, Ranier Adriaan 03 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2009. / In this thesis we studied large-vocabulary continuous speech recognition. We considered the components necessary to realise a large-vocabulary speech recogniser and how systems such as Sphinx and HTK solved the problems facing such a system. Hidden Markov Models (HMMs) have been a common approach to acoustic modelling in speech recognition in the past. HMMs are well suited to modelling speech, since they are able to model both its stationary nature and temporal e ects. We studied HMMs and the algorithms associated with them. Since incorporating all knowledge sources as e ciently as possible is of the utmost importance, the N-Best paradigm was explored along with some more advanced HMM algorithms. The way in which sounds and words are constructed has been studied extensively in the past. Context dependency on the acoustic level and on the linguistic level can be exploited to improve the performance of a speech recogniser. We considered some of the techniques used in the past to solve the associated problems. We implemented and combined some chosen algorithms to form our system and reported the recognition results. Our nal system performed reasonably well and will form an ideal framework for future studies on large-vocabulary speech recognition at the University of Stellenbosch. Many avenues of research for future versions of the system were considered. Dissertations -- Electronic engineering Theses -- Electronic engineering Acoustic modeling Automatic speech recognition Electrical and Electronic Engineering
406	Fusion of phoneme recognisers for South African English Strydom, George Wessel 03 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2009. / ENGLISH ABSTRACT: Phoneme recognition systems typically suffer from low classification accuracy. Recognition for South African English is especially difficult, due to the variety of vastly different accent groups. This thesis investigates whether a fusion of classifiers, each trained on a specific accent group, can outperform a single general classifier trained on all. We implemented basic voting and score fusion techniques from which a small increase in classifier accuracy could be seen. To ensure that similarly-valued output scores from different classifiers imply the same opinion, these classifiers need to be calibrated before fusion. The main focus point of this thesis is calibration with the Pool Adjacent Violators algorithm. We achieved impressive gains in accuracy with this method and an in-depth investigation was made into the role of the prior and the connection with the proportion of target to non-target scores. Calibration and fusion using the information metric Cllr was showed to perform impressively with synthetic data, but minor increases in accuracy was found for our phoneme recognition system. The best results for this technique was achieved by calibrating each classifier individually, fusing these calibrated classifiers and then finally calibrating the fused system. Boosting and Bagging classifiers were also briefly investigated as possible phoneme recognisers. Our attempt did not achieve the target accuracy of the classifier trained on all the accent groups. The inherent difficulties typical of phoneme recognition were highlighted. Low per-class accuracies, a large number of classes and an unbalanced speech corpus all had a negative influence on the effectivity of the tested calibration and fusion techniques. / AFRIKAANSE OPSOMMING: Foneemherkenningstelsels het tipies lae klassifikasie akkuraatheid. As gevolg van die verskeidenheid verskillende aksent groepe is herkenning vir Suid-Afrikaanse Engels veral moeilik. Hierdie tesis ondersoek of ’n fusie van klassifiseerders, elk afgerig op ’n spesifieke aksent groep, beter kan doen as ’n enkele klassifiseerder wat op alle groepe afgerig is. Ons het basiese stem- en tellingfusie tegnieke ge¨ımplementeer, wat tot ’n klein verbetering in klassifiseerder akkuraatheid gelei het. Om te verseker dat soortgelyke uittreetellings van verskillende klassifiseerders dieselfde opinie impliseer, moet hierdie klassifiseerders gekalibreer word voor fusie. Die hoof fokuspunt van hierdie tesis is kalibrasie met die Pool Adja- cent Violators algoritme. Indrukwekkende toenames in akkuraatheid is behaal met hierdie metode en ’n in-diepte ondersoek is ingestel oor die rol van die aanneemlikheidswaarskynlikhede en die verwantskap met die verhouding van teiken tot nie-teiken tellings. Kalibrasie en fusie met behulp van die informasie maatstaf Cllr lewer indrukwekkende resultate met sintetiese data, maar slegs klein verbeterings in akkuraatheid is gevind vir ons foneemherkenningstelsel. Die beste resultate vir hierdie tegniek is verkry deur elke klassifiseerder afsonderlik te kalibreer, hierdie gekalibreerde klassifiseerders dan te kombineer en dan die finale gekombineerde stelsel weer te kalibreer. Boosting en Bagging klassifiseerders is ook kortliks ondersoek as moontlike foneem herkenners. Ons poging het nie die akkuraatheid van ons basislyn klassifiseerder (wat op alle data afgerig is) bereik nie. Die inherente probleme wat tipies is tot foneemherkenning is uitgewys. Lae per-klas akkuraatheid, ’n groot hoeveelheid klasse en ’n ongebalanseerde spraak korpus het almal ’n negatiewe invloed op die effektiwiteit van die getoetsde kalibrasie en fusie tegnieke gehad. Phoneme recognition Dissertations -- Electronic engineering Theses -- Electronic engineering Automatic speech recognition Electrical and Electronic Engineering
407	Automatic Transcript Generator for Podcast Files Holst, Andy January 2010 (has links) <p>In the modern world, Internet has become a popular place, people with speech hearing disabilities and search engines can't take part of speech content in podcast les. In order to solve the problem partially, the Sphinx decoders such as Sphinx-3, Sphinx-4 can be used to implement a Auto Transcript Generator application, by coupling already existing large acoustic model, language model and a existing dictionary, or by training your own large acoustic model, language model and creating your own dictionary to support continuous speaker independent speech recognition system.</p> speech recognition auto transcript generator implementation podcast acoustic model language model dictionary Computer science Datavetenskap
408	Useful Transcriptions of Webcast Lectures Munteanu, Cosmin 25 September 2009 (has links) Webcasts are an emerging technology enabled by the expanding availability and capacity of the World Wide Web. This has led to an increase in the number of lectures and academic presentations being broadcast over the Internet. Ideally, repositories of such webcasts would be used in the same manner as libraries: users could search for, retrieve, or browse through textual information. However, one major obstacle prevents webcast archives from becoming the digital equivalent of traditional libraries: information is mainly transmitted and stored in spoken form. Despite voice being currently present in all webcasts, users do not benefit from it beyond simple playback. My goal has been to exploit this information-rich resource and improve webcast users' experience in browsing and searching for specific information. I achieve this by combining research in Human-Computer Interaction and Automatic Speech Recognition that would ultimately see text transcripts of lectures being integrated into webcast archives. In this dissertation, I show that the usefulness of automatically-generated transcripts of webcast lectures can be improved by speech recognition techniques specifically addressed at increasing the accuracy of webcast transcriptions, and the development of an interactive collaborative interface that facilitates users' contributions to machine-generated transcripts. I first investigate the user needs for transcription accuracy in webcast archives and show that users' performance and transcript quality perception is affected by the Word Error Rate (WER). A WER equal to or less than 25% is acceptable for use in webcast archives. As current Automatic Speech Recognition (ASR) systems can only deliver, in realistic lecture conditions, WERs of around 45-50%, I propose and evaluate a webcast system extension that engages users to collaborate in a wiki manner on editing imperfect ASR transcripts. My research on ASR focuses on reducing the WER for lectures by making use of available external knowledge sources, such as documents on the World Wide Web and lecture slides, to better model the conversational and the topic-specific styles of lectures. I show that this approach results in relative WER reductions of 11%. Further ASR improvements are proposed that combine the research on language modelling with aspects of collaborative transcript editing. Extracting information about the most frequent ASR errors from user-edited partial transcripts, and attempting to correct such errors when they occur in the remaining transcripts, can lead to an additional 10 to 18% relative reduction in lecture WER. human-computer interaction automatic speech recognition computer supported cooperative work 0984
409	Lecture transcription systems in resource-scarce environments / Pieter Theunis de Villiers De Villiers, Pieter Theunis January 2014 (has links) Classroom note taking is a fundamental task performed by learners on a daily basis. These notes provide learners with valuable offline study material, especially in the case of more difficult subjects. The use of class notes has been found to not only provide students with a better learning experience, but also leads to an overall higher academic performance. In a previous study, an increase of 10.5% in student grades was observed after these students had been provided with multimedia class notes. This is not surprising, as other studies have found that the rate of successful transfer of information to humans increases when provided with both visual and audio information. Note taking might seem like an easy task; however, students with hearing impairments, visual impairments, physical impairments, learning disabilities or even non-native listeners find this task very difficult to impossible. It has also been reported that even non-disabled students find note taking time consuming and that it requires a great deal of mental effort while also trying to pay full attention to the lecturer. This is illustrated by a study where it was found that college students were only able to record ~40% of the data presented by the lecturer. It is thus reasonable to expect an automatic way of generating class notes to be beneficial to all learners. Lecture transcription (LT) systems are used in educational environments to assist learners by providing them with real-time in-class transcriptions or recordings and transcriptions for offline use. Such systems have already been successfully implemented in the developed world where all required resources were easily obtained. These systems are typically trained on hundreds to thousands of hours of speech while their language models are trained on millions or even hundreds of millions of words. These amounts of data are generally not available in the developing world. In this dissertation, a number of approaches toward the development of LT systems in resource-scarce environments are investigated. We focus on different approaches to obtaining sufficient amounts of well transcribed data for building acoustic models, using corpora with few transcriptions and of variable quality. One approach investigates the use of alignment using a dynamic programming phone string alignment procedure to harvest as much usable data as possible from approximately transcribed speech data. We find that target-language acoustic models are optimal for this purpose, but encouraging results are also found when using models from another language for alignment. Another approach entails using unsupervised training methods where an initial low accuracy recognizer is used to transcribe a set of untranscribed data. Using this poorly transcribed data, correctly recognized portions are extracted based on a word confidence threshold. The initial system is retrained along with the newly recognized data in order to increase its overall accuracy. The initial acoustic models are trained using as little as 11 minutes of transcribed speech. After several iterations of unsupervised training, a noticeable increase in accuracy was observed (47.79% WER to 33.44% WER). Similar results were however found (35.97% WER) after using a large speaker-independent corpus to train the initial system. Usable LMs were also created using as few as 17955 words from transcribed lectures; however, this resulted in large out-of-vocabulary rates. This problem was solved by means of LM interpolation. LM interpolation was found to be very beneficial in cases where subject specific data (such as lecture slides and books) was available. We also introduce our NWU LT system, which was developed for use in learning environments and was designed using a client/server based architecture. Based on the results found in this study we are confident that usable models for use in LT systems can be developed in resource-scarce environments. / MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014 Acoustic modeling Automatic speech recognition Language modeling Lecture transcription Unsupervised training
410	Improving Grapheme-based speech recognition through P2G transliteration / W.D. Basson Basson, Willem Diederick January 2014 (has links) Grapheme-based speech recognition systems are faster to develop, but typically do not reach the same level of performance as phoneme-based systems. Using Afrikaans speech recognition as a case study, we first analyse the reasons for the discrepancy in performance, before introducing a technique for improving the performance of standard grapheme-based systems. It is found that by handling a relatively small number of irregular words through phoneme-to-grapheme (P2G) transliteration – transforming the original orthography of irregular words to an ‘idealised’ orthography – grapheme-based accuracy can be improved. An analysis of speech recognition accuracy based on word categories shows that P2G transliteration succeeds in improving certain word categories in which grapheme-based systems typically perform poorly, and that the problematic categories can be identified prior to system development. An evaluation is offered of when category-based P2G transliteration is beneficial and methods to implement the technique in practice are discussed. Comparative results are obtained for a second language (Vietnamese) in order to determine whether the technique can be generalised. / MSc (Computer Science) North-West University, Vaal Triangle Campus, 2014 Automatic speech recognition Phoneme-to-grapheme Transliteration Gra- pheme-based ASR

Search results