Spelling suggestions: "subject:"speechrecognition"" "subject:"breedsrecognition""
401 |
Tree-based Gaussian mixture models for speaker verificationCilliers, Francois Dirk 12 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2005. / The Gaussian mixture model (GMM) performs very effectively in applications
such as speech and speaker recognition. However, evaluation speed is greatly
reduced when the GMM has a large number of mixture components. Various
techniques improve the evaluation speed by reducing the number of required
Gaussian evaluations.
|
402 |
The design of a high-performance, floating-point embedded system for speech recognition and audio research purposesDuckitt, William 03 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--Stellenbosch University, 2008. / This thesis describes the design of a high performance, floating-point, standalone embedded
system that is appropriate for speech and audio processing purposes.
The system successfully employs the Analog Devices TigerSHARC TS201 600MHz floating
point digital signal processor as a CPU, and includes 512MB RAM, a Compact Flash storage card
interface as non-volatile memory, a multi-channel audio input and output system with two
programmable microphone preamplifiers offering up to 65dB gain, a USB interface, a LCD display
and a push-button user interface.
An Altera Cyclone II FPGA is used to interface the CPU with the various peripheral
components. The FIFO buffers within the FPGA allow bulk DMA transfers of audio data for minimal
processor delays. Similar approaches are taken for communication with the USB interface, the
Compact Flash storage card and the LCD display.
A logic analyzer interface allows system debugging via the FPGA. This interface can also in
future be used to interface to additional components. The power distribution required a total of 11
different supplies to be provided with a total consumption of 16.8W. A 6 layer PCB incorporating 4
signal layers, a power plane and ground plane was designed for the final prototype.
All system components were verified to be operating correctly by means of appropriate
testing software, and the computational performance was measured by repeated calculation of a
multi-dimensional Gaussian log-probability and found to be comparable with an Intel 1.8GHz
Core2Duo processor.
The design can therefore be considered a success, and the prototype is ready for
development of suitable speech or audio processing software.
|
403 |
Speech recognition of South African English accentsKamper, Herman 03 1900 (has links)
Thesis (MScEng)--Stellenbosch University, 2012. / ENGLISH ABSTRACT: Several accents of English are spoken in South Africa. Automatic speech recognition (ASR) systems
should therefore be able to process the di erent accents of South African English (SAE).
In South Africa, however, system development is hampered by the limited availability of speech
resources. In this thesis we consider di erent acoustic modelling approaches and system con gurations
in order to determine which strategies take best advantage of a limited corpus of the ve
accents of SAE for the purpose of ASR. Three acoustic modelling approaches are considered:
(i) accent-speci c modelling, in which accents are modelled separately; (ii) accent-independent
modelling, in which acoustic training data is pooled across accents; and (iii) multi-accent modelling,
which allows selective data sharing between accents. For the latter approach, selective
sharing is enabled by extending the decision-tree state clustering process normally used to construct
tied-state hidden Markov models (HMMs) by allowing accent-based questions.
In a rst set of experiments, we investigate phone and word recognition performance achieved
by the three modelling approaches in a con guration where the accent of each test utterance is
assumed to be known. Each utterance is therefore presented only to the matching model set.
We show that, in terms of best recognition performance, the decision of whether to separate
or to pool training data depends on the particular accents in question. Multi-accent acoustic
modelling, however, allows this decision to be made automatically in a data-driven manner.
When modelling the ve accents of SAE, multi-accent models yield a statistically signi cant
improvement of 1.25% absolute in word recognition accuracy over accent-speci c and accentindependent
models.
In a second set of experiments, we consider the practical scenario where the accent of each test
utterance is assumed to be unknown. Each utterance is presented simultaneously to a bank
of recognisers, one for each accent, running in parallel. In this setup, accent identi cation is
performed implicitly during the speech recognition process. A system employing multi-accent
acoustic models in this parallel con guration is shown to achieve slightly improved performance
relative to the con guration in which the accents are known. This demonstrates that accent
identi cation errors made during the parallel recognition process do not a ect recognition performance.
Furthermore, the parallel approach is also shown to outperform an accent-independent
system obtained by pooling acoustic and language model training data.
In a nal set of experiments, we consider the unsupervised reclassi cation of training set accent
labels. Accent labels are assigned by human annotators based on a speaker's mother-tongue or
ethnicity. These might not be optimal for modelling purposes. By classifying the accent of each
utterance in the training set by using rst-pass acoustic models and then retraining the models,
reclassi ed acoustic models are obtained. We show that the proposed relabelling procedure does
not lead to any improvements and that training on the originally labelled data remains the best
approach. / AFRIKAANSE OPSOMMING: Verskeie aksente van Engels word in Suid Afrika gepraat. Outomatiese spraakherkenningstelsels
moet dus in staat wees om verskillende aksente van Suid Afrikaanse Engels (SAE) te kan
hanteer. In Suid Afrika word die ontwikkeling van spraakherkenningstegnologie egter deur die
beperkte beskikbaarheid van geannoteerde spraakdata belemmer. In hierdie tesis ondersoek ons
verskillende akoestiese modelleringstegnieke en stelselkon gurasies ten einde te bepaal watter
strategie e die beste gebruik maak van 'n databasis van die vyf aksente van SAE. Drie akoestiese
modelleringstegnieke word ondersoek: (i) aksent-spesi eke modellering, waarin elke aksent
apart gemodelleer word; (ii) aksent-onafhanklike modellering, waarin die akoestiese afrigdata
van verskillende aksente saamgegooi word; en (iii) multi-aksent modellering, waarin data selektief
tussen aksente gedeel word. Vir laasgenoemde word selektiewe deling moontlik gemaak
deur die besluitnemingsboom-toestandbondeling-algoritme, wat gebruik word in die afrig van
gebinde-toestand verskuilde Markov-modelle, uit te brei deur aksent-gebaseerde vrae toe te laat.
In 'n eerste stel eksperimente word die foon- en woordherkenningsakkuraathede van die drie modelleringstegnieke
vergelyk in 'n kon gurasie waarin daar aanvaar word dat die aksent van elke
toetsspraakdeel bekend is. In hierdie kon gurasie word elke spraakdeel slegs gebied aan die
modelstel wat ooreenstem met die aksent van die spraakdeel. In terme van herkenningsakkuraathede,
wys ons dat die keuse tussen aksent-spesi eke en aksent-onafhanklike modellering
afhanklik is van die spesi eke aksente wat ondersoek word. Multi-aksent akoestiese modellering
stel ons egter in staat om hierdie besluit outomaties op 'n data-gedrewe wyse te neem. Vir
die modellering van die vyf aksente van SAE lewer multi-aksent modelle 'n statisties beduidende
verbetering van 1.25% absoluut in woordherkenningsakkuraatheid op in vergelyking met
aksent-spesi eke en aksent-onafhanklike modelle.
In 'n tweede stel eksperimente word die praktiese scenario ondersoek waar daar aanvaar word
dat die aksent van elke toetsspraakdeel onbekend is. Elke spraakdeel word gelyktydig gebied aan
'n stel herkenners, een vir elke aksent, wat in parallel hardloop. In hierdie opstelling word aksentidenti
kasie implisiet uitgevoer. Ons vind dat 'n stelsel wat multi-aksent akoestiese modelle
in parallel inspan, e ense verbeterde werkverrigting toon in vergelyking met die opstelling waar
die aksent bekend is. Dit dui daarop dat aksentidenti seringsfoute wat gemaak word gedurende
herkenning, nie werkverrigting be nvloed nie. Verder wys ons dat die parallelle benadering ook
beter werkverrigting toon as 'n aksent-onafhanklike stelsel wat verkry word deur akoestiese en
taalmodelleringsafrigdata saam te gooi.
In 'n nale stel eksperimente ondersoek ons die ongekontroleerde herklassi kasie van aksenttoekennings
van die spraakdele in ons afrigstel. Aksente word gemerk deur menslike transkribeerders
op grond van 'n spreker se moedertaal en ras. Hierdie toekennings is nie noodwendig
optimaal vir modelleringsdoeleindes nie. Deur die aksent van elke spraakdeel in die afrigstel te
klassi seer deur van aanvanklike akoestiese modelle gebruik te maak en dan weer modelle af te
rig, word hergeklassi seerde akoestiese modelle verkry. Ons wys dat die voorgestelde herklassi
seringsalgoritme nie tot enige verbeterings lei nie en dat dit die beste is om modelle op die
oorspronklike data af te rig.
|
404 |
Automatic alignment and error detection for phonetic transcriptions in the African speech technology project databasesDe Villiers, Edward 03 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2006. / The African Speech Technology (AST) project ran from 2000 to 2004 and involved collecting speech data for five South African languages, transcribing the data and building automatic speech recognition systems in these languages. The work described here form part of this project and involved implementing methods for automatic boundary placement in manually labelled files and for determining errors made by transcribers during the labelling process.
|
405 |
Evaluation of modern large-vocabulary speech recognition techniques and their implementationSwart, Ranier Adriaan 03 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2009. / In this thesis we studied large-vocabulary continuous speech recognition.
We considered the components necessary to realise a large-vocabulary speech
recogniser and how systems such as Sphinx and HTK solved the problems
facing such a system.
Hidden Markov Models (HMMs) have been a common approach to
acoustic modelling in speech recognition in the past. HMMs are well suited
to modelling speech, since they are able to model both its stationary nature
and temporal e ects. We studied HMMs and the algorithms associated with
them. Since incorporating all knowledge sources as e ciently as possible is
of the utmost importance, the N-Best paradigm was explored along with
some more advanced HMM algorithms.
The way in which sounds and words are constructed has been studied
extensively in the past. Context dependency on the acoustic level and on
the linguistic level can be exploited to improve the performance of a speech recogniser. We considered some of the techniques used in the past to solve
the associated problems.
We implemented and combined some chosen algorithms to form our
system and reported the recognition results. Our nal system performed
reasonably well and will form an ideal framework for future studies on
large-vocabulary speech recognition at the University of Stellenbosch. Many
avenues of research for future versions of the system were considered.
|
406 |
Fusion of phoneme recognisers for South African EnglishStrydom, George Wessel 03 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2009. / ENGLISH ABSTRACT: Phoneme recognition systems typically suffer from low classification accuracy. Recognition
for South African English is especially difficult, due to the variety of vastly different accent
groups. This thesis investigates whether a fusion of classifiers, each trained on a specific
accent group, can outperform a single general classifier trained on all.
We implemented basic voting and score fusion techniques from which a small increase in
classifier accuracy could be seen. To ensure that similarly-valued output scores from different
classifiers imply the same opinion, these classifiers need to be calibrated before fusion. The
main focus point of this thesis is calibration with the Pool Adjacent Violators algorithm.
We achieved impressive gains in accuracy with this method and an in-depth investigation
was made into the role of the prior and the connection with the proportion of target to
non-target scores.
Calibration and fusion using the information metric Cllr was showed to perform impressively
with synthetic data, but minor increases in accuracy was found for our phoneme
recognition system. The best results for this technique was achieved by calibrating each
classifier individually, fusing these calibrated classifiers and then finally calibrating the fused
system.
Boosting and Bagging classifiers were also briefly investigated as possible phoneme recognisers.
Our attempt did not achieve the target accuracy of the classifier trained on all the
accent groups.
The inherent difficulties typical of phoneme recognition were highlighted. Low per-class
accuracies, a large number of classes and an unbalanced speech corpus all had a negative
influence on the effectivity of the tested calibration and fusion techniques. / AFRIKAANSE OPSOMMING: Foneemherkenningstelsels het tipies lae klassifikasie akkuraatheid. As gevolg van die verskeidenheid
verskillende aksent groepe is herkenning vir Suid-Afrikaanse Engels veral moeilik.
Hierdie tesis ondersoek of ’n fusie van klassifiseerders, elk afgerig op ’n spesifieke aksent
groep, beter kan doen as ’n enkele klassifiseerder wat op alle groepe afgerig is.
Ons het basiese stem- en tellingfusie tegnieke ge¨ımplementeer, wat tot ’n klein verbetering
in klassifiseerder akkuraatheid gelei het. Om te verseker dat soortgelyke uittreetellings van
verskillende klassifiseerders dieselfde opinie impliseer, moet hierdie klassifiseerders gekalibreer
word voor fusie. Die hoof fokuspunt van hierdie tesis is kalibrasie met die Pool Adja-
cent Violators algoritme. Indrukwekkende toenames in akkuraatheid is behaal met hierdie
metode en ’n in-diepte ondersoek is ingestel oor die rol van die aanneemlikheidswaarskynlikhede
en die verwantskap met die verhouding van teiken tot nie-teiken tellings.
Kalibrasie en fusie met behulp van die informasie maatstaf Cllr lewer indrukwekkende
resultate met sintetiese data, maar slegs klein verbeterings in akkuraatheid is gevind vir
ons foneemherkenningstelsel. Die beste resultate vir hierdie tegniek is verkry deur elke
klassifiseerder afsonderlik te kalibreer, hierdie gekalibreerde klassifiseerders dan te kombineer
en dan die finale gekombineerde stelsel weer te kalibreer.
Boosting en Bagging klassifiseerders is ook kortliks ondersoek as moontlike foneem herkenners.
Ons poging het nie die akkuraatheid van ons basislyn klassifiseerder (wat op alle data
afgerig is) bereik nie.
Die inherente probleme wat tipies is tot foneemherkenning is uitgewys. Lae per-klas
akkuraatheid, ’n groot hoeveelheid klasse en ’n ongebalanseerde spraak korpus het almal ’n
negatiewe invloed op die effektiwiteit van die getoetsde kalibrasie en fusie tegnieke gehad.
|
407 |
Automatic Transcript Generator for Podcast FilesHolst, Andy January 2010 (has links)
<p>In the modern world, Internet has become a popular place, people with speech hearing disabilities and search engines can't take part of speech content in podcast les. In order to solve the problem partially, the Sphinx decoders such as Sphinx-3, Sphinx-4 can be used to implement a Auto Transcript Generator application, by coupling already existing large acoustic model, language model and a existing dictionary, or by training your own large acoustic model, language model and creating your own dictionary to support continuous speaker independent speech recognition system.</p>
|
408 |
Useful Transcriptions of Webcast LecturesMunteanu, Cosmin 25 September 2009 (has links)
Webcasts are an emerging technology enabled by the expanding availability and capacity of the World Wide Web. This has led to an increase in the number of lectures and academic presentations being broadcast over the Internet. Ideally, repositories of such webcasts would be used in the same manner as libraries: users could search for, retrieve, or browse through textual information. However, one major obstacle prevents webcast archives from becoming the digital equivalent of traditional libraries: information is mainly transmitted and stored in spoken form. Despite voice being currently present in all webcasts, users do not benefit from it beyond simple playback. My goal has been to exploit this information-rich resource and improve webcast users' experience in browsing and searching for specific information. I achieve this by combining research in Human-Computer Interaction and Automatic Speech Recognition that would ultimately see text transcripts of lectures being integrated into webcast archives.
In this dissertation, I show that the usefulness of automatically-generated transcripts of webcast lectures can be improved by speech recognition techniques specifically addressed at increasing the accuracy of webcast transcriptions, and the development of an interactive collaborative interface that facilitates users' contributions to machine-generated transcripts. I first investigate the user needs for transcription accuracy in webcast archives and show that users' performance and transcript quality perception is affected by the Word Error Rate (WER). A WER equal to or less than 25% is acceptable for use in webcast archives. As current Automatic Speech Recognition (ASR) systems can only deliver, in realistic lecture conditions, WERs of around 45-50%, I propose and evaluate a webcast system extension that engages users to collaborate in a wiki manner on editing imperfect ASR transcripts.
My research on ASR focuses on reducing the WER for lectures by making use of available external knowledge sources, such as documents on the World Wide Web and lecture slides, to better model the conversational and the topic-specific styles of lectures. I show that this approach results in relative WER reductions of 11%. Further ASR improvements are proposed that combine the research on language modelling with aspects of collaborative transcript editing. Extracting information about the most frequent ASR errors from user-edited partial transcripts, and attempting to correct such errors when they occur in the remaining transcripts, can lead to an additional 10 to 18% relative reduction in lecture WER.
|
409 |
Lecture transcription systems in resource-scarce environments / Pieter Theunis de VilliersDe Villiers, Pieter Theunis January 2014 (has links)
Classroom note taking is a fundamental task performed by learners on a daily basis.
These notes provide learners with valuable offline study material, especially in the case of more difficult subjects. The use of class notes has been found to not only provide students with a better learning experience, but also leads to an overall higher academic performance. In a previous study, an increase of 10.5% in student grades was observed after these students had been provided with multimedia class notes. This is not surprising, as other studies have found that the rate of successful transfer of information to humans increases when provided with both visual and audio information. Note taking might seem like an easy task; however, students with hearing impairments, visual impairments, physical impairments, learning disabilities or even non-native listeners find this task very difficult to impossible. It has also been reported that even non-disabled students find note taking time consuming and that it requires a great deal of mental effort while also trying to pay full attention to the lecturer. This is illustrated by a study where it was found that college students were only able to record ~40% of the data presented by the lecturer. It is thus reasonable to expect an automatic way of generating class notes to be beneficial to all learners. Lecture transcription (LT) systems are used in educational environments to assist learners by providing them with real-time in-class transcriptions or recordings and transcriptions for offline use. Such systems have already been successfully implemented in the developed world where all required resources were easily obtained. These systems are typically trained on hundreds to thousands of hours of speech while their language models are trained on millions or even hundreds of millions of words. These amounts of data are generally not available in the developing world. In this dissertation, a number of approaches toward the development of LT systems in resource-scarce environments are investigated.
We focus on different approaches to obtaining sufficient amounts of well transcribed
data for building acoustic models, using corpora with few transcriptions and of variable quality. One approach investigates the use of alignment using a dynamic programming phone string alignment procedure to harvest as much usable data as possible from approximately transcribed speech data. We find that target-language acoustic models are optimal for this purpose, but encouraging results are also found when using models from another language for alignment. Another approach entails using unsupervised training methods where an initial low accuracy recognizer is used to transcribe a set of untranscribed data. Using this poorly transcribed data, correctly recognized portions are extracted based on a word confidence threshold. The initial system is retrained along with the newly recognized data in order to increase its overall accuracy. The initial acoustic models are trained using as little as 11 minutes of transcribed speech. After several iterations of unsupervised training, a noticeable increase in accuracy was observed (47.79% WER to 33.44% WER). Similar results were however found (35.97% WER) after using a large speaker-independent corpus to train the initial system. Usable LMs were also created using as few as 17955 words from transcribed lectures; however, this resulted in large out-of-vocabulary rates. This problem was solved by means of LM interpolation. LM interpolation was found to be very beneficial in cases where subject specific data (such as lecture slides and books) was available. We also introduce our NWU LT system, which was developed for use in learning environments and was designed using a client/server based architecture. Based on the results found in this study we are confident that usable models for use in LT systems can be developed in resource-scarce environments. / MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014
|
410 |
Improving Grapheme-based speech recognition through P2G transliteration / W.D. BassonBasson, Willem Diederick January 2014 (has links)
Grapheme-based speech recognition systems are faster to develop, but typically do not
reach the same level of performance as phoneme-based systems. Using Afrikaans speech
recognition as a case study, we first analyse the reasons for the discrepancy in performance, before introducing a technique for improving the performance of standard grapheme-based systems. It is found that by handling a relatively small number of irregular words through phoneme-to-grapheme (P2G) transliteration – transforming the original orthography of irregular words to an ‘idealised’ orthography – grapheme-based accuracy can be improved. An analysis of speech recognition accuracy based on word categories shows that P2G transliteration succeeds in improving certain word categories in which grapheme-based systems typically perform poorly, and that the problematic categories can be identified prior to system development. An evaluation is offered of when category-based P2G transliteration is beneficial and methods to implement the technique in practice are discussed. Comparative results are obtained for a second language (Vietnamese) in order to determine whether the technique can be generalised. / MSc (Computer Science) North-West University, Vaal Triangle Campus, 2014
|
Page generated in 0.0867 seconds