261 |
Automatic alignment and error detection for phonetic transcriptions in the African speech technology project databasesDe Villiers, Edward 03 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2006. / The African Speech Technology (AST) project ran from 2000 to 2004 and involved collecting speech data for five South African languages, transcribing the data and building automatic speech recognition systems in these languages. The work described here form part of this project and involved implementing methods for automatic boundary placement in manually labelled files and for determining errors made by transcribers during the labelling process.
|
262 |
Evaluation of modern large-vocabulary speech recognition techniques and their implementationSwart, Ranier Adriaan 03 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2009. / In this thesis we studied large-vocabulary continuous speech recognition.
We considered the components necessary to realise a large-vocabulary speech
recogniser and how systems such as Sphinx and HTK solved the problems
facing such a system.
Hidden Markov Models (HMMs) have been a common approach to
acoustic modelling in speech recognition in the past. HMMs are well suited
to modelling speech, since they are able to model both its stationary nature
and temporal e ects. We studied HMMs and the algorithms associated with
them. Since incorporating all knowledge sources as e ciently as possible is
of the utmost importance, the N-Best paradigm was explored along with
some more advanced HMM algorithms.
The way in which sounds and words are constructed has been studied
extensively in the past. Context dependency on the acoustic level and on
the linguistic level can be exploited to improve the performance of a speech recogniser. We considered some of the techniques used in the past to solve
the associated problems.
We implemented and combined some chosen algorithms to form our
system and reported the recognition results. Our nal system performed
reasonably well and will form an ideal framework for future studies on
large-vocabulary speech recognition at the University of Stellenbosch. Many
avenues of research for future versions of the system were considered.
|
263 |
Fusion of phoneme recognisers for South African EnglishStrydom, George Wessel 03 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2009. / ENGLISH ABSTRACT: Phoneme recognition systems typically suffer from low classification accuracy. Recognition
for South African English is especially difficult, due to the variety of vastly different accent
groups. This thesis investigates whether a fusion of classifiers, each trained on a specific
accent group, can outperform a single general classifier trained on all.
We implemented basic voting and score fusion techniques from which a small increase in
classifier accuracy could be seen. To ensure that similarly-valued output scores from different
classifiers imply the same opinion, these classifiers need to be calibrated before fusion. The
main focus point of this thesis is calibration with the Pool Adjacent Violators algorithm.
We achieved impressive gains in accuracy with this method and an in-depth investigation
was made into the role of the prior and the connection with the proportion of target to
non-target scores.
Calibration and fusion using the information metric Cllr was showed to perform impressively
with synthetic data, but minor increases in accuracy was found for our phoneme
recognition system. The best results for this technique was achieved by calibrating each
classifier individually, fusing these calibrated classifiers and then finally calibrating the fused
system.
Boosting and Bagging classifiers were also briefly investigated as possible phoneme recognisers.
Our attempt did not achieve the target accuracy of the classifier trained on all the
accent groups.
The inherent difficulties typical of phoneme recognition were highlighted. Low per-class
accuracies, a large number of classes and an unbalanced speech corpus all had a negative
influence on the effectivity of the tested calibration and fusion techniques. / AFRIKAANSE OPSOMMING: Foneemherkenningstelsels het tipies lae klassifikasie akkuraatheid. As gevolg van die verskeidenheid
verskillende aksent groepe is herkenning vir Suid-Afrikaanse Engels veral moeilik.
Hierdie tesis ondersoek of ’n fusie van klassifiseerders, elk afgerig op ’n spesifieke aksent
groep, beter kan doen as ’n enkele klassifiseerder wat op alle groepe afgerig is.
Ons het basiese stem- en tellingfusie tegnieke ge¨ımplementeer, wat tot ’n klein verbetering
in klassifiseerder akkuraatheid gelei het. Om te verseker dat soortgelyke uittreetellings van
verskillende klassifiseerders dieselfde opinie impliseer, moet hierdie klassifiseerders gekalibreer
word voor fusie. Die hoof fokuspunt van hierdie tesis is kalibrasie met die Pool Adja-
cent Violators algoritme. Indrukwekkende toenames in akkuraatheid is behaal met hierdie
metode en ’n in-diepte ondersoek is ingestel oor die rol van die aanneemlikheidswaarskynlikhede
en die verwantskap met die verhouding van teiken tot nie-teiken tellings.
Kalibrasie en fusie met behulp van die informasie maatstaf Cllr lewer indrukwekkende
resultate met sintetiese data, maar slegs klein verbeterings in akkuraatheid is gevind vir
ons foneemherkenningstelsel. Die beste resultate vir hierdie tegniek is verkry deur elke
klassifiseerder afsonderlik te kalibreer, hierdie gekalibreerde klassifiseerders dan te kombineer
en dan die finale gekombineerde stelsel weer te kalibreer.
Boosting en Bagging klassifiseerders is ook kortliks ondersoek as moontlike foneem herkenners.
Ons poging het nie die akkuraatheid van ons basislyn klassifiseerder (wat op alle data
afgerig is) bereik nie.
Die inherente probleme wat tipies is tot foneemherkenning is uitgewys. Lae per-klas
akkuraatheid, ’n groot hoeveelheid klasse en ’n ongebalanseerde spraak korpus het almal ’n
negatiewe invloed op die effektiwiteit van die getoetsde kalibrasie en fusie tegnieke gehad.
|
264 |
Useful Transcriptions of Webcast LecturesMunteanu, Cosmin 25 September 2009 (has links)
Webcasts are an emerging technology enabled by the expanding availability and capacity of the World Wide Web. This has led to an increase in the number of lectures and academic presentations being broadcast over the Internet. Ideally, repositories of such webcasts would be used in the same manner as libraries: users could search for, retrieve, or browse through textual information. However, one major obstacle prevents webcast archives from becoming the digital equivalent of traditional libraries: information is mainly transmitted and stored in spoken form. Despite voice being currently present in all webcasts, users do not benefit from it beyond simple playback. My goal has been to exploit this information-rich resource and improve webcast users' experience in browsing and searching for specific information. I achieve this by combining research in Human-Computer Interaction and Automatic Speech Recognition that would ultimately see text transcripts of lectures being integrated into webcast archives.
In this dissertation, I show that the usefulness of automatically-generated transcripts of webcast lectures can be improved by speech recognition techniques specifically addressed at increasing the accuracy of webcast transcriptions, and the development of an interactive collaborative interface that facilitates users' contributions to machine-generated transcripts. I first investigate the user needs for transcription accuracy in webcast archives and show that users' performance and transcript quality perception is affected by the Word Error Rate (WER). A WER equal to or less than 25% is acceptable for use in webcast archives. As current Automatic Speech Recognition (ASR) systems can only deliver, in realistic lecture conditions, WERs of around 45-50%, I propose and evaluate a webcast system extension that engages users to collaborate in a wiki manner on editing imperfect ASR transcripts.
My research on ASR focuses on reducing the WER for lectures by making use of available external knowledge sources, such as documents on the World Wide Web and lecture slides, to better model the conversational and the topic-specific styles of lectures. I show that this approach results in relative WER reductions of 11%. Further ASR improvements are proposed that combine the research on language modelling with aspects of collaborative transcript editing. Extracting information about the most frequent ASR errors from user-edited partial transcripts, and attempting to correct such errors when they occur in the remaining transcripts, can lead to an additional 10 to 18% relative reduction in lecture WER.
|
265 |
Lecture transcription systems in resource-scarce environments / Pieter Theunis de VilliersDe Villiers, Pieter Theunis January 2014 (has links)
Classroom note taking is a fundamental task performed by learners on a daily basis.
These notes provide learners with valuable offline study material, especially in the case of more difficult subjects. The use of class notes has been found to not only provide students with a better learning experience, but also leads to an overall higher academic performance. In a previous study, an increase of 10.5% in student grades was observed after these students had been provided with multimedia class notes. This is not surprising, as other studies have found that the rate of successful transfer of information to humans increases when provided with both visual and audio information. Note taking might seem like an easy task; however, students with hearing impairments, visual impairments, physical impairments, learning disabilities or even non-native listeners find this task very difficult to impossible. It has also been reported that even non-disabled students find note taking time consuming and that it requires a great deal of mental effort while also trying to pay full attention to the lecturer. This is illustrated by a study where it was found that college students were only able to record ~40% of the data presented by the lecturer. It is thus reasonable to expect an automatic way of generating class notes to be beneficial to all learners. Lecture transcription (LT) systems are used in educational environments to assist learners by providing them with real-time in-class transcriptions or recordings and transcriptions for offline use. Such systems have already been successfully implemented in the developed world where all required resources were easily obtained. These systems are typically trained on hundreds to thousands of hours of speech while their language models are trained on millions or even hundreds of millions of words. These amounts of data are generally not available in the developing world. In this dissertation, a number of approaches toward the development of LT systems in resource-scarce environments are investigated.
We focus on different approaches to obtaining sufficient amounts of well transcribed
data for building acoustic models, using corpora with few transcriptions and of variable quality. One approach investigates the use of alignment using a dynamic programming phone string alignment procedure to harvest as much usable data as possible from approximately transcribed speech data. We find that target-language acoustic models are optimal for this purpose, but encouraging results are also found when using models from another language for alignment. Another approach entails using unsupervised training methods where an initial low accuracy recognizer is used to transcribe a set of untranscribed data. Using this poorly transcribed data, correctly recognized portions are extracted based on a word confidence threshold. The initial system is retrained along with the newly recognized data in order to increase its overall accuracy. The initial acoustic models are trained using as little as 11 minutes of transcribed speech. After several iterations of unsupervised training, a noticeable increase in accuracy was observed (47.79% WER to 33.44% WER). Similar results were however found (35.97% WER) after using a large speaker-independent corpus to train the initial system. Usable LMs were also created using as few as 17955 words from transcribed lectures; however, this resulted in large out-of-vocabulary rates. This problem was solved by means of LM interpolation. LM interpolation was found to be very beneficial in cases where subject specific data (such as lecture slides and books) was available. We also introduce our NWU LT system, which was developed for use in learning environments and was designed using a client/server based architecture. Based on the results found in this study we are confident that usable models for use in LT systems can be developed in resource-scarce environments. / MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014
|
266 |
Improving Grapheme-based speech recognition through P2G transliteration / W.D. BassonBasson, Willem Diederick January 2014 (has links)
Grapheme-based speech recognition systems are faster to develop, but typically do not
reach the same level of performance as phoneme-based systems. Using Afrikaans speech
recognition as a case study, we first analyse the reasons for the discrepancy in performance, before introducing a technique for improving the performance of standard grapheme-based systems. It is found that by handling a relatively small number of irregular words through phoneme-to-grapheme (P2G) transliteration – transforming the original orthography of irregular words to an ‘idealised’ orthography – grapheme-based accuracy can be improved. An analysis of speech recognition accuracy based on word categories shows that P2G transliteration succeeds in improving certain word categories in which grapheme-based systems typically perform poorly, and that the problematic categories can be identified prior to system development. An evaluation is offered of when category-based P2G transliteration is beneficial and methods to implement the technique in practice are discussed. Comparative results are obtained for a second language (Vietnamese) in order to determine whether the technique can be generalised. / MSc (Computer Science) North-West University, Vaal Triangle Campus, 2014
|
267 |
Speaker normalisation for large vocabulary multiparty conversational speech recognitionGarau, Giulia January 2009 (has links)
One of the main problems faced by automatic speech recognition is the variability of the testing conditions. This is due both to the acoustic conditions (different transmission channels, recording devices, noises etc.) and to the variability of speech across different speakers (i.e. due to different accents, coarticulation of phonemes and different vocal tract characteristics). Vocal tract length normalisation (VTLN) aims at normalising the acoustic signal, making it independent from the vocal tract length. This is done by a speaker specific warping of the frequency axis parameterised through a warping factor. In this thesis the application of VTLN to multiparty conversational speech was investigated focusing on the meeting domain. This is a challenging task showing a great variability of the speech acoustics both across different speakers and across time for a given speaker. VTL, the distance between the lips and the glottis, varies over time. We observed that the warping factors estimated using Maximum Likelihood seem to be context dependent: appearing to be influenced by the current conversational partner and being correlated with the behaviour of formant positions and the pitch. This is because VTL also influences the frequency of vibration of the vocal cords and thus the pitch. In this thesis we also investigated pitch-adaptive acoustic features with the goal of further improving the speaker normalisation provided by VTLN. We explored the use of acoustic features obtained using a pitch-adaptive analysis in combination with conventional features such as Mel frequency cepstral coefficients. These spectral representations were combined both at the acoustic feature level using heteroscedastic linear discriminant analysis (HLDA), and at the system level using ROVER. We evaluated this approach on a challenging large vocabulary speech recognition task: multiparty meeting transcription. We found that VTLN benefits the most from pitch-adaptive features. Our experiments also suggested that combining conventional and pitch-adaptive acoustic features using HLDA results in a consistent, significant decrease in the word error rate across all the tasks. Combining at the system level using ROVER resulted in a further significant improvement. Further experiments compared the use of pitch adaptive spectral representation with the adoption of a smoothed spectrogram for the extraction of cepstral coefficients. It was found that pitch adaptive spectral analysis, providing a representation which is less affected by pitch artefacts (especially for high pitched speakers), delivers features with an improved speaker independence. Furthermore this has also shown to be advantageous when HLDA is applied. The combination of a pitch adaptive spectral representation and VTLN based speaker normalisation in the context of LVCSR for multiparty conversational speech led to more speaker independent acoustic models improving the overall recognition performances.
|
268 |
Linear dynamic models for automatic speech recognitionFrankel, Joe January 2004 (has links)
The majority of automatic speech recognition (ASR) systems rely on hidden Markov models (HMM), in which the output distribution associated with each state is modelled by a mixture of diagonal covariance Gaussians. Dynamic information is typically included by appending time-derivatives to feature vectors. This approach, whilst successful, makes the false assumption of framewise independence of the augmented feature vectors and ignores the spatial correlations in the parametrised speech signal. This dissertation seeks to address these shortcomings by exploring acoustic modelling for ASR with an application of a form of state-space model, the linear dynamic model (LDM). Rather than modelling individual frames of data, LDMs characterize entire segments of speech. An auto-regressive state evolution through a continuous space gives a Markovian model of the underlying dynamics, and spatial correlations between feature dimensions are absorbed into the structure of the observation process. LDMs have been applied to speech recognition before, however a smoothed Gauss-Markov form was used which ignored the potential for subspace modelling. The continuous dynamical state means that information is passed along the length of each segment. Furthermore, if the state is allowed to be continuous across segment boundaries, long range dependencies are built into the system and the assumption of independence of successive segments is loosened. The state provides an explicit model of temporal correlation which sets this approach apart from frame-based and some segment-based models where the ordering of the data is unimportant. The benefits of such a model are examined both within and between segments. LDMs are well suited to modelling smoothly varying, continuous, yet noisy trajectories such as found in measured articulatory data. Using speaker-dependent data from the MOCHA corpus, the performance of systems which model acoustic, articulatory, and combined acoustic-articulatory features are compared. As well as measured articulatory parameters, experiments use the output of neural networks trained to perform an articulatory inversion mapping. The speaker-independent TIMIT corpus provides the basis for larger scale acoustic-only experiments. Classification tasks provide an ideal means to compare modelling choices without the confounding influence of recognition search errors, and are used to explore issues such as choice of state dimension, front-end acoustic parametrization and parameter initialization. Recognition for segment models is typically more computationally expensive than for frame-based models. Unlike frame-level models, it is not always possible to share likelihood calculations for observation sequences which occur within hypothesized segments that have different start and end times. Furthermore, the Viterbi criterion is not necessarily applicable at the frame level. This work introduces a novel approach to decoding for segment models in the form of a stack decoder with A* search. Such a scheme allows flexibility in the choice of acoustic and language models since the Viterbi criterion is not integral to the search, and hypothesis generation is independent of the particular language model. Furthermore, the time-asynchronous ordering of the search means that only likely paths are extended, and so a minimum number of models are evaluated. The decoder is used to give full recognition results for feature-sets derived from the MOCHA and TIMIT corpora. Conventional train/test divisions and choice of language model are used so that results can be directly compared to those in other studies. The decoder is also used to implement Viterbi training, in which model parameters are alternately updated and then used to re-align the training data.
|
269 |
End-to-End Speech Recognition ModelsChan, William 01 December 2016 (has links)
For the past few decades, the bane of Automatic Speech Recognition (ASR) systems have been phonemes and Hidden Markov Models (HMMs). HMMs assume conditional indepen-dence between observations, and the reliance on explicit phonetic representations requires expensive handcrafted pronunciation dictionaries. Learning is often via detached proxy problems, and there especially exists a disconnect between acoustic model performance and actual speech recognition performance. Connectionist Temporal Classification (CTC) character models were recently proposed attempts to solve some of these issues, namely jointly learning the pronunciation model and acoustic model. However, HMM and CTC models still suffer from conditional independence assumptions and must rely heavily on language models during decoding. In this thesis, we question the traditional paradigm of ASR and highlight the limitations of HMM and CTC models. We propose a novel approach to ASR with neural attention models and we directly optimize speech transcriptions. Our proposed method is not only an end-to- end trained system but also an end-to-end model. The end-to-end model jointly learns all the traditional components of a speech recognition system: the pronunciation model, acoustic model and language model. Our model can directly emit English/Chinese characters or even word pieces given the audio signal. There is no need for explicit phonetic representations, intermediate heuristic loss functions or conditional independence assumptions. We demonstrate our end-to-end speech recognition model on various ASR tasks. We show competitive results compared to a state-of-the-art HMM based system on the Google voice search task. We demonstrate an online end-to-end Chinese Mandarin model and show how to jointly optimize the Pinyin transcriptions during training. Finally, we also show state-of-the-art results on the Wall Street Journal ASR task compared to other end-to-end models.
|
270 |
The Effectiveness of Speech Recognition as a User Interface for Computer-Based TrainingCreech, Wayne E. (Wayne Everette) 08 1900 (has links)
Some researchers are saying that natural language is probably one of the most promising interfaces for use in the long term for simplicity of learning. If this is true, then it follows that speech recognition would be ideal as the interface for computer-based training (CBT). While many speech recognition applications are being used as a means for a computer interface, these are usually confined to controlling the computer or causing the computer to control other devices. The user input or interface has been the recipient of a strong effort to improve the quality of the communication between man and machine and is proposed to be a dominant factor in determining user productivity, performance, and satisfaction. However, other researchers note that full natural interfaces with computers are still a long way from being the state-of-the art with technology. The focus of this study was to determine if the technology of speech recognition is an effective interface for an academic lesson presented via CBT. How does one determine if learning has been affected and how is this measured? Previous research has attempted quantify a learning effect when using a variety of interfaces. This dissertation summarizes previous studies using other interfaces and those using speech recognition. It attempted to apply a framework used to measure learning effectiveness in some of these studies to quantify the measurement of learning when speech recognition is used as the sole interface. The focus of the study was on cognitive processing which affects short-term memory and in-turn, the effect on original learning (OL). The methods and procedures applied in an experimental study were presented.
|
Page generated in 0.1246 seconds