251 |
The effects of recognition accuracy and vocabulary size of a speech recognition system on task performance and user acceptanceCasali, Sherry P. 22 June 2010 (has links)
Automatic speech recognition systems have at last advanced to the state that they are now a feasible alternative for human-machine communication in selected applications. As such, research efforts are now beginning to focus on characteristics of the human, the recognition device, and the interface which optimize the system performance, rather than the previous trend of determining factors affecting recognizer performance alone. This study investigated two characteristics of the recognition device, the accuracy level at which it recognizes speech, and the vocabulary size of the recognizer as a percent of task vocabulary size to determine their effects on system performance. In addition, the study considered one characteristic of the user, age. Briefly, subjects performed a data entry task under each of the treatment conditions. Task completion time and the number of errors remaining at the end of each session were recorded. After each session, subjects rated the recognition device used as to its acceptability for the task.
The accuracy level at which the recognizer was performing significantly influenced the task completion time as well as the user's acceptability ratings, but had only a small effect on the number of errors left uncorrected. The available vocabulary size also significantly affected the task completion time; however, its effect on the final error rate and on the acceptability ratings was negligible. The age of the subject was also found to influence both objective and subjective measures. Older subjects in general required longer times to complete the tasks; however, they consistently rated the speech input systems more favorably than the younger subjects. / Master of Science
|
252 |
Development of robust language models for speech recognition of under-resourced languageSindana, Daniel January 2020 (has links)
Thesis (M.Sc.(Computer Science )) -- University of Limpopo, 2020 / Language modelling (LM) work for under-resourced languages that does not consider
most linguistic information inherent in a language produces language models that in adequately represent the language, thereby leading to under-development of natural
language processing tools and systems such as speech recognition systems. This
study investigated the influence that the orthography (i.e., writing system) of a lan guage has on the quality and/or robustness of the language models created for the
text of that language. The unique conjunctive and disjunctive writing systems of isiN debele (Ndebele) and Sepedi (Pedi) were studied.
The text data from the LWAZI and NCHLT speech corpora were used to develop lan guage models. The LM techniques that were implemented included: word-based n gram LM, LM smoothing, LM linear interpolation, and higher-order n-gram LM. The
toolkits used for development were: HTK LM, SRILM, and CMU-Cam SLM toolkits.
From the findings of the study – found on text preparation, data pooling and sizing,
higher n-gram models, and interpolation of models – it is concluded that the orthogra phy of the selected languages does have effect on the quality of the language models
created for their text. The following recommendations are made as part of LM devel opment for the concerned languages. 1) Special preparation and normalisation of the text data before LM development – paying attention to within sentence text markers
and annotation tags that may incorrectly form part of sentences, word sequences, and
n-gram contexts. 2) Enable interpolation during training. 3) Develop pentagram and
hexagram language models for Pedi texts, and trigrams and quadrigrams for Ndebele
texts. 4) Investigate efficient smoothing method for the different languages, especially
for different text sizes and different text domains / National Research Foundation (NRF)
Telkom
University of Limpopo
|
253 |
A motion based approach for audio-visual automatic speech recognitionAhmad, Nasir January 2011 (has links)
The research work presented in this thesis introduces novel approaches for both visual region of interest extraction and visual feature extraction for use in audio-visual automatic speech recognition. In particular, the speaker‘s movement that occurs during speech is used to isolate the mouth region in video sequences and motionbased features obtained from this region are used to provide new visual features for audio-visual automatic speech recognition. The mouth region extraction approach proposed in this work is shown to give superior performance compared with existing colour-based lip segmentation methods. The new features are obtained from three separate representations of motion in the region of interest, namely the difference in luminance between successive images, block matching based motion vectors and optical flow. The new visual features are found to improve visual-only and audiovisual speech recognition performance when compared with the commonly-used appearance feature-based methods. In addition, a novel approach is proposed for visual feature extraction from either the discrete cosine transform or discrete wavelet transform representations of the mouth region of the speaker. In this work, the image transform is explored from a new viewpoint of data discrimination; in contrast to the more conventional data preservation viewpoint. The main findings of this work are that audio-visual automatic speech recognition systems using the new features extracted from the frequency bands selected according to their discriminatory abilities generally outperform those using features designed for data preservation. To establish the noise robustness of the new features proposed in this work, their performance has been studied in presence of a range of different types of noise and at various signal-to-noise ratios. In these experiments, the audio-visual automatic speech recognition systems based on the new approaches were found to give superior performance both to audio-visual systems using appearance based features and to audio-only speech recognition systems.
|
254 |
Improved MFCC Front End Using Spectral Maxima For Noisy Speech RecognitionSujatha, J 11 1900 (has links) (PDF)
No description available.
|
255 |
Efficient development of human language technology resources for resource-scarce languages / Martin Johannes PuttkammerPuttkammer, Martin Johannes January 2014 (has links)
The development of linguistic data, especially annotated corpora, is imperative for the human language technology enablement of any language. The annotation process is, however, often time-consuming and expensive. As such, various projects make use of several strategies to expedite the development of human language technology resources. For resource-scarce languages – those with limited resources, finances and expertise – the efficiency of these strategies has not been conclusively established. This study investigates the efficiency of some of these strategies in the development of resources for resource-scarce languages, in order to provide recommendations for future projects facing decisions regarding which strategies they should implement.
For all experiments, Afrikaans is used as an example of a resource-scarce language. Two tasks, viz. lemmatisation of text data and orthographic transcription of audio data, are evaluated in terms of quality and in terms of the time required to perform the task. The main focus of the study is on the skill level of the annotators, software environments which aim to improve the quality and time needed to perform annotations, and whether it is beneficial to annotate more data, or to increase the quality of the data. We outline and conduct systematic experiments on each of the three focus areas in order to determine the efficiency of each.
First, we investigated the influence of a respondent’s skill level on data annotation by using untrained, sourced respondents for annotation of linguistic data for Afrikaans. We compared data annotated by experts, novices and laymen. From the results it was evident that the experts outperformed the non-experts on both tasks, and that the differences in performance were statistically significant.
Next, we investigated the effect of software environments on data annotation to determine the benefits of using tailor-made software as opposed to general-purpose or domain-specific software. The comparison showed that, for these two specific projects, it was beneficial in terms of time and quality to use tailor-made software rather than domain-specific or general-purpose software. However, in the context of linguistic annotation of data for resource-scarce languages, the additional time needed to develop tailor-made software is not justified by the savings in annotation time.
Finally, we compared systems trained with data of varying levels of quality and quantity, to determine the impact of quality versus quantity on the performance of systems. When comparing systems trained with gold standard data to systems trained with more data containing a low level of errors, the systems
trained with the erroneous data were statistically significantly better. Thus, we conclude that it is more beneficial to focus on the quantity rather than on the quality of training data.
Based on the results and analyses of the experiments, we offer some recommendations regarding which of the methods should be implemented in practice. For a project aiming to develop gold standard data, the highest quality annotations can be obtained by using experts to double-blind annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). For a project that aims to develop a core technology, experts or trained novices should be used to single-annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). / PhD (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2014
|
256 |
Efficient development of human language technology resources for resource-scarce languages / Martin Johannes PuttkammerPuttkammer, Martin Johannes January 2014 (has links)
The development of linguistic data, especially annotated corpora, is imperative for the human language technology enablement of any language. The annotation process is, however, often time-consuming and expensive. As such, various projects make use of several strategies to expedite the development of human language technology resources. For resource-scarce languages – those with limited resources, finances and expertise – the efficiency of these strategies has not been conclusively established. This study investigates the efficiency of some of these strategies in the development of resources for resource-scarce languages, in order to provide recommendations for future projects facing decisions regarding which strategies they should implement.
For all experiments, Afrikaans is used as an example of a resource-scarce language. Two tasks, viz. lemmatisation of text data and orthographic transcription of audio data, are evaluated in terms of quality and in terms of the time required to perform the task. The main focus of the study is on the skill level of the annotators, software environments which aim to improve the quality and time needed to perform annotations, and whether it is beneficial to annotate more data, or to increase the quality of the data. We outline and conduct systematic experiments on each of the three focus areas in order to determine the efficiency of each.
First, we investigated the influence of a respondent’s skill level on data annotation by using untrained, sourced respondents for annotation of linguistic data for Afrikaans. We compared data annotated by experts, novices and laymen. From the results it was evident that the experts outperformed the non-experts on both tasks, and that the differences in performance were statistically significant.
Next, we investigated the effect of software environments on data annotation to determine the benefits of using tailor-made software as opposed to general-purpose or domain-specific software. The comparison showed that, for these two specific projects, it was beneficial in terms of time and quality to use tailor-made software rather than domain-specific or general-purpose software. However, in the context of linguistic annotation of data for resource-scarce languages, the additional time needed to develop tailor-made software is not justified by the savings in annotation time.
Finally, we compared systems trained with data of varying levels of quality and quantity, to determine the impact of quality versus quantity on the performance of systems. When comparing systems trained with gold standard data to systems trained with more data containing a low level of errors, the systems
trained with the erroneous data were statistically significantly better. Thus, we conclude that it is more beneficial to focus on the quantity rather than on the quality of training data.
Based on the results and analyses of the experiments, we offer some recommendations regarding which of the methods should be implemented in practice. For a project aiming to develop gold standard data, the highest quality annotations can be obtained by using experts to double-blind annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). For a project that aims to develop a core technology, experts or trained novices should be used to single-annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). / PhD (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2014
|
257 |
Audio-visual automatic speech recognition using Dynamic Bayesian NetworksReikeras, Helge 03 1900 (has links)
Thesis (MSc (Applied mathematics))--University of Stellenbosch, 2011. / Includes bibliography. / Please refer to full text to view abstract.
|
258 |
Non-acoustic speaker recognitionDu Toit, Ilze 12 1900 (has links)
Thesis (MScIng)--University of Stellenbosch, 2004. / ENGLISH ABSTRACT: In this study the phoneme labels derived from a phoneme recogniser are used for phonetic
speaker recognition. The time-dependencies among phonemes are modelled by using
hidden Markov models (HMMs) for the speaker models. Experiments are done using firstorder
and second-order HMMs and various smoothing techniques are examined to address
the problem of data scarcity. The use of word labels for lexical speaker recognition is also
investigated. Single word frequencies are counted and the use of various word selections
as feature sets are investigated. During April 2004, the University of Stellenbosch, in collaboration
with Spescom DataVoice, participated in an international speaker verification
competition presented by the National Institute of Standards and Technology (NIST). The
University of Stellenbosch submitted phonetic and lexical (non-acoustic) speaker recognition
systems and a fused system (the primary system) that fuses the acoustic system of
Spescom DataVoice with the non-acoustic systems of the University of Stellenbosch. The
results were evaluated by means of a cost model. Based on the cost model, the primary
system obtained second and third position in the two categories that were submitted. / AFRIKAANSE OPSOMMING: Hierdie projek maak gebruik van foneem-etikette wat geklassifiseer word deur ’n foneemherkenner
en daarna gebruik word vir fonetiese sprekerherkenning. Die tyd-afhanklikhede
tussen foneme word gemodelleer deur gebruik te maak van verskuilde Markov modelle
(HMMs) as sprekermodelle. Daar word ge¨eksperimenteer met eerste-orde en tweede-orde
HMMs en verskeie vergladdingstegnieke word ondersoek om dataskaarsheid aan te spreek.
Die gebruik van woord-etikette vir sprekerherkenning word ook ondersoek. Enkelwoordfrekwensies
word getel en daar word ge¨eksperimenteer met verskeie woordseleksies as kenmerke
vir sprekerherkenning. Gedurende April 2004 het die Universiteit van Stellenbosch
in samewerking met Spescom DataVoice deelgeneem aan ’n internasionale sprekerverifikasie
kompetisie wat deur die National Institute of Standards and Technology (NIST)
aangebied is. Die Universiteit van Stellenbosch het ingeskryf vir ’n fonetiese en ’n woordgebaseerde
(nie-akoestiese) sprekerherkenningstelsel, asook ’n saamgesmelte stelsel wat as
primˆere stelsel dien. Die saamgesmelte stelsel is ’n kombinasie van Spescom DataVoice se
akoestiese stelsel en die twee nie-akoestiese stelsels van die Universiteit van Stellenbosch.
Die resultate is ge¨evalueer deur gebruik te maak van ’n koste-model. Op grond van die
koste-model het die primˆere stelsel tweede en derde plek behaal in die twee kategorie¨e
waaraan deelgeneem is.
|
259 |
Tree-based Gaussian mixture models for speaker verificationCilliers, Francois Dirk 12 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2005. / The Gaussian mixture model (GMM) performs very effectively in applications
such as speech and speaker recognition. However, evaluation speed is greatly
reduced when the GMM has a large number of mixture components. Various
techniques improve the evaluation speed by reducing the number of required
Gaussian evaluations.
|
260 |
The design of a high-performance, floating-point embedded system for speech recognition and audio research purposesDuckitt, William 03 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--Stellenbosch University, 2008. / This thesis describes the design of a high performance, floating-point, standalone embedded
system that is appropriate for speech and audio processing purposes.
The system successfully employs the Analog Devices TigerSHARC TS201 600MHz floating
point digital signal processor as a CPU, and includes 512MB RAM, a Compact Flash storage card
interface as non-volatile memory, a multi-channel audio input and output system with two
programmable microphone preamplifiers offering up to 65dB gain, a USB interface, a LCD display
and a push-button user interface.
An Altera Cyclone II FPGA is used to interface the CPU with the various peripheral
components. The FIFO buffers within the FPGA allow bulk DMA transfers of audio data for minimal
processor delays. Similar approaches are taken for communication with the USB interface, the
Compact Flash storage card and the LCD display.
A logic analyzer interface allows system debugging via the FPGA. This interface can also in
future be used to interface to additional components. The power distribution required a total of 11
different supplies to be provided with a total consumption of 16.8W. A 6 layer PCB incorporating 4
signal layers, a power plane and ground plane was designed for the final prototype.
All system components were verified to be operating correctly by means of appropriate
testing software, and the computational performance was measured by repeated calculation of a
multi-dimensional Gaussian log-probability and found to be comparable with an Intel 1.8GHz
Core2Duo processor.
The design can therefore be considered a success, and the prototype is ready for
development of suitable speech or audio processing software.
|
Page generated in 0.107 seconds