Spelling suggestions: "subject:"[een] SPEECH PROCESSING"" "subject:"[enn] SPEECH PROCESSING""
151 |
Using duration information in HMM-based automatic speech recognition.January 2005 (has links)
Zhu Yu. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (leaves 100-104). / Abstracts in English and Chinese. / Chapter CHAPTER 1 --- lNTRODUCTION --- p.1 / Chapter 1.1. --- Speech and its temporal structure --- p.1 / Chapter 1.2. --- Previous work on the modeling of temporal structure --- p.1 / Chapter 1.3. --- Integrating explicit duration modeling in HMM-based ASR system --- p.3 / Chapter 1.4. --- Thesis outline --- p.3 / Chapter CHAPTER 2 --- BACKGROUND --- p.5 / Chapter 2.1. --- Automatic speech recognition process --- p.5 / Chapter 2.2. --- HMM for ASR --- p.6 / Chapter 2.2.1. --- HMM for ASR --- p.6 / Chapter 2.2.2. --- HMM-based ASR system --- p.7 / Chapter 2.3. --- General approaches to explicit duration modeling --- p.12 / Chapter 2.3.1. --- Explicit duration modeling --- p.13 / Chapter 2.3.2. --- Training of duration model --- p.16 / Chapter 2.3.3. --- Incorporation of duration model in decoding --- p.18 / Chapter CHAPTER 3 --- CANTONESE CONNECTD-DlGlT RECOGNITION --- p.21 / Chapter 3.1. --- Cantonese connected digit recognition --- p.21 / Chapter 3.1.1. --- Phonetics of Cantonese and Cantonese digit --- p.21 / Chapter 3.2. --- The baseline system --- p.24 / Chapter 3.2.1. --- Speech corpus --- p.24 / Chapter 3.2.2. --- Feature extraction --- p.25 / Chapter 3.2.3. --- HMM models --- p.26 / Chapter 3.2.4. --- HMM decoding --- p.27 / Chapter 3.3. --- Baseline performance and error analysis --- p.27 / Chapter 3.3.1. --- Recognition performance --- p.27 / Chapter 3.3.2. --- Performance for different speaking rates --- p.28 / Chapter 3.3.3. --- Confusion matrix --- p.30 / Chapter CHAPTER 4 --- DURATION MODELING FOR CANTONESE DIGITS --- p.41 / Chapter 4.1. --- Duration features --- p.41 / Chapter 4.1.1. --- Absolute duration feature --- p.41 / Chapter 4.1.2. --- Relative duration feature --- p.44 / Chapter 4.2. --- Parametric distribution for duration modeling --- p.47 / Chapter 4.3. --- Estimation of the model parameters --- p.51 / Chapter 4.4. --- Speaking-rate-dependent duration model --- p.52 / Chapter CHAPTER 5 --- USING DURATION MODELING FOR CANTONSE DIGIT RECOGNITION --- p.57 / Chapter 5.1. --- Baseline decoder --- p.57 / Chapter 5.2. --- Incorporation of state-level duration model --- p.59 / Chapter 5.3. --- Incorporation word-level duration model --- p.62 / Chapter 5.4. --- Weighted use of duration model --- p.65 / Chapter CHAPTER 6 --- EXPERIMENT RESULT AND ANALYSIS --- p.66 / Chapter 6.1. --- Experiments with speaking-rate-independent duration models --- p.66 / Chapter 6.1.1. --- Discussion --- p.68 / Chapter 6.1.2. --- Analysis of the error patterns --- p.71 / Chapter 6.1.3. --- "Reduction of deletion, substitution and insertion" --- p.72 / Chapter 6.1.4. --- Recognition performance at different speaking rates --- p.75 / Chapter 6.2. --- Experiments with speaking-rate-dependent duration models --- p.77 / Chapter 6.2.1. --- Using true speaking rate --- p.77 / Chapter 6.2.2. --- Using estimated speaking rate --- p.79 / Chapter 6.3. --- Evaluation on another speech database --- p.80 / Chapter 6.3.1. --- Experimental setup --- p.80 / Chapter 6.3.2. --- Experiment results and analysis --- p.82 / Chapter CHAPTER 7 --- CONCLUSIONS AND FUTUR WORK --- p.87 / Chapter 7.1. --- Conclusion and understanding of current work --- p.87 / Chapter 7.2. --- Future work --- p.89 / Chapter A --- APPENDIX --- p.90 / BIBLIOGRAPHY --- p.100
|
152 |
Model-based classification of speech audioUnknown Date (has links)
This work explores the process of model-based classification of speech audio signals using low-level feature vectors. The process of extracting low-level features from audio signals is described along with a discussion of established techniques for training and testing mixture model-based classifiers and using these models in conjunction with feature selection algorithms to select optimal feature subsets. The results of a number of classification experiments using a publicly available speech database, the Berlin Database of Emotional Speech, are presented. This includes experiments in optimizing feature extraction parameters and comparing different feature selection results from over 700 candidate feature vectors for the tasks of classifying speaker gender, identity, and emotion. In the experiments, final classification accuracies of 99.5%, 98.0% and 79% were achieved for the gender, identity and emotion tasks respectively. / by Chris Thoman. / Thesis (M.S.C.S.)--Florida Atlantic University, 2009. / Includes bibliography. / Electronic reproduction. Boca Raton, Fla., 2009. Mode of access: World Wide Web.
|
153 |
An error detection and correction framework to improve large vocabulary continuous speech recognition. / CUHK electronic theses & dissertations collectionJanuary 2009 (has links)
In addition to the ED-EC framework, this thesis proposes a discriminative lattice rescoring (DLR) algorithm to facilitate the investigation of the extensibility of the framework. The DLR method recasts a discriminative n-gram model as a pseudo-conventional n-gram model and then uses this recast model to perform lattice rescoring. DLR improves the efficiency of discriminative n-gram modeling and facilitates combined processing of discriminative n-gram modeling with other post-processing techniques such as the ED-EC framework. / This thesis proposes an error detection and correction (ED-EC) framework to incorporate advanced linguistic knowledge sources into large vocabulary continuous speech recognition. Previous efforts that apply sophisticated language models (LMs) in speech recognition normally face a serious efficiency problem due to the intense computation required by these models. The ED-EC framework aims to achieve the full benefit of complex linguistic sources while at the same time maximize efficiency. The framework attempts to only apply computationally expensive LMs where needed in input speech. First, the framework detects recognition errors in the output of an efficient state-of-the-art decoding procedure. Then, it corrects the detected errors with the aid of sophisticated LMs by (1) creating alternatives for each detected error and (2) applying advanced models to distinguish among the alternatives. In this thesis, we implement a prototype of the ED-EC framework on the task of Mandarin dictation. This prototype detects recognition errors based on generalized word posterior probabilities, selects alternatives for errors from recognition lattices generated during decoding and adopts an advanced LM that combines mutual information, word trigrams and POS trigrams. The experimental results indicate the practical feasibility of the ED-EC framework, for which the optimal gain of the focused LM is theoretically achievable at low computational cost. On a general-domain test set, a 6.0% relative reduction in character error rate (CER) over the performance of a state-of-the-art baseline recognizer is obtained. In terms of efficiency, while both the detection of errors and the creation of alternatives are efficient, the application of the computationally expensive LM is concentrated on less than 50% of the utterances. We further demonstrate that the potential benefit of using the ED-EC framework in improving the recognition performance is tremendous. If error detection is perfect and alternatives for an error are guaranteed to include the correct one, the relative CER reduction over the baseline performance will increase to 36.0%. We also illustrate that the ED-EC framework is robust on unseen data and can be conveniently extended to other recognition systems. / Zhou, Zhengyu. / Adviser: Helen Mei-Ling Meng. / Source: Dissertation Abstracts International, Volume: 72-11, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2009. / Includes bibliographical references (leaves 142-155). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
154 |
Auditory display for internet-based E-healthcare robotic system. / CUHK electronic theses & dissertations collectionJanuary 2006 (has links)
A psychological experiment based on a MIDI (Musical Instrument Digital Interface) sequence auditory interface was conducted initially to examine the rationale of using acoustic information in teleoperation. The experiment was designed to separately test subjects' perceptions of obstacle location and proximities of obstacles. The results revealed the potential use of audio stimuli in teleoperation tasks as well as several drawbacks about this interface. The interface translates information into a single audio stream, as a result, fails to exploit the spatial ability of the ear. Therefore, it was considered to represent the information acquired from the robotic communication sensors---microphones pair and one camera---by means of spatial audio in an ecological way. Firstly, a monitoring method based on the two microphones has been developed to supplement the narrow view of the camera, so that a better understanding of the environment can be formed. The developed bio-mimetic algorithm based on a new Aibo's head model is able to locate the sound event with 10° resolution. Afterwards, a new strategy for vision to audio sensory substitution has been proposed in which the task is concentrated on the spatial motion perception for mobile robot operation. After tracking a moving target from monocular image sequence by an active contour model, the spatial positions of the moving were determined by a pinhole camera model and camera calibration. Accordingly, the corresponding relations of the two modalities, e.g., spatial direction and scaled depth, were built for translation. / A scientific way of using auditory feedback as the substitute for visual feedback is proposed in the thesis to guarantee that the E-healthcare robotic system still functions under the conditions of image losses, visual fails and low-bandwidth communication links. This study is an experimental exploration into a relatively new topic about real-time robotic control. / Conclusions and recommendations for further research about the successful and extended usage of auditory display in teleoperation are also included. / Finally, an experimental e-healthcare robotic system has been developed with which high-frequency interactive contacts between patients and physicians or/and family members can be realized. Specifically, a new network protocol, Trinomial Protocol, has been implemented to facilitate data communication between client and server. Using two protocols: TCP and Trinomial Protocol, we have conducted experiments over a local network and the trans-pacific Internet. The experimental results about roundtrip time (RTT) and sending rate showed that there were large spikes corresponding to severe delay jitters when TCP was used and much less variance in RTTs when Trinomial protocol was used. To sum up, the Trinomial Protocol achieves better performance than the TCP. With this system, we also carried out some psychological experiments to compare the teleoperation performance under different sensory feedback conditions. The time it took to finish the task and the distance away to the target when the robot was controlled to stop were recorded for all the experiments. In addition, subjective workload assessments based on a set of NASA Task Load Index were collected. For the completion time of the task, the difference between the different modalities was not large. Even for vision only feedback, the average completion time was slightly larger than the auditory feedback. After pair t-test analysis, it was found there was no significant difference. Results of distance perception showed that the target was perceived more correctly using bimodal audiovisual integration than vision only condition, but less precise when compared with auditory only condition. As to the workload assessments, the average workload was 9.5973 for the auditory condition and 8.6147 for the visual one. There was no significant difference between them. The experimental results demonstrate the effectiveness of our proposed auditory display approaches in navigating a robot remotely. / Liu Rong. / "September 2006." / Adviser: Max O. H. Meng. / Source: Dissertation Abstracts International, Volume: 68-03, Section: B, page: 1765. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2006. / Includes bibliographical references (p. 128-140). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307.
|
155 |
A perceptual study on linearly approximated F0 contours in Cantonese speech. / CUHK electronic theses & dissertations collection / Digital dissertation consortiumJanuary 2011 (has links)
Approximation of F0 contours in Cantonese speech is investigated. Multiple approximations are examined and evaluated. The modified speech utterances that carry the approximated contours at syllable, word and sentence levels are perceptually examined with reference to natural speech. It is found that linear approximation can adequately describe all perception-sensitive F0 variations in Cantonese speech. Each tone contour can be represented by one or two linear movements, and the transition between co-articulated tones can be represented by one linear movement. / F0 contours measured from human speech (observed contours) generally vary to a considerable extent. This research attempts to investigate perception-critical variations in these highly varying contours. In particular, F0 contours in Cantonese speech are concerned. Cantonese is a major Chinese dialect that is known of being rich in tones. Psychoacoustic findings suggest that human perception has limitations in perceiving pitch movements. This means that not all of the variations in the observed contours are perceivable. A major problem addressed in this study is to find the simplest acoustic representation of an observed F0 contour that is adequate to attain comparable perception with the natural speech. / F0 variation in speech is known to carry abundant information, both linguistic and paralinguistic. Its impact on speech communication is thus widely concerned. F0 variation in speech, being a major super-segmental acoustic feature, has received a lot of attention, particularly from the perspectives of production-acoustics and perception-acoustics. However, it is noted that perception-acoustic knowledge of F0 variation in association with speech naturalness is quite limited. This is especially the case in the studies of tonal languages, in which most efforts are made on acoustic cues related to tone identification. / The feasibility of using linear approximation greatly simplifies the way to understand and interpret F0 variations in speech processing, by the means of learning the properties of linear movements. Three steps of analysis are carried out on the generated linear approximations. The first one examines the movement slopes in the approximated F0 contours of isolated syllables, in comparison with the perceptual thresholds found in the psychoacoustic studies. The second analysis is performed over a set of linearly approximated F0 contours of polysyllabic Cantonese words. The determining attributes of these linear movements, i.e., movement slopes, movement heights and time locations of turning points are analyzed statistically. The last analysis concerns the evaluation of modified F0 contours. Objective evaluations are compared with perceptual evaluation. These analyses provide knowledge which can improve our understanding on how F0 variations are processed in speech path. / To explore the potentials oflinear approximation in research of speech prosody, a perception-oriented framework of automatic approximation is developed, so as to replace the manual process in the feasibility study. The framework aims to make the process of deriving approximations standardized, consistent and efficient. It is formulated based on the experiences from manual approximations and is also implemented with other perceptual findings. The initial test on polysyllabic words gives promising results. / Li, Yujia. / Adviser: Tan Lee. / Source: Dissertation Abstracts International, Volume: 73-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 161-170). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese; includes Chinese characters.
|
156 |
A speech recognition IC with an efficient MFCC extraction algorithm and multi-mixture models. / CUHK electronic theses & dissertations collectionJanuary 2006 (has links)
Automatic speech recognition (ASR) by machine has received a great deal of attention in past decades. Speech recognition algorithms based on the Mel frequency cepstrum coefficient (MFCC) and the hidden Markov model (HMM) have a better recognition performance compared with other speech recognition algorithms and are widely used in many applications. In this thesis a speech recognition system with an efficient MFCC extraction algorithm and multi-mixture models is presented. It is composed of two parts: a MFCC feature extractor and a HMM-based speech decoder. / For the HMM-based decoder of the speech recognition system, it is advantageous to use models with multi mixtures, but with more mixtures the calculation becomes more complicated. Using a table look-up method proposed in this thesis the new design can handle up to 16 states and 8 mixtures. This new design can be easily extended to handle models which have more states and mixtures. We have implemented the new algorithm with an Altera FPGA chip using fix-point calculation and tested the FPGA chip with the speech data from the AURORA 2 database, which is a well known database designed to evaluate the performance of speech recognition algorithms in noisy conditions [27]. The recognition accuracy of the new system is 91.01%. A conventional software recognition system running on PC using 32-bit floating point calculation has a recognition accuracy of 94.65%. / In the conventional MFCC feature extraction algorithm, speech is separated into some short overlapped frames. The existing extraction algorithm requires a lot of computations and is not suitable for hardware implementation. We have developed a hardware efficient MFCC feature extraction algorithm in our work. The new algorithm reduces the computational power by 54% compared to the conventional algorithm with only 1.7% reduction in recognition accuracy. / Han Wei. / "September 2006." / Adviser: Cheong Fat Chan. / Source: Dissertation Abstracts International, Volume: 68-03, Section: B, page: 1823. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2006. / Includes bibliographical references (p. 108-111). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307.
|
157 |
Identificação de locutor usando modelos de misturas de gaussianas. / Speaker identification using Gaussian mixture models.Denis Pirttiaho Cardoso 03 April 2009 (has links)
A identificação de locutor está relacionada com a seleção de um locutor dentro de um conjunto de membros pré-definidos e neste trabalho os experimentos foram realizados utilizando um sistema de identificação de locutor independente de texto baseado em modelos de mistura de gaussianas. Para realizar os testes, foi empregado o banco de voz TIMIT e sua correspondente versão corrompida por ruído de canal telefônico, isto é, NTIMIT. O aparelho fonador pode ser representado por coeficientes mel-cepstrais obtidos por meio de banco de filtros ou, alternativamente, por coeficientes de predição linear. Adicionalmente, a técnica de subtração da média cepstral é aplicada quando o banco de voz NITMIT é utilizado com o intuito de minimizar a distorção de canal intrínseca a ele. A componente da locução para a qual os coeficientes mel-cepstrais são calculados é obtida através de um detector de atividade de voz (DAV). No entanto, os DAVs são em geral sensíveis à relação de sinal-ruído da locução, sendo necessário adaptá-los para as condições de operação do sistema. É sugerida a integração no DAV de um estimador da relação de sinal-ruído baseado no método Minima Controlled Recursive Average (MCRA), que é necessário para permitir o tratamento de sinais tanto limpos quanto ruidosos. É observado que em locuções de elevada relação de sinal-ruído, como aquelas provenientes do banco de voz TIMIT, o método mais apropriado de extração dos coeficientes mel-cepstrais foi o padrão, isto é, baseado em banco de filtros, enquanto que para sinais de voz ruidosos a técnica de subtração da média cepstral aliada à extração dos coeficientes mel-cepstrais a partir de coeficientes de predição linear revelou os melhores resultados. / Speaker identification is concerned with the selection of one speaker within a set of enrolled members and in this work the experiments were performed using a textindependent cohort Gaussian mixture model (GMM) speaker identification system. In order to perform the tests, TIMIT speech database is used and its corresponding version corrupted by a noisy telephone channel, i.e., NTIMIT. The vocal tract is represented by Mel-cepstral frequency coefficients with filter banks or, alternatively, by linear prediction cepstral coefficients. Additionally, the cepstral mean subtraction technique is applied when the NTIMIT database is used to minimize the channel distortion intrinsic to it. The utterance component for which the Mel-frequency cepstral coefficients is obtained using a voice activity detector (VAD). However, the VADs are generally sensitive to the signal-to-noise ratio of the utterance, making it necessary to adapt them to the system operating conditions. A signal-to-noise ratio estimator is included in the proposal VAD, which is based on Minima Controlled Recursive Average (MCRA), in order to be able to handle both clean and noisy speech. It is observed that in high signal-to-noise ratio utterances, such as those from the TIMIT database, the more appropriate extraction method for the Mel-frequency cepstral coefficients was the baseline one consisting of filter banks, while for noisy speech the technique of cepstral mean subtraction coupled with the extraction of Mel-frequency cepstral coefficients from linear prediction cepstral coefficients provided the best results.
|
158 |
Identificação de locutor usando modelos de misturas de gaussianas. / Speaker identification using Gaussian mixture models.Cardoso, Denis Pirttiaho 03 April 2009 (has links)
A identificação de locutor está relacionada com a seleção de um locutor dentro de um conjunto de membros pré-definidos e neste trabalho os experimentos foram realizados utilizando um sistema de identificação de locutor independente de texto baseado em modelos de mistura de gaussianas. Para realizar os testes, foi empregado o banco de voz TIMIT e sua correspondente versão corrompida por ruído de canal telefônico, isto é, NTIMIT. O aparelho fonador pode ser representado por coeficientes mel-cepstrais obtidos por meio de banco de filtros ou, alternativamente, por coeficientes de predição linear. Adicionalmente, a técnica de subtração da média cepstral é aplicada quando o banco de voz NITMIT é utilizado com o intuito de minimizar a distorção de canal intrínseca a ele. A componente da locução para a qual os coeficientes mel-cepstrais são calculados é obtida através de um detector de atividade de voz (DAV). No entanto, os DAVs são em geral sensíveis à relação de sinal-ruído da locução, sendo necessário adaptá-los para as condições de operação do sistema. É sugerida a integração no DAV de um estimador da relação de sinal-ruído baseado no método Minima Controlled Recursive Average (MCRA), que é necessário para permitir o tratamento de sinais tanto limpos quanto ruidosos. É observado que em locuções de elevada relação de sinal-ruído, como aquelas provenientes do banco de voz TIMIT, o método mais apropriado de extração dos coeficientes mel-cepstrais foi o padrão, isto é, baseado em banco de filtros, enquanto que para sinais de voz ruidosos a técnica de subtração da média cepstral aliada à extração dos coeficientes mel-cepstrais a partir de coeficientes de predição linear revelou os melhores resultados. / Speaker identification is concerned with the selection of one speaker within a set of enrolled members and in this work the experiments were performed using a textindependent cohort Gaussian mixture model (GMM) speaker identification system. In order to perform the tests, TIMIT speech database is used and its corresponding version corrupted by a noisy telephone channel, i.e., NTIMIT. The vocal tract is represented by Mel-cepstral frequency coefficients with filter banks or, alternatively, by linear prediction cepstral coefficients. Additionally, the cepstral mean subtraction technique is applied when the NTIMIT database is used to minimize the channel distortion intrinsic to it. The utterance component for which the Mel-frequency cepstral coefficients is obtained using a voice activity detector (VAD). However, the VADs are generally sensitive to the signal-to-noise ratio of the utterance, making it necessary to adapt them to the system operating conditions. A signal-to-noise ratio estimator is included in the proposal VAD, which is based on Minima Controlled Recursive Average (MCRA), in order to be able to handle both clean and noisy speech. It is observed that in high signal-to-noise ratio utterances, such as those from the TIMIT database, the more appropriate extraction method for the Mel-frequency cepstral coefficients was the baseline one consisting of filter banks, while for noisy speech the technique of cepstral mean subtraction coupled with the extraction of Mel-frequency cepstral coefficients from linear prediction cepstral coefficients provided the best results.
|
159 |
A text-to-speech synthesis system for Xitsonga using hidden Markov modelsBaloyi, Ntsako January 2012 (has links)
Thesis (M.Sc. (Computer Science) --University of Limpopo, 2013 / This research study focuses on building a general-purpose working Xitsonga speech synthesis system that is as far as can be possible reasonably intelligible, natural sounding, and flexible. The system built has to be able to model some of the desirable speaker characteristics and speaking styles. This research project forms part of the broader national speech technology project that aims at developing spoken language systems for human-machine interaction using the eleven official languages of South Africa (SA). Speech synthesis is the reverse of automatic speech recognition (which receives speech as input and converts it to text) in that it receives text as input and produces synthesized speech as output. It is generally accepted that most people find listening to spoken utterances better that reading the equivalent of such utterances.
The Xitsonga speech synthesis system has been developed using a hidden Markov model (HMM) speech synthesis method. The HMM-based speech synthesis (HTS) system synthesizes speech that is intelligible, and natural sounding. This method can synthesize speech on a footprint of only a few megabytes of training speech data. The HTS toolkit is applied as a patch to the HTK toolkit which is a hidden Markov model toolkit primarily designed for use in speech recognition to build and manipulate hidden Markov models.
|
160 |
Incorporation of syntax and semantics to improve the performance of an automatic speech recognizerRapholo, Moyahabo Isaiah January 2012 (has links)
Thesis (M.Sc. (Computer Science)) -- University of Limpopo, 2012 / Automatic Speech Recognition (ASR) is a technology that allows a computer to identify spoken words and translate those spoken words into text. Speech recognition systems have started to be used in may application areas such as healthcare, automobile, e-commerce, military, and others. The use of these speech recognition systems is usually limited by their poor performance.
In this research we are looking at improving the performance of the baseline ASR systems by incorporating syntactic structures in grammar into an existing Northern Sotho ASR, based on hidden Markov models (HMMs). The syntactic structures will be applied to the vocabulary used within the healthcare application area domain. The Backus Naur Form (BNF) and the Extended Backus Naur Form (EBNF) was used to specify the grammar. The experimental results show the overall improvement to the baseline ASR System and hence give a basis for following this approach.
|
Page generated in 0.0522 seconds