Global ETD Search

451	Investigation into automatic speech recognition of different dialects of Northern Sotho Mapeka, Madimetja Asaph January 2005 (has links) Thesis (MSc. (Computer Science)) -- University of Limpopo, 2005 / Refer to the document / Telkom (SA), HP (SA) and National Research Fund Automatic speech recognizer Speech technology Dialects Automatic speech recognition Speech perception
452	Using Blind Source Separation and a Compact Microphone Array to Improve the Error Rate of Speech Recognition Hoffman, Jeffrey Dean 01 December 2016 (has links) Automatic speech recognition has become a standard feature on many consumer electronics and automotive products, and the accuracy of the decoded speech has improved dramatically over time. Often, designers of these products achieve accuracy by employing microphone arrays and beamforming algorithms to reduce interference. However, beamforming microphone arrays are too large for small form factor products such as smart watches. Yet these small form factor products, which have precious little space for tactile user input (i.e. knobs, buttons and touch screens), would benefit immensely from a user interface based on reliably accurate automatic speech recognition. This thesis proposes a solution for interference mitigation that employs blind source separation with a compact array of commercially available unidirectional microphone elements. Such an array provides adequate spatial diversity to enable blind source separation and would easily fit in a smart watch or similar small form factor product. The solution is characterized using publicly available speech audio clips recorded for the purpose of testing automatic speech recognition algorithms. The proposal is modelled in different interference environments and the efficacy of the solution is evaluated. Factors affecting the performance of the solution are identified and their influence quantified. An expectation is presented for the quality of separation as well as the resulting improvement in word error rate that can be achieved from decoding the separated speech estimate versus the mixture obtained from a single unidirectional microphone element. Finally, directions for future work are proposed, which have the potential to improve the performance of the solution thereby making it a commercially viable product. Automatic speech recognition Interference (Sound) Blind source separation Microphone arrays Electrical and Computer Engineering
453	Relationship Between Intelligibility and Response Accuracy of the Amazon Echo in Individuals with Amyotrophic Lateral Sclerosis Exhibiting Mild-Moderate Dysarthria Layden, Caroline A. 27 June 2018 (has links) There is an ever-growing and increasing amount of technology options that use speech recognition software. Currently, the market includes smartphones, computers, and individual smart home personal assistants that allow for hands-free access to this technology. Research studies have explored the utility of these assistive devices for the completion of activities of daily living; however, there is limited research looking at the accuracy of voice recognition software within smart home personal assistants in populations with disordered speech. In persons with amyotrophic lateral sclerosis (ALS), symptoms include changes to motor functions, speech in particular, and it is unknown how some of these devices may respond to their disordered speech. The present study aimed to examine the accuracy of the Amazon Echo to respond appropriately to commands given by dysarthric patients with ALS. Participants were asked to read a variety of commands to an Amazon Echo. The sentences and responses by the Amazon Echo were audio-recorded for transcription and intelligibility ratings, which were then analyzed to look for relationships between intelligibility, auditory-perceptual features of speech, and sentence type. Results revealed there was no significant relationship between command intelligibility and accuracy of response by the Amazon Echo, nor was there a significant relationship between any of the auditory-perceptual ratings and accuracy of response. There was, however, a significant and positive association between conversational intelligibility and accuracy of responses by the Amazon Echo. This study provides support for use of hands-free assistive technology in patients with ALS to aid in the maintenance of quality of life and activities of daily living. Alexa ALS Amazon Echo Dysarthria Speech intelligibility Speech Recognition Speech and Hearing Science
454	[en] CONTINUOUS SPEECH RECOGNITION WITH MFCC, SSCH AND PNCC FEATURES, WAVELET DENOISING AND NEURAL NETWORKS / [pt] RECONHECIMENTO DE VOZ CONTÍNUA COM ATRIBUTOS MFCC, SSCH E PNCC, WAVELET DENOISING E REDES NEURAIS JAN KRUEGER SIQUEIRA 09 February 2012 (has links) [pt] Um dos maiores desafios na área de reconhecimento de voz contínua é desenvolver sistemas robustos ao ruído aditivo. Para isso, este trabalho analisa e testa três técnicas. A primeira delas é a extração de atributos do sinal de voz usando os métodos MFCC, SSCH e PNCC. A segunda é a remoção de ruído do sinal de voz via wavelet denoising. A terceira e última é uma proposta original batizada de feature denoising, que busca melhorar os atributos extraídos usando um conjunto de redes neurais. Embora algumas dessas técnicas já sejam conhecidas na literatura, a combinação entre elas trouxe vários resultados interessantes e inéditos. Inclusive, nota-se que o melhor desempenho vem da união de PNCC com feature denoising. / [en] One of the biggest challenges on the continuous speech recognition field is to develop systems that are robust to additive noise. To do so, this work analyses and tests three techniques. The first one extracts features from the voice signal using the MFCC, SSCH and PNCC methods. The second one removes noise from the voice signal through wavelet denoising. The third one is an original one, called feature denoising, that seeks to improve the extracted features using a set of neural networks. Although some of these techniques are already known in the literature, the combination of them brings many interesting and new results. In fact, it is noticed that the best performance comes from the union of PNCC and feature denoising. [pt] RUIDO [en] NOISE [pt] RECONHECIMENTO DE VOZ [en] SPEECH RECOGNITION [pt] REDE NEURAL [en] NEURAL NETWORK
455	Forensic speaker analysis and identification by computer : a Bayesian approach anchored in the cepstral domain Khodai-Joopari, Mehrdad, Information Technology & Electrical Engineering, Australian Defence Force Academy, UNSW January 2007 (has links) This thesis advances understanding of the forensic value of the automatic speech parameters by addressing the following question: what is the potentiality of the speech cepstrum as a forensic-acoustic parameter? Despite many advances in automatic speech and speaker recognition, robust and unconstrained progress in technical forensic speaker identification has been partly impeded by our incomplete understanding of the interaction and relation between forensic phonetics and the techniques employed in state-of-the-art automatic speech and speaker recognition. The posed question underlies the recurrent and longstanding issue of acoustic parameterisation in the area of forensic phonetics, where 1) speaker identification often must be carried out under less than optimal conditions, and 2) views differ on the usefulness and trustworthiness of the formant frequency measurements. To this end, a new formulation for the forensic evaluation of speech data was derived which is effectively a spectral likelihood ratio with enhanced sensitivity to the local peaks of the formant structure of the speech spectrum of vowel sounds, while retaining the characteristics of the Bayesian framework. This new hybrid formula was used together with a novel approach, which is founded on a statistically-based matched-pairs technique to account for various levels of variation inherent in speech recordings, thereby providing a spectrally meaningful measure of variations between two speech spectra and hence the true worth of speech samples as forensic evidence. The experimental results are obtained based on a forensically-realistic database of a relatively large population of 297 native speakers of Japanese. In sum, the research conducted in this thesis is a major step forward in advancing the forensic-phonetic field which broadens the objective basis of the forensic speaker identification. Beyond advancing knowledge in the field, the semi data-independent nature of the new formula ultimately has great implications in technical forensic speaker identification. It also provides us with a valuable biometric tool with both academic and commercial potential in crime investigation in a field which is already suffering from the lack of adequate data. Speech speaker identification forensic phonetics automatic speech recognition cepstral parameterisation Bayesian statistical decision theory
456	Short-Time Phase Spectrum in Human and Automatic Speech Recognition Alsteris, Leigh, n/a January 2006 (has links) Incorporating information from the short-time phase spectrum into a feature set for automatic speech recognition (ASR) may possibly serve to improve recognition accuracy. Currently, however, it is common practice to discard this information in favour of features that are derived purely from the short-time magnitude spectrum. There are two reasons for this: 1) the results of some well-known human listening experiments have indicated that the short-time phase spectrum conveys a negligible amount of intelligibility at the small window durations of 20-40 ms used for ASR spectral analysis, and 2) using the short-time phase spectrum directly for ASR has proven di?cult from a signal processing viewpoint, due to phase-wrapping and other problems. In this thesis, we explore the possibility of using short-time phase spectrum information for ASR by considering the two points mentioned above. To address the ?rst point, we conduct our own set of human listening experiments. Contrary to previous studies, our results indicate that the short-time phase spectrum can indeed contribute signi?cantly to speech intelligibility over small window durations of 20-40 ms. Also, the results of these listening experiments, in addition to some ASR experiments, indicate that at least part of this intelligibility may be supplementary to that provided by the short-time magnitude spectrum. To address the second point (i.e., the signal processing di?culties), it may be necessary to transform the short-time phase spectrum into a more physically meaningful representation from which useful features could possibly be extracted. Speci?cally, we investigate the frequency-derivative (or group delay function, GDF) and the time-derivative (or instantaneous frequency distribution, IFD) as potential candidates for this intermediate representation. We have performed various experiments which show that the GDF and IFD may be useful for ASR. We conduct several ASR experiments to test a feature set derived from the GDF. We ?nd that, in most cases, these features perform worse than the standard MFCC features. Therefore, we suggest that a short-time phase spectrum feature set may ultimately be derived from a concatenation of information from both the GDF and IFD representations. For best performance, the feature set may also need to be concatenated with short-time magnitude spectrum information. Further to addressing the two aforementioned points, we also discuss a number of other speech applications in which the short-time phase spectrum has proven to be very useful. We believe that an appreciation for how the short-time phase spectrum has been used for other tasks, in addition to the results of our research, will provoke fellow researchers to also investigate its potential for use in ASR. Short-time Fourier transform phase spectrum magnitude spectrum speech perception automatic speech recognition overlap-add procedure
457	Combining acoustic analysis and phonotactic analysis to improve automatic speech recognition Nulsen, Susan, n/a January 1998 (has links) This thesis addresses the problem of automatic speech recognition, specifically, how to transform an acoustic waveform into a string of words or phonemes. A preliminary chapter gives linguistic information potentially useful in automatic speech recognition. This is followed by a description of the Wave Analysis Laboratory (WAL), a rule-based system which detects features in speech and was designed as the acoustic front end of a speech recognition system. Temporal reasoning as used in WAL rules is examined. The use of WAL in recognizing one particular class of speech sounds, the nasal consonants, is described in detail. The remainder of the thesis looks at the statistical analysis of samples of spontaneous speech. An orthographic transcription of a large sample of spontaneous speech is automatically translated into phonemes. Tables of the frequencies of word initial and word final phoneme clusters are constructed to illustrate some of the phonotactic constraints of the language. Statistical data is used to assign phonemes to phonotactic classes. These classes are unlike the acoustic classes, although there is a general distinction between the vowels, the consonants and the word boundary. A way of measuring the phonetic balance of a sample of speech is described. This can be used as a means of ranking potential test samples in terms of how well they represent the language. A phoneme n-gram model is used to measure the entropy of the language. The broad acoustic encoding output from WAL is used with this language model to reconstruct a small test sample. "Branching" a simpler alternative to perplexity is introduced and found to give similar results to perplexity. Finally, the drop in branching is calculated as knowledge of various sets of acoustic classes is considered. In the work described in this thesis the main contributions made to automatic speech recognition and the study of speech are in the development of the Wave Analysis Laboratory and in the analysis of speech from a phonotactic point of view. The phoneme cluster frequencies provide new information on spoken language, as do the phonotactic classes. The measures of phonetic balance and branching provide additional tools for use in the development of speech recognition systems. acoustic analysis phonotactic analysis automatic speech recognition Wave Analysis Laboratory WAL
458	Cochlear implant sound coding with across-frequency delays Taft, Daniel Adam January 2009 (has links) The experiments described in this thesis investigate the temporal relationship between frequency bands in a cochlear implant sound processor. Initial studies were of cochlea-based traveling wave delays for cochlear implant sound processing strategies. These were later broadened into studies of an ensemble of across-frequency delays. / Before incorporating cochlear delays into a cochlear implant processor, a set of suitable delays was determined with a psychoacoustic calibration to pitch perception, since normal cochlear delays are a function of frequency. The first experiment assessed the perception of pitch evoked by electrical stimuli from cochlear implant electrodes. Six cochlear implant users with acoustic hearing in their non-implanted ears were recruited for this, since they were able to compare electric stimuli to acoustic tones. Traveling wave delays were then computed for each subject using the frequencies matched to their electrodes. These were similar across subjects, ranging over 0-6 milliseconds along the electrode array. / The next experiment applied the calibrated delays to the ACE strategy filter outputs before maxima selection. The effects upon speech perception in noise were assessed with cochlear implant users, and a small but significant improvement was observed. A subsequent sensitivity analysis indicated that accurate calibration of the delays might not be necessary after all; instead, a range of across-frequency delays might be similarly beneficial. / A computational investigation was performed next, where a corpus of recorded speech was passed through the ACE cochlear implant sound processing strategy in order to determine how across-frequency delays altered the patterns of stimulation. A range of delay vectors were used in combination with a number of processing parameter sets and noise levels. The results showed that additional stimuli from broadband sounds (such as the glottal pulses of vowels) are selected when frequency bands are desynchronized with across-frequency delays. Background noise contains fewer dominant impulses than a single talker and so is not enhanced in this way. / In the following experiment, speech perception with an ensemble of across-frequency delays was assessed with eight cochlear implant users. Reverse cochlear delays (high frequency delays) were equivalent to conventional cochlear delays. Benefit was diminished for larger delays. Speech recognition scores were at baseline with random delay assignments. An information transmission analysis of speech in quiet indicated that the discrimination of voiced cues was most improved with across-frequency delays. For some subjects, this was seen as improved vowel discrimination based on formant locations and improved transmission of the place of articulation of consonants. / A final study indicated that benefits to speech perception with across-frequency delays are diminished when the number of maxima selected per frame is increased above 8-out-of-22 frequency bands.
459	Learning pronunciation variation : A data-driven approach to rule-based lecxicon adaptation for automatic speech recognition Amdal, Ingunn January 2002 (has links) <p>To achieve a robust system the variation seen for different speaking styles must be handled. An investigation of standard automatic speech recognition techniques for different speaking styles showed that lexical modelling using general-purpose variants gave small improvements, but the errors differed compared with using only one canonical pronunciation per word. Modelling the variation using the acoustic models (using context dependency and/or speaker dependent adaptation) gave a significant improvement, but the resulting performance for non-native and spontaneous speech was still far from read speech.</p><p>In this dissertation a complete data-driven approach to rule-based lexicon adaptation is presented, where the effect of the acoustic models is incorporated in the rule pruning metric. Reference and alternative transcriptions were aligned by dynamic programming, but with a data-driven method to derive the phone-to-phone substitution costs. The costs were based on the statistical co-occurrence of phones, association strength. Rules for pronunciation variation were derived from this alignment. The rules were pruned using a new metric based on acoustic log likelihood. Well trained acoustic models are capable of modelling much of the variation seen, and using the acoustic log likelihood to assess the pronunciation rules prevents the lexical modelling from adding variation already accounted for as shown for direct pronunciation variation modelling.</p><p>For the non-native task data-driven pronunciation modelling by learning pronunciation rules gave a significant performance gain. Acoustic log likelihood rule pruning performed better than rule probability pruning.</p><p>For spontaneous dictation the pronunciation variation experiments did not improve the performance. The answer to how to better model the variation for spontaneous speech seems to lie neither in the acoustical nor the lexical modelling. The main differences between read and spontaneous speech are the grammar used and disfluencies like restarts and long pauses. The language model may thus be the best starting point for more research to achieve better performance for this speaking style.</p> pronunciation variation lexical modelling automatic speech recognition non-native speech Electronics Elektronik
460	Speech recognition availability / Tillgängligheten i taligenkänning Eriksson, Mattias January 2004 (has links) <p>This project investigates the importance of availability in the scope of dictation programs. Using speech recognition technology for dictating has not reached the public, and that may very well be a result of poor availability in today’s technical solutions. </p><p>I have constructed a persona character, Johanna, who personalizes the target user. I have also developed a solution that streams audio into a speech recognition server and sends back interpreted text. Johanna affirmed that the solution was successful in theory. </p><p>I then incorporated test users that tried out the solution in practice. Half of them do indeed claim that their usage has been and will continue to be increased thanks to the new level of availability.</p> Datalogi Speech recognition dictation program availability streaming audio persona. Datalogi Computer science Datalogi

Search results