ABSTRACT OF THE DISSERTATION OF Yonglian Wang, for Doctor of Philosophy degree in Electrical and Computer Engineering, presented on May 19, 2009, at Southern Illinois University- Carbondale. TITLE: SPEECH RECOGNITION UNDER STRESS MAJOR PROFESSOR: Dr. Nazeih M. Botros In this dissertation, three techniques, Dynamic Time Warping (DTW), Hidden Markov Models (HMM), and Hidden Control Neural Network (HCNN) are utilized to realize talker-independent isolated word recognition. DTW is a technique utilized to measure the distance between two input patterns or vectors; HMM is a tool utilized to model speech signals using stochastic process in five states to compare the similarity between signals; and HCNN calculates the errors between actual output and target output and it is mainly built for the stress compensated speech recognition. When stress (Angry, Question and Soft) is induced into the normal talking speech, speech recognition performance degrades greatly. Therefore hypothesis driven approach, a stress compensation technique is introduced to cancel the distortion caused by stress. The database for this research is SUSAS (Speech under Simulated and Actual Stress) which includes five domains encompassing a wide variety of stress, 16,000 isolated-word speech signal samples available from 44 speakers. Another database, called TIMIT (10 speakers and 6300 sentences in total) is used as a minor in DTW algorithm. The words used for speech recognition are speaker-independent. The characteristic feature analysis has been carried out in three domains: pitch, intensity, and glottal spectrum. The results showed that speech spoken under angry and question stress indicates extremely wide fluctuations with average higher pitch, higher RMS intensity, and more energy compared to neutral. In contrast, the soft talking style has lower pitch, lower RMS intensity, and less energy compared to neutral. The Linear Predictive Coding (LPC) cepstral feature analysis is used to obtain the observation vector and the input vector for DTW, HMM, and stress compensation. Both HMM and HCNN consist of training and recognition stages. Training stage is to form references, while recognition stage is to compare an unknown word against all the reference models. The unknown word is recognized by the model with highest similarity. Our results showed that HMM technique can achieve 91% recognition rate for Normal speech; however, the recognition rate dropped to 60% for Angry stress condition, 65% for Question stress condition, and 76% for Soft stress condition. After compensation was applied for the cepstral tilts, the recognition rate increased by 10% for Angry stress condition, 8% for Question stress condition, and 4% for Soft stress condition. Finally, HCNN technique increased the recognition rate to 90% for Angry stress condition and it also differentiated the Angry stress from other stress group.
Identifer | oai:union.ndltd.org:siu.edu/oai:opensiuc.lib.siu.edu:dissertations-1076 |
Date | 01 December 2009 |
Creators | Wang, Yonglian |
Publisher | OpenSIUC |
Source Sets | Southern Illinois University Carbondale |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | Dissertations |
Page generated in 0.0023 seconds