Return to search

Consonant recognition by humans and machines

Thesis (Ph.D.)--Harvard--Massachusetts Institute of Technology Division of Health Sciences and Technology, 1998. / Includes bibliographical references (p. 113-117). / The goal of this research is to determine how aspects of human speech processing can be utilized to improve the performance of Automatic Speech Recognition (ASR) systems. Three traditional ASR parameterizations matched with Hidden Markov Models (HMMs) are compared to humans on a consonant recognition task using Consonant Vowel- Consonant (CVC) nonsense syllables degraded by highpass filtering, lowpass filtering, or additive noise. Confusion matrices were determined by recognizing the syllabies using different ASR front ends, including Mel-Filter Bank (MFB) energies, Mel-F filtered Cepstral Coefficients (MFCCs), and the Ensemble Interval Histogram (EIH). For syllables degraded by lowpass and highpass filtering, automated systems trained on the degraded condition recognized the consonants roughly as well as humans. Moreover, all the ASR systems produce similar patterns of recognition errors for a given filtering condition. These patterns differ significantly from that characteristic of humans under the same filtering conditions. For syllables degraded by additive speech-shaped noise, none of the automated systems recognized consonants as well as humans. As with filtered conditions, confusion matrices revealed similar error patterns for all the ASR systems. While the error patterns of humans and machines was more similar for noise conditions than for filtered conditions, the similarities were not as great as between the ASR systems. The greatest difference between human and machine performances was in determining the correct voiced/unvoiced classification of consonants. Given these results, work was focused on recognition of the correct voicing classification in additive noise (0 dB SNR). The approach taken attempted to automatically extract attributes of the. speech signal, termed subphonetic features, which are useful in determining the distinctive feature voicing. Two subphonetic features, intervocal period ( the length of time between the onset of the vowel and any preceding vocalization) and delta fundamental (the average first difference of fundamental frequency over the first 90 msec of the vowel) proved particularly useful. When these two features were appended to traditional ASR parameters, th-3 deficit exhibited by automated systems was reduced substantially, though not eliminated. / by Jason Sroka. / Ph.D.

Identiferoai:union.ndltd.org:MIT/oai:dspace.mit.edu:1721.1/9312
Date January 1998
CreatorsSroka, Jason (Jason Jonathan), 1970-
ContributorsLouis D. Braida., Harvard University--MIT Division of Health Sciences and Technology., Harvard University--MIT Division of Health Sciences and Technology.
PublisherMassachusetts Institute of Technology
Source SetsM.I.T. Theses and Dissertation
LanguageEnglish
Detected LanguageEnglish
TypeThesis
Format117 p., 8796504 bytes, 8796264 bytes, application/pdf, application/pdf, application/pdf
RightsM.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission., http://dspace.mit.edu/handle/1721.1/7582

Page generated in 0.0015 seconds