Return to search

DESIGN OF A KEYWORD SPOTTING SYSTEM USING MODIFIED CROSS-CORRELATION IN THE TIME AND THE MFCC DOMAIN

Abstract A Keyword Spotting System (KWS) is a system that recognizes predefined keywords in spoken utterances or written documents. The objective is to obtain the highest possible keyword detection rate without increasing the number of false detections in a system. The common approach to keyword spotting is the use of a Hidden Markov Model (HMM). These are usually complex systems which require training speech data. The Typical HMM approach uses garbage templates or HMM models to match non-keyword speech and non-speech sounds. The purpose of this research is to design a simple Keyword Spotting System. The system will be designed to spot English words and should be easily adaptable to other languages There are many challenges in designing a keyword spotting system such as variations in speech like pitch, loudness, timbre that make recognition difficult. There can be wide variations in utterances even from the same speaker. In this research, the use of cross-correlation, as an alternative means for detecting keywords in an utterance, was investigated. This research also involves the modeling of a global keyword using a quantized dynamic time warping algorithm, which can function effectively with multi-speakers. The global keyword is an aggregation of the features from several occurrences of the same keyword. This research also investigates the effect of pitch normalization on keyword detection. The use of cross-correlation as a method for keyword spotting was investigated in both the time and MFCC domain. In the time domain the global keyword was cross-correlated with a pitch-normalized utterance. A zero lag ratio (the ratio of the power around the zero lag obtained from a cross correlation to the power in the rest of the signal is computed) was computed for each speech frame, a threshold was then used to determine if the keyword is present. For the MFCC domain the MFCC features of each keyword were computed, normalized and cross-correlated with the normalized MFCC features of portions of the utterance of the same size as the keyword. Cross-correlation of MFCC features of the keyword with that of each portion of the utterance yields a single value between 0-1. The portion with the highest value is usually the location of the keyword. Results in the time domain varied from keyword to keyword, some words showed a 60% hit rate while the average obtained from various keywords from the Call Home database had an average of 41%. Cross-correlation of the keywords and utterance in the MFCC domain yielded a 66% hit rate in test conducted on all different keywords in the Call Home and Switchboard corpus. The system accuracy is keyword dependent with some keywords having an 85% hit rate / Electrical and Computer Engineering

Identiferoai:union.ndltd.org:TEMPLE/oai:scholarshare.temple.edu:20.500.12613/694
Date January 2012
CreatorsAnifowose, Olakunle
ContributorsYantorno, Robert E., Picone, Joseph, Silage, Dennis
PublisherTemple University. Libraries
Source SetsTemple University
LanguageEnglish
Detected LanguageEnglish
TypeThesis/Dissertation, Text
Format70 pages
RightsIN COPYRIGHT- This Rights Statement can be used for an Item that is in copyright. Using this statement implies that the organization making this Item available has determined that the Item is in copyright and either is the rights-holder, has obtained permission from the rights-holder(s) to make their Work(s) available, or makes the Item available under an exception or limitation to copyright (including Fair Use) that entitles it to make the Item available., http://rightsstatements.org/vocab/InC/1.0/
Relationhttp://dx.doi.org/10.34944/dspace/676, Theses and Dissertations

Page generated in 0.002 seconds