Return to search

Unsupervised clustering of audio data for acoustic modelling in automatic speech recognition systems

Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2011. / ENGLISH ABSTRACT: This thesis presents a system that is designed to replace the manual process of
generating a pronunciation dictionary for use in automatic speech recognition.
The proposed system has several stages.
The first stage segments the audio into what will be known as the subword
units, using a frequency domain method. In the second stage, dynamic
time warping is used to determine the similarity between the segments of each
possible pair of these acoustic segments. These similarities are used to cluster
similar acoustic segments into acoustic clusters. The final stage derives a
pronunciation dictionary from the orthography of the training data and corresponding
sequence of acoustic clusters. This process begins with an initial
mapping between words and their sequence of clusters, established by Viterbi
alignment with the orthographic transcription. The dictionary is refined iteratively
by pruning redundant mappings, hidden Markov model estimation and
Viterbi re-alignment in each iteration.
This approach is evaluated experimentally by applying it to two subsets of
the TIMIT corpus. It is found that, when test words are repeated often in the
training material, the approach leads to a system whose accuracy is almost as
good as one trained using the phonetic transcriptions. When test words are
not repeated often in the training set, the proposed approach leads to better
results than those achieved using the phonetic transcriptions, although the
recognition is poor overall in this case. / AFRIKAANSE OPSOMMING: Die doelwit van die tesis is om ’n stelsel te beskryf wat ontwerp is om die
handgedrewe proses in die samestelling van ’n woordeboek, vir die gebruik
in outomatiese spraakherkenningsstelsels, te vervang. Die voorgestelde stelsel
bestaan uit ’n aantal stappe.
Die eerste stap is die segmentering van die oudio in sogenaamde sub-woord
eenhede deur gebruik te maak van ’n frekwensie gebied tegniek. Met die tweede
stap word die dinamiese tydverplasingsalgoritme ingespan om die ooreenkoms
tussen die segmente van elkeen van die moontlike pare van die akoestiese segmente
bepaal. Die ooreenkomste word dan gebruik om die akoestiese segmente
te groepeer in akoestiese groepe. Die laaste stap stel die woordeboek
saam deur gebruik te maak van die ortografiese transkripsie van afrigtingsdata
en die ooreenstemmende reeks akoestiese groepe. Die finale stap begin met
’n aanvanklike afbeelding vanaf woorde tot hul reeks groep identifiseerders,
bewerkstellig deur Viterbi belyning en die ortografiese transkripsie. Die woordeboek
word iteratief verfyn deur oortollige afbeeldings te snoei, verskuilde
Markov modelle af te rig en deur Viterbi belyning te gebruik in elke iterasie.
Die benadering is getoets deur dit eksperimenteel te evalueer op twee subversamelings
data vanuit die TIMIT korpus. Daar is bevind dat, wanneer
woorde herhaal word in die afrigtingsdata, die stelsel se benadering die akkuraatheid
ewenaar van ’n stelsel wat met die fonetiese transkripsie afgerig is.
As die woorde nie herhaal word in die afrigtingsdata nie, is die akkuraatheid
van die stelsel se benadering beter as wanneer die stelsel afgerig word met die
fonetiese transkripsie, alhoewel die akkuraatheid in die algemeen swak is.

Identiferoai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:sun/oai:scholar.sun.ac.za:10019.1/6686
Date03 1900
CreatorsGoussard, George Willem
ContributorsNiesler, T. R., University of Stellenbosch. Faculty of Engineering. Dept. of Electrical and Electronic Engineering.
PublisherStellenbosch : University of Stellenbosch
Source SetsSouth African National ETD Portal
Languageen_ZA
Detected LanguageUnknown
TypeThesis
Format71 p. : ill.
RightsUniversity of Stellenbosch

Page generated in 0.002 seconds