In the world of human language technology, resource–scarce languages (RSLs) suffer from the problem
of little available electronic data and linguistic expertise. The Lwazi project in South Africa
is a large–scale endeavour to collect and apply such resources for all eleven of the official South
African languages. One of the deliverables of the project is more natural text–to–speech (TTS)
voices. Naturalness is primarily determined by prosody and it is shown that many aspects of
prosodic modelling is, in turn, dependent on part–of–speech (POS) information. Solving the POS
problem is, therefore, a prudent first step towards meeting the goal of natural TTS voices.
In a resource–scarce environment, obtaining and applying the POS information are not trivial.
Firstly, an automatic tagger is required to tag the text to be synthesised with POS categories, but
state–of–the–art POS taggers are data–driven and thus require large amounts of labelled training
data. Secondly, the subsequent processes in TTS that are used to apply the POS information
towards prosodic modelling are resource–intensive themselves: some require non–trivial linguistic
knowledge; others require labelled data as well.
The first problem asks the question of which available POS tagging algorithm will be the most
accurate on little training data. This research sets out to answer the question by reviewing the
most popular supervised data–driven algorithms. Since literature to date consists mostly of isolated
papers discussing one algorithm, the aim of the review is to consolidate the research into a single
point of reference. A subsequent experimental investigation compares the tagging algorithms on
small training data sets of English and Afrikaans, and it is shown that the hidden Markov model
(HMM) tagger outperforms the rest when using both a comprehensive and a reduced POS tagset.
Regarding the second problem, the question arises whether it is perhaps possible to circumvent
the traditional approaches to prosodic modelling by learning the latter directly from the speech
data using POS information. In other words, does the addition of POS features to the HTS context
labels improve the naturalness of a TTS voice? Towards answering this question, HTS voices are
trained from English and Afrikaans prosodically rich speech. The voices are compared with and
without POS features incorporated into the HTS context labels, analytically and perceptually. For
the analytical experiments, measures of prosody to quantify the comparisons are explored. It is
then also noted whether the results of the perceptual experiments correlate with their analytical
counterparts. It is found that, when a minimal feature set is used for the HTS context labels, the
addition of POS tags does improve the naturalness of the voice. However, the same effect can be
accomplished by including segmental counting and positional information instead of the POS tags. / Thesis (M.Sc. Engineering Sciences (Electrical and Electronic Engineering))--North-West University, Potchefstroom Campus, 2011.
Identifer | oai:union.ndltd.org:NWUBOLOKA1/oai:dspace.nwu.ac.za:10394/4944 |
Date | January 2010 |
Creators | Schlünz, Georg Isaac |
Publisher | North-West University |
Source Sets | North-West University |
Detected Language | English |
Type | Thesis |
Page generated in 0.0029 seconds