As building transcribed speech corpora for under–resourced languages plays a pivotal role in developing
automatic speech recognition (ASR) technologies for such languages, a key step in developing
these technologies is the effective collection of ASR data, consisting of transcribed audio and associated
meta data.
The problem is that no suitable tool currently exists for effectively collecting ASR data for such
languages. The specific context and requirements for effectively collecting ASR data for underresourced
languages, render all currently known solutions unsuitable for such a task. Such requirements
include portability, Internet independence and an open–source code–base.
This work documents the development of such a tool, called Woefzela, from the determination
of the requirements necessary for effective data collection in this context, to the verification and
validation of its functionality. The study demonstrates the effectiveness of using smartphones without
any Internet connectivity for ASR data collection for under–resourced languages. It introduces a semireal–
time quality control philosophy which increases the amount of usable ASR data collected from
speakers.
Woefzela was developed for the Android Operating System, and is freely available for use on
Android smartphones, with its source code also being made available. A total of more than 790 hours
of ASR data for the eleven official languages of South Africa have been successfully collected with
Woefzela.
As part of this study a benchmark for the performance of a new National Centre for Human
Language Technology (NCHLT) English corpus was established. / Thesis (M.Ing. (Electrical Engineering))--North-West University, Potchefstroom Campus, 2012.
Identifer | oai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:nwu/oai:dspace.nwu.ac.za:10394/7354 |
Date | January 2011 |
Creators | De Vries, Nicolaas Johannes |
Publisher | North-West University |
Source Sets | South African National ETD Portal |
Detected Language | English |
Type | Thesis |
Page generated in 0.0021 seconds