Global ETD Search

Return to search

Automatic lemmatisation for Afrikaans / by Hendrik J. Groenewald

A lemmatiser is an important component of various human language technology applicalions
for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this
lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current
lemmatiser serves as motivation for developing another lemmatiser based on an alternative
approach than language-specific rules. The alternalive method of lemmatiser corlstruction
investigated in this study is memory-based learning.
Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu
"Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu,
thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation,
ii) to determine the influence of data size and various feature options on the performance of
I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best
performancc in Lcrms of linguistic accuracy, execution time and memory usage.
In order to achieve the first objective, we investigate the processes of inflecrion and
derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between
inflection and derivation. We proceed to define the inflectional calegories for Afrikaans,
which represent a number of affixes that should be removed from word-forms during
lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these
affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime
increase as the amount of training dala is increased and that Ihe various feature options bave a
significant effect on the performance of Lia. The algorithmic parameters and data
representation that deliver the best results are determincd by the use of I'Senrck, a
programme that implements Wrapped Progre~sive Sampling in order determine a set of
possibly optimal algorithmic parameters for each of the TiMBL classification algorithms.
Aulornaric Lcmlnalisa~ionf or Afrikaans
- -
Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the
best performing parameters for the IB1 algorithm on feature-aligned data with 20 features.
This result indicates that memory-based learning is indeed more suitable than rule-based
methods for Afrikaans lenlmatiser construction. / Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007.

http://hdl.handle.net/10394/131

Lemmatisation

Machine learning

Memory-based learning

Human language technology

Natural language processing

Identifer	oai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:nwu/oai:dspace.nwu.ac.za:10394/131
Date	January 2006
Creators	Groenewald, Hendrik Johannes
Publisher	North-West University
Source Sets	South African National ETD Portal
Detected Language	English
Type	Thesis

Page generated in 0.0058 seconds

Automatic lemmatisation for Afrikaans / by Hendrik J. Groenewald

Description

Links & Downloads

Tags

Additional Fields