Global ETD Search

Return to search

Outomatiese Setswana lemma-identifisering / Jeanetta Hendrina Brits

Within the context of natural language processing, a lemmatiser is one of the
most important core technology modules that has to be developed for a particular
language. A lemmatiser reduces words in a corpus to the corresponding lemmas
of the words in the lexicon.
A lemma is defined as the meaningful base form from which other more complex
forms (i.e. variants) are derived. Before a lemmatiser can be developed for a
specific language, the concept "lemma" as it applies to that specific language
should first be defined clearly. This study concludes that, in Setswana, only
stems (and not roots) can act independently as words; therefore, only stems
should be accepted as lemmas in the context of automatic lemmatisation for
Setswana.
Five of the seven parts of speech in Setswana could be viewed as closed
classes, which means that these classes are not extended by means of regular
morphological processes. The two other parts of speech (nouns and verbs) require
the implementation of alternation rules to determine the lemma. Such alternation
rules were formalised in this study, for the purpose of development of a
Setswana lemmatiser. The existing Setswana grammars were used as basis for
these rules. Therewith the precision of the formalisation of these existing grammars
to lemmatise Setswana words could be determined.
The software developed by Van Noord (2002), FSA 6, is one of the best-known
applications available for the development of finite state automata and transducers.
Regular expressions based on the formalised morphological rules were
used in FSA 6 to create finite state transducers. The code subsequently generated
by FSA 6 was implemented in the lemmatiser.
The metric that applies to the evaluation of the lemmatiser is precision. On a test
corpus of 1 000 words, the lemmatiser obtained 70,92%. In another evaluation
on 500 complex nouns and 500 complex verbs separately, the lemmatiser obtained
70,96% and 70,52% respectively. Expressed in numbers the precision on
500 complex and simplex nouns was 78,45% and on complex and simplex verbs
79,59%. The quantitative achievement only gives an indication of the relative
precision of the grammars. Nevertheless, it did offer analysed data with which
the grammars were evaluated qualitatively. The study concludes with an overview
of how these results might be improved in the future. / Thesis (M.A. (African Languages))--North-West University, Potchefstroom Campus, 2006.

http://hdl.handle.net/10394/1160

Computational linguistics

Natural language processing

Regular expression

Finite state automata

Finite state transducer

FSA 6

Identifer	oai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:nwu/oai:dspace.nwu.ac.za:10394/1160
Date	January 2006
Creators	Brits, Jeanetta Hendrina
Publisher	North-West University
Source Sets	South African National ETD Portal
Detected Language	English
Type	Thesis

Page generated in 0.0021 seconds

Outomatiese Setswana lemma-identifisering / Jeanetta Hendrina Brits

Description

Links & Downloads

Tags

Additional Fields