1 |
Holistic Scoring of ESL Essays Using Linguistic Maturity AttributesMillett, Ronald 21 July 2006 (has links) (PDF)
Automated scoring of essays has been a research topic for some time in computational linguistics studies. Only recently have the particular challenges of automatic holistic scoring of ESL essays with their high grammatical, spelling and other error rates been a topic of research. This thesis evaluates the effectiveness of using statistical measures of linguistic maturity to predict holistic scores for ESL essays using several techniques. Selected linguistic attributes include parts of speech, part-of-speech patterns, vocabulary density, and sentence and essay lengths. Using customized algorithms based on multivariable regression analysis as well as memory-based machine learning, holistic scores were predicted on test essays within ±1.0 of the scoring level of human judges' scores successfully an average of 90% of the time. This level of prediction is an improvement over a 66% prediction level attained in a previous study using customized algorithms.
|
2 |
Automatic lemmatisation for Afrikaans / by Hendrik J. GroenewaldGroenewald, Hendrik Johannes January 2006 (has links)
A lemmatiser is an important component of various human language technology applicalions
for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this
lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current
lemmatiser serves as motivation for developing another lemmatiser based on an alternative
approach than language-specific rules. The alternalive method of lemmatiser corlstruction
investigated in this study is memory-based learning.
Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu
"Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu,
thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation,
ii) to determine the influence of data size and various feature options on the performance of
I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best
performancc in Lcrms of linguistic accuracy, execution time and memory usage.
In order to achieve the first objective, we investigate the processes of inflecrion and
derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between
inflection and derivation. We proceed to define the inflectional calegories for Afrikaans,
which represent a number of affixes that should be removed from word-forms during
lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these
affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime
increase as the amount of training dala is increased and that Ihe various feature options bave a
significant effect on the performance of Lia. The algorithmic parameters and data
representation that deliver the best results are determincd by the use of I'Senrck, a
programme that implements Wrapped Progre~sive Sampling in order determine a set of
possibly optimal algorithmic parameters for each of the TiMBL classification algorithms.
Aulornaric Lcmlnalisa~ionf or Afrikaans
- -
Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the
best performing parameters for the IB1 algorithm on feature-aligned data with 20 features.
This result indicates that memory-based learning is indeed more suitable than rule-based
methods for Afrikaans lenlmatiser construction. / Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007.
|
3 |
Outomatiese Afrikaanse tekseenheididentifisering / deur Martin J. PuttkammerPuttkammer, Martin Johannes January 2006 (has links)
An important core technology in the development of human language technology
applications is an automatic morphological analyser. Such a morphological analyser
consists of various modules, one of which is a tokeniser. At present no tokeniser
exists for Afrikaans and it has therefore been impossible to develop a morphological
analyser for Afrikaans. Thus, in this research project such a tokeniser is being developed,
and the project therefore has two objectives: i)to postulate a tag set for integrated
tokenisation, and ii) to develop an algorithm for integrated tokenisation.
In order to achieve the first object, a tag set for the tagging of sentences, named-entities,
words, abbreviations and punctuation is proposed specifically for the annotation
of Afrikaans texts. It consists of 51 tags, which can be expanded in future in order to
establish a larger, more specific tag set. The postulated tag set can also be simplified
according to the level of specificity required by the user.
It is subsequently shown that an effective tokeniser cannot be developed using only
linguistic, or only statistical methods. This is due to the complexity of the task: rule-based
modules should be used for certain processes (for example sentence recognition),
while other processes (for example named-entity recognition) can only be executed
successfully by means of a machine-learning module. It is argued that a hybrid
system (a system where rule-based and statistical components are integrated) would
achieve the best results on Afrikaans tokenisation.
Various rule-based and statistical techniques, including a TiMBL-based classifier, are
then employed to develop such a hybrid tokeniser for Afrikaans. The final tokeniser
achieves an ∫-score of 97.25% when the complete set of tags is used. For sentence
recognition an ∫-score of 100% is achieved. The tokeniser also recognises 81.39% of
named entities. When a simplified tag set (consisting of only 12 tags) is used to annotate
named entities, the ∫-score rises to 94.74%.
The conclusion of the study is that a hybrid approach is indeed suitable for Afrikaans
sentencisation, named-entity recognition and tokenisation. The tokeniser will improve
if it is trained with more data, while the expansion of gazetteers as well as the
tag set will also lead to a more accurate system / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2006.
|
4 |
Automatic lemmatisation for Afrikaans / by Hendrik J. GroenewaldGroenewald, Hendrik Johannes January 2006 (has links)
A lemmatiser is an important component of various human language technology applicalions
for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this
lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current
lemmatiser serves as motivation for developing another lemmatiser based on an alternative
approach than language-specific rules. The alternalive method of lemmatiser corlstruction
investigated in this study is memory-based learning.
Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu
"Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu,
thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation,
ii) to determine the influence of data size and various feature options on the performance of
I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best
performancc in Lcrms of linguistic accuracy, execution time and memory usage.
In order to achieve the first objective, we investigate the processes of inflecrion and
derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between
inflection and derivation. We proceed to define the inflectional calegories for Afrikaans,
which represent a number of affixes that should be removed from word-forms during
lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these
affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime
increase as the amount of training dala is increased and that Ihe various feature options bave a
significant effect on the performance of Lia. The algorithmic parameters and data
representation that deliver the best results are determincd by the use of I'Senrck, a
programme that implements Wrapped Progre~sive Sampling in order determine a set of
possibly optimal algorithmic parameters for each of the TiMBL classification algorithms.
Aulornaric Lcmlnalisa~ionf or Afrikaans
- -
Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the
best performing parameters for the IB1 algorithm on feature-aligned data with 20 features.
This result indicates that memory-based learning is indeed more suitable than rule-based
methods for Afrikaans lenlmatiser construction. / Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007.
|
5 |
Outomatiese Afrikaanse tekseenheididentifisering / deur Martin J. PuttkammerPuttkammer, Martin Johannes January 2006 (has links)
An important core technology in the development of human language technology
applications is an automatic morphological analyser. Such a morphological analyser
consists of various modules, one of which is a tokeniser. At present no tokeniser
exists for Afrikaans and it has therefore been impossible to develop a morphological
analyser for Afrikaans. Thus, in this research project such a tokeniser is being developed,
and the project therefore has two objectives: i)to postulate a tag set for integrated
tokenisation, and ii) to develop an algorithm for integrated tokenisation.
In order to achieve the first object, a tag set for the tagging of sentences, named-entities,
words, abbreviations and punctuation is proposed specifically for the annotation
of Afrikaans texts. It consists of 51 tags, which can be expanded in future in order to
establish a larger, more specific tag set. The postulated tag set can also be simplified
according to the level of specificity required by the user.
It is subsequently shown that an effective tokeniser cannot be developed using only
linguistic, or only statistical methods. This is due to the complexity of the task: rule-based
modules should be used for certain processes (for example sentence recognition),
while other processes (for example named-entity recognition) can only be executed
successfully by means of a machine-learning module. It is argued that a hybrid
system (a system where rule-based and statistical components are integrated) would
achieve the best results on Afrikaans tokenisation.
Various rule-based and statistical techniques, including a TiMBL-based classifier, are
then employed to develop such a hybrid tokeniser for Afrikaans. The final tokeniser
achieves an ∫-score of 97.25% when the complete set of tags is used. For sentence
recognition an ∫-score of 100% is achieved. The tokeniser also recognises 81.39% of
named entities. When a simplified tag set (consisting of only 12 tags) is used to annotate
named entities, the ∫-score rises to 94.74%.
The conclusion of the study is that a hybrid approach is indeed suitable for Afrikaans
sentencisation, named-entity recognition and tokenisation. The tokeniser will improve
if it is trained with more data, while the expansion of gazetteers as well as the
tag set will also lead to a more accurate system / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2006.
|
Page generated in 0.0783 seconds