Global ETD Search

1	Improving the Effectiveness of Machine-Assisted Annotation Felt, Paul L. 10 May 2012 (has links) (PDF) Annotated textual corpora are an essential language resource, facilitating manual search and discovery as well as supporting supervised Natural Language Processing (NLP) techniques designed to accomplishing a variety of useful tasks. However, manual annotation of large textual corpora can be cost-prohibitive, especially for rare and under-resourced languages. For this reason, developers of annotated corpora often attempt to reduce annotation cost by offering annotators various forms of machine assistance intended to increase annotator speed and accuracy. This thesis contributes to the field of annotated corpus development by providing tools and methodologies for empirically evaluating the effectiveness of machine assistance techniques. This allows developers of annotated corpora to improve annotator efficiency by choosing to employ only machine assistance techniques that make a measurable, positive difference. We validate our tools and methodologies using a concrete example. First we present CCASH, a platform for machine-assisted online linguistic annotation capable of recording detailed annotator performance statistics. We employ CCASH to collect data detailing the performance of annotators engaged in syriac morphological analysis in the presence of two machine assistance techniques: pre-annotation and correction propagation. We conduct a preliminary analysis of the data using the traditional approach of comparing mean data values. We then demonstrate a Bayesian analysis of the data that yields deeper insights into our data. Pre-annotation is shown to increase annotator accuracy when pre-annotations are at least 60% accurate, and annotator speed when pre-annotations are at least 80% accurate. Correction propagation's effect on accuracy is minor. The Bayesian analysis indicates that correction propagation has a positive effect on annotator speed after accounting for the effects of the particular visual mechanism we employed to implement it. Syriac Bayesian methods Annotated Corpora Machine-Assisted Annotation Machine Assistance Computer Sciences
2	Interactive Machine Assistance: A Case Study in Linking Corpora and Dictionaries Black, Kevin P 01 November 2015 (has links) (PDF) Machine learning can provide assistance to humans in making decisions, including linguistic decisions such as determining the part of speech of a word. Supervised machine learning methods derive patterns indicative of possible labels (decisions) from annotated example data. For many problems, including most language analysis problems, acquiring annotated data requires human annotators who are trained to understand the problem and to disambiguate among multiple possible labels. Hence, the availability of experts can limit the scope and quantity of annotated data. Machine-learned pre-annotation assistance, which suggests probable labels for unannotated items, can enable expert annotators to work more quickly and thus to produce broader and larger annotated resources more cost-efficiently. Yet, because annotated data is required to build the pre-annotation model, bootstrapping is an obstacle to utilizing pre-annotation assistance, especially for low-resource problems where little or no annotated data exists. Interactive pre-annotation assistance can mitigate bootstrapping costs, even for low-resource problems, by continually refining the pre-annotation model with new annotated examples as the annotators work. In practice, continually refining models has seldom been done except for the simplest of models which can be trained quickly. As a case study in developing sophisticated, interactive, machine-assisted annotation, this work employs the task of corpus-dictionary linkage (CDL), which is to link each word token in a corpus to its correct dictionary entry. CDL resources, such as machine-readable dictionaries and concordances, are essential aids in many tasks including language learning and corpus studies. We employ a pipeline model to provide CDL pre-annotations, with one model per CDL sub-task. We evaluate different models for lemmatization, the most significant CDL sub-task since many dictionary entry headwords are usually lemmas. The best performing lemmatization model is a hybrid which uses a maximum entropy Markov model (MEMM) to handle unknown (novel) word tokens and other component models to handle known word tokens. We extend the hybrid model design to the other CDL sub-tasks in the pipeline. We develop an incremental training algorithm for the MEMM which avoids wasting previous computation as would be done by simply retraining from scratch. The incremental training algorithm facilitates the addition of new dictionary entries over time (i.e., new labels) and also facilitates learning from partially annotated sentences which allows annotators to annotate words in any order. We validate that the hybrid model attains high accuracy and can be trained sufficiently quickly to provide interactive pre-annotation assistance by simulating CDL annotation on Quranic Arabic and classical Syriac data. interactive machine assistance machine-assisted annotation corpus-dictionary linkage annotation supervised machine learning string transduction hybrid probabilistic models low-resource languages low-resource language settings Arabic Quran Syriac New Testament Computer Sciences

Search results

Improving the Effectiveness of Machine-Assisted Annotation

Interactive Machine Assistance: A Case Study in Linking Corpora and Dictionaries