1 |
Improving the Effectiveness of Machine-Assisted AnnotationFelt, Paul L. 10 May 2012 (has links) (PDF)
Annotated textual corpora are an essential language resource, facilitating manual search and discovery as well as supporting supervised Natural Language Processing (NLP) techniques designed to accomplishing a variety of useful tasks. However, manual annotation of large textual corpora can be cost-prohibitive, especially for rare and under-resourced languages. For this reason, developers of annotated corpora often attempt to reduce annotation cost by offering annotators various forms of machine assistance intended to increase annotator speed and accuracy. This thesis contributes to the field of annotated corpus development by providing tools and methodologies for empirically evaluating the effectiveness of machine assistance techniques. This allows developers of annotated corpora to improve annotator efficiency by choosing to employ only machine assistance techniques that make a measurable, positive difference. We validate our tools and methodologies using a concrete example. First we present CCASH, a platform for machine-assisted online linguistic annotation capable of recording detailed annotator performance statistics. We employ CCASH to collect data detailing the performance of annotators engaged in syriac morphological analysis in the presence of two machine assistance techniques: pre-annotation and correction propagation. We conduct a preliminary analysis of the data using the traditional approach of comparing mean data values. We then demonstrate a Bayesian analysis of the data that yields deeper insights into our data. Pre-annotation is shown to increase annotator accuracy when pre-annotations are at least 60% accurate, and annotator speed when pre-annotations are at least 80% accurate. Correction propagation's effect on accuracy is minor. The Bayesian analysis indicates that correction propagation has a positive effect on annotator speed after accounting for the effects of the particular visual mechanism we employed to implement it.
|
2 |
Interactive Transcription of Old Text DocumentsSerrano Martínez-Santos, Nicolás 09 June 2014 (has links)
Nowadays, there are huge collections of handwritten text documents in libraries
all over the world. The high demand for these resources has led to the creation
of digital libraries in order to facilitate the preservation and provide electronic
access to these documents. However text transcription of these documents im-
ages are not always available to allow users to quickly search information, or
computers to process the information, search patterns or draw out statistics.
The problem is that manual transcription of these documents is an expensive
task from both economical and time viewpoints. This thesis presents a novel ap-
proach for e cient Computer Assisted Transcription (CAT) of handwritten text
documents using state-of-the-art Handwriting Text Recognition (HTR) systems.
The objective of CAT approaches is to e ciently complete a transcription
task through human-machine collaboration, as the e ort required to generate a
manual transcription is high, and automatically generated transcriptions from
state-of-the-art systems still do not reach the accuracy required. This thesis
is centered on a special application of CAT, that is, the transcription of old
text document when the quantity of user e ort available is limited, and thus,
the entire document cannot be revised. In this approach, the objective is to
generate the best possible transcription by means of the user e ort available.
This thesis provides a comprehensive view of the CAT process from feature
extraction to user interaction.
First, a statistical approach to generalise interactive transcription is pro-
posed. As its direct application is unfeasible, some assumptions are made to
apply it to two di erent tasks. First, on the interactive transcription of hand-
written text documents, and next, on the interactive detection of the document
layout.
Next, the digitisation and annotation process of two real old text documents
is described. This process was carried out because of the scarcity of similar
resources and the need of annotated data to thoroughly test all the developed
tools and techniques in this thesis. These two documents were carefully selected
to represent the general di culties that are encountered when dealing with
HTR. Baseline results are presented on these two documents to settle down a
benchmark with a standard HTR system. Finally, these annotated documents
were made freely available to the community. It must be noted that, all the
techniques and methods developed in this thesis have been assessed on these
two real old text documents.
Then, a CAT approach for HTR when user e ort is limited is studied and
extensively tested. The ultimate goal of applying CAT is achieved by putting
together three processes. Given a recognised transcription from an HTR system.
The rst process consists in locating (possibly) incorrect words and employs the
user e ort available to supervise them (if necessary). As most words are not
expected to be supervised due to the limited user e ort available, only a few are
selected to be revised. The system presents to the user a small subset of these
words according to an estimation of their correctness, or to be more precise,
according to their con dence level. Next, the second process starts once these low con dence words have been supervised. This process updates the recogni-
tion of the document taking user corrections into consideration, which improves
the quality of those words that were not revised by the user. Finally, the last
process adapts the system from the partially revised (and possibly not perfect)
transcription obtained so far. In this adaptation, the system intelligently selects
the correct words of the transcription. As results, the adapted system will bet-
ter recognise future transcriptions. Transcription experiments using this CAT
approach show that this approach is mostly e ective when user e ort is low.
The last contribution of this thesis is a method for balancing the nal tran-
scription quality and the supervision e ort applied using our previously de-
scribed CAT approach. In other words, this method allows the user to control
the amount of errors in the transcriptions obtained from a CAT approach. The
motivation of this method is to let users decide on the nal quality of the desired
documents, as partially erroneous transcriptions can be su cient to convey the
meaning, and the user e ort required to transcribe them might be signi cantly
lower when compared to obtaining a totally manual transcription. Consequently,
the system estimates the minimum user e ort required to reach the amount of
error de ned by the user. Error estimation is performed by computing sepa-
rately the error produced by each recognised word, and thus, asking the user to
only revise the ones in which most errors occur.
Additionally, an interactive prototype is presented, which integrates most
of the interactive techniques presented in this thesis. This prototype has been
developed to be used by palaeographic expert, who do not have any background
in HTR technologies. After a slight ne tuning by a HTR expert, the prototype
lets the transcribers to manually annotate the document or employ the CAT ap-
proach presented. All automatic operations, such as recognition, are performed
in background, detaching the transcriber from the details of the system. The
prototype was assessed by an expert transcriber and showed to be adequate and
e cient for its purpose. The prototype is freely available under a GNU Public
Licence (GPL). / Serrano Martínez-Santos, N. (2014). Interactive Transcription of Old Text Documents [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/37979
|
3 |
Interactive Machine Assistance: A Case Study in Linking Corpora and DictionariesBlack, Kevin P 01 November 2015 (has links) (PDF)
Machine learning can provide assistance to humans in making decisions, including linguistic decisions such as determining the part of speech of a word. Supervised machine learning methods derive patterns indicative of possible labels (decisions) from annotated example data. For many problems, including most language analysis problems, acquiring annotated data requires human annotators who are trained to understand the problem and to disambiguate among multiple possible labels. Hence, the availability of experts can limit the scope and quantity of annotated data. Machine-learned pre-annotation assistance, which suggests probable labels for unannotated items, can enable expert annotators to work more quickly and thus to produce broader and larger annotated resources more cost-efficiently. Yet, because annotated data is required to build the pre-annotation model, bootstrapping is an obstacle to utilizing pre-annotation assistance, especially for low-resource problems where little or no annotated data exists. Interactive pre-annotation assistance can mitigate bootstrapping costs, even for low-resource problems, by continually refining the pre-annotation model with new annotated examples as the annotators work. In practice, continually refining models has seldom been done except for the simplest of models which can be trained quickly. As a case study in developing sophisticated, interactive, machine-assisted annotation, this work employs the task of corpus-dictionary linkage (CDL), which is to link each word token in a corpus to its correct dictionary entry. CDL resources, such as machine-readable dictionaries and concordances, are essential aids in many tasks including language learning and corpus studies. We employ a pipeline model to provide CDL pre-annotations, with one model per CDL sub-task. We evaluate different models for lemmatization, the most significant CDL sub-task since many dictionary entry headwords are usually lemmas. The best performing lemmatization model is a hybrid which uses a maximum entropy Markov model (MEMM) to handle unknown (novel) word tokens and other component models to handle known word tokens. We extend the hybrid model design to the other CDL sub-tasks in the pipeline. We develop an incremental training algorithm for the MEMM which avoids wasting previous computation as would be done by simply retraining from scratch. The incremental training algorithm facilitates the addition of new dictionary entries over time (i.e., new labels) and also facilitates learning from partially annotated sentences which allows annotators to annotate words in any order. We validate that the hybrid model attains high accuracy and can be trained sufficiently quickly to provide interactive pre-annotation assistance by simulating CDL annotation on Quranic Arabic and classical Syriac data.
|
Page generated in 0.1322 seconds