Return to search

Interactive Transcription of Old Text Documents

Nowadays, there are huge collections of handwritten text documents in libraries
all over the world. The high demand for these resources has led to the creation
of digital libraries in order to facilitate the preservation and provide electronic
access to these documents. However text transcription of these documents im-
ages are not always available to allow users to quickly search information, or
computers to process the information, search patterns or draw out statistics.
The problem is that manual transcription of these documents is an expensive
task from both economical and time viewpoints. This thesis presents a novel ap-
proach for e cient Computer Assisted Transcription (CAT) of handwritten text
documents using state-of-the-art Handwriting Text Recognition (HTR) systems.
The objective of CAT approaches is to e ciently complete a transcription
task through human-machine collaboration, as the e ort required to generate a
manual transcription is high, and automatically generated transcriptions from
state-of-the-art systems still do not reach the accuracy required. This thesis
is centered on a special application of CAT, that is, the transcription of old
text document when the quantity of user e ort available is limited, and thus,
the entire document cannot be revised. In this approach, the objective is to
generate the best possible transcription by means of the user e ort available.
This thesis provides a comprehensive view of the CAT process from feature
extraction to user interaction.
First, a statistical approach to generalise interactive transcription is pro-
posed. As its direct application is unfeasible, some assumptions are made to
apply it to two di erent tasks. First, on the interactive transcription of hand-
written text documents, and next, on the interactive detection of the document
layout.
Next, the digitisation and annotation process of two real old text documents
is described. This process was carried out because of the scarcity of similar
resources and the need of annotated data to thoroughly test all the developed
tools and techniques in this thesis. These two documents were carefully selected
to represent the general di culties that are encountered when dealing with
HTR. Baseline results are presented on these two documents to settle down a
benchmark with a standard HTR system. Finally, these annotated documents
were made freely available to the community. It must be noted that, all the
techniques and methods developed in this thesis have been assessed on these
two real old text documents.
Then, a CAT approach for HTR when user e ort is limited is studied and
extensively tested. The ultimate goal of applying CAT is achieved by putting
together three processes. Given a recognised transcription from an HTR system.
The rst process consists in locating (possibly) incorrect words and employs the
user e ort available to supervise them (if necessary). As most words are not
expected to be supervised due to the limited user e ort available, only a few are
selected to be revised. The system presents to the user a small subset of these
words according to an estimation of their correctness, or to be more precise,
according to their con dence level. Next, the second process starts once these low con dence words have been supervised. This process updates the recogni-
tion of the document taking user corrections into consideration, which improves
the quality of those words that were not revised by the user. Finally, the last
process adapts the system from the partially revised (and possibly not perfect)
transcription obtained so far. In this adaptation, the system intelligently selects
the correct words of the transcription. As results, the adapted system will bet-
ter recognise future transcriptions. Transcription experiments using this CAT
approach show that this approach is mostly e ective when user e ort is low.
The last contribution of this thesis is a method for balancing the nal tran-
scription quality and the supervision e ort applied using our previously de-
scribed CAT approach. In other words, this method allows the user to control
the amount of errors in the transcriptions obtained from a CAT approach. The
motivation of this method is to let users decide on the nal quality of the desired
documents, as partially erroneous transcriptions can be su cient to convey the
meaning, and the user e ort required to transcribe them might be signi cantly
lower when compared to obtaining a totally manual transcription. Consequently,
the system estimates the minimum user e ort required to reach the amount of
error de ned by the user. Error estimation is performed by computing sepa-
rately the error produced by each recognised word, and thus, asking the user to
only revise the ones in which most errors occur.
Additionally, an interactive prototype is presented, which integrates most
of the interactive techniques presented in this thesis. This prototype has been
developed to be used by palaeographic expert, who do not have any background
in HTR technologies. After a slight ne tuning by a HTR expert, the prototype
lets the transcribers to manually annotate the document or employ the CAT ap-
proach presented. All automatic operations, such as recognition, are performed
in background, detaching the transcriber from the details of the system. The
prototype was assessed by an expert transcriber and showed to be adequate and
e cient for its purpose. The prototype is freely available under a GNU Public
Licence (GPL). / Serrano Martínez-Santos, N. (2014). Interactive Transcription of Old Text Documents [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/37979 / TESIS

Identiferoai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/37979
Date09 June 2014
CreatorsSerrano Martínez-Santos, Nicolás
ContributorsCivera Saiz, Jorge, Juan Císcar, Alfonso, Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació
PublisherUniversitat Politècnica de València
Source SetsUniversitat Politècnica de València
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/doctoralThesis, info:eu-repo/semantics/acceptedVersion
SourceRiunet
Rightshttp://rightsstatements.org/vocab/InC/1.0/, info:eu-repo/semantics/openAccess

Page generated in 0.0031 seconds