Return to search

Lost in Transcription : Evaluating Clustering and Few-Shot learningfor transcription of historical ciphers

Where there has been a steady development of Optical Character Recognition (OCR) techniques for printed documents, the instruments that provide good quality for hand-written manuscripts by Hand-written Text Recognition  methods (HTR) and transcriptions are still some steps behind. With the main focus on historical ciphers (i.e. encrypted documents from the past with various types of symbol sets), this thesis examines the performance of two machine learning architectures developed within the DECRYPT project framework, a clustering based unsupervised algorithm and a semi-supervised few-shot deep-learning model. Both models are tested on seen and unseen scribes to evaluate the difference in performance and the shortcomings of the two architectures, with the secondary goal of determining the influences of the datasets on the performance. An in-depth analysis of the transcription results is performed with particular focus on the Alchemic and Zodiac symbol sets, with analysis of the model performance relative to character shape and size. The results show the promising performance of Few-Shot architectures when compared to Clustering algorithm, with a respective SER average of 0.336 (0.15 and 0.104 on seen data / 0.754 on unseen data) and 0.596 (0.638 and 0.350 on seen data / 0.8 on unseen data).

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-460248
Date January 2021
CreatorsMagnifico, Giacomo
PublisherUppsala universitet, Institutionen för lingvistik och filologi
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0021 seconds