Return to search

Towards a Language Model for Stenography : A Proof of Concept

The availability of the stenographic manuscripts of Astrid Lindgren have sparked an interest in the creation of a language model for stenography. By its very nature stenography is low-resource and the unavailability of data requires a tool for using normal data. The tool presented in this thesis is to create stenographic data from manipulating orthographic data. Stenographic data is distinct from orthographic data through three different types manipulations that can be carried out. Firstly stenography is based on a phonetic version of language, secondly it used its own alphabet that is distinct from normal orthographic data, and thirdly it used several techniques to compress the data. The first type of manipulation is done by using a grapheme-to-phoneme converter. The second type is done by using an orthographic representation of a stenographic alphabet. The third type of manipulation is done by manipulating based on subword level, word level and phrase level. With these manipulations different datasets are created with different combinations of these manipulations. Results are measured for both perplexity on a GPT-2 language model and for compression rate on the different datasets. These results show a general decrease of perplexity scores and a slight compression rate across the board. We see that the lower perplexity scores are possibly due to the growth of ambiguity.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-489199
Date January 2022
CreatorsLangstraat, Naomi Johanna
PublisherUppsala universitet, Institutionen för lingvistik och filologi
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0016 seconds