The availability of the stenographic manuscripts of Astrid Lindgren have sparked an interest in the creation of a language model for stenography. By its very nature stenography is low-resource and the unavailability of data requires a tool for using normal data. The tool presented in this thesis is to create stenographic data from manipulating orthographic data. Stenographic data is distinct from orthographic data through three different types manipulations that can be carried out. Firstly stenography is based on a phonetic version of language, secondly it used its own alphabet that is distinct from normal orthographic data, and thirdly it used several techniques to compress the data. The first type of manipulation is done by using a grapheme-to-phoneme converter. The second type is done by using an orthographic representation of a stenographic alphabet. The third type of manipulation is done by manipulating based on subword level, word level and phrase level. With these manipulations different datasets are created with different combinations of these manipulations. Results are measured for both perplexity on a GPT-2 language model and for compression rate on the different datasets. These results show a general decrease of perplexity scores and a slight compression rate across the board. We see that the lower perplexity scores are possibly due to the growth of ambiguity.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-489199 |
Date | January 2022 |
Creators | Langstraat, Naomi Johanna |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.002 seconds