Historically author profiling has been used in forensic linguistics. However, it is not until the last decades that the analysis method has worked into computer science and machine learning. In comparison, determining author profiling characteristics in machine learning is nothing new. This paper investigates the possibility to improve upon previous results with modern frameworks using data sets that have seen limited usage. The purpose of this master thesis was to use pre-trained transformers or embeddings together with transfer learning. In addition, to examine if general author profiling characteristics of anonymous users on internet forums or conversations on social media could be determined. The data sets used to investigate the questions above were PAN15 and PANDORA, which contains various properties in text data based on authors paired with ground truth labels such as gender, age, and Big Five/OCEAN. In addition, transfer learning of BERT and GloVe was used as a starting point to decrease the learning time of a new task. PAN15, a Twitter data set, did not contain enough data when training a model and was augmented using PANDORA, a Reddit-based data set. Ultimately, BERT obtained the best performance using a stacked approach, achieving 86 − 91% accuracy for each label on unseen data.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:ltu-89737 |
Date | January 2022 |
Creators | From, Viktor |
Publisher | Luleå tekniska universitet, Institutionen för system- och rymdteknik |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.002 seconds