Authorship identification is the field of classifying a given text by its author based on the assumption that authors exhibit unique writing styles. This thesis investigates the semantic shortcomings of the vector space model by constructing a semantic kernel created from WordNet which is evaluated on the problem of authorship attribution. A multiclass SVM classifier is constructed using the one-versus-all strategy and evaluated in terms of precision, recall, accuracy and F1 scores. Results show that the use of the semantic scores from WordNet degrades the performance compared to using a linear kernel. Experiments are run to identify the best feature engineering configurations, showing that removing stopwords has a positive effect on the financial dataset Reuters while the Kaggle dataset consisting of short extracts of horror stories benefit from keeping the stopwords.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-412897 |
Date | January 2020 |
Creators | Westin, Emil |
Publisher | Uppsala universitet, Statistiska institutionen |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0014 seconds