This study explores the possibilities of classifying language as governing or not. The ground premise is to examine how detecting and quantifying governing conditions from thousands of financial grants in appropriation directions can be performed automatically, as well as creating a data set to perform machine learning for this text classification task. In this study, automatic classification is performed along with an annotation process extracting and labelling data. Automatic classification can be performed by using a variety of data, methods and tasks. The classification task aims to mainly divide conditions into being governing of the conducting of the specific agency or not. The data consists of text from the specific chapter in the appropriation directions regarding financial grants. The text is split into sentences, keeping only sentences longer than 15 words. An iterative annotation process is then performed in order to receive labelled conditions, involving three expert annotators for the final data set, and laymen annotations for initial experiments. Given the data extracted from the annotation process, SVM, BiLSTM and KB-BERT classifiers are trained and evaluated. All models are evaluated using no context information, with bullet points as an exception, where a previous, generally descriptive sentence is included. Apart from this default input representation type, context regarding preceding sentence along with the target sentence, as well as adding specific agency to the target sentence are evaluated as alternative data representation types. The final inter-annotator agreement was not optimal with Cohen’s Kappa scores that can be interpreted as representing moderate agreement. By using majority vote for the test set, the non-optimal agreement was somewhat prevented for this specific set. The best performing model all input representation types considered was the KB-BERT using no context information, receiving an F1-score on 0.81 and an accuracy score on 0.89 on the test set. All models gave a better performance for sentences classed as governing, which might be partially due to the final annotated data sets being skewed. Possible future studies include further iterative annotation and working towards a clear and as objective definition of how a governing condition can be defined, as well as exploring the possibilities of using data augmentation to counteract the uneven distribution of classes in the final data sets.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-479525 |
Date | January 2022 |
Creators | Wallerö, Emma |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0025 seconds