Return to search

TEXT ANNOTATION IN PARLIAMENTARY RECORDSUSING BERT MODELS

This thesis has investigated whether a transformer-based language model can be improved by training the model on context sequences which are input sequences with a larger window of text, by combining a transformer model with a neural network for non-text features, or by domain-adaptive pre-training. Two types of context input sequences are tested: left context and full context. The three modifications are explored by applying BERT models to the Swedish Parliamentary Corpus to classify whether a text sequence is a heading. A standard BERT model is trained for sequence classification alongside a position model which adds an additional feedforward neural network to the model. Each model is trained with- and without context sequences as well as with- and without domain-adaptive pre-training. A standard implementation of the BERT model with domain adaptation achieves an F1 score of 0.9358 on the test set and an accuracy of 0.9940. The best performing standard BERT model with a context input sequence achieves an F1 of 0.9636 and an accuracy of 0.9966 while the best performing position model achieves an F1 of 0.9550 and an accuracy of 0.9957. The best performing model which combines context input sequences with the position model achieves an F1 of 0.9908 and an accuracy of 0.9991 on the test set. Analysis of misclassified sequences suggests that the models with context input sequences and positional features are less likely to misclassify sequences which can appear both as a heading and a non-heading in the corpus. However, a McNemar's exact test indicates that only a position model with left context input sequences differs significantly from its standard BERT counterpart in terms of the number of differing misclassifications at a 5% significance level. Furthermore, there is no experimental evidence that domain-adaptive pre-training improves classification performance on this specific sequence classification task.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-532972
Date January 2024
CreatorsEriksson, Fabian
PublisherUppsala universitet, Statistiska institutionen
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0017 seconds