Return to search

Readability Assessment with Pre-Trained Transformer Models : An Investigation with Neural Linguistic Features

Readability assessment (RA) is to assign a score or a grade to a given document, which measures the degree of difficulty to read the document. RA originated in language education studies and was used to classify reading materials for language learners. Later, RA was applied to many other applications, such as aiding automatic text simplification.  This thesis is aimed at improving the way of using Transformer for RA. The motivation is the “pipeline” effect (Tenney et al., 2019) of pretrained Transformers: lexical, syntactic, and semantic features are best encoded with different layers of a Transformer model.  After a preliminary test of a basic RA model that resembles the previous works, we proposed several methods to enhance the performance: by using a Transformer layer that is not the last, by concatenating or mixing the outputs of all layers, and by using syntax-augmented Transformer layers. We examined these enhanced methods on three datasets: WeeBit, OneStopEnglish, and CommonLit.  We observed that the improvements showed a clear correlation with the dataset characteristics. On the OneStopEnglish and the CommonLit datasets, we achieved absolute improvements of 1.2% in F1 score and 0.6% in Pearson’s correlation coefficients, respectively. We also show that an 𝑛-gram frequency- based baseline, which is simple but was not reported in previous works, has superior performance on the classification datasets (WeeBit and OneStopEnglish), prompting further research on vocabulary-based lexical features for RA.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-484356
Date January 2022
CreatorsMa, Chuchu
PublisherUppsala universitet, Institutionen för lingvistik och filologi
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0018 seconds