In this work, we explore the usefulness of contextualized embeddings from language models on lexical semantic change (LSC) detection. With diachronic corpora spanning two time periods, we construct word embeddings for a selected set of target words, aiming at detecting potential LSC of each target word across time. We explore different systems of embeddings to cover three topics: contextualized vs static word embeddings, token- vs type-based embeddings, and multilingual vs monolingual language models. We use a multilingual dataset covering three languages (English, German, Swedish) and explore each system of embedding with two subtasks, a binary classification task and a ranking task. We compare the performance of different systems of embeddings, and seek to answer our research questions through discussion and analysis of experimental results. We show that contextualized word embeddings are on par with static word embeddings in the classification task. Our results also show that it is more beneficial to use the contextualized embeddings from a multilingual model than from a language specific model in most cases. We present that token-based setting is strong for static embeddings, and type-based setting for contextual embeddings, especially for the ranking task. We provide some explanation for the results we achieve, and propose improvements that can be made to our experiments for future work.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-444871 |
Date | January 2021 |
Creators | You, Huiling |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0018 seconds