Return to search

Exploring Language Descriptions through Vector Space Models

The abundance of natural languages and the complexities involved in describingtheir structures pose significant challenges for modern linguists, not only in documentation but also in the systematic organization of knowledge. Computational linguisticstools hold promise in comprehending the “big picture”, provided existing grammars aredigitized and made available for analysis using state-of-the-art language models. Extensive efforts have been made by an international team of linguists to compile such aknowledge base, resulting in the DReaM corpus – a comprehensive dataset comprisingtens of thousands of digital books containing multilingual language descriptions.However, there remains a lack of tools that facilitate understanding of concise language structures and uncovering overlooked topics and dialects. This thesis representsa small step towards elucidating the broader picture by utilizing a subset of the DReaMcorpus as a vector space capable of capturing genetic ties among described languages.To achieve this, we explore several encoding algorithms in conjunction with varioussegmentation strategies and vector summarization approaches for generating bothmonolingual and cross-lingual feature representations of selected grammars in Englishand Russian.Our newly proposed sentence-facets TF-IDF model shows promise in unsupervisedgeneration of monolingual representations, conveying sufficient signal to differentiate historical linguistic relations among 484 languages from 26 language familiesbased on their descriptions. However, the construction of a cross-lingual vector spacenecessitates further exploration of advanced technologies.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-531135
Date January 2024
CreatorsAleksandrova, Anastasiia
PublisherUppsala universitet, Institutionen för lingvistik och filologi, U
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0019 seconds