Return to search

Enhancing Document Retrieval in the FinTech Domain : Applications of Advanced Language Models

In this thesis, methods of creating an information retrieval (IR) model within the Fin-Tech domain are explored. Given the domain-specific and data-scarce environment, methods of artificially generating data to train and evaluate IR models are implemented and their limitations are discussed. The generative model GPT-J 6B is used to generate pseudo-queries for a document corpus, resulting in a training- and test-set of 148 and 166 query-document pairs respectively. Transformer-based models, fine-tuned- and original versions, are put to the test against the baseline model BM25 which historically has been seen as an effective document retrieval model. The models are evaluated using mean reciprocal rank at k (MRR@k) and time-cost to retrieve relevant documents. The main findings are that the historical BM25 model performs well in comparison to the transformer alternatives, it reaches the highest score for MRR@2 = 0.612. The results show that for MRR@5 and MRR@10, a combination model of BM25 and a cross encoder slightly outperforms the baseline reaching scores of MRR@5 = 0.655 and MRR@10 = 0.672. However, the increase in performance is slim and may not be enough to motivate an implementation. Finally, further research using real-world data is required to argue that transformer-based models are more robust in a real-world setting.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:umu-226528
Date January 2024
CreatorsHansen, Jesper
PublisherUmeå universitet, Institutionen för matematik och matematisk statistik
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0015 seconds