In this thesis, methods of creating an information retrieval (IR) model within the Fin-Tech domain are explored. Given the domain-specific and data-scarce environment, methods of artificially generating data to train and evaluate IR models are implemented and their limitations are discussed. The generative model GPT-J 6B is used to generate pseudo-queries for a document corpus, resulting in a training- and test-set of 148 and 166 query-document pairs respectively. Transformer-based models, fine-tuned- and original versions, are put to the test against the baseline model BM25 which historically has been seen as an effective document retrieval model. The models are evaluated using mean reciprocal rank at k (MRR@k) and time-cost to retrieve relevant documents. The main findings are that the historical BM25 model performs well in comparison to the transformer alternatives, it reaches the highest score for MRR@2 = 0.612. The results show that for MRR@5 and MRR@10, a combination model of BM25 and a cross encoder slightly outperforms the baseline reaching scores of MRR@5 = 0.655 and MRR@10 = 0.672. However, the increase in performance is slim and may not be enough to motivate an implementation. Finally, further research using real-world data is required to argue that transformer-based models are more robust in a real-world setting.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:umu-226528 |
Date | January 2024 |
Creators | Hansen, Jesper |
Publisher | Umeå universitet, Institutionen för matematik och matematisk statistik |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0015 seconds