• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 2
  • Tagged with
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Trade-offs between Quality and Efficiency in Multilingual Dense Retrieval / Avvägningar mellan kvalitet och effektivitet i f lerspråkig tät informationssökning

Schüldt, Emma January 2022 (has links)
As the amount of content online grows, information retrieval becomes increasingly crucial. Traditional information retrieval does not take the text order into account and is also dependent on exact text matching between the query and the document. Therefore, a query consisting of synonyms to words in a document will not retrieve that document even if it could have been relevant to the user. An alternative approach is dense retrieval which solves these issues by representing the semantic meaning of the query or document using a vector representation. Semantically similar queries and documents are represented with vectors close to each other in a vector space. Vector similarity search can be used to find the most relevant documents for a query. Since the semantic meanings of the words are used, synonyms and paraphrases are handled implicitly. There are several ways to design these representation vectors, either by using one or several vectors to represent each query or document, by changing the dimensionality of the vectors, or by changing the span of values in the vectors. Each option brings its trade-offs in terms of quality of search results, query latency, and index memory footprint. This study experimented with each of the alternatives above. Since most previous research within the area has been done in a monolingual, mainly English context, this study used four different languages to investigate if the trade-offs differed. In this study, the quality, latency, and memory footprint moved in the same direction, i.e., when the quality increased, then the latency increased as well. This was the case for all the languages. For the version that used one vector each for the document and query, decreasing the dimensionality to 128 or 64 gave significant latency improvements but did not affect the quality. For the larger version, which used 32 vectors for the query and 64 for the document, converting the values of vectors to binary had no significant effect on quality but greatly reduced the storage size. / Mängden innehåll på internet växer, och med det behovet av välfungerande informationssökningssystem. Traditionella sökmotorer tar inte hänsyn till ordföljden och är beroende av exakt textmatchning mellan sökfrågan och dokumentet. På grund av detta kommer en sökfråga som innehåller synonymer till ord i ett dokument inte att hämta det dokumentet, även om det hade kunnat vara relevant för användaren. En annan metod är tät informationssökning (en: Dense Retrieval) som löser de här problemen implicit genom att representera den semantiska betydelsen av sökfrågan eller dokumentet med en vektorrepresentation. Semantiskt lika sökfrågor och dokument representeras av närliggande vektorer i ett vektorrum. Likhetssökning med vektorerna kan användas för att hitta de mest relevanta dokumenten för en sökfråga. Eftersom ordens semantiska betydelse används, hanteras synonymer och parafraser implicit. Det finns flera sätt att utforma vektorerna, antingen genom att använda en eller flera vektorer för att representera varje sökfråga eller dokument, genom att ändra vektorernas dimensionalitet, eller genom att ändra spannet för vektorernas värden. Varje alternativ har sina egna för- och nackdelar med avseende på sökresultatens kvalitet, sökningarnas tidsåtgång, och hur mycket minne indexet upptar. I den här studien har vi undersökt alla ovanstående aspekter. Eftersom den mesta tidigare forskningen enbart har gjorts i en engelsk kontext, använder den här studien fyra olika språk för att se om föroch nackdelarna skiljde sig åt mellan de olika språken. I den här studien rörde sig kvaliteten, söktiden och minnesavtrycket i samma riktning, det vill säga när kvaliteten ökade, ökade också söktiden. Detta gällde för alla olika språk. För versionen som använde en vektor vardera för sökfrågan och dokumentet, gav en minskning av dimensionaliteten till 128 eller 64 betydande minskningar av söktiden men förändrade inte kvaliteten. För den större version som använde 32 vektorer för sökfrågan och 64 för dokumentet, gjorde inte en omvandling av vektorernas värden till binära någon skillnad för kvaliteten, men minskade lagringsutrymmet betydligt.
2

Zero-shot, One Kill: BERT for Neural Information Retrieval

Efes, Stergios January 2021 (has links)
[Background]: The advent of bidirectional encoder representation from trans- formers (BERT) language models (Devlin et al., 2018) and MS Marco, a large scale human-annotated dataset for machine reading comprehension (Bajaj et al., 2016) that made publicly available, led the field of information retrieval (IR) to experience a revolution (Lin et al., 2020). The retrieval model based on BERT of Nogueira and Cho (2019), by the time they published their paper, became the top entry in the MS Marco passage-reranking leaderboard, surpassing the previous state of the art by 27% in MRR@10. However, training such neural IR models for different domains than MS Marco is still hard because neural approaches often require a vast amount of training data to perform effectively, which is not always available. To address the problem of the shortage of labelled data a new line of research emerged, training neural models with weak supervision. In weak supervision, given an unlabelled dataset labels are generated automatically using an existing model and then a machine learning model is trained upon the artificial “weak“ data. In case of weak supervision for IR, the training dataset comes in the form of a tuple (query, passage). Dehghani et al. (2017) in their work used the AOL query logs (Pass et al., 2006), which is a set of millions of real web queries, and BM25 to retrieve the relevant passages for each of the user queries. A drawback with this approach is that it is hard to obtain query logs for every single different domain. [Objective]: This thesis proposes an intuitive approach for addressing the shortage of data in domains with limited or no data at all through transfer learning in the context of IR. We leverage Wikipedia’s structure for creating a Wikipedia-based generic IR training dataset for zero-shot neural models. [Method]: We create the “pseudo-queries“ by concatenating the titles of Wikipedia’s articles along with each of their title sections and we consider the associated section’s passage as the relevant passage of the pseudo-queries. All of our experiments are evaluated on a standard collection: MS Marco, which is a large scale web collection. For our zero-shot experiments, our proposed model, called “Wiki“, is a BERT model trained on the artificial Wikipedia-based dataset and the baseline is a default BERT model without any additional training. In our second line of experiments, we explore the benefits gained by pre-fine- tuning on the Wikipedia-based IR dataset and further fine-tuning on in-domain data. Our proposed model, "Wiki+Ma", is a BERT model pre-fine-tuned in the Wikipedia-based dataset and further fine-tuned in MS Marco, while the baseline is a BERT model fine-tuned only in MS Marco. [Results]: Results regarding our first experiments show that our BERT model trained on the Wikipedia-based IR dataset, called "Wiki", achieves a performance of 0.197 in MRR@10, which is about +10 points more in comparison to a BERT model with default weights; in addition, results in the development set indicate that the “Wiki“ model performs better than BERT model trained on in-domain data when the data is between 10k-50k instances. Results regarding our second line of experiments show that pre-fine-tuning on the Wikipedia-based IR dataset benefits later fine-tuning steps on in-domain data in terms of stability. [Conclusion]: Our findings suggest that transfer learning for IR tasks by leveraging the generic knowledge incorporated in Wikipedia is possible, though more experimentation is needed to understand its limitations in comparison with the traditional approaches such as the BM25.

Page generated in 0.0215 seconds