Spelling suggestions: "subject:"imageretrieval"" "subject:"storageretrieval""
1 |
Passage Retrieval : en litteraturstudie av ett forskningsområde inom information retrieval / Passage Retrieval : a study of a research topic in information retrievalÅkesson, Mattias January 2000 (has links)
The aim of this thesis is to describe passage retrieval (PR), with basis in results from various empirical experiments, and to critically investigate different approaches in PR. The main questions to be answered in the thesis are: (1) What characterizes PR? (2) What approaches have been proposed? (3) How well do the approaches work in experimental information retrieval (IR)? PR is a research topic in information retrieval, which instead of retrieving the fulltext of documents, that can lead to information overload for the user, tries to retrieve the most relevant passages in the documents. This technique was investigated studying a number of central articles in the research field. PR can be divided into three different types of approaches based on the segmentation of the documents. First, you can divide the text considering the semantics and where the topics change. Second, you can divide the text based on the explicit structure of the documents, with help from e.g. a markup language like SGML. And third, you can do a form of PR, where you divide the text in parts containing a fixed number of words. This method is called unmotivated segmentation. The study showed that an unmotivated segmentation resulted in the best retrieval effectiveness even though the results are difficult to compare because of different kinds of evaluation methods and different types of test collections. A combination between full text retrieval and PR also showed improved results. / Uppsatsnivå: D
|
2 |
Εξατομικευμένη αναζήτηση πληροφορίας με χρήση σημασιολογικών δικτύων / Personalized web search through the use of semantic networksΖώτος, Νικόλαος 15 November 2007 (has links)
Κατά την αναζήτηση στον Παγκόσμιο Ιστό, είναι πιθανό να επιστρέφονται πολλά αποτελέσματα για ερωτήματα που είναι ασαφή και αμφιλεγόμενα. Τα snippets που εξάγονται από τις σελίδες που ανακτήθηκαν, είναι ένας δείκτης της χρησιμότητας της σελίδας ως προς την θεματική πρόθεση του ερωτήματος και μπορούν να χρησιμοποιηθούν για να εστιάσουμε στο αντικείμενο της αναζήτησης. Στην παρούσα εργασία προτείνουμε μια καινοτόμο μέθοδο αυτόματης εξαγωγής snippets ιστοσελίδων που είναι πολύ σχετικά με την πρόθεση του ερωτήματος αλλά και αντιπροσωπευτικά του συνολικού περιεχομένου των σελίδων. Θα δείξουμε ότι η χρήση σημασιολογίας ως βάση της θεματικά προσανατολισμένης ανάκτησης πληροφορίας μας βοηθάει να προτείνουμε στον χρήστη snippets υψηλής ποιότητας. Τα snippets που παράγονται με την μέθοδο που προτείνουμε είναι σημαντικά καλύτερα όσον αφορά την απόδοση της ανάκτησης σε σχέση με αυτά που προκύπτουν από στατιστική επεξεργασία της σελίδας. Επιπλέον, μπορούμε να χρησιμοποιήσουμε τη σημασιολογική εξαγωγή snippets για να αυξήσουμε την απόδοση των παραδοσιακών αλγορίθμων, οι οποίοι βασίζονται στην επικάλυψη λέξεων ή σε στατιστικά βάρη, αφού αυτοί συνήθως παράγουν διαφορετικά αποτελέσματα. Η επιλογή από την πλευρά του χρήστη των πιο σχετικών με το ερώτημά του snippets, μπορεί να χρησιμοποιηθεί στο να βελτιώσουμε τα επιστρεφόμενα αποτελέσματα και να προωθήσουμε τις πιο χρήσιμες προς αυτόν σελίδες. / When searching the web, it is often possible that there are too many results available for ambiguous queries. Text snippets, extracted from the retrieved pages, are an indicator of the pages’ usefulness to the query intention and can be used to focus the scope of search results. In this paper, we propose a novel method for automatically extracting web page snippets that are highly relevant to the query intention and expressive of the pages’ entire content. We show that the usage of semantics, as a basis for focused retrieval, produces high quality text snippet suggestions. The snippets delivered by our method are significantly better in terms of retrieval performance compared to those derived using the pages’ statistical content. Furthermore, our study suggests that semantically-driven snippet generation can also be used to augment traditional passage retrieval algorithms based on word overlap or statistical weights, since they typically differ in coverage and produce different results. User clicks on the query relevant snippets can be used to refine the query results and promote the most comprehensive among the relevant documents.
|
3 |
Zero-shot, One Kill: BERT for Neural Information RetrievalEfes, Stergios January 2021 (has links)
[Background]: The advent of bidirectional encoder representation from trans- formers (BERT) language models (Devlin et al., 2018) and MS Marco, a large scale human-annotated dataset for machine reading comprehension (Bajaj et al., 2016) that made publicly available, led the field of information retrieval (IR) to experience a revolution (Lin et al., 2020). The retrieval model based on BERT of Nogueira and Cho (2019), by the time they published their paper, became the top entry in the MS Marco passage-reranking leaderboard, surpassing the previous state of the art by 27% in MRR@10. However, training such neural IR models for different domains than MS Marco is still hard because neural approaches often require a vast amount of training data to perform effectively, which is not always available. To address the problem of the shortage of labelled data a new line of research emerged, training neural models with weak supervision. In weak supervision, given an unlabelled dataset labels are generated automatically using an existing model and then a machine learning model is trained upon the artificial “weak“ data. In case of weak supervision for IR, the training dataset comes in the form of a tuple (query, passage). Dehghani et al. (2017) in their work used the AOL query logs (Pass et al., 2006), which is a set of millions of real web queries, and BM25 to retrieve the relevant passages for each of the user queries. A drawback with this approach is that it is hard to obtain query logs for every single different domain. [Objective]: This thesis proposes an intuitive approach for addressing the shortage of data in domains with limited or no data at all through transfer learning in the context of IR. We leverage Wikipedia’s structure for creating a Wikipedia-based generic IR training dataset for zero-shot neural models. [Method]: We create the “pseudo-queries“ by concatenating the titles of Wikipedia’s articles along with each of their title sections and we consider the associated section’s passage as the relevant passage of the pseudo-queries. All of our experiments are evaluated on a standard collection: MS Marco, which is a large scale web collection. For our zero-shot experiments, our proposed model, called “Wiki“, is a BERT model trained on the artificial Wikipedia-based dataset and the baseline is a default BERT model without any additional training. In our second line of experiments, we explore the benefits gained by pre-fine- tuning on the Wikipedia-based IR dataset and further fine-tuning on in-domain data. Our proposed model, "Wiki+Ma", is a BERT model pre-fine-tuned in the Wikipedia-based dataset and further fine-tuned in MS Marco, while the baseline is a BERT model fine-tuned only in MS Marco. [Results]: Results regarding our first experiments show that our BERT model trained on the Wikipedia-based IR dataset, called "Wiki", achieves a performance of 0.197 in MRR@10, which is about +10 points more in comparison to a BERT model with default weights; in addition, results in the development set indicate that the “Wiki“ model performs better than BERT model trained on in-domain data when the data is between 10k-50k instances. Results regarding our second line of experiments show that pre-fine-tuning on the Wikipedia-based IR dataset benefits later fine-tuning steps on in-domain data in terms of stability. [Conclusion]: Our findings suggest that transfer learning for IR tasks by leveraging the generic knowledge incorporated in Wikipedia is possible, though more experimentation is needed to understand its limitations in comparison with the traditional approaches such as the BM25.
|
4 |
Vers la conception de documents composites : extraction et organisation de l'information pertinenteLamprier, Sylvain 05 December 2008 (has links) (PDF)
Au cours de ces dernières années, le domaine de la recherche d'information s'est élargi à la mise en place d'applications ne visant plus uniquement à aider l'utilisateur dans sa tâche de localisation des documents pertinents, mais cherchant à lui construire une réponse synthétique permettant de satisfaire ses besoins en information. Dans ce contexte, cette thèse se concentre sur la production d'une entité, appelée document composite, représentant un aperçu des différents types d'information que l'utilisateur pourra trouver, en rapport avec sa requête, dans le corpus interrogé. Après s'être interrogés sur le mode d'extraction et de sélection des fragments de texte à faire figurer dans ce document composite, l'étude réalisée nous a finalement conduits à la mise en place d'un algorithme multi-objectifs, de recherche du sous-ensemble de segments thématiques maximisant conjointement un critère de proximité à la requête et un critère de représentativité des thématiques abordées par les documents considérés. Outre la conception du document composite qui est l'objectif central de cette thèse, les contributions réalisées concernent le découpage des documents et son évaluation, les mesures de pertinence et de similarité des textes, l'impact que peut avoir l'individualisation des thématiques en recherche d'information, le mode d'évaluation des systèmes utilisant un clustering des résultats et enfin, la prise en considération de la requête dans les processus de clustering.
|
5 |
Distilling Multilingual Transformer Models for Efficient Document Retrieval : Distilling multi-Transformer models with distillation losses involving multi-Transformer interactions / Destillering av flerspråkiga transformatormodeller för effektiv dokumentsökning : Destillering av modeller med flera transformatorer med destilleringsförluster som involverar interaktioner mellan flera transformatorerLiu, Xuecong January 2022 (has links)
Open Domain Question Answering (OpenQA) is a task concerning automatically finding answers to a query from a given set of documents. Language-agnostic OpenQA is an increasingly important research area in the globalised world, where the answers can be in a different language from the question. An OpenQA system generally consists of a document retriever to retrieve relevant passages and a reader to extract answers from the passages. Large Transformers, such as Dense Passage Retrieval (DPR) models, have achieved state-of-the-art performances in document retrievals, but they are computationally expensive in production. Knowledge Distillation (KD) is an effective way to reduce the size and increase the speed of Transformers while retaining their performances. However, most existing research focuses on distilling single Transformer models, instead of multi-Transformer models, as in the case of DPR. This thesis project uses MiniLM and DistilBERT distillation methods, two of the most successful methods to distil the BERT model, to individually distil the passage and query model of a fined-tuned DPR model comprised of two pretrained MPNet models. In addition, the project proposes and tests Embedding Similarity Loss (ESL), a distillation loss designed for the interaction between the passage and query models in DPR architecture. The results show that using ESL results in better students than using MiniLM or DistilBERT loss alone and that combining ESL with any of the other two losses increases their student models’ performances in most cases, especially when training on Information-Seeking Question Answering in Typologically Diverse Languages (TyDi QA) instead of The Stanford Question Answering Dataset 1.1 (SQuAD 1.1). The best resulting 6-layer student DPR model retained more than 90% of the recall and Mean Average Precision (MAP) in Cross-Lingual Transfer (XLT) tasks while reducing the inference time to 63.2%. In Generalised Cross-Lingual Transfer (G-XLT) tasks, it retained only around 42% of the recall and MAP using 53.8% of the inference time. / Domänlöst frågebesvarande är en uppgift som handlar om att automatiskt hitta svar på en fråga från en given uppsättning av dokument. Språkagnostiska domänlöst frågebesvarande är ett allt viktigare forskningsområde i den globaliserade världen, där svaren kan vara på ett annat språk än själva frågan. Ett domänlöst frågebesvarande-system består i allmänhet av en dokumenthämtare som plockar relevanta textavsnitt och en läsare som extraherar svaren från dessa textavsnitt. Stora transformatorer, såsom DPR-modeller (Dense Passage Retrieval), har uppnått toppresultat i dokumenthämtning, men de är beräkningsmässigt dyra i produktion. KD (Knowledge Distillation) är ett effektivt sätt att minska storleken och öka hastigheten hos transformatorer samtidigt som deras prestanda bibehålls. För det mesta är den existerande forskningen dock inriktad på att destillera enstaka transformatormodeller i stället för modeller med flera transformatorer, som i fallet med DPR. I det här examensarbetet används MiniLM- och DistilBERT-destilleringsmetoderna, två av de mest framgångsrika metoderna för att destillera BERT-modellen, för att individuellt destillera text- och frågemodellen i en finjusterad DPRmodell som består av två förinlärda MPNet-modeller. Dessutom föreslås och testas ESL (Embedding Similarity Loss), en destilleringsförlust som är utformad för interaktionen mellan text- och frågemodellerna i DPRarkitekturen. Resultaten visar att användning av ESL resulterar i bättre studenter än om man enbart använder MiniLM eller DistilBERT-förlusten och att kombinationen ESL med någon av de andra två förlusterna ökar deras studentmodellers prestanda i de flesta fall, särskilt när man tränar på TyDi QA (Typologically Diverse Languages) istället för SQuAD 1.1 (The Stanford Question Answering Dataset). Den bästa resulterande 6-lagriga student DPRmodellen behöll mer än 90% av återkallandet och MAP (Mean Average Precision) för XLT-uppgifterna (Cross-Lingual Transfer) samtidigt som tiden för inferens minskades till 63.2%. För G-XLT-uppgifterna (Generalised CrossLingual Transfer) bibehölls endast cirka 42% av återkallelsen och MAP med 53.8% av inferenstiden.
|
6 |
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competitionTsatsaronis, George 10 October 2017 (has links) (PDF)
This article provides an overview of the first BioASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BioASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies.
|
7 |
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competitionTsatsaronis, George 10 October 2017 (has links)
This article provides an overview of the first BioASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BioASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies.
|
Page generated in 0.0597 seconds