Spelling suggestions: "subject:"query expansion"" "subject:"guery expansion""
31 |
Automated Vocabulary Building for Characterizing and Forecasting Elections using Social Media AnalyticsMahendiran, Aravindan 12 February 2014 (has links)
Twitter has become a popular data source in the recent decade and garnered a significant amount of attention as a surrogate data source for many important forecasting problems. Strong correlations have been observed between Twitter indicators and real-world trends spanning elections, stock markets, book sales, and flu outbreaks. A key ingredient to all methods that use Twitter for forecasting is to agree on a domain-specific vocabulary to track the pertinent tweets, which is typically provided by subject matter experts (SMEs). The language used in Twitter drastically differs from other forms of online discourse, such as news articles and blogs. It constantly evolves over time as users adopt popular hashtags to express their opinions. Thus, the vocabulary used by forecasting algorithms needs to be dynamic in nature and should capture emerging trends over time. This thesis proposes a novel unsupervised learning algorithm that builds a dynamic vocabulary using Probabilistic Soft Logic (PSL), a framework for probabilistic reasoning over relational domains. Using eight presidential elections from Latin America, we show how our query expansion methodology improves the performance of traditional election forecasting algorithms. Through this approach we demonstrate how we can achieve close to a two-fold increase in the number of tweets retrieved for predictions and a 36.90% reduction in prediction error. / Master of Science
|
32 |
Contribution à l'analyse et l'évaluation des requêtes expertes : cas du domaine médical / Contribution to the analyze and evaluation of clinical queries : medical domainZnaidi, Eya 30 June 2016 (has links)
La recherche d'information nécessite la mise en place de stratégies qui consistent à (1) cerner le besoin d'information ; (2) formuler le besoin d'information ; (3) repérer les sources pertinentes ; (4) identifier les outils à exploiter en fonction de ces sources ; (5) interroger les outils ; et (6) évaluer la qualité des résultats. Ce domaine n'a cessé d'évoluer pour présenter des techniques et des approches permettant de sélectionner à partir d'un corpus de documents l'information pertinente capable de satisfaire le besoin exprimé par l'utilisateur. De plus, dans le contexte applicatif du domaine de la RI biomédicale, les sources d'information hétérogènes sont en constante évolution, aussi bien du point de vue de la structure que du contenu. De même, les besoins en information peuvent être exprimés par des utilisateurs qui se caractérisent par différents profils, à savoir : les experts médicaux comme les praticiens, les cliniciens et les professionnels de santé, les utilisateurs néophytes (sans aucune expertise ou connaissance du domaine) comme les patients et leurs familles, etc. Plusieurs défis sont liés à la tâche de la RI biomédicale, à savoir : (1) la variation et la diversité du besoin en information, (2) différents types de connaissances médicales, (3) différences de compé- tences linguistiques entre experts et néophytes, (4) la quantité importante de la littérature médicale ; et (5) la nature de la tâche de RI médicale. Cela implique une difficulté d'accéder à l'information pertinente spécifique au contexte de la recherche, spécialement pour les experts du domaine qui les aideraient dans leur prise de décision médicale. Nos travaux de thèse s'inscrivent dans le domaine de la RI biomédicale et traitent les défis de la formulation du besoin en information experte et l'identification des sources pertinentes pour mieux répondre aux besoins cliniques. Concernant le volet de la formulation et l'analyse de requêtes expertes, nous proposons des analyses exploratoires sur des attributs de requêtes, que nous avons définis, formalisés et calculés, à savoir : (1) deux attributs de longueur en nombre de termes et en nombre de concepts, (2) deux facettes de spécificité terme-document et hiérarchique, (3) clarté de la requête basée sur la pertinence et celle basée sur le sujet de la requête. Nous avons proposé des études et analyses statistiques sur des collections issues de différentes campagnes d'évaluation médicales CLEF et TREC, afin de prendre en compte les différentes tâches de RI. Après les analyses descriptives, nous avons étudié d'une part, les corrélations par paires d'attributs de requêtes et les analyses de corrélation multidimensionnelle. Nous avons étudié l'impact de ces corrélations sur les performances de recherche d'autre part. Nous avons pu ainsi comparer et caractériser les différentes requêtes selon la tâche médicale d'une manière plus généralisable. Concernant le volet lié à l'accès à l'information, nous proposons des techniques d'appariement et d'expansion sémantiques de requêtes dans le cadre de la RI basée sur les preuves cliniques. / The research topic of this document deals with a particular setting of medical information retrieval (IR), referred to as expert based information retrieval. We were interested in information needs expressed by medical domain experts like praticians, physicians, etc. It is well known in information retrieval (IR) area that expressing queries that accurately reflect the information needs is a difficult task either in general domains or specialized ones and even for expert users. Thus, the identification of the users' intention hidden behind queries that they submit to a search engine is a challenging issue. Moreover, the increasing amount of health information available from various sources such as government agencies, non-profit and for-profit organizations, internet portals etc. presents oppor- tunities and issues to improve health care information delivery for medical professionals, patients and general public. One critical issue is the understanding of users search strategies and tactics for bridging the gap between their intention and the delivered information. In this thesis, we focus, more particularly, on two main aspects of medical information needs dealing with the expertise which consist of two parts, namely : - Understanding the users intents behind the queries is critically important to gain a better insight of how to select relevant results. While many studies investigated how users in general carry out exploratory health searches in digital environments, a few focused on how are the queries formulated, specifically by domain expert users. We address more specifically domain expert health search through the analysis of query attributes namely length, specificity and clarity using appropriate proposed measures built according to different sources of evidence. In this respect, we undertake an in-depth statistical analysis of queries issued from IR evalua- tion compaigns namely Text REtrieval Conference (TREC) and Conference and Labs of the Evaluation Forum (CLEF) devoted for different medical tasks within controlled evaluation settings. - We address the issue of answering PICO (Population, Intervention, Comparison and Outcome) clinical queries formulated within the Evidence Based Medicine framework. The contributions of this part include (1) a new algorithm for query elicitation based on the semantic mapping of each facet of the query to a reference terminology, and (2) a new document ranking model based on a prioritized aggregation operator. we tackle the issue related to the retrieval of the best evidence that fits with a PICO question, which is an underexplored research area. We propose a new document ranking algorithm that relies on semantic based query expansion leveraged by each question facet. The expansion is moreover bounded by the local search context to better discard irrelevant documents. The experimental evaluation carried out on the CLIREC dataset shows the benefit of our approaches.
|
33 |
Systematisierung und Evaluierung von Clustering-Verfahren im Information RetrievalKürsten, Jens 04 December 2006 (has links) (PDF)
Im Rahmen der vorliegenden Diplomarbeit werden Verfahren zur Clusteranalyse sowie deren Anwendungsmöglichkeiten zur Optimierung der Rechercheergebnisse von Information Retrievalsystemen untersucht. Die Grundlage der vergleichenden Evaluation erfolgversprechender Ansätze zur Clusteranalyse anhand der Domain Specific Monolingual Tasks des Cross-Language Evaluation Forums 2006 bildet die systematische Analyse der in der Forschung etablierten Verfahren zur Clusteranalyse.
Die Implementierung ausgewählter Clusterverfahren wird innerhalb eines bestehenden, Lucene-basierten Retrievalsystems durchgeführt. Zusätzlich wird dieses System im Rahmen dieser Arbeit mit Komponenten zur Query Expansion und zur Datenfusion ausgestattet. Diese beiden Ansätze haben sich in der Forschung zur automatischen Optimierung von Retrievalergebnissen durchgesetzt und bilden daher die Bewertungsgrundlage für die implementierten Konzepte zur Optimierung von Rechercheergebnissen auf Basis der Clusteranalyse.
Im Ergebnis erweist sich das lokale Dokument Clustering auf Basis des k-means Clustering-Algorithmus in Kombination mit dem Pseudo-Relevanz-Feedback Ansatz zur Selektion der Dokumente für die Query Expansion als besonders erfolgversprechend. Darüber hinaus wird gezeigt, dass mit Hilfe der Datenfusion auf Basis des Z-Score Operators die Ergebnisse verschiedener Indizierungsverfahren so kombiniert werden können, dass sehr gute und insbesondere sehr robuste Rechercheergebnisse erreicht werden. / Within the present diploma thesis, widely used Cluster Analysis approaches are studied in respect to their application to optimize the results of Information Retrieval systems. A systematic analysis of approved methods of the Cluster Analysis is the basis of the comparative evaluation of promising approaches to use the Cluster Analysis to optimize retrieval results. The evaluation is accomplished by the participation at the Domain Specific Monolingual Tasks of the Cross-Language Evaluation Forum 2006.
The implementation of selected approaches for Clustering is realized within the framework of an existing Lucene-based retrieval system. Within the scope of work, this system will be supplemented with components for Query Expansion and Data Fusion. Both approaches have prevailed in the research of automatic optimization of retrieval results. Therefore, they are the basis of assessment of the implemented methods, which aim at improving the results of retrieval and are based on Cluster Analysis.
The results show that selecting documents for Query Expansion with the help of local Document Clustering based on the k-means Clustering algorithm combined with the Blind Feedback approach is very promising. Furthermore, the Data Fusion approach based on the Z-Score operator proves to be very useful to combine retrieval results of different indexing methods. In fact, this approach achieves very good and in particular very robust results of retrieval.
|
34 |
Short text contextualization in information retrieval : application to tweet contextualization and automatic query expansion / Contextualisation de textes courts pour la recherche d'information : application à la contextualisation de tweets et à l'expansion automatique de requêtes.Ermakova, Liana 31 March 2016 (has links)
La communication efficace a tendance à suivre la loi du moindre effort. Selon ce principe, en utilisant une langue donnée les interlocuteurs ne veulent pas travailler plus que nécessaire pour être compris. Ce fait mène à la compression extrême de textes surtout dans la communication électronique, comme dans les microblogues, SMS, ou les requêtes dans les moteurs de recherche. Cependant souvent ces textes ne sont pas auto-suffisants car pour les comprendre, il est nécessaire d’avoir des connaissances sur la terminologie, les entités nommées ou les faits liés. Ainsi, la tâche principale de la recherche présentée dans ce mémoire de thèse de doctorat est de fournir le contexte d’un texte court à l’utilisateur ou au système comme à un moteur de recherche par exemple.Le premier objectif de notre travail est d'aider l’utilisateur à mieux comprendre un message court par l’extraction du contexte d’une source externe comme le Web ou la Wikipédia au moyen de résumés construits automatiquement. Pour cela nous proposons une approche pour le résumé automatique de documents multiples et nous l’appliquons à la contextualisation de messages, notamment à la contextualisation de tweets. La méthode que nous proposons est basée sur la reconnaissance des entités nommées, la pondération des parties du discours et la mesure de la qualité des phrases. Contrairement aux travaux précédents, nous introduisons un algorithme de lissage en fonction du contexte local. Notre approche s’appuie sur la structure thème-rhème des textes. De plus, nous avons développé un algorithme basé sur les graphes pour le ré-ordonnancement des phrases. La méthode a été évaluée à la tâche INEX/CLEF Tweet Contextualization sur une période de 4 ans. La méthode a été également adaptée pour la génération de snippets. Les résultats des évaluations attestent une bonne performance de notre approche. / The efficient communication tends to follow the principle of the least effort. According to this principle, using a given language interlocutors do not want to work any harder than necessary to reach understanding. This fact leads to the extreme compression of texts especially in electronic communication, e.g. microblogs, SMS, search queries. However, sometimes these texts are not self-contained and need to be explained since understanding them requires knowledge of terminology, named entities or related facts. The main goal of this research is to provide a context to a user or a system from a textual resource.The first aim of this work is to help a user to better understand a short message by extracting a context from an external source like a text collection, the Web or the Wikipedia by means of text summarization. To this end we developed an approach for automatic multi-document summarization and we applied it to short message contextualization, in particular to tweet contextualization. The proposed method is based on named entity recognition, part-of-speech weighting and sentence quality measuring. In contrast to previous research, we introduced an algorithm for smoothing from the local context. Our approach exploits topic-comment structure of a text. Moreover, we developed a graph-based algorithm for sentence reordering. The method has been evaluated at INEX/CLEF tweet contextualization track. We provide the evaluation results over the 4 years of the track. The method was also adapted to snippet retrieval. The evaluation results indicate good performance of the approach.
|
35 |
Relevance Analysis for Document RetrievalLabouve, Eric 01 March 2019 (has links) (PDF)
Document retrieval systems recover documents from a dataset and order them according to their perceived relevance to a user’s search query. This is a difficult task for machines to accomplish because there exists a semantic gap between the meaning of the terms in a user’s literal query and a user’s true intentions. Even with this ambiguity that arises with a lack of context, users still expect that the set of documents returned by a search engine is both highly relevant to their query and properly ordered. The focus of this thesis is on document retrieval systems that explore methods of ordering documents from unstructured, textual corpora using text queries. The main goal of this study is to enhance the Okapi BM25 document retrieval model. In doing so, this research hypothesizes that the structure of text inside documents and queries hold valuable semantic information that can be incorporated into the Okapi BM25 model to increase its performance. Modifications that account for a term’s part of speech, the proximity between a pair of related terms, the proximity of a term with respect to its location in a document, and query expansion are used to augment Okapi BM25 to increase the model’s performance. The study resulted in 87 modifications which were all validated using open source corpora. The top scoring modification from the validation phase was then tested under the Lisa corpus and the model performed 10.25% better than Okapi BM25 when evaluated under mean average precision. When compared against two industry standard search engines, Lucene and Solr, the top scoring modification largely outperforms these systems by upwards to 21.78% and 23.01%, respectively.
|
36 |
Extracting and exploiting word relationships for information retrievalCao, Guihong January 2008 (has links)
Thèse numérisée par la Division de la gestion de documents et des archives de l'Université de Montréal.
|
37 |
APPLYING ENTERPRISE MODELS AS INTERFACE FOR INFORMATION SEARCHINGMATONGO, Tanguy, DEGBELO, Auriol January 2009 (has links)
<p>Nowadays, more and more companies use Enterprise Models to integrate and coordinate their business processes with the aim of remaining competitive on the market. Consequently, Enterprise Models play a critical role in this integration enabling to improve the objectives of the enterprise, and ways to reach them in a given period of time. Through Enterprise Models, companies are able to improve the management of their operations, actors, processes and also to improve communication within the organisation.</p><p>This thesis describes another use of Enterprise Models. In this work, we intend to apply Enterprise Models as interface for information searching. The underlying needsfor this project lay in the fact that we would like to show that Enterprise Models canbe more than just models but it can be used in a more dynamic way which is through a software program for information searching. The software program aimed at, first,extracting the information contained in the Enterprise Models (which are stored into aXML file on the system). Once the information is extracted, it is used to express a query which will be sent into a search engine to retrieve some relevant document to the query and return them to the user.</p><p>The thesis was carried out over an entire academic semester. The results of this workare a report which summarizes all the knowledge gained into the field of the study. A software has been built to serve as a proof of testing the theories.</p>
|
38 |
Αυτόματη επιλογή σημασιολογικά συγγενών όρων για την επαναδιατύπωση των ερωτημάτων σε μηχανές αναζήτησης πληροφορίας / Automatic selection of semantic related terms for reformulating a query into a search engineΚοζανίδης, Ελευθέριος 14 September 2007 (has links)
Η βελτίωση ερωτημάτων (Query refinement) είναι η διαδικασία πρότασης εναλλακτικών όρων στους χρήστες των μηχανών αναζήτησης του Διαδικτύου για την διατύπωση της πληροφοριακής τους ανάγκης. Παρόλο που εναλλακτικοί σχηματισμοί ερωτημάτων μπορούν να συνεισφέρουν στην βελτίωση των ανακτηθέντων αποτελεσμάτων, η χρησιμοποίησή τους από χρήστες του Διαδικτύου είναι ιδιαίτερα περιορισμένη καθώς οι όροι των βελτιωμένων ερωτημάτων δεν περιέχουν σχεδόν καθόλου πληροφορία αναφορικά με τον βαθμό ομοιότητάς τους με τους όρους του αρχικού ερωτήματος, ενώ συγχρόνως δεν καταδεικνύουν το βαθμό συσχέτισής τους με τα πληροφοριακά ενδιαφέροντα των χρηστών. Παραδοσιακά, οι εναλλακτικοί σχηματισμοί ερωτημάτων καθορίζονται κατ’ αποκλειστικότητα από τη σημασιολογική σχέση που επιδεικνύουν οι συμπληρωματικοί όροι με τους αρχικούς όρους του ερωτήματος, χωρίς να λαμβάνουν υπόψη τον επιδιωκόμενο στόχο της αναζήτησης που υπολανθάνει πίσω από ένα ερώτημα του χρήστη. Στην παρούσα εργασία θα παρουσιάσουμε μια πρότυπη τεχνική βελτίωσης ερωτημάτων η οποία χρησιμοποιεί μια λεξική οντολογία προκειμένου να εντοπίσει εναλλακτικούς σχηματισμούς ερωτημάτων οι οποίοι αφενός, θα περιγράφουν το αντικείμενο της αναζήτησης του χρήστη και αφετέρου θα σχετίζονται με τα ερωτήματα που υπέβαλε ο χρήστης. Το πιο πρωτοποριακό χαρακτηριστικό της τεχνικής μας είναι η οπτική αναπαράσταση του εναλλακτικού ερωτήματος με την μορφή ενός ιεραρχικά δομημένου γράφου. Η αναπαράσταση αυτή παρέχει σαφείς πληροφορίες για την σημασιολογική σχέση μεταξύ των όρων του βελτιωμένου ερωτήματος και των όρων που χρησιμοποίησε ο χρήστης για να εκφράσει την πληροφοριακή του ανάγκη ενώ παράλληλα παρέχει την δυνατότητα στον χρήστη να επιλέξει ποιοι από τους υποψήφιους όρους θα συμμετέχουν τελικά στην διαδικασία βελτιστοποίησης δημιουργώντας διαδραστικά το νέο ερώτημα. Τα αποτελέσματα των πειραμάτων που διενεργήσαμε για να αξιολογήσουμε την απόδοση της τεχνικής μας, είναι ιδιαίτερα ικανοποιητικά και μας οδηγούν στο συμπέρασμα ότι η μέθοδός μας μπορεί να βοηθήσει σημαντικά στη διευκόλυνση του χρήστη κατά τη διαδικασία επιλογής ερωτημάτων για την ανάκτηση πληροφορίας από τα δεδομένα του Παγκόσμιου Ιστού. / Query refinement is the process of providing Web information seekers with alternative wordings for expressing their information needs. Although alternative query formulations may contribute to the improvement of retrieval results, nevertheless their realization by Web users is intrinsically limited in that alternative query wordings do not convey explicit information about neither their degree nor their type of correlation to the user-issued queries. Moreover, alternative query formulations are determined based on the semantics of the issued query alone and they do not consider anything about the search intentions of the user issuing that query. In this paper, we introduce a novel query refinement technique which uses a lexical ontology for identifying alternative query formulations that are both informative of the user’s interests and related to the user selected queries. The most innovative feature of our technique is the visualization of the alternative query wordings in a graphical representation form, which conveys explicit information about the refined queries correlation to the user issued requests and which allows the user select which terms to participate in the refinement process. Experimental results demonstrate that our method has a significant potential in improving the user search experience.
|
39 |
Automatic Concept-Based Query Expansion Using Term Relational Pathways Built from a Collection-Specific Association ThesaurusLyall-Wilson, Jennifer Rae January 2013 (has links)
The dissertation research explores an approach to automatic concept-based query expansion to improve search engine performance. It uses a network-based approach for identifying the concept represented by the user's query and is founded on the idea that a collection-specific association thesaurus can be used to create a reasonable representation of all the concepts within the document collection as well as the relationships these concepts have to one another. Because the representation is generated using data from the association thesaurus, a mapping will exist between the representation of the concepts and the terms used to describe these concepts. The research applies to search engines designed for use in an individual website with content focused on a specific conceptual domain. Therefore, both the document collection and the subject content must be well-bounded, which affords the ability to make use of techniques not currently feasible for general purpose search engine used on the entire web.
|
40 |
Extracting and exploiting word relationships for information retrievalCao, Guihong January 2008 (has links)
Thèse numérisée par la Division de la gestion de documents et des archives de l'Université de Montréal
|
Page generated in 0.0632 seconds