• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 3
  • 3
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

從搜尋引擎查詢紀錄中學習Ontology / Ontology Learning from Query Logs of Search Engines

陳茂富 Unknown Date (has links)
Ontology可用來組織、管理與分享知識,Ontology Engineering是一種建構Ontology的過程,建構的過程中,多數的工作需要人費時費力地去完成,因此利用機器來輔助Ontology Engineering成了一門重要的課題。使用Knowledge Discovery的方法協助Ontology Engineering建構Ontology的過程,稱為Ontology Learning,本論文中提出的Ontology Learning方法為分析使用者在搜尋引擎下關鍵字查詢時的行為,加上利用與查詢關鍵字有關的網頁資訊,以輔助建構Ontology。本論文中的Ontology由使用者所查詢的關鍵字組成,我們要learning的,則是這些關鍵字彼此之間的關係,其中有上義詞、下義詞與同義詞等等,因此,自動尋找關鍵字彼此之間的關係以輔助建構Ontology,即為我們提出本論文的目的。除此之外,本論文亦實作了完整的Ontology Learning系統,從一開始使用者查詢記錄的蒐集,關鍵字擷取與分析,關鍵字之間的關係判定,直到最後Ontology的產生,都將由系統自動完成。 / Ontology can be used to organize, manage and share knowledge. Ontology Engineering is the process of constructing Ontology. However, it’s usually a time-consuming and error-prone task. Thus, utilizing methods of Knowledge Discovery to help Ontology Engineering is called Ontology Learning. In this thesis, Ontology Learning process is done by using those pages related query terms and analyzing the querying behavior of users on search engines. The Ontology is organized by user query terms and relations among them. These relations we define are hyperonomy, hyponomy, synonymy and et al. Our goal of this thesis is to automatically learn the correct relations among these query terms. Besides, we implemented the complete system platform for Ontology Learning. The system can automatically collect logs, extract and analyze query keywords, and produce the final Ontology.
2

Usage-driven unified model for user profile and data source profile extraction / Model unifié dérigé par l'usage pour l'extraction du profile de l'utilisateur et de la source de donnée

Limam, Lyes 24 June 2014 (has links)
La problématique traitée dans la thèse s’inscrit dans le cadre de l’analyse d’usage dans les systèmes de recherche d’information. En effet, nous nous intéressons à l’utilisateur à travers l’historique de ses requêtes, utilisées comme support d’analyse pour l’extraction d'un profil d’usage. L’objectif est de caractériser l’utilisateur et les sources de données qui interagissent dans un réseau afin de permettre des comparaisons utilisateur-utilisateur, source-source et source-utilisateur. Selon une étude que nous avons menée sur les travaux existants sur les modèles de profilage, nous avons conclu que la grande majorité des contributions sont fortement liés aux applications dans lesquelles ils étaient proposés. En conséquence, les modèles de profils proposés ne sont pas réutilisables et présentent plusieurs faiblesses. Par exemple, ces modèles ne tiennent pas compte de la source de données, ils ne sont pas dotés de mécanismes de traitement sémantique et ils ne tiennent pas compte du passage à l’échelle (en termes de complexité). C'est pourquoi, nous proposons dans cette thèse un modèle d’utilisateur et de source de données basé sur l’analyse d’usage. Les caractéristiques de ce modèle sont les suivantes. Premièrement, il est générique, permettant de représenter à la fois un utilisateur et une source de données. Deuxièmement, il permet de construire le profil de manière implicite à partir de l’historique de requêtes de recherche. Troisièmement, il définit le profil comme un ensemble de centres d’intérêts, chaque intérêt correspondant à un cluster sémantique de mots-clés déterminé par un algorithme de clustering spécifique. Et enfin, dans ce modèle le profil est représenté dans un espace vectoriel. Les différents composants du modèle sont organisés sous la forme d’un Framework, la complexité de chaque composant y est évaluée. Le Framework propose : - une méthode pour la désambigüisation de requêtes; - une méthode pour la représentation sémantique des logs sous la forme d’une taxonomie ; - un algorithme de clustering qui permet l’identification rapide et efficace des centres d’intérêt représentés par des clusters sémantiques de mots clés ; - une méthode pour le calcul du profil de l’utilisateur et du profil de la source de données à partir du modèle générique. Le Framework proposé permet d'effectuer différentes tâches liées à la structuration d’un environnement distribué d’un point de vue usage. Comme exemples d’application, le Framework est utilisé pour la découverte de communautés d’utilisateurs et la catégorisation de sources de données. Pour la validation du Framework, une série d’expérimentations est menée en utilisant des logs du moteur de recherche AOL-search, qui ont démontrées l’efficacité de la désambigüisation sur des requêtes courtes, et qui ont permis d’identification de la relation entre le clustering basé sur une fonction de qualité et le clustering basé sur la structure. / This thesis addresses a problem related to usage analysis in information retrieval systems. Indeed, we exploit the history of search queries as support of analysis to extract a profile model. The objective is to characterize the user and the data source that interact in a system to allow different types of comparison (user-to-user, source-to-source, user-to-source). According to the study we conducted on the work done on profile model, we concluded that the large majority of the contributions are strongly related to the applications within they are proposed. As a result, the proposed profile models are not reusable and suffer from several weaknesses. For instance, these models do not consider the data source, they lack of semantic mechanisms and they do not deal with scalability (in terms of complexity). Therefore, we propose a generic model of user and data source profiles. The characteristics of this model are the following. First, it is generic, being able to represent both the user and the data source. Second, it enables to construct the profiles in an implicit way based on histories of search queries. Third, it defines the profile as a set of topics of interest, each topic corresponding to a semantic cluster of keywords extracted by a specific clustering algorithm. Finally, the profile is represented according to the vector space model. The model is composed of several components organized in the form of a framework, in which we assessed the complexity of each component. The main components of the framework are: - a method for keyword queries disambiguation; - a method for semantically representing search query logs in the form of a taxonomy; - a clustering algorithm that allows fast and efficient identification of topics of interest as semantic clusters of keywords; - a method to identify user and data source profiles according to the generic model. This framework enables in particular to perform various tasks related to usage-based structuration of a distributed environment. As an example of application, the framework is used to the discovery of user communities, and the categorization of data sources. To validate the proposed framework, we conduct a series of experiments on real logs from the search engine AOL search, which demonstrate the efficiency of the disambiguation method in short queries, and show the relation between the quality based clustering and the structure based clustering.
3

Mining Clickthrough Data To Improve Search Engine Results

Veilumuthu, Ashok 05 1900 (has links) (PDF)
In this thesis, we aim at improving the search result quality by utilizing the search intelligence (history of searches) available in the form of click-through data. We address two key issues, namely 1) relevance feedback extraction and fusion, and 2) deciphering search query intentions. Relevance Feedback Extraction and Fusion: The existing search engines depend heavily on the web linkage structure in the form of hyperlinks to determine the relevance and importance of the documents. But these are collective judgments given by the page authors and hence, prone to collaborated spamming. To overcome the spamming attempts and language semantic issues, it is also important to incorporate the user feedback on the documents' relevance. Since users can be hardly motivated to give explicit/direct feedback on search quality, it becomes necessary to consider implicit feedback that can be collected from search engine logs. Though a number of implicit feedback measures have been proposed in the literature, we have not been able to identify studies that aggregate those feedbacks in a meaningful way to get a final ranking of documents. In this thesis, we first evaluate two implicit feedback measures namely 1) click sequence and 2) time spent on the document for their content uniqueness. We develop a mathematical programming model to collate the feedbacks collected from different sessions into a single ranking of documents. We use Kendall's τ rank correlation to determine the uniqueness of the information content present in the individual feedbacks. The experimental evaluation on top 30 select queries from an actual search log data confirms that these two measures are not in perfect agreement and hence, incremental information can potentially be derived from them. Next, we study the feedback fusion problem in which the user feedbacks from various sessions need to be combined meaningfully. Preference aggregation is a classical problem in economics and we study a variation of it where the rankers, i.e., the feedbacks, possess different expertise. We extend the generalized Mallows' model to model the feedback rankings given in user sessions. We propose a single stage and two stage aggregation framework to combine different feedbacks into one final ranking by taking their respective expertise into consideration. We show that the complexity of the parameter estimation problem is exponential in number of documents and queries. We develop two scalable heuristics namely, 1) a greedy algorithm, and 2) a weight based heuristic, that can closely approximate the solution. We also establish the goodness of fit of the model by testing it on actual log data through log-likelihood ratio test. As the independent evaluation of documents is not available, we conduct experiments on synthetic datasets devised appropriately to examine the various merits of the heuristics. The experimental results confirm the possibility of expertise oriented aggregation of feedbacks by producing orderings better than both the best ranker as well as equi-weight aggregator. Motivated with this result, we extend the aggregation framework to hold infinite rankings for the meta-search applications. The aggregation results on synthetic datasets are found to be ensuring the extension fruitful and scalable. Deciphering Search Query Intentions: The search engine often retrieves a huge list of documents based on their relevance scores for a given query. Such a presentation strategy may work if the submitted query is very specific, homogeneous and unambiguous. But many a times it so happen that the queries posed to the search engine are too short to be specific and hence ambiguous to identify clearly the exact information need, (eg. "jaguar"). These ambiguous and heterogeneous queries invite results from diverse topics. In such cases, the users may have to sift through the entire list to find their needed information and that could be a difficult task. Such a task can be simplified by organizing the search results under meaningful subtopics, which would help the users to directly move on to their topic of interest and ignore the rest. We develop a method to determine the various possible intentions of a given short generic and ambiguous query using information from the click-through data. We propose a two stage clustering framework to co-cluster the queries and documents into intentions that can readily be presented whenever it is demanded. For this problem, we adapt the spectral bipartite partitioning by extending it to automatically determine the number of clusters hidden in the log data. The algorithm has been tested on selected ambiguous queries and the results demonstrate the ability of the algorithm in distinguishing among the user intentions.

Page generated in 0.0477 seconds