Global ETD Search

511	Prestandajämförelse mellan NOSQL och SQL vid sökningsoperationer på bokhandels hemsidor : En jämförelse mellan MongoDB & MySQL / Performance comparison between NOSQL and SQL at search operations on bookstore websites : A comparison between MongoDB & MySQL Wetterlind, Hampus January 2024 (has links) I denna studie tittar vi närmare på vilken av databaserna, MySQL och MongoDB som har lägst svarstid vid sökningsoperationer en bokhandels-sida. Databaserna testades mot olika nivåer av query komplexitet med växande datamängder. Klustrad indexeringsmetod användes för att se om indexeringen påverkade databasernas responstider positivt eller negativt vid varje mätning. Experimentet var utformat för att hämta typisk data, ofta sedd på bokhandels-hemsidor, exempelvis författarnamn och boktitel. Jämförelsen mellan de två databaserna utförs till syftet att se vilken av dessa som ger lägst svarstid, då detta är något vitalt för internethandels-sidor ur ett kundperspektiv. Resultatet var att MySQL hade i genomsnitt lägre svarstider än MongoDB. Fortsatt arbete är möjligt att genomföra, öka till ännu större datamängder och addera fler databaser är ett exempel. Query komplexitet MySQL MongoDB e-handel Information Systems, Social aspects
512	Neue Indexingverfahren für die Ähnlichkeitssuche in metrischen Räumen über großen Datenmengen / New indexing techniques for similarity search in metric spaces Guhlemann, Steffen 06 July 2016 (has links) (PDF) Ein zunehmend wichtiges Thema in der Informatik ist der Umgang mit Ähnlichkeit in einer großen Anzahl unterschiedlicher Domänen. Derzeit existiert keine universell verwendbare Infrastruktur für die Ähnlichkeitssuche in allgemeinen metrischen Räumen. Ziel der Arbeit ist es, die Grundlage für eine derartige Infrastruktur zu legen, die in klassische Datenbankmanagementsysteme integriert werden könnte. Im Rahmen einer Analyse des State of the Art wird der M-Baum als am besten geeignete Basisstruktur identifiziert. Dieser wird anschließend zum EM-Baum erweitert, wobei strukturelle Kompatibilität mit dem M-Baum erhalten wird. Die Abfragealgorithmen werden im Hinblick auf eine Minimierung notwendiger Distanzberechnungen optimiert. Aufbauend auf einer mathematischen Analyse der Beziehung zwischen Baumstruktur und Abfrageaufwand werden Freiheitsgrade in Baumänderungsalgorithmen genutzt, um Bäume so zu konstruieren, dass Ähnlichkeitsanfragen mit einer minimalen Anzahl an Anfrageoperationen beantwortet werden können. / A topic of growing importance in computer science is the handling of similarity in multiple heterogenous domains. Currently there is no common infrastructure to support this for the general metric space. The goal of this work is lay the foundation for such an infrastructure, which could be integrated into classical data base management systems. After some analysis of the state of the art the M-Tree is identified as most suitable base and enhanced in multiple ways to the EM-Tree retaining structural compatibility. The query algorithms are optimized to reduce the number of necessary distance calculations. On the basis of a mathematical analysis of the relation between the tree structure and the query performance degrees of freedom in the tree edit algorithms are used to build trees optimized for answering similarity queries using a minimal number of distance calculations. Metrik Metrischer Raum Indexing Curse of Dimensionality EM-Baum M-Baum Ähnlichkeitssuche Bereichssuche k-Nächste-Nachbarn-Suche Metric Metric space Indexing Curse of Dimensionality EM-Tree M-Tree Similarity search Range query k-Nearest-Neighbor-Query ddc:004 rvk:ST 270
513	View-Based techniques for the efficient management of web data / Techniques fondées sur des vues matérialisées pour la gestion efficace des données du web Karanasos, Konstantinos 29 June 2012 (has links) De nos jours, des masses de données sont publiées à grande échelle dans des formats numériques. Une part importante de ces données a une structure complexe, typiquement organisée sous la forme d'arbres (les documents du web, comme HTML et XML, étant les plus représentatifs) ou de graphes (en particulier, les bases de données du Web Sémantique structurées en graphes, et exprimées en RDF). Exploiter ces données complexes, qu'elles soient dans un format d'accès Open Data ou bien propriétaire (au sein d'une compagnie), présente un grand intérêt. Le faire de façon efficace pour de grands volumes de données reste encore un défi. Les vues matérialisées sont utilisées depuis longtemps pour améliorer considérablement l'évaluation des requêtes. Le principe est q'une vue stocke des résultats pre-calculés qui peuvent être utilisés pour évaluer (une partie d') une requête. L'adoption des techniques de vues matérialisées dans le contexte de données du web que nous considérons est particulièrement exigeante à cause de la complexité structurelle et sémantique des données. Cette thèse aborde deux problèmes liés à la gestion des données du web basée sur des vues matérialisées. D'abord, nous nous concentrons sur le problème de sélection des vues pour des ensembles de requêtes RDF. Nous présentons un algorithme original qui, basé sur un ensemble de requêtes, propose les vues les plus appropriées à matérialiser dans la base des données. Ceci dans le but de minimiser à la fois les coûts d'évaluation des requêtes, de maintenance et de stockage des vues. Bien que les requêtes RDF contiennent typiquement un grand nombre de jointures, ce qui complique le processus de sélection de vues, notre algorithme passe à l'échelle de centaines de requêtes, un nombre non atteint par les méthodes existantes. En outre, nous proposons des techniques nouvelles pour tenir compte des données implicites qui peuvent être dérivées des schémas RDF sans complexifier davantage la sélection des vues. La deuxième contribution de notre travail concerne la réécriture de requêtes en utilisant des vues matérialisées XML. Nous commençons par identifier un dialecte expressif de XQuery, correspondant aux motifs d'arbres avec des jointures sur la valeur, et nous étudions des propriétés importantes de ces requêtes, y compris l'inclusion et la minimisation. En nous fondant sur ces notions, nous considérons le problème de trouver des réécritures minimales et équivalentes d'une requête exprimée dans ce dialecte, en utilisant des vues matérialisées exprimées dans le même dialecte, et nous fournissons un algorithme correct et complet à cet effet. Notre travail dépasse l'état de l'art en permettant à chaque motif d'arbre de renvoyer un ensemble d'attributs, en prenant en charge des jointures sur la valeur entre les motifs, et en considérant des réécritures qui combinent plusieurs vues. Enfin, nous montrons comment notre méthode de réécriture peut être appliquée dans un contexte distribué, pour la dissémination efficace d'un corpus de documents XML annotés en RDF. / Data is being published in digital formats at very high rates nowadays. A large share of this data has complex structure, typically organized as trees (Web documents such as HTML and XML being the most representative) or graphs (in particular, graph-structured Semantic Web databases, expressed in RDF). There is great interest in exploiting such complex data, whether in an Open Data access model or within companies owning it, and efficiently doing so for large data volumes remains challenging. Materialized views have long been used to obtain significant performance improvements when processing queries. The principle is that a view stores pre-computed results that can be used to evaluate (possibly part of) a query. Adapting materialized view techniques to the Web data setting we consider is particularly challenging due to the structural and semantic complexity of the data. This thesis tackles two problems in the broad context of materialized view-based management of Web data. First, we focus on the problem of view selection for RDF query workloads. We present a novel algorithm, which, based on a query workload, proposes the most appropriate views to be materialized in the database, in order to minimize the combined cost of query evaluation, view maintenance and view storage. Although RDF query workloads typically feature many joins, hampering the view selection process, our algorithm scales to hundreds of queries, a number unattained by existing approaches. Furthermore, we propose new techniques to account for the implicit data that can be derived by the RDF Schemas and which further complicate the view selection process. The second contribution of our work concerns query rewriting based on materialized XML views. We start by identifying an expressive dialect of XQuery, corresponding to tree patterns with value joins, and study some important properties for these queries, such as containment and minimization. Based on these notions, we consider the problem of finding minimal equivalent rewritings of a query expressed in this dialect, using materialized views expressed in the same dialect, and provide a sound and complete algorithm for that purpose. Our work extends the state of the art by allowing each pattern node to return a set of attributes, supporting value joins in the patterns, and considering rewritings which combine many views. Finally, we show how our view-based query rewriting algorithm can be applied in a distributed setting, in order to efficiently disseminate corpora of XML documents carrying RDF annotations. XML RDF RDFS Données du web Vues materialisées Optimisation des requêtes Sélection des vues XML RDF RDFS Web data Materialized views Query optimization View-based query rewriting View selection
514	Resource Centered Store Heese, Ralf 04 January 2016 (has links) Mit dem Resource Description Framework (RDF) können Eigenschaften von und die Beziehungen zwischen Ressourcen maschinenverarbeitbar beschrieben werden. Dadurch werden diese Daten für Maschinen zugänglicher und können unter anderem automatisch Daten zu einer Ressource lokalisieren und verarbeiten, unterschiedliche Bedeutungen einer Zeichenkette erkennen und implizite Informationen ableiten. Das Datenmodell von RDF und der zugehörigen Anfragesprache SPARQL basiert auf gerichteten und beschrifteten Multigraphen. Forschungsergebnisse haben gezeigt, dass relationale DBMS zum Verwalten von RDF-Daten ungeeignet sind. Native basierende RDF-DBMS können Anfragen in kürzerer Zeit verarbeiten. Der Leistungsgewinn wird durch redundantes Speichern von Tripeln in mehreren B+-Bäumen erzielt. Jedoch sind Join-ähnliche Operationen zum Berechnen des Ergebnisses erforderlich, was bei größeren Anfragen zu Leistungseinbußen führt. In dieser Arbeit wird der Resource Centered Store (RCS) entwickelt, dessen Speichermodell RDF-inhärente Eigenschaften ausnutzt, um Anfragen ohne die Notwendigkeit redundanter Speicherung effizient beantworten zu können. Die grundlegende Idee des RCS-Speichermodells besteht im Gruppieren der Daten als sternförmigen Teilgraphen auf Datenbankseiten. Die verwendeten Prinzipien ähnelt denen in RDBMS und daher können deren Algorithmen zur Beantwortung von Anfragen wiederverwendet werden. Darüber hinaus werden Transformationsregeln und Heuristiken zum Optimieren von SPARQL-Anfragen zum Finden eines möglichst optimalen Ausführungsplans definiert. In diesem Kontext wurden auch graphmusterbasierte Indexe spezifiziert und deren Nutzen für die Verarbeitung von Anfragen untersucht. Das RCS-Speichermodell wurde prototypisch implementiert und im Vergleich zum nativen RDF-DBMS Jena TDB evaluiert. Die durchgeführten Experimenten zeigen, dass das System insbesondere für das Beantworten von Anfragen mit großen sternförmigen Teilmustern geeignet ist. / The Resource Description Framework (RDF) is the conceptual foundation for representing properties of real-world or virtual resources and describing the relationships between them. Standards based on RDF allow machines to access and process information automatically and locate additional data about resources. It also supports the discovery of relationships between concepts. The smallest information unit in RDF are triples which form a directed labeled multi-graph. The query language SPARQL is also based on a graph model which makes it difficult for relational DBMS to store and query RDF data efficiently. The most performant DBMS for managing and querying RDF data implement a RDF-specific storage model based on a set of B+ tree indexes. The key disadvantages of these systems are the increased usage of secondary storage in cause of redundantly stored triples as well as the necessity of expensive join operation to compute the solutions of a SPARQL query. In this work we develop and describe the Resource Centered Store which exploits RDF inherent characteristics to avoid the requirement for storing triples redundantly while improving the query performance of larger queries. In the RCS storage model triples are grouped by their first component (subject) and storing these star-shaped subgraphs on database pages -- similar to relational DBMS. As a result the RCS can benefit from principles and algorithms that have been developed in the context of relational databases. Additionally, we defined transformation rules and heuristics to optimize SPARQL queries and generate an efficient query execution plan. In this context we also defined graph pattern based indexes and investigated their benefits for computing the solutions of queries. We implemented the RCS storage model prototypically and compared it to the native RDF DBMS Jena TDB. Our experiments showed that our storage model is especially suited to speed up the query performance of large star-shaped graph pattern. Anfragebearbeitung Anfrageoptimierung SPARQL Native RDF-Datenbankmanagementsystem SPARQL Native RDF database management system Query processing Query optimization 004 Informatik 28 Informatik, Datenverarbeitung ST 250 ST 250 X70 ST 270 ddc:004
515	Processamento distribu?do da consulta espa?o textual top-k Novaes, Tiago Fernandes de Athayde 17 July 2017 (has links) Submitted by Ricardo Cedraz Duque Moliterno (ricardo.moliterno@uefs.br) on 2017-11-28T21:38:06Z No. of bitstreams: 1 dissertacao-versao-final.pdf: 2717503 bytes, checksum: a1476bba65482b40daa1a139191ea912 (MD5) / Made available in DSpace on 2017-11-28T21:38:06Z (GMT). No. of bitstreams: 1 dissertacao-versao-final.pdf: 2717503 bytes, checksum: a1476bba65482b40daa1a139191ea912 (MD5) Previous issue date: 2017-07-17 / With the popularization of databases containing objects with spatial and textual information (spatio-textual object), the interest in new queries and techniques for retrieving these objects have increased. In this scenario, the main query is the the top-k spatio-textual query. This query retrieves the k best spatio-textual objects considering the distance of the object to the query location and the textual similarity between the query keywords and the textual information of the objects. However, most the studies related to top-k spatio-textual query are performed in centralized environments, not addressing real world problems such as scalability. In this paper, we study different strategies for partitioning the data and processing the top-k spatio-textual query in a distributed environment. We evaluate each strategy in a real distributed environment, employing real datasets. / Com a populariza??o de bases de dados contendo objetos que possuem informa??o espacial e textual (objeto espa?o-textual), aumentou o interesse por novas consultas e t?cnicas capazes de recuperar esses objetos de forma eficiente. Uma das principais consultas para objetos espa?o-textuais ? a consulta espa?o-textual top-k. Essa consulta visa recuperar os k melhores objetos considerando a dist?ncia do objeto at? um local informado na consulta e a similaridade textual entre palavras-chave de busca e a informa??o textual dos objetos. No entanto, a maioria dos estudos para consultas espa?o-textual top-k assumem ambientes centralizados, n?o abordando problemas frequentes em aplica??es do mundo real como escalabilidade. Nesta disserta??o s?o estudadas diferentes formas de particionar os dados e o impacto destes particionamentos no processamento da consulta espa?o-textual top-k em um ambiente distribu?do. Todas as estrat?gias propostas s?o avaliadas em um ambiente distribu?do real, utilizando dados reais. Particionamento de dados Processamento de consultas distribu?das Consultas espa?o-textuais Sistemas de informa??o Recupera??o de informa??o Data partitioning Distributed query processing Spatio-textual query Information systems Information retrieval
516	Semantically-enabled stream processing and complex event processing over RDF graph streams / Traitement de flux sémantiquement activé et traitement d'évènements complexes sur des flux de graphe RDF Gillani, Syed 04 November 2016 (has links) Résumé en français non fourni par l'auteur. / There is a paradigm shift in the nature and processing means of today’s data: data are used to being mostly static and stored in large databases to be queried. Today, with the advent of new applications and means of collecting data, most applications on the Web and in enterprises produce data in a continuous manner under the form of streams. Thus, the users of these applications expect to process a large volume of data with fresh low latency results. This has resulted in the introduction of Data Stream Processing Systems (DSMSs) and a Complex Event Processing (CEP) paradigm – both with distinctive aims: DSMSs are mostly employed to process traditional query operators (mostly stateless), while CEP systems focus on temporal pattern matching (stateful operators) to detect changes in the data that can be thought of as events. In the past decade or so, a number of scalable and performance intensive DSMSs and CEP systems have been proposed. Most of them, however, are based on the relational data models – which begs the question for the support of heterogeneous data sources, i.e., variety of the data. Work in RDF stream processing (RSP) systems partly addresses the challenge of variety by promoting the RDF data model. Nonetheless, challenges like volume and velocity are overlooked by existing approaches. These challenges require customised optimisations which consider RDF as a first class citizen and scale the processof continuous graph pattern matching. To gain insights into these problems, this thesis focuses on developing scalable RDF graph stream processing, and semantically-enabled CEP systems (i.e., Semantic Complex Event Processing, SCEP). In addition to our optimised algorithmic and data structure methodologies, we also contribute to the design of a new query language for SCEP. Our contributions in these two fields are as follows: • RDF Graph Stream Processing. We first propose an RDF graph stream model, where each data item/event within streams is comprised of an RDF graph (a set of RDF triples). Second, we implement customised indexing techniques and data structures to continuously process RDF graph streams in an incremental manner. • Semantic Complex Event Processing. We extend the idea of RDF graph stream processing to enable SCEP over such RDF graph streams, i.e., temporalpattern matching. Our first contribution in this context is to provide a new querylanguage that encompasses the RDF graph stream model and employs a set of expressive temporal operators such as sequencing, kleene-+, negation, optional,conjunction, disjunction and event selection strategies. Based on this, we implement a scalable system that employs a non-deterministic finite automata model to evaluate these operators in an optimised manner. We leverage techniques from diverse fields, such as relational query optimisations, incremental query processing, sensor and social networks in order to solve real-world problems. We have applied our proposed techniques to a wide range of real-world and synthetic datasets to extract the knowledge from RDF structured data in motion. Our experimental evaluations confirm our theoretical insights, and demonstrate the viability of our proposed methods Traitement de flux Traitement d'évènements complexes Graphes RDF Optimisations de question Ebauche de requête Web sémantique Requêtes top-k Données de graphes Stream processing Complex event processing RDF graphs Query optimisations Query design Semantic web Top-k queries Graph databases
517	Suporte a consultas por similaridade unárias em SQL / Extending SQL to support unary similary queries Ferreira, Mônica Ribeiro Porto 15 February 2008 (has links) Os operadores convencionais para comparação de dados por igualdade e por relação de ordem total não são adequados para o gerenciamento de dados complexos como, por exemplo, os dados multimí?dia (imagens, áudio, textos longos), séries temporais e seqüências genéticas. Para comparar dados desses tipos, o grau de similaridade entre suas instâncias é, em geral, o fator mais importante sendo, portanto, indicado que as operações de consulta sejam realizadas utilizando os chamados operadores por similaridade. Existem operadores de busca por similaridade tanto unários quanto binários. Os operadores unários são utilizados para implementar operações de seleção, enquanto os operadores binários destinam-se a operações de junção. A álgebra relacional, usada nos Sistemas de Gerenciamento de Bases de Dados Relacionais, não provê suporte para expressar critérios de busca por similaridade. Para suprir esse suporte, está em desenvolvimento no Grupo de Bases de Dados e Imagens (GBdI-ICMC-USP) uma extensão à álgebra relacional que permite representar as consultas por similaridade em expressões algébricas. Esta dissertação incorpora-se nesse empreendimento, abordando o tratamento aos operadores unários por similaridade na álgebra, bem como a implementação do otimizador de consultas por similaridade no SIREN (Similarity Retrieval Engine) para que as consultas por similaridade possam ser respondidas pelos Sistemas de Gerenciamento de Bases de Dados relacionais / Conventional operators for data comparison based on exact matching and total order relations are not appropriate to manage complex data, such as multimedia data (e.g. images, audio and large texts), time series and genetic sequences. In fact, the most important aspect to compare complex data is usually the similarity degree between instances, leading to the use of similarity operators to perform search and retrieval operations. Similarity operators can be classified as unary or as binary, respectively used to implement selection operations and joins. However, the Relation Algebra, employed in Relational Database Management Systems (DBMS), does not provide resources to express similarity search criteria. In order to fulfill this lack of support, an extension to the Relational Algebra is under development at GBdI-ICMC-USP (Grupo de Bases de Dados e Imagens), aiming to represent similarity queries in algebraic expressions. This work contributes to such an effort by dealing with unary similarity operators in Relational Algebra and by developing a similarity query optimizer for SIREN (Similarity Retrieval Engine), therefore allowing similarity queries to be answered by Relational DBMS Álgebra por similaridade Consultas por similaridade Interpretação de consultas Método de acesso métrico Metric acess method Otimização de consulta Query optimization Query processing Seleção por similaridade Similarity algebra Similarity queries Similarity selection
518	Ontologias de domínio na interpretação de consultas a bancos de dados relacionais / Domain ontologies in query interpretation to relational databases Marins, Walquíria Fernandes 08 October 2015 (has links) Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2016-02-29T10:48:44Z No. of bitstreams: 2 Dissertação - Walquiria Fernandes Marins - 2015.pdf: 2611919 bytes, checksum: 3d20806ae28d0e0e4c891d8dd65cbd88 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2016-02-29T10:54:01Z (GMT) No. of bitstreams: 2 Dissertação - Walquiria Fernandes Marins - 2015.pdf: 2611919 bytes, checksum: 3d20806ae28d0e0e4c891d8dd65cbd88 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Made available in DSpace on 2016-02-29T10:54:01Z (GMT). No. of bitstreams: 2 Dissertação - Walquiria Fernandes Marins - 2015.pdf: 2611919 bytes, checksum: 3d20806ae28d0e0e4c891d8dd65cbd88 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Previous issue date: 2015-10-08 / There is an huge amount of data and information stored digitally. Part of this amount of data is available for consultation by Web, however, a significant portion is hidden, due to its storage form, and can’t be recovered by traditional search engines. This makes users face a common and growing challenge to search and find specific information. This challenge is enhanced by the unpreparedness of users to formulate searches and the limitations inherent in research technologies, that because of syntactic differences, can’t find probably relevant data. Aiming to expand the possibilities of semantic interpretation of a query, this work proposes an approach to interpretation of queries in natural language to relational databases through the use of domain ontologies as a tool for their interpretation and semantic enrichment. With the implementation of the approach and its appreciation against a backdrop of queries without the semantic enrichment it is observed that the approach contributes satisfactorily to identify the user’s intention. / Há uma quantidade enorme de dados e informações armazenadas digitalmente. Uma parte desse volume de dados está disponível para consultas através da Web, entretanto, uma parcela significativa está oculta, devido à sua forma de armazenamento, e não pode ser recuperada pelos mecanismos tradicionais de busca. Isso faz com que os usuários enfrentem um desafio comum e crescente para buscar e encontrar informações específicas. Este desafio é potencializado pelo despreparo dos usuários em formular buscas e pelas limitações inerentes às tecnologias de pesquisa que, em virtude de diferenças sintáticas, não conseguem encontrar dados provavelmente relevantes. Visando ampliar as possibilidades de interpretação semântica de uma consulta, este trabalho propõe uma abordagem de interpretação de consultas em linguagem natural a bancos de dados relacionais através do uso de ontologias de domínio como instrumento para sua interpretação e enriquecimento semântico. Com a implementação da abordagem e sua apreciação em relação a um cená- rio de consultas sem o enriquecimento semântico observa-se que a abordagem contribui satisfatoriamente para identificar a intenção do usuário. Bancos de dados Web Bancos de dados relacionais Consulta por palavras-chave Ontologia de domínio Interpretação semântica da consulta Web databases Relational databases Keyword-based query Domain ontology Semantic query interpretation
519	TOP-K AND SKYLINE QUERY PROCESSING OVER RELATIONAL DATABASE Samara, Rafat January 2012 (has links) Top-k and Skyline queries are a long study topic in database and information retrieval communities and they are two popular operations for preference retrieval. Top-k query returns a subset of the most relevant answers instead of all answers. Efficient top-k processing retrieves the k objects that have the highest overall score. In this paper, some algorithms that are used as a technique for efficient top-k processing for different scenarios have been represented. A framework based on existing algorithms with considering based cost optimization that works for these scenarios has been presented. This framework will be used when the user can determine the user ranking function. A real life scenario has been applied on this framework step by step. Skyline query returns a set of points that are not dominated (a record x dominates another record y if x is as good as y in all attributes and strictly better in at least one attribute) by other points in the given datasets. In this paper, some algorithms that are used for evaluating the skyline query have been introduced. One of the problems in the skyline query which is called curse of dimensionality has been presented. A new strategy that based on the skyline existing algorithms, skyline frequency and the binary tree strategy which gives a good solution for this problem has been presented. This new strategy will be used when the user cannot determine the user ranking function. A real life scenario is presented which apply this strategy step by step. Finally, the advantages of the top-k query have been applied on the skyline query in order to have a quickly and efficient retrieving results. Top-k query Skyline query Fagin’s algorithm Threshold Algorithm No random access algorithm Minimal Probing algorithm Block-Nested-Loop algorithm Nearest Neighbor algorithm Branch and Bound Skyline Algorithm Divide and Conquer algorithm
520	Mining Clickthrough Data To Improve Search Engine Results Veilumuthu, Ashok 05 1900 (has links) (PDF) In this thesis, we aim at improving the search result quality by utilizing the search intelligence (history of searches) available in the form of click-through data. We address two key issues, namely 1) relevance feedback extraction and fusion, and 2) deciphering search query intentions. Relevance Feedback Extraction and Fusion: The existing search engines depend heavily on the web linkage structure in the form of hyperlinks to determine the relevance and importance of the documents. But these are collective judgments given by the page authors and hence, prone to collaborated spamming. To overcome the spamming attempts and language semantic issues, it is also important to incorporate the user feedback on the documents' relevance. Since users can be hardly motivated to give explicit/direct feedback on search quality, it becomes necessary to consider implicit feedback that can be collected from search engine logs. Though a number of implicit feedback measures have been proposed in the literature, we have not been able to identify studies that aggregate those feedbacks in a meaningful way to get a final ranking of documents. In this thesis, we first evaluate two implicit feedback measures namely 1) click sequence and 2) time spent on the document for their content uniqueness. We develop a mathematical programming model to collate the feedbacks collected from different sessions into a single ranking of documents. We use Kendall's τ rank correlation to determine the uniqueness of the information content present in the individual feedbacks. The experimental evaluation on top 30 select queries from an actual search log data confirms that these two measures are not in perfect agreement and hence, incremental information can potentially be derived from them. Next, we study the feedback fusion problem in which the user feedbacks from various sessions need to be combined meaningfully. Preference aggregation is a classical problem in economics and we study a variation of it where the rankers, i.e., the feedbacks, possess different expertise. We extend the generalized Mallows' model to model the feedback rankings given in user sessions. We propose a single stage and two stage aggregation framework to combine different feedbacks into one final ranking by taking their respective expertise into consideration. We show that the complexity of the parameter estimation problem is exponential in number of documents and queries. We develop two scalable heuristics namely, 1) a greedy algorithm, and 2) a weight based heuristic, that can closely approximate the solution. We also establish the goodness of fit of the model by testing it on actual log data through log-likelihood ratio test. As the independent evaluation of documents is not available, we conduct experiments on synthetic datasets devised appropriately to examine the various merits of the heuristics. The experimental results confirm the possibility of expertise oriented aggregation of feedbacks by producing orderings better than both the best ranker as well as equi-weight aggregator. Motivated with this result, we extend the aggregation framework to hold infinite rankings for the meta-search applications. The aggregation results on synthetic datasets are found to be ensuring the extension fruitful and scalable. Deciphering Search Query Intentions: The search engine often retrieves a huge list of documents based on their relevance scores for a given query. Such a presentation strategy may work if the submitted query is very specific, homogeneous and unambiguous. But many a times it so happen that the queries posed to the search engine are too short to be specific and hence ambiguous to identify clearly the exact information need, (eg. "jaguar"). These ambiguous and heterogeneous queries invite results from diverse topics. In such cases, the users may have to sift through the entire list to find their needed information and that could be a difficult task. Such a task can be simplified by organizing the search results under meaningful subtopics, which would help the users to directly move on to their topic of interest and ignore the rest. We develop a method to determine the various possible intentions of a given short generic and ambiguous query using information from the click-through data. We propose a two stage clustering framework to co-cluster the queries and documents into intentions that can readily be presented whenever it is demanded. For this problem, we adapt the spectral bipartite partitioning by extending it to automatically determine the number of clusters hidden in the log data. The algorithm has been tested on selected ambiguous queries and the results demonstrate the ability of the algorithm in distinguishing among the user intentions. Data Mining (Special Computer Methods) Search Engine (Computer Science) Information Search and Retrieval Clickthrough Data Mining Implicit Releveance Feedback Rank Aggregation Query Clustering Intent Clustering Search Engine Log Files Search Engine Query Log Search Engine Log Queries Ranking Models Computer Science

Search results