Spelling suggestions: "subject:"top1k"" "subject:"top2k""
21 |
Traitement de requêtes top-k multicritères et application à la recherche par le contenu dans les bases de données multimédia / Multicriteria top-k query processing and application to content-based search in multimedia databasesBadr, Mehdi 07 October 2013 (has links)
Le développement des techniques de traitement des requêtes de classement est un axe de recherche très actif dans le domaine de la recherche d'information. Plusieurs applications nécessitent le traitement des requêtes de classement multicritères, telles que les méta-moteurs de recherche sur le web, la recherche dans les réseaux sociaux, la recherche dans les bases de documents multimédia, etc. Contrairement aux requêtes booléennes traditionnelles, dans lesquelles le filtrage est basé sur des prédicats qui retournent vrai ou faux, les requêtes de classement utilisent des prédicats de similarité retournant un score de pertinence. Ces requêtes spécifient une fonction d'agrégation qui combine les scores individuels produits par les prédicats de similarité permettant de calculer un score global pour chaque objet. Les k objets avec les meilleurs scores globaux sont retournés dans le résultat final. Dans cette thèse, nous étudions dans un premier temps les techniques et algorithmes proposés dans la littérature conçus pour le traitement des requêtes top-k multicritères dans des contextes spécifiques de type et de coût d'accès aux scores, et nous proposons un cadre générique capable d'exprimer tous ces algorithmes. Ensuite, nous proposons une nouvelle stratégie en largeur «breadth-first», qui maintient l'ensemble courant des k meilleurs objets comme un tout, à la différence des stratégies en profondeur habituelles qui se focalisent sur le meilleur candidat. Nous présentons un nouvel algorithme «Breadth-Refine» (BR), basé sur cette stratégie et adaptable à n'importe quelle configuration de type et de coût d'accès aux scores. Nous montrons expérimentalement la supériorité de l'algorithme BR sur les algorithmes existants. Dans un deuxième temps, nous proposons une adaptation des algorithmes top-k à la recherche approximative, dont l'objectif est de trouver un compromis entre le temps de recherche et la qualité du résultat retourné. Nous explorons l'approximation par arrêt prématuré de l'exécution et proposons une première étude expérimentale du potentiel d'approximation des algorithmes top-k. Dans la dernière partie de la thèse, nous nous intéressons à l'application des techniques top-k multicritères à la recherche par le contenu dans les grandes bases de données multimédia. Dans ce contexte, un objet multimédia (une image par exemple) est représenté par un ou plusieurs descripteurs, en général sous forme de vecteurs numériques qui peuvent être vus comme des points dans un espace multidimensionnel. Nous explorons la recherche des k plus proches voisins (k-ppv) dans ces espaces et proposons une nouvelle technique de recherche k-ppv approximative «Multi-criteria Search Algorithm » (MSA) basée sur les principes des algorithmes top-k. Nous comparons MSA à des méthodes de l'état de l'art dans le contexte des grandes bases multimédia où les données ainsi que les structures d'index sont stockées sur disque, et montrons qu'il produit rapidement un très bon résultat approximatif. / Efficient processing of ranking queries is an important issue in today information retrieval applications such as meta-search engines on the web, information retrieval in social networks, similarity search in multimedia databases, etc. We address the problem of top-k multi-criteria query processing, where queries are composed of a set of ranking predicates, each one expressing a measure of similarity between data objects on some specific criteria. Unlike traditional Boolean predicates returning true or false, similarity predicates return a relevance score in a given interval. The query also specifies an aggregation function that combines the scores produced by the similarity predicates. Query results are ranked following the global score and only the best k ones are returned.In this thesis, we first study the state of the art techniques and algorithms designed for top-k multi-criteria query processing in specific conditions for the type of access to the scores and cost settings, and propose a generic framework able to express any top-k algorithm. Then we propose a new breadth-first strategy that maintains the current best k objects as a whole instead of focusing only on the best one such as in all the state of the art techniques. We present Breadth-Refine (BR), a new top-k algorithm based on this strategy and able to adapt to any combination of source access types and to any cost settings. Experiments clearly indicate that BR successfully adapts to various settings, with better results than state of the art algorithms.Secondly, we propose an adaptation of top-k algorithms to approximate search aiming to a compromise between execution time and result quality. We explore approximation by early stopping of the execution and propose a first experimental study of the approximation potential of top-k algorithms. Finally, we focus on the application of multi-criteria top-k techniques to Large Scale Content-Based Image Retrieval. In this context an image is represented by one or several descriptors, usually numeric vectors that can be seen as points in a multidimensional space. We explore the k-Nearest Neighbors search on such space and propose “Multi-criteria Search Algorithm” (MSA) a new technique for approximate k-NN based on multi-criteria top-k techniques. We compare MSA with state of the art methods in the context of large multimedia databases, where the database and the index structure are stored on disk, and show that MSA quickly produces very good approximate results.
|
22 |
針對複合式競賽挑選最佳球員組合的方法 / Selecting the best group of players for a composite competition鄧雅文, Teng, Ya Wen Unknown Date (has links)
在資料庫的處理中,top-k查詢幫助使用者從龐大的資料中萃取出具有價值的物件,它將資料庫中的物件依照給分公式給分後,選擇出分數最高的前k個回傳給使用者。然而在多數的情況下,一個物件也許不只有一個分數,要如何在多個分數中仍然選擇出整體最高分的前k個物件,便成為一個新的問題。在本研究中,我們將這樣的物件用不確定資料來表示,而每個物件的不確定性則是其帶有機率的分數以表示此分數出現的可能性,並提出一個新的問題:Best-kGROUP查詢。在此我們將情況模擬為一個複合式競賽,其中有多個子項目,每個項目的參賽人數各異,且最多需要k個人參賽;我們希望能針對此複合式競賽挑選出最佳的k個球員組合。當我們定義一個較佳的組合為其在較多項目居首位的機率比另一組合高,而最佳的組合則是沒有比它更佳的組合。為了加快挑選的速度,我們利用動態規劃的方式與篩選的演算法,將不可能的組合先剔除;所剩的組合則是具有天際線特質的組合,在這些天際線組合中,我們可以輕易的找出最佳的組合。此外,在實驗中,對於在所有球員中挑選最佳的組合,Best-kGROUP查詢也有非常優異的表現。 / In a large database, top-k query is an important mechanism to retrieve the most valuable information for the users. It ranks data objects with a ranking function and reports the k objects with the highest scores. However, when an object has multiple scores, how to rank objects without information loss becomes challenging. In this paper, we model the object with multiple scores as an uncertain data object and the uncertainty of the object as a distribution of the scores, and consider a novel problem named Best-kGROUP query. Imagine the following scenario. Assume there is a composite competition consisting of several games each of which requires a distinct number of players. Suppose the largest number is k, and we want to select the best group of k players from all the players for the competition. A group x is considered better than another group y if x has higher aggregated probability to be the top ones in more games than y. In order to speed up the selection process, the groups worse than another group definitely should first be discarded. We identify these groups using a dynamic programming based approach and a filtering algorithm. The remaining groups with the property that none of them have higher aggregated probability to be the top ones for all games against the other groups are called skyline groups. From these skyline groups, we can easily compare them to select the best group for the composite competition. The experiments show that our approach outperforms the other approaches in selecting the best group to defeat the other groups in the composite competitions.
|
23 |
Preferenčné vyhľadávanie založené na viacrozmernom B-strome / Preference Top-k Search Based on Multidimensional B-treeOndreička, Matúš January 2013 (has links)
Title: Preference Top-k Search Based on Multidimensional B-Tree Author: RNDr. Matúš Ondreička Department: Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague Supervisor: Prof. RNDr. Jaroslav Pokorný, CSc. Author's e-mail address: ondreicka@ksi.mff.cuni.cz Supervisor's e-mail address: pokorny@ksi.mff.cuni.cz Abstract: In this thesis, we focus on the top-k search according to user pref- erences by using B+ -trees and the multidimensional B-tree (MDB-tree). We use model of user preferences based on fuzzy functions, which enable us to search according to a non-monotone ranking function. We propose model of sorted list based on the B+ -tree, which enables Fagin's algorithms to search for the top-k objects according to a non-monotone ranking function. We apply this model in the Internet environment with data on different remote servers. Furthermore, we designed novel dynamic tree-based data structures, namely, MDB-tree composed of B+ -trees, MDB-tree with lists, MDB-tree with groups of B+ -trees and multiple-ordered MDB-tree. Concurrently, we have developed novel top-k algorithms, namely, the MD algorithm, the MXT algorithm and their variants which are able search for the top-k objects ac- cording to a non-monotone ranking function. These top-k algorithms are efficient...
|
24 |
Projeto e avaliação de algoritmos paralelos para sistemas Multicore e Manycore aplicados no processamento de documentos / Design and evaluation of parallel algorithms for Multicore and Manycore systems applied on document processingFreitas, Mateus Ferreira e 30 August 2017 (has links)
Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2017-10-02T15:28:01Z
No. of bitstreams: 2
Dissertação - Mateus Ferreira e Freitas - 2017.pdf: 4269845 bytes, checksum: e84e69d8747a21125170793812384a98 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2017-10-02T15:30:07Z (GMT) No. of bitstreams: 2
Dissertação - Mateus Ferreira e Freitas - 2017.pdf: 4269845 bytes, checksum: e84e69d8747a21125170793812384a98 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2017-10-02T15:30:07Z (GMT). No. of bitstreams: 2
Dissertação - Mateus Ferreira e Freitas - 2017.pdf: 4269845 bytes, checksum: e84e69d8747a21125170793812384a98 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)
Previous issue date: 2017-08-30 / Several applications process documents in different ways, aiming to filter, organize or learn with them. Nowadays, a great computational power is necessary in order to do that efficiently, due to the large and increasing number of documents. Usually, documents are independent of each other, which facilitates the use of parallelism to speed up this processing. This work explores three problems: active learning, learning to rank (L2R) and top-k search. Using the parallelism on multicore CPUs and manycore GPUs (Graphics Processing Unit), parallel algorithms were proposed and evaluated for each problem, and implemented with the OpenMP and CUDA APIs.
For the active learning problem a multicore algorithm was proposed, which obtained 10.8x of speedup in the best case with 12 threads. The proposed manycore version obtained 128x of speedup over the serial version, and a solution with 4 GPUs achieved 3.5x of speedup over 1 GPU.
For the L2R problem a manycore algorithm was proposed, which follows a thread-block approach using the concept of Combinadic, and uses a cache with fingerprint to speed up the processing. The best case speedups were 508x over the serial, 9x over a GPU baseline, and 4x over our solution when using 4 GPUs. When comparing with a version without combinadic, the speedup over it was 4.4x with both versions using 1 GPU and 3.9x with 4. These solutions used bitmap structures to speed up the association rules creation.
In the top-k search a serial and multicore solutions were implemented from a state of the art manycore algorithm for exact searches. These implementations served as baselines for our extension of this algorithm, which includes the use of multi-GPU, group searches and an intra-block load balancing. The speedups were 2.7x over the original algorithm, 17x over the serial, 4x over the multicore, and 4x over our version when using 4 GPUs. / Diversas aplicações processam documentos de diferentes maneiras, visando filtrá-los, organizá-los ou aprender com eles. Atualmente, é necessário um grande poder computacional para que isso seja feito eficientemente, devido ao número grande e crescente de documentos. Geralmente os documentos são independentes entre si, o que facilita o uso de paralelismo para acelerar esse processamento. Este trabalho explora três problemas: aprendizado ativo, learning to rank (L2R) e busca top-k. Usando o paralelismo em CPUs multicore e GPUs (Graphics Processing Unit) manycore, algoritmos paralelos foram propostos e avaliados para cada problema, e implementados com as APIs OpenMP e CUDA.
Para problema de aprendizado ativo foi proposto um algoritmo multicore, que obteve speedup de 10,8x no melhor caso com 12 threads. A versão manycore proposta obteve speedup de 128x em relação ao serial, e uma solução com 4 GPUs atingiu 3,5x de speedup sobre 1 GPU.
Para o problema de L2R foi proposto um algoritmo manycore, que segue uma abordagem por bloco de threads} usando o conceito de Combinadic, e usa uma cache} com fingerprint para acelerar o processamento. Os speedups nos melhores casos foram de 508x sobre o serial, 9x sobre uma baseline em GPU, e 4x sobre nossa solução com 1 GPU ao usar 4 GPUs. Ao comparar com uma versão sem o combinadic, o speedup sobre ela foi de 4,4x com ambas versões usando 1 GPU e 3,9x usando 4. Estas soluções usaram estruturas de mapa de bits para acelerar a criação de regras de associação.
Na busca top-k foram implementadas uma solução serial e uma multicore de um algoritmo manycore estado da arte para buscas exatas. Estas implementações serviram de baseline para nossa extensão desse algoritmo, que inclui o uso de multi-GPU, buscas em grupos e um balanceamento de carga intra-bloco. Os speedups obtidos foram de 2,7x sobre o algoritmo original, 17x sobre o serial, 4x sobre o multicore, e 4x sobre nossa versão ao usar 4 GPUs.
|
25 |
An Advanced Skyline Approach for Imperfect Data Exploitation and Analysis / Modèle Skyline pour l'analyse et l'exploitation des données incertainesElmi, Saïda 15 September 2017 (has links)
Ce travail de thèse porte sur un modèle de requête de préférence, appelée l'opérateur Skyline, pour l'exploitation de données imparfaites. L'imperfection de données peut être modélisée au moyen de la théorie de l'évidence. Ce type de données peut être géré dans des bases de données imparfaites appelées bases de données évidentielles. D'autre part, l'opérateur skyline est un outil puissant pour extraire les objets les plus intéressants dans une base de données.Dans le cadre de cette thèse, nous définissons une nouvelle sémantique de l'opérateur Skyline appropriée aux données imparfaites modélisées par la théorie de l'évidence. Nous introduisons par la suite la notion de points marginaux pour optimiser le calcul distribué du Skyline ainsi que la maintenance des objets Skyline en cas d'insertion ou de suppression d'objets dans la base de données.Nous modélisons aussi une fonction de score pour mesurer le degré de dominance de chaque objet skyline et définir le top-k Skyline. Une dernière contribution porte sur le raffinement de la requête Skyline pour obtenir les meilleurs objets skyline appelés objets Etoile ou Skyline stars. / The main purpose of this thesis is to study an advanced database tool named the skyline operator in the context of imperfect data modeled by the evidence theory. In this thesis, we first address, on the one hand, the fundamental question of how to extend the dominance relationship to evidential data, and on the other hand, it provides some optimization techniques for improving the efficiency of the evidential skyline. We then introduce efficient approach for querying and processing the evidential skyline over multiple and distributed servers. ln addition, we propose efficient methods to maintain the skyline results in the evidential database context wben a set of objects is inserted or deleted. The idea is to incrementally compute the new skyline, without reconducting an initial operation from the scratch. In the second step, we introduce the top-k skyline query over imperfect data and we develop efficient algorithms its computation. Further more, since the evidential skyline size is often too large to be analyzed, we define the set SKY² to refine the evidential skyline and retrieve the best evidential skyline objects (or the stars). In addition, we develop suitable algorithms based on scalable techniques to efficiently compute the evidential SKY². Extensive experiments were conducted to show the efficiency and the effectiveness of our approaches.
|
26 |
Query-Time Data IntegrationEberius, Julian 16 December 2015 (has links) (PDF)
Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established.
This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need.
To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort.
Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections.
|
27 |
Query-Time Data IntegrationEberius, Julian 10 December 2015 (has links)
Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established.
This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need.
To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort.
Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections.
|
28 |
Traitement personnalisé de requête top-k: des systèmes centralisés aux systèmes décentralisésBai, Xiao 08 December 2010 (has links) (PDF)
La révolution Web 2.0 a transformé l'Internet, une infrastructure auparavant en lecture seule, en une plate-forme collaborative en lecture-écriture. La forte augmentation des donnés générées par les utilisateurs des systèmes collaboratifs constitue désormais une source considérable d'informations. Pourtant, effectuer efficacement des recherches dans un tel environnement est devenu plus difficile, en particulier lorsque ces recherches engendrent des ambiguïtés. Personnaliser les recherches permet d'éviter ces écueils en limitant les recherches au sein d'un réseau très réduit de participants ayant des intérêts similaires. Toutefois, les solutions centralisées pour mettre en œuvre cette personnalisation s'avèrent difficile compte tenu du volume important d'informations qui doit être maintenu pour chaque utilisateur. La nature dynamique de ces systèmes, dans lesquels les utilisateurs changent potentiellement souvent d'intérêt, complique la tâche. Cette thèse propose de nouveaux algorithmes permettant d'effectuer des recherches personnalisées de manière efficace dans des systèmes dynamiques, centralisés ou décentralisés, selon deux axes majeurs : (i) la personnalisation hors ligne qui s'appuie sur le comportement passé des utilisateurs et (ii) la personnalisation en ligne qui s'appuie sur le comportement passé et la requête en cours. Nous présentons d'abord l'algorithme P3K, qui décentralise une approche existante et réalise le traitement personnalisé des requêtes top-k hors ligne dans les systèmes pair-à-pair. Ensuite, nous présentons P4Q, une extension de P3K qui améliore les performances du système en termes de stockage, bande passante et la robustesse en distribuant le traitement des requêtes. Les deux algorithmes, P3K et P4Q, reposent sur des protocoles épidémiques pour capturer la similarité implicite entre les utilisateurs et associer ainsi à chaque utilisateur un "réseau personnel" dans lequel traiter la requête. Nos évaluations analytiques et expérimentales démontrent leur efficacité pour le traitement des requêtes top-k, y compris dans les systèmes dynamiques, en particulier que la capacité inhérente de P4Q à faire face aux mises à jours des profils des utilisateurs. Dans le but d'améliorer encore la qualité des résultats pour les requêtes représentant les intérêts émergents des utilisateurs, et donc non représentés dans son profil, nous proposons un modèle hybride d'intérêt, prenant en compte à la fois le profil des utilisateurs mais également la requête elle-même. Nous avons proposé une solution à la fois en centralisé, l'algorithme DT², qui effectue une recherche de type top-k à deux reprises: le premier top-k consiste à sélectionner dynamiquement un sous-réseau (le réseau personnel) le plus adapté à la requête et à l'utilisateur la générant. Le second top-k consiste à effectuer la recherche dans ce sous réseau. L'algorithme DT²P², exécute efficacement la personnalisation en ligne de manière entièrement décentralisée. Les résultats expérimentaux sur des traces réelles de systèmes collaboratifs, montrent que la personnalisation en ligne est prometteuse pour répondre aux préférences diverses des utilisateurs.
|
29 |
Efficient and Reliable In-Network Query Processing in Wireless Sensor NetworksMalhotra, Baljeet Singh 11 1900 (has links)
The Wireless Sensor Networks (WSNs) have emerged as a new paradigm for collecting and processing data from physical environments, such as wild life sanctuaries, large warehouses, and battlefields. Users can access sensor data by issuing queries over the network, e.g., to find what are the 10 highest temperature values in the network. Typically, a WSN operates by constructing a logical topology, such as a spanning tree, built on top of the physical topology of the network. The constructed logical topology is then used to disseminate queries in the network, and also to process and return the results of such queries back to the user. A major challenge in this context is prolonging the network's lifetime that mainly depends on the energy cost of data communication via wireless radios, which is known to be very expensive as compared to the cost of data processing within the network.
In this research, we investigate some of the core problems that deal with the different aspects of in-network query processing in WSNs. In that context, we propose an efficient filtering based algorithm for the top-k query processing in WSNs. Through a systematic study of the top-k query processing in WSNs we propose several solutions in this thesis, which are applicable not only to the top-k queries, but also to in-network query processing problems in general. Specifically, we consider broadcasting and convergecasting, which are two basic operations that are required by many in-network query processing solutions.
Scheduling broadcasting and convergecasting is another problem that is important for energy efficiency in WSNs. Failure of communication links, which are common in WSNs, is yet another important issue that needs to be addressed.
In this research, we take a holistic approach to deal with the above problems while processing the top-k queries in WSNs. To this end, the thesis makes several contributions. In particular, our proposed solutions include new logical topologies, scheduling algorithms, and an overall sophisticated communication framework, which allows to process the top-k queries efficiently and with increased reliability. Extensive simulation studies reveal that
our solutions are not only energy efficient, saving up to 50% of the energy cost as compared to the current state-of-the-art solutions, but they are also robust to link failures.
|
30 |
Ranked Retrieval in Uncertain and Probabilistic DatabasesSoliman, Mohamed January 2011 (has links)
Ranking queries are widely used in data exploration, data analysis and decision
making scenarios. While most of the currently proposed ranking techniques focus
on deterministic data, several emerging applications involve data that are imprecise
or uncertain. Ranking uncertain data raises new challenges in query semantics and
processing, making conventional methods inapplicable. Furthermore, the interplay
between ranking and uncertainty models introduces new dimensions for ordering query
results that do not exist in the traditional settings.
This dissertation introduces new formulations and processing techniques for ranking queries on uncertain data. The formulations are based on marriage of traditional ranking semantics with possible worlds semantics under widely-adopted uncertainty models. In particular, we focus on studying the impact of tuple-level and attribute-level uncertainty on the semantics and processing techniques of ranking queries.
Under the tuple-level uncertainty model, we introduce a processing framework leveraging the capabilities of relational database systems to recognize and handle data
uncertainty in score-based ranking. The framework encapsulates a state space model,
and efficient search algorithms that compute query answers by lazily materializing the
necessary parts of the space. Under the attribute-level uncertainty model, we give a new probabilistic ranking model, based on partial orders, to encapsulate the space of possible rankings originating from uncertainty in attribute values. We present a set of efficient query evaluation algorithms, including sampling-based techniques based on the theory of Markov chains and Monte-Carlo method, to compute query answers.
We build on our techniques for ranking under attribute-level uncertainty to support
rank join queries on uncertain data. We show how to extend current rank join methods
to handle uncertainty in scoring attributes. We provide a pipelined query operator
implementation of uncertainty-aware rank join algorithm integrated with sampling
techniques to compute query answers.
|
Page generated in 0.0567 seconds