• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 54
  • 9
  • 6
  • 5
  • 1
  • Tagged with
  • 87
  • 87
  • 32
  • 27
  • 27
  • 24
  • 23
  • 19
  • 16
  • 13
  • 12
  • 12
  • 12
  • 10
  • 9
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

Heterogeneity-Aware Placement Strategies for Query Optimization

Karnagel, Tomas 23 May 2017 (has links)
Computing hardware is changing from systems with homogeneous CPUs to systems with heterogeneous computing units like GPUs, Many Integrated Cores, or FPGAs. This trend is caused by scaling problems of homogeneous systems, where heat dissipation and energy consumption is limiting further growths in compute-performance. Heterogeneous systems provide differently optimized computing hardware, which allows different operations to be computed on the most appropriate computing unit, resulting in faster execution and less energy consumption. For database systems, this is a new opportunity to accelerate query processing, allowing faster and more interactive querying of large amounts of data. However, the current hardware trend is also a challenge as most database systems do not support heterogeneous computing resources and it is not clear how to support these systems best. In the past, mainly single operators were ported to different computing units showing great results, while missing a system wide application. To efficiently support heterogeneous systems, a systems approach for query processing and query optimization is needed. In this thesis, we tackle the optimization challenge in detail. As a starting point, we evaluate three different approaches on isolated use-cases to assess their advantages and limitations. First, we evaluate a fork-join approach of intra-operator parallelism, where the same operator is executed on multiple computing units at the same time, each execution with different data partitions. Second, we evaluate using one computing unit statically to accelerate one operator, which provides high code-optimization potential, due to this static and pre-known usage of hardware and software. Third, we evaluate dynamically placing operators onto computing units, depending on the operator, the available computing hardware, and the given data sizes. We argue that the first and second approach suffer from multiple overheads or high implementation costs. The third approach, dynamic placement, shows good performance, while being highly extensible to different computing units and different operator implementations. To automate this dynamic approach, we first propose general placement optimization for query processing. This general approach includes runtime estimation of operators on different computing units as well as two approaches for defining the actual operator placement according to the estimated runtimes. The two placement approaches are local optimization, which decides the placement locally at run-time, and global optimization, where the placement is decided at compile-time, while allowing a global view for enhanced data sharing. The main limitation of the latter is the high dependency on cardinality estimation of intermediate results, as estimation errors for the cardinalities propagate to the operator runtime estimation and placement optimization. Therefore, we propose adaptive placement optimization, allowing the placement optimization to become fully independent of cardinalities estimation, effectively eliminating the main source of inaccuracy for runtime estimation and placement optimization. Finally, we define an adaptive placement sequence, incorporating all our proposed techniques of placement optimization. We implement this sequence as a virtualization layer between the database system and the heterogeneous hardware. Our implementation approach bases on preexisting interfaces to the database system and the hardware, allowing non-intrusive integration into existing database systems. We evaluate our techniques using two different database systems and two different OLAP benchmarks, accelerating the query processing through heterogeneous execution.
52

Efficient exploitation of similar subexpressions for query processing

Zhou, Jingren, Larson, Per-Ake, Freytag, Johann Christoph, Lehner, Wolfgang 13 December 2022 (has links)
Complex queries often contain common or similar subexpressions, either within a single query or among multiple queries submitted as a batch. If so, query execution time can be improved by evaluating a common subexpression once and reusing the result in multiple places. However, current query optimizers do not recognize and exploit similar subexpressions, even within the same query. We present an efficient, scalable, and principled solution to this long-standing optimization problem. We introduce a light-weight and effective mechanism to detect potential sharing opportunities among expressions. Candidate covering subexpressions are constructed and optimization is resumed to determine which, if any, such subexpressions to include in the final query plan. The chosen subexpression(s) are computed only once and the results are reused to answer other parts of queries. Our solution automatically applies to optimization of query batches, nested queries, and maintenance of multiple materialized views. It is the first comprehensive solution covering all aspects of the problem: detection, construction, and cost-based optimization. Experiments on Microsoft SQL Server show significant performance improvements with minimal overhead.
53

Scalable view-based techniques for web data : algorithms and systems / Techniques efficaces basées sur des vues matérialisées pour la gestion des données du Web : algorithmes et systèmes

Katsifodimos, Asterios 03 July 2013 (has links)
Le langage XML, proposé par le W3C, est aujourd’hui utilisé comme un modèle de données pour le stockage et l’interrogation de grands volumes de données dans les systèmes de bases de données. En dépit d’importants travaux de recherche et le développement de systèmes efficace, le traitement de grands volumes de données XML pose encore des problèmes des performance dus à la complexité et hétérogénéité des données ainsi qu’à la complexité des langages courants d’interrogation XML. Les vues matérialisées sont employées depuis des décennies dans les bases de données afin de raccourcir les temps de traitement des requêtes. Elles peuvent être considérées les résultats de requêtes pré-calculées, que l’on réutilise afin d’éviter de recalculer (complètement ou partiellement) une nouvelle requête. Les vues matérialisées ont fait l’objet de nombreuses recherches, en particulier dans le contexte des entrepôts des données relationnelles.Cette thèse étudie l’applicabilité de techniques de vues matérialisées pour optimiser les performances des systèmes de gestion de données Web, et en particulier XML, dans des environnements distribués. Dans cette thèse, nos apportons trois contributions.D’abord, nous considérons le problème de la sélection des meilleures vues à matérialiser dans un espace de stockage donné, afin d’améliorer la performance d’une charge de travail des requêtes. Nous sommes les premiers à considérer un sous-langage de XQuery enrichi avec la possibilité de sélectionner des noeuds multiples et à de multiples niveaux de granularités. La difficulté dans ce contexte vient de la puissance expressive et des caractéristiques du langage des requêtes et des vues, et de la taille de l’espace de recherche de vues que l’on pourrait matérialiser.Alors que le problème général a une complexité prohibitive, nous proposons et étudions un algorithme heuristique et démontrer ses performances supérieures par rapport à l’état de l’art.Deuxièmement, nous considérons la gestion de grands corpus XML dans des réseaux pair à pair, basées sur des tables de hachage distribuées. Nous considérons la plateforme ViP2P dans laquelle des vues XML distribuées sont matérialisées à partir des données publiées dans le réseau, puis exploitées pour répondre efficacement aux requêtes émises par un pair du réseau. Nous y avons apporté d’importantes optimisations orientées sur le passage à l’échelle, et nous avons caractérisé la performance du système par une série d’expériences déployées dans un réseau à grande échelle. Ces expériences dépassent de plusieurs ordres de grandeur les systèmes similaires en termes de volumes de données et de débit de dissémination des données. Cette étude est à ce jour la plus complète concernant une plateforme de gestion de contenus XML déployée entièrement et testée à une échelle réelle.Enfin, nous présentons une nouvelle approche de dissémination de données dans un système d’abonnements, en présence de contraintes sur les ressources CPU et réseau disponibles; cette approche est mise en oeuvre dans le cadre de notre plateforme Delta. Le passage à l’échelle est obtenu en déchargeant le fournisseur de données de l’effort de répondre à une partie des abonnements. Pour cela, nous tirons profit de techniques de réécriture de requêtes à l’aide de vues afin de diffuser les données de ces abonnements, à partir d’autres abonnements.Notre contribution principale est un nouvel algorithme qui organise les vues dans un réseau de dissémination d’information multi-niveaux ; ce réseau est calculé à l’aide d’outils techniques de programmation linéaire afin de passer à l’échelle pour de grands nombres de vues, respecter les contraintes de capacité du système, et minimiser les délais de propagation des information. L’efficacité et la performance de notre algorithme est confirmée par notre évaluation expérimentale, qui inclut l’étude d’un déploiement réel dans un réseau WAN. / XML was recommended by W3C in 1998 as a markup language to be used by device- and system-independent methods of representing information. XML is nowadays used as a data model for storing and querying large volumes of data in database systems. In spite of significant research and systems development, many performance problems are raised by processing very large amounts of XML data. Materialized views have long been used in databases to speed up queries. Materialized views can be seen as precomputed query results that can be re-used to evaluate (part of) another query, and have been a topic of intensive research, in particular in the context of relational data warehousing. This thesis investigates the applicability of materialized views techniques to optimize the performance of Web data management tools, in particular in distributed settings, considering XML data and queries. We make three contributions.We first consider the problem of choosing the best views to materialize within a given space budget in order to improve the performance of a query workload. Our work is the first to address the view selection problem for a rich subset of XQuery. The challenges we face stem from the expressive power and features of both the query and view languages and from the size of the search space of candidate views to materialize. While the general problem has prohibitive complexity, we propose and study a heuristic algorithm and demonstrate its superior performance compared to the state of the art.Second, we consider the management of large XML corpora in peer-to-peer networks, based on distributed hash tables (or DHTs, in short). We consider a platform leveraging distributed materialized XML views, defined by arbitrary XML queries, filled in with data published anywhere in the network, and exploited to efficiently answer queries issued by any network peer. This thesis has contributed important scalability oriented optimizations, as well as a comprehensive set of experiments deployed in a country-wide WAN. These experiments outgrow by orders of magnitude similar competitor systems in terms of data volumes and data dissemination throughput. Thus, they are the most advanced in understanding the performance behavior of DHT-based XML content management in real settings.Finally, we present a novel approach for scalable content-based publish/subscribe (pub/sub, in short) in the presence of constraints on the available computational resources of data publishers. We achieve scalability by off-loading subscriptions from the publisher, and leveraging view-based query rewriting to feed these subscriptions from the data accumulated in others. Our main contribution is a novel algorithm for organizing subscriptions in a multi-level dissemination network in order to serve large numbers of subscriptions, respect capacity constraints, and minimize latency. The efficiency and effectiveness of our algorithm are confirmed through extensive experiments and a large deployment in a WAN.
54

Algorithms for XML stream processing : massive data, external memory and scalable performance / Algorithmes de traitement de flux XML : masses de données, mémoire externe et performances extensibles

Alrammal, Muath 16 May 2011 (has links)
Plusieurs applications modernes nécessitent un traitement de flux massifs de données XML, cela crée de défis techniques. Parmi ces derniers, il y a la conception et la mise en ouvre d'outils pour optimiser le traitement des requêtes XPath et fournir une estimation précise des coûts de ces requêtes traitées sur un flux massif de données XML. Dans cette thèse, nous proposons un nouveau modèle de prédiction de performance qui estime a priori le coût (en termes d'espace utilisé et de temps écoulé) pour les requêtes structurelles de Forward XPath. Ce faisant, nous réalisons une étude expérimentale pour confirmer la relation linéaire entre le traitement de flux, et les ressources d'accès aux données. Par conséquent, nous présentons un modèle mathématique (fonctions de régression linéaire) pour prévoir le coût d'une requête XPath donnée. En outre, nous présentons une technique nouvelle d'estimation de sélectivité. Elle se compose de deux éléments. Le premier est le résumé path tree: une présentation concise et précise de la structure d'un document XML. Le second est l'algorithme d'estimation de sélectivité: un algorithme efficace de flux pour traverser le synopsis path tree pour estimer les valeurs des paramètres de coût. Ces paramètres sont utilisés par le modèle mathématique pour déterminer le coût d'une requête XPath donnée. Nous comparons les performances de notre modèle avec les approches existantes. De plus, nous présentons un cas d'utilisation d'un système en ligne appelé "online stream-querying system". Le système utilise notre modèle de prédiction de performance pour estimer le coût (en termes de temps / mémoire) d'une requête XPath donnée. En outre, il fournit une réponse précise à l'auteur de la requête. Ce cas d'utilisation illustre les avantages pratiques de gestion de performance avec nos techniques / Many modern applications require processing of massive streams of XML data, creating difficult technical challenges. Among these, there is the design and implementation of applications to optimize the processing of XPath queries and to provide an accurate cost estimation for these queries processed on a massive steam of XML data. In this thesis, we propose a novel performance prediction model which a priori estimates the cost (in terms of space used and time spent) for any structural query belonging to Forward XPath. In doing so, we perform an experimental study to confirm the linear relationship between stream-processing and data-access resources. Therefore, we introduce a mathematical model (linear regression functions) to predict the cost for a given XPath query. Moreover, we introduce a new selectivity estimation technique. It consists of two elements. The first one is the path tree structure synopsis: a concise, accurate, and convenient summary of the structure of an XML document. The second one is the selectivity estimation algorithm: an efficient stream-querying algorithm to traverse the path tree synopsis for estimating the values of cost-parameters. Those parameters are used by the mathematical model to determine the cost of a given XPath query. We compare the performance of our model with existing approaches. Furthermore, we present a use case for an online stream-querying system. The system uses our performance predicate model to estimate the cost for a given XPath query in terms of time/memory. Moreover, it provides an accurate answer for the query's sender. This use case illustrates the practical advantages of performance management with our techniques
55

ADI : A NoSQL system for bi-temporal databases / ADI : Un système NoSQL pour les bases de données bi-temporelles

Ait Ouassarah, Azhar 23 May 2016 (has links)
La complexité et la dynamique de l'environnement dans lequel évolue chaque entreprise requiert de la part de ses managers la capacité de prendre des décisions pertinentes dans un laps de temps très court afin de s'accroître. Pour cela, l'analyse des données générées par l'activité de l'entreprise peut être une précieuse source d'information. L'Intelligence Opérationnelle (IO) est une classe de systèmes d'aide à la décision permettant aux managers d'avoir une très bonne compréhension de la situation de l'entreprise, à travers l'analyse de l'activité passée et présente. Dans ce contexte, les notions de temps et de traçabilité sont primordiales dans la compréhension de l'évolution de l'activité de l'entreprise à travers le temps. Dans cette thèse, nous présentons Axway Decision Insight (ADI), une solution d'IO développée par Axway. Son composant clé est un SGBD orienté-colonnes et bi-temporel développé en interne par l'entreprise pour répondre aux besoins spécifiques de l'IO. Ses capacités bi-temporelles lui permettent de gérer nativement aussi bien l'évolution des données dans la réalité modélisée (temps de validité) que l'évolution des données dans la base de données (temps de transaction). Nous commencerons par présenter la solution ADI en nous focalisant sur deux éléments importants: 1) l'interface graphique qui permet la conception et l'utilisation d'ADI sans écrire la moindre ligne de code. 2) L'approche adoptée pour modéliser les données bi-temporelles. Ensuite, nous présenterons un benchmark bi-temporel destiné ADI.Après cela, nous présenterons deux optimisations pour ADI. La première permet de pré-calculer et matérialiser les opérations d'agrégation, ce qui permet de réduire le temps nécessaire à la mise à jour de interface graphique d'ADI. La deuxième optimisation ordonne l'exécution des opérateurs de jointure des plans de requêtes en utilisant un modèle coût basé sur des statistiques sur des données bi-temporelles. Pour ces optimisations, nous avons effectué des expérimentations en utilisant notre benchmark, et qui ont démontré leurs intérêts. / Nowadays, every company is operating in very dynamic and complex environments which require from its managers to have a deep understanding of its business in order to take rapid and relevant decisions, and thus maintain or improve their company's activities. They can rely on analyzing the data deluge generated by the company's activities. A new class of systems has emerged in the decision support system galaxy called "Operational Intelligence" (OI) to meet this challenge. The objective is to enable operational managers to understand what happened in the past as well as what is currently happening in their business. In this context, the notions of time and traceability turns out to play a crucial role to understand what happened in the company and what is currently happening in the company. In this thesis, we present "Axway Decision Insight" (ADI), an "Operational Intelligence" solution developed by Axway. ADI's key component is a proprietary bi-temporal and column-oriented DBMS that has specially been designed to meet OI requirements. Its bi-temporal capabilities enable to catch both data evolution in the modeled reality (valid time) and in the database (transaction time).We first introduce ADI by focusing on two topics: 1) the GUI that makes the platform "code-free". 2) The adopted bi-temporal modeling approaches. Then we propose a performance benchmark that meets ADI's requirements. Next, we present two bi-temporal query optimizations for ADI. The first one consists in redefining a complex bi-temporal query into: 1) a set of continuous queries in charge of computing aggregation operations as data is collected. 2) A bi-temporal query that accesses the continuous queries' results and feeds the GUI. The second one is a cost-based optimization that uses statistics on bi-temporal data to determine an "optimal" query plan. For these two optimizations, we conducted some experiments, using our benchmark, which show their interests.
56

Optimizing similarity queries in metric spaces meeting user\'s expectation / Otimização de operações de busca por similaridade em espaços métricos

Ferreira, Mônica Ribeiro Porto 22 October 2012 (has links)
The complexity of data stored in large databases has increased at very fast paces. Hence, operations more elaborated than traditional queries are essential in order to extract all required information from the database. Therefore, the interest of the database community in similarity search has increased significantly. Two of the well-known types of similarity search are the Range (\'R IND. q\') and the k-Nearest Neighbor (\'kNN IND. q\') queries, which, as any of the traditional ones, can be sped up by indexing structures of the Database Management System (DBMS). Another way of speeding up queries is to perform query optimization. In this process, metrics about data are collected and employed to adjust the parameters of the search algorithms in each query execution. However, although the integration of similarity search into DBMS has begun to be deeply studied more recently, the query optimization has been developed and employed just to answer traditional queries. The execution of similarity queries, even using efficient indexing structures, tends to present higher computational cost than the execution of traditional ones. Two strategies can be applied to speed up the execution of any query, and thus they are worth to employ to answer also similarity queries. The first strategy is query rewriting based on algebraic properties and cost functions. The second technique is when external query factors are applied, such as employing the semantic expected by the user, to prune the answer space. This thesis aims at contributing to the development of novel techniques to improve the similarity-based query optimization processing, exploiting both algebraic properties and semantic restrictions as query refinements / A complexidade dos dados armazenados em grandes bases de dados tem aumentado sempre, criando a necessidade de novas operações de consulta. Uma classe de operações de crescente interesse são as consultas por similaridade, das quais as mais conhecidas são as consultas por abrangência (\'R IND. q\') e por k-vizinhos mais próximos (\'kNN IND. q\'). Qualquer consulta e agilizada pelas estruturas de indexação dos Sistemas de Gerenciamento de Bases de Dados (SGBDs). Outro modo de agilizar as operações de busca e a manutenção de métricas sobre os dados, que são utilizadas para ajustar parâmetros dos algoritmos de busca em cada consulta, num processo conhecido como otimização de consultas. Como as buscas por similaridade começaram a ser estudadas seriamente para integração em SGBDs muito mais recentemente do que as buscas tradicionais, a otimização de consultas, por enquanto, e um recurso que tem sido utilizado para responder apenas a consultas tradicionais. Mesmo utilizando as melhores estruturas existentes, a execução de consultas por similaridade tende a ser mais custosa do que as operações tradicionais. Assim, duas estratégias podem ser utilizadas para agilizar a execução de qualquer consulta e, assim, podem ser empregadas também para responder às consultas por similaridade. A primeira estratégia e a reescrita de consultas baseada em propriedades algébricas e em funções de custo. A segunda técnica faz uso de fatores externos à consulta, tais como a semântica esperada pelo usuário, para restringir o espaço das respostas. Esta tese pretende contribuir para o desenvolvimento de técnicas que melhorem o processo de otimização de consultas por similaridade, explorando propriedades algebricas e restrições semânticas como refinamento de consultas
57

AQuES

Stillger, Michael 21 January 2000 (has links)
Die parallele Anfragebearbeitung für relationale Datenbankmanagementsysteme (RDBMS) ist wegen ihrer unterschiedlichen Arten der Ausführungsparallelität und den Eigenschaften der zugrunde liegenden parallelen Architektur ein äusserst komplexes Problem. Systemänderungen zur Laufzeit der Anfrage können zusätzlich ein dynamisches Verhalten der ausführenden Komponenten erfordern, um eine nahezu optimale Antwortzeit zu gewährleisten. Diese Arbeit stellt einen neuen, flexiblen Ansatz für die Optimierung und Abarbeitung von komplexen Anfragen vor, der besonders die dynamische Optimierung berücksichtigt. Insbesondere werden in der Arbeit folgende Teile präsentiert: 1. die Architektur eines neuen, verteilt-kooperierenden Komponentensystems beeinflusst von agenten-orientierten Konzepten; 2. der Entwurf und die Realisierung einer neuen Kommunikationsinfrastruktur für die identifizierten Systemkomponenten; 3. der Entwurf und die Implementierung eines flexiblen Anfrageoptimierers mit einem neuen, zufallsbasierten Algorithmus; und 4. der Entwurf und die Realisierung einer parallel arbeitenden Ausführungskomponente unter besonderer Berücksichtigung der dynamischen Anfrageoptimierung. Bei der Entwicklung der Konzepte standen neben den spezifischen Anforderungen für RDBMS besonders die Konfigurierbarkeit und die Erweiterbarkeit des verteilten Systems im Vordergrund. / Parallel query evaluation for relational database management systems (RDBSM) still remains a challenging problem. Modern systems must show near optimal performance in spite of running in a heterogeneous hardware environment, exploiting different ways of parallelism and dealing with unpredictable system load. This thesis paper presents a dynamic and flexible system addressing the issues of optimization and evaluation of relational queries for a distributed and dynamic environment. In particular, this work consists of: 1) the architecture of a distributed system which was inspired by the concepts of software agents, 2) the architecture and the implementation of a communication infrastructure for the system components, 3) the architecture and the implementation of a new query optimization algorithm, and 4) the concept and the implementation of a new query evaluation engine for parallel execution, which enables runtime optimization of queries. Furthermore, the design supports the extension and the configuration of the system and its components.
58

A Declarative Approach to Modeling and Solving the View Selection Problem / Une approche déclarative pour la modélisation et la résolution du problème de la sélection de vues à matérialiser

Mami, Imene 15 November 2012 (has links)
La matérialisation de vues est une technique très utilisée dans les systèmes de gestion bases de données ainsi que dans les entrepôts de données pour améliorer les performances des requêtes. Elle permet de réduire de manière considérable le temps de réponse des requêtes en pré-calculant des requêtes coûteuses et en stockant leurs résultats. De ce fait, l'exécution de certaines requêtes nécessite seulement un accès aux vues matérialisées au lieu des données sources. En contrepartie, la matérialisation entraîne un surcoût de maintenance des vues. En effet, les vues matérialisées doivent être mises à jour lorsque les données sources changent afin de conserver la cohérence et l'intégrité des données. De plus, chaque vue matérialisée nécessite également un espace de stockage supplémentaire qui doit être pris en compte au moment de la sélection. Le problème de choisir quelles sont les vues à matérialiser de manière à réduire les coûts de traitement des requêtes étant donné certaines contraintes tel que l'espace de stockage et le coût de maintenance, est connu dans la littérature sous le nom du problème de la sélection de vues. Trouver la solution optimale satisfaisant toutes les contraintes est un problème NP-complet. Dans un contexte distribué constitué d'un ensemble de noeuds ayant des contraintes de ressources différentes (CPU, IO, capacité de l'espace de stockage, bande passante réseau, etc.), le problème de la sélection des vues est celui de choisir un ensemble de vues à matérialiser ainsi que les noeuds du réseau sur lesquels celles-ci doivent être matérialisées de manière à optimiser les coût de maintenance et de traitement des requêtes.Notre étude traite le problème de la sélection de vues dans un environnement centralisé ainsi que dans un contexte distribué. Notre objectif est de fournir une approche efficace dans ces contextes. Ainsi, nous proposons une solution basée sur la programmation par contraintes, connue pour être efficace dans la résolution des problèmes NP-complets et une méthode puissante pour la modélisation et la résolution des problèmes d'optimisation combinatoire. L'originalité de notre approche est qu'elle permet une séparation claire entre la formulation et la résolution du problème. A cet effet, le problème de la sélection de vues est modélisé comme un problème de satisfaction de contraintes de manière simple et déclarative. Puis, sa résolution est effectuée automatiquement par le solveur de contraintes. De plus, notre approche est flexible et extensible, en ce sens que nous pouvons facilement modéliser et gérer de nouvelles contraintes et mettre au point des heuristiques pour un objectif d'optimisation.Les principales contributions de cette thèse sont les suivantes. Tout d'abord, nous définissons un cadre qui permet d'avoir une meilleure compréhension des problèmes que nous abordons dans cette thèse. Nous analysons également l'état de l'art des méthodes de sélection des vues à matérialiser en en identifiant leurs points forts ainsi que leurs limites. Ensuite, nous proposons une solution utilisant la programmation par contraintes pour résoudre le problème de la sélection de vues dans un contexte centralisé. Nos résultats expérimentaux montrent notre approche fournit de bonnes performances. Elle permet en effet d'avoir le meilleur compromis entre le temps de calcul nécessaire pour la sélection des vues à matérialiser et le gain de temps de traitement des requêtes à réaliser en matérialisant ces vues. Enfin, nous étendons notre approche pour résoudre le problème de la sélection de vues à matérialiser lorsque celui-ci est étudié sous contraintes de ressources multiples dans un contexte distribué. A l'aide d'une évaluation de performances extensive, nous montrons que notre approche fournit des résultats de qualité et fiable. / View selection is important in many data-intensive systems e.g., commercial database and data warehousing systems to improve query performance. View selection can be defined as the process of selecting a set of views to be materialized in order to optimize query evaluation. To support this process, different related issues have to be considered. Whenever a data source is changed, the materialized views built on it have to be maintained in order to compute up-to-date query results. Besides the view maintenance issue, each materialized view also requires additional storage space which must be taken into account when deciding which and how many views to materialize.The problem of choosing which views to materialize that speed up incoming queries constrained by an additional storage overhead and/or maintenance costs, is known as the view selection problem. This is one of the most challenging problems in data warehousing and it is known to be a NP-complete problem. In a distributed environment, the view selection problem becomes more challenging. Indeed, it includes another issue which is to decide on which computer nodes the selected views should be materialized. The view selection problem in a distributed context is now additionally constrained by storage space capacities per computer node, maximum global maintenance costs and the communications cost between the computer nodes of the network.In this work, we deal with the view selection problem in a centralized context as well as in a distributed setting. Our goal is to provide a novel and efficient approach in these contexts. For this purpose, we designed a solution using constraint programming which is known to be efficient for the resolution of NP-complete problems and a powerful method for modeling and solving combinatorial optimization problems. The originality of our approach is that it provides a clear separation between formulation and resolution of the problem. Indeed, the view selection problem is modeled as a constraint satisfaction problem in an easy and declarative way. Then, its resolution is performed automatically by the constraint solver. Furthermore, our approach is flexible and extensible, in that it can easily model and handle new constraints and new heuristic search strategies for optimization purpose. The main contributions of this thesis are as follows. First, we define a framework that enables to have a better understanding of the problems we address in this thesis. We also analyze the state of the art in materialized view selection to review the existing methods by identifying respective potentials and limits. We then design a solution using constraint programming to address the view selection problem in a centralized context. Our performance experimentation results show that our approach has the ability to provide the best balance between the computing time to be required for finding the materialized views and the gain to be realized in query processing by materializing these views. Our approach will also guarantee to pick the optimal set of materialized views where no time limit is imposed. Finally, we extend our approach to provide a solution to the view selection problem when the latter is studied under multiple resource constraints in a distributed context. Based on our extensive performance evaluation, we show that our approach outperforms the genetic algorithm that has been designed for a distributed setting.
59

Optimizing similarity queries in metric spaces meeting user\'s expectation / Otimização de operações de busca por similaridade em espaços métricos

Mônica Ribeiro Porto Ferreira 22 October 2012 (has links)
The complexity of data stored in large databases has increased at very fast paces. Hence, operations more elaborated than traditional queries are essential in order to extract all required information from the database. Therefore, the interest of the database community in similarity search has increased significantly. Two of the well-known types of similarity search are the Range (\'R IND. q\') and the k-Nearest Neighbor (\'kNN IND. q\') queries, which, as any of the traditional ones, can be sped up by indexing structures of the Database Management System (DBMS). Another way of speeding up queries is to perform query optimization. In this process, metrics about data are collected and employed to adjust the parameters of the search algorithms in each query execution. However, although the integration of similarity search into DBMS has begun to be deeply studied more recently, the query optimization has been developed and employed just to answer traditional queries. The execution of similarity queries, even using efficient indexing structures, tends to present higher computational cost than the execution of traditional ones. Two strategies can be applied to speed up the execution of any query, and thus they are worth to employ to answer also similarity queries. The first strategy is query rewriting based on algebraic properties and cost functions. The second technique is when external query factors are applied, such as employing the semantic expected by the user, to prune the answer space. This thesis aims at contributing to the development of novel techniques to improve the similarity-based query optimization processing, exploiting both algebraic properties and semantic restrictions as query refinements / A complexidade dos dados armazenados em grandes bases de dados tem aumentado sempre, criando a necessidade de novas operações de consulta. Uma classe de operações de crescente interesse são as consultas por similaridade, das quais as mais conhecidas são as consultas por abrangência (\'R IND. q\') e por k-vizinhos mais próximos (\'kNN IND. q\'). Qualquer consulta e agilizada pelas estruturas de indexação dos Sistemas de Gerenciamento de Bases de Dados (SGBDs). Outro modo de agilizar as operações de busca e a manutenção de métricas sobre os dados, que são utilizadas para ajustar parâmetros dos algoritmos de busca em cada consulta, num processo conhecido como otimização de consultas. Como as buscas por similaridade começaram a ser estudadas seriamente para integração em SGBDs muito mais recentemente do que as buscas tradicionais, a otimização de consultas, por enquanto, e um recurso que tem sido utilizado para responder apenas a consultas tradicionais. Mesmo utilizando as melhores estruturas existentes, a execução de consultas por similaridade tende a ser mais custosa do que as operações tradicionais. Assim, duas estratégias podem ser utilizadas para agilizar a execução de qualquer consulta e, assim, podem ser empregadas também para responder às consultas por similaridade. A primeira estratégia e a reescrita de consultas baseada em propriedades algébricas e em funções de custo. A segunda técnica faz uso de fatores externos à consulta, tais como a semântica esperada pelo usuário, para restringir o espaço das respostas. Esta tese pretende contribuir para o desenvolvimento de técnicas que melhorem o processo de otimização de consultas por similaridade, explorando propriedades algebricas e restrições semânticas como refinamento de consultas
60

Cost-based optimization of graph queries in relational database management systems

Trissl, Silke 14 June 2012 (has links)
Graphen sind in vielen Bereichen des Lebens zu finden, wobei wir speziell an Graphen in der Biologie interessiert sind. Knoten in solchen Graphen sind chemische Komponenten, Enzyme, Reaktionen oder Interaktionen, die durch Kanten miteinander verbunden sind. Eine effiziente Ausführung von Graphanfragen ist eine Herausforderung. In dieser Arbeit präsentieren wir GRIcano, ein System, das die effiziente Ausführung von Graphanfragen erlaubt. Wir nehmen an, dass Graphen in relationalen Datenbankmanagementsystemen (RDBMS) gespeichert sind. Als Graphanfragesprache schlagen wir eine erweiterte Version der Pathway Query Language (PQL) vor. Der Hauptbestandteil von GRIcano ist ein kostenbasierter Anfrageoptimierer. Diese Arbeit enthält Beiträge zu allen drei benötigten Komponenten des Optimierers, der relationalen Algebra, Implementierungen und Kostenmodellen. Die Operatoren der relationalen Algebra sind nicht ausreichend, um Graphanfragen auszudrücken. Daher stellen wir zuerst neue Operatoren vor. Wir schlagen den Erreichbarkeits-, Distanz-, Pfadlängen- und Pfadoperator vor. Zusätzlich geben wir Regeln für die Umformung von Ausdrücken an. Des Weiteren präsentieren wir Implementierungen für jeden vorgeschlagenen Operator. Der Hauptbeitrag ist GRIPP, eine Indexstruktur, die die effiziente Ausführung von Erreichbarkeitsanfragen auf sehr großen Graphen erlaubt. Wir zeigen, wie GRIPP und die rekursive Anfragestrategie genutzt werden können, um Implementierungen für alle Operatoren bereitzustellen. Die dritte Komponente von GRIcano ist das Kostenmodell, das Kardinalitätsabschätzungen der Operatoren und Kostenfunktionen für die Implementierungen benötigt. Basierend auf umfangreichen Experimenten schlagen wir in dieser Arbeit Funktionen dafür vor. Der neue Ansatz unserer Kostenmodelle ist, dass die Funktionen nur Kennzahlen der Graphen verwenden. Abschließend zeigen wir die Wirkungsweise von GRIcano durch Beispielanfragen auf echten biologischen Graphen. / Graphs occur in many areas of life. We are interested in graphs in biology, where nodes are chemical compounds, enzymes, reactions, or interactions that are connected by edges. Efficiently querying these graphs is a challenging task. In this thesis we present GRIcano, a system that efficiently executes graph queries. For GRIcano we assume that graphs are stored and queried using relational database management systems (RDBMS). We propose an extended version of the Pathway Query Language PQL to express graph queries. The core of GRIcano is a cost-based query optimizer. This thesis makes contributions to all three required components of the optimizer, the relational algebra, implementations, and cost model. Relational algebra operators alone are not sufficient to express graph queries. Thus, we first present new operators to rewrite PQL queries to algebra expressions. We propose the reachability, distance, path length, and path operator. In addition, we provide rewrite rules for the newly proposed operators in combination with standard relational algebra operators. Secondly, we present implementations for each proposed operator. The main contribution is GRIPP, an index structure that allows us to answer reachability queries on very large graphs. GRIPP has advantages over other existing index structures, which we review in this work. In addition, we show how to employ GRIPP and the recursive query strategy as implementation for all four proposed operators. The third component of GRIcano is the cost model, which requires cardinality estimates for operators and cost functions for implementations. Based on extensive experimental evaluation of our proposed algorithms we present functions to estimate the cardinality of operators and the cost of executing a query. The novelty of our approach is that these functions only use key figures of the graph. We finally present the effectiveness of GRIcano using exemplary graph queries on real biological networks.

Page generated in 0.0805 seconds