Global ETD Search

11	A C++ Distributed Database Select-project-join Queryprocessor On A Hpc Cluster Ceran, Erhan 01 May 2012 (has links) (PDF) High performance computer clusters have become popular as they are more scalable, affordable and reliable than their centralized counterparts. Database management systems are particularly suitable for distributed architectures / however distributed DBMS are still not used widely because of the design difficulties. In this study, we aim to help overcome these difficulties by implementing a simulation testbed for a distributed query plan processor. This testbed works on our departmental HPC cluster machine and is able to perform select, project and join operations. A data generation module has also been implemented which preserves the foreign key and primary key constraints in the database schema. The testbed has capability to measure, simulate and estimate the response time of a given query execution plan using specified communication network parameters. Extensive experimental work is performed to show the correctness of the produced results. The estimated execution time costs are also compared with the actual run-times obtained from the testbed to verify the proposed estimation functions. Thus, we make sure that these estimation iv functions can be used in distributed database query optimization and distributed database design tools. Read more QA Computer Software 76.75-76.765
12	Optimization for big joins and recursive query evaluation using intersection and difference filters in MapReduce Phan, Thuong-Cang 07 July 2014 (has links) (PDF) The information technology community has created unprecedented amount of data through large-scale applications. As a result, the Big Data is considered as gold mines of information that just wait for the processing power to be available, reliable, and apt at evaluating complex analytic algorithms. MapReduce is one of the most popular programming models designed to support such processing. It has become a standard for processing, analyzing and generating large data in a massively parallel manner. However, the MapReduce programming model suffers from severe limitations of operations beyond simple scan/grouping, particularly operations with multiple inputs. In the present dissertation we efficiently investigate and optimize the evaluation, in a MapReduce environment, of one of the most salient and representative such operations: Join. It focuses not only on two-way joins, but also complex joins such as multi-way joins and recursive joins. To achieve these objectives, we first devise a new type of filter called intersection filter using a probabilistic model to represent an approximation of the set intersection. The intersection filter is then applied to two-way join operations to eliminate most non-joining elements in input datasets before sending data to actual join processing. In addition, we make an extension of the intersection filter to improve the performance of three-way joins and chain joins including both cyclic chain joins with many shared join keys. We use the Lagrangian multiplier method to indicate a good choice between our optimized solutions for the multi-way joins. Another important proposal is a difference filter, which is a probabilistic data structure designed to represent a set and examine disjoint elements of the set. It can be applied to a wide range of popular problems such as reconciliation, deduplication, error-correction, especially a recursive join operation. A recursive join using the difference filter is implemented as an iteration of one join job instead of two jobs including a join job and a difference job. This improvement will significantly reduce the number of executed jobs by half, and the related overheads such as data rescanning, intermediate data, and communication for the deduplication and difference operations. Besides, this research also improves the general semi-naive algorithm, as well as the evaluation of recursive queries in MapReduce. We then provide general cost models for two-way joins, multi-way joins, and recursive joins. Thanks to these cost models, we can make comparisons of the join algorithms more persuasive. As a result, with using the proposed filters, the join operations can minimize disk I/O and communication costs. Moreover, the intersection filter-based join operations are demonstrated to be more efficient than existing solutions through experimental evaluations. Experimental comparisons of different algorithms for joins are examined with respect to intermediate data amount, the total output amount, the total execution time, and especially task timelines. Finally, our improvements on the join operations contribute to the global scene of optimizing data management for MapReduce applications on large-scale distributed infrastructures. Read more [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Big data MapReduce Bloom filter Join Recursive query evaluation Optimization
13	Grid-aware evaluation of regular path queries on large Spatial networks Miao, Zhuo 20 August 2007 (has links) Regular path queries (RPQs), expressed as regular expressions over the alphabet of database edge-labels, are commonly used for guided navigation of graph databases. RPQs are the basic building block of almost all the query languages for graph databases, providing the user with a nice and simple way to express recursion. While convenient to use, RPQs are notorious for their high computational demand. Except for few theoretical works, there has been little work evaluating RPQs on databases of great practical interest, such as large spatial networks. In this thesis, we present a grid-aware, fault tolerant distributed algorithm for answering RPQs on spatial networks. We engineer each part of the algorithm to account for the assumed computational-grid setting. We experimentally evaluate our algorithm, and show that for typical user queries, our algorithm satisfies the desiderata for distributed computing in general, and computational-grids in particular. Distributed Query Evaluation Regular Path Queries Grid Spatial databases Road networks
14	Optimisation des requêtes distribuées par apprentissage / Learning-based distributed query optimization Martinez Medina, Lourdes 07 January 2014 (has links) Les systèmes de gestion de données distribuées deviennent de plus en plus complexes. Ils interagissent avec des réseaux de dispositifs fixes et/ou mobiles, tels que des smartphones ou des tablettes, dispositifs hétérogènes, autonomes et possédant des limitations physiques. Ces dispositifs exécutent des applications permettant l'interaction des usagers (i.e. jeux virtuels, réseaux sociaux). Ces applications produisent et consomment des données à tout moment voire même en continu. Les caractéristiques de ces systèmes ajoutent des dimensions au problème de l'optimisation de requêtes, telles que la variabilité des objectifs d'optimisation, l'absence d'information sur les données (métadonnées) ou le manque d'une vision globale du système. Les techniques traditionnelles d'optimisation des requêtes n'abordent pas (ou très peu) les systèmes autonomes. Elles se basent sur les métadonnées et font des hypothèses très fortes sur le comportement du système. En plus, la majorité de ces techniques d'optimisation ciblent uniquement l'optimisation du temps d'exécution. La difficulté d'évaluation des requêtes dans les applications modernes incite à revisiter les techniques traditionnelles d'optimisation. Cette thèse fait face aux défis décris précédemment par l'adaptation du paradigme du Raisonnement à partir de cas (CBR pour Case-Based Reasoning) au problème de l'optimisation des requêtes. Cette adaptation, associée à une exploration pseudo-aléatoire de l'espace de solutions fournit un moyen pour optimiser des requêtes dans les contextes possédant très peu voire aucune information sur les données. Cette approche se concentre sur l'optimisation de requêtes en utilisant les cas générés précédemment dans l'évaluation de requêtes similaires. Un cas de requête et composé par : (i) la requête (le problème), (ii) le plan d'exécution (la solution) et (iii) les mesures de ressources utilisés par l'exécution du plan (l'évaluation de la solution). Cette thèse aborde également la façon que le processus CBR interagit avec le processus de génération de plan d'exécution de la requête qui doit permettre d'explorer l'espace des solutions. Ce processus utilise les heuristiques classiques et prennent des décisions de façon aléatoire lorsque les métadonnées viennent à manquer (e.g. pour l'ordre des jointures, la sélection des algorithmes, voire même le choix des protocoles d'acheminement de messages). Ce processus exploite également le CBR pour générer des plans pour des sous-requêtes, accélérant ainsi l'apprentissage de nouveaux cas. Les propositions de cette thèse ont été validées à l'aide du prototype CoBRA développé dans le contexte du projet UBIQUEST. / Distributed data systems are becoming increasingly complex. They interconnect devices (e.g. smartphones, tablets, etc.) that are heterogeneous, autonomous, either static or mobile, and with physical limitations. Such devices run applications (e.g. virtual games, social networks, etc.) for the online interaction of users producing / consuming data on demand or continuously. The characteristics of these systems add new dimensions to the query optimization problem, such as multi-optimization criteria, scarce information on data, lack of global system view, among others. Traditional query optimization techniques focus on semi (or not at all) autonomous systems. They rely on information about data and make strong assumptions about the system behavior. Moreover, most of these techniques are centered on the optimization of execution time only. The difficulty for evaluating queries efficiently on nowadays applications motivates this work to revisit traditional query optimization techniques. This thesis faces these challenges by adapting the Case Based Reasoning (CBR) paradigm to query processing, providing a way to optimize queries when there is no prior knowledge of data. It focuses on optimizing queries using cases generated from the evaluation of similar past queries. A query case comprises: (i) the query, (ii) the query plan and (iii) the measures (computational resources consumed) of the query plan. The thesis also concerns the way the CBR process interacts with the query plan generation process. This process uses classical heuristics and makes decisions randomly (e.g. when there are no statistics for join ordering and selection of algorithms, routing protocols). It also (re)uses cases (existing query plans) for similar queries parts, improving the query optimization, and therefore evaluation efficiency. The propositions of this thesis have been validated within the CoBRa optimizer developed in the context of the UBIQUEST project . Read more Bases de données Evaluation et optimisation de requêtes Réseaux multi-hop Apprentissage Data bases Query evaluation and optimization Multi-hop networks Machine learning 004
15	Declarative parallel query processing on large scale astronomical databases / Traitement parallèle et déclaratif de requêtes sur des masses de données issues d'observations astronomiques Mesmoudi, Amin 03 December 2015 (has links) Les travaux de cette thèse s'inscrivent dans le cadre du projet Petasky. Notre objectif est de proposer des outils permettant de gérer des dizaines de Peta-octets de données issues d'observations astronomiques. Nos travaux se focalisent essentiellement sur la conception des nouveaux systèmes permettant de garantir le passage à l'échelle. Dans cette thèse, nos contributions concernent trois aspects : Benchmarking des systèmes existants, conception d'un nouveau système et optimisation du système. Nous avons commencé par analyser la capacité des systèmes fondés sur le modèle MapReduce et supportant SQL à gérer les données LSST et leurs capacités d'optimisation de certains types de requêtes. Nous avons pu constater qu'il n'y a pas de technique « magique » pour partitionner, stocker et indexer les données mais l'efficacité des techniques dédiées dépend essentiellement du type de requête et de la typologie des données considérées. Suite à notre travail de Benchmarking, nous avons retenu quelques techniques qui doivent être intégrées dans un système de gestion de données à large échelle. Nous avons conçu un nouveau système de façon à garantir la capacité dudit système à supporter plusieurs mécanismes de partitionnement et plusieurs opérateurs d'évaluation. Nous avons utilisé BSP (Bulk Synchronous Parallel) comme modèle de calcul. Les données sont représentées logiquement par des graphes. L'évaluation des requêtes est donc faite en explorant le graphe de données en utilisant les arcs entrants et les arcs sortants. Les premières expérimentations ont montré que notre approche permet une amélioration significative des performances par rapport aux systèmes Map/Reduce / This work is carried out in framework of the PetaSky project. The objective of this project is to provide a set of tools allowing to manage Peta-bytes of data from astronomical observations. Our work is concerned with the design of a scalable approach. We first started by analyzing the ability of MapReduce based systems and supporting SQL to manage the LSST data and ensure optimization capabilities for certain types of queries. We analyzed the impact of data partitioning, indexing and compression on query performance. From our experiments, it follows that there is no “magic” technique to partition, store and index data but the efficiency of dedicated techniques depends mainly on the type of queries and the typology of data that are considered. Based on our work on benchmarking, we identified some techniques to be integrated to large-scale data management systems. We designed a new system allowing to support multiple partitioning mechanisms and several evaluation operators. We used the BSP (Bulk Synchronous Parallel) model as a parallel computation paradigm. Unlike MapeReduce model, we send intermediate results to workers that can continue their processing. Data is logically represented as a graph. The evaluation of queries is performed by exploring the data graph using forward and backward edges. We also offer a semi-automatic partitioning approach, i.e., we provide the system administrator with a set of tools allowing her/him to choose the manner of partitioning data using the schema of the database and domain knowledge. The first experiments show that our approach provides a significant performance improvement with respect to Map/Reduce systems Read more Benchmarking Passage à échelle Traitement parallèle Optimisation Graphes Astronomie Benchmark Scalability Parallel query evaluation Query optimization Graphs Astronomy 004.21
16	SCINTRA: A Model for Quantifying Inconsistencies in Grid-Organized Sensor Database Systems Schlesinger, Lutz, Lehner, Wolfgang 12 January 2023 (has links) Sensor data sets are usually collected in a centralized sensor database system or replicated cached in a distributed system to speed up query evaluation. However, a high data refresh rate disallows the usage of traditional replicated approaches with its strong consistency property. Instead we propose a combination of grid computing technology with sensor database systems. Each node holds cached data of other grid members. Since cached information may become stale fast, the access to outdated data may sometimes be acceptable if the user has knowledge about the degree of inconsistency if unsynchronized data are combined. The contribution of this paper is the presentation and discussion of a model for describing inconsistencies in grid organized sensor database systems. info:eu-repo/classification/ddc/004 ddc:004
17	Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates Idris, Muhammad 10 April 2019 (has links) Responsive analytics are rapidly taking over the traditional data analytics dominated by the post-fact approaches in traditional data warehousing. Recent advancements in analytics demand placing analytical engines at the forefront of the system to react to updates occurring at high speed and detect patterns, trends and anomalies. These kinds of solutions find applications in Financial Systems, Industrial Control Systems, Business Intelligence and on-line Machine Learning among others. These applications are usually associated with Big Data and require the ability to react to constantly changing data in order to obtain timely insights and take proactive measures. Generally, these systems specify the analytical results or their basic elements in a query language, where the main task then is to maintain these results under frequent updates efficiently. The task of reacting to updates and analyzing changing data has been addressed in two ways in the literature: traditional business intelligence (BI) solutions focus on historical data analysis where the data is refreshed periodically and in batches, and stream processing solutions process streams of data from transient sources as flow (or set of flows) of data items. Both kinds of systems share the niche of reacting to updates (known as dynamic evaluation); however, they differ in architecture, query languages, and processing mechanisms. In this thesis, we investigate the possibility of a reactive and unified framework to model queries that appear in both kinds of systems. In traditional BI solutions, evaluating queries under updates has been studied under the umbrella of incremental evaluation of updates that is based on relational incremental view maintenance model and mostly focus on queries that feature equi-joins. Streaming systems, in contrast, generally follow the automaton based models to evaluate queries under updates, and they generally process queries that mostly feature comparisons of temporal attributes (e.g., timestamp attributes) along-with comparisons of non-temporal attributes over streams of bounded sizes. Temporal comparisons constitute inequality constraints, while non-temporal comparisons can either be equality or inequality constraints, hence these systems mostly process inequality joins. As starting point, we postulate the thesis that queries in streaming systems can also be evaluated efficiently based on the paradigm of incremental evaluation just like in BI systems in a main-memory model. The efficiency of such a model is measured in terms of runtime memory footprint and the update processing cost. To this end, the existing approaches of dynamic evaluation in both kind of systems present a trade-off between memory footprint and the update processing cost. More specifically, systems that avoid materialization of query (sub) results incur high update latency and systems that materialize (sub) results incur high memory footprint. We are interested in investigating the possibility to build a model that can address this trade-off. In particular, we overcome this trade-off by investigating the possibility of practical dynamic evaluation algorithm for queries that appear in both kinds of systems, and present a main-memory data representation that allows to enumerate query (sub) results without materialization and can be maintained efficiently under updates. We call this representation the Dynamic Constant Delay Linear Representation (DCLR). We devise DCLRs with the following properties: 1) they allow, without materialization, enumeration of query results with bounded-delay (and with constant delay for a sub-class of queries); 2) they allow tuple lookup in query results with logarithmic delay (and with constant delay for conjunctive queries with equi-joins only); 3) they take space linear in the size of the database; 4) they can be maintained efficiently under updates. We first study the DCLRs with the above-described properties for the class of acyclic conjunctive queries featuring equi-joins with projections and present the dynamic evaluation algorithm. Then, we present the generalization of thiw algorithm to the class of acyclic queries featuring multi-way theta-joins with projections. We devise DCLRs with the above properties for acyclic conjunctive queries, and the working of dynamic algorithms over DCLRs is based on a particular variant of join trees, called the Generalized Join Trees (GJTs) that guarantee the above-described properties of DCLRs. We define GJTs and present the algorithms to test a conjunctive query featuring theta-joins for acyclicity and to generate GJTs for such queries. To do this, we extend the classical GYO algorithm from testing a conjunctive query with equalities for acyclicity to test a conjunctive query featuring multi-way theta-joins with projections for acyclicity. We further extend the GYO algorithm to generate GJTs for queries that are acyclic. We implemented our algorithms in a query compiler that takes as input the SQL queries and generates Scala executable code – a trigger program to process queries and maintain under updates. We tested our approach against state of the art main-memory BI and CEP systems. Our evaluation results have shown that our DCLRs based approach is over an order of magnitude efficient than existing systems for both memory footprint and update processing cost. We have also shown that the enumeration of query results without materialization in DCLRs is comparable (and in some cases efficient) as compared to enumerating from materialized query results. Read more info:eu-repo/classification/ddc/004 ddc:004
18	Traitement de requêtes SPARQL sur des données liées / SPARQL distributed query processing over linked data Macina, Abdoul 17 December 2018 (has links) De plus en plus de sources de données liées sont publiées à travers le Web en s'appuyant sur les technologies du Web sémantique, formant ainsi un large réseau de données distribuées. Cependant il est difficile pour les consommateurs de données de profiter de la richesse de ces données, compte tenu de leur distribution, de l'augmentation de leur volume et de l'autonomie des sources de données. Les moteurs fédérateurs de données permettent d'interroger ces sources de données en utilisant des techniques de traitement de requêtes distribuées. Cependant, une mise en œuvre naïve de ces techniques peut générer un nombre considérable de requêtes distantes et de nombreux résultats intermédiaires entraînant ainsi un long temps de traitement des requêtes et des communications réseau coûteuse. Par ailleurs, la sémantique des requêtes distribuées est souvent ignorée. L'expressivité des requêtes, le partitionnement des données et leur réplication sont d'autres défis auxquels doivent faire face les moteurs de requêtes. Pour répondre à ces défis, nous avons d'abord proposé une sémantique des requêtes distribuées compatible avec les standards SPARQL et RDF qui préserve l’expressivité de SPARQL. Nous avons ensuite présenté plusieurs stratégies d'optimisation pour un moteur de requêtes fédérées qui interroge de manière transparente des sources de données distribuées. La performance de ces optimisations est évaluée sur une implémentation d’un moteur de requêtes distribuées SPARQL / Driven by the Semantic Web standards, an increasing number of RDF data sources are published and connected over the Web by data providers, leading to a large distributed linked data network. However, exploiting the wealth of these data sources is very challenging for data consumers considering the data distribution, their volume growth and data sources autonomy. In the Linked Data context, federation engines allow querying these distributed data sources by relying on Distributed Query Processing (DQP) techniques. Nevertheless, a naive implementation of the DQP approach may generate a tremendous number of remote requests towards data sources and numerous intermediate results, thus leading to costly network communications. Furthermore, the distributed query semantics is often overlooked. Query expressiveness, data partitioning, and data replication are other challenges to be taken into account. To address these challenges, we first proposed in this thesis a SPARQL and RDF compliant Distributed Query Processing semantics which preserves the SPARQL language expressiveness. Afterwards, we presented several strategies for a federated query engine that transparently addresses distributed data sources, while managing data partitioning, query results completeness, data replication, and query processing performance. We implemented and evaluated our approach and optimization strategies in a federated query engine to prove their effectiveness. Read more Web sémantique Web de données Données liées Données ouvertes liées Traitement de requêtes distribuées Évaluation des requêtes fédérées SPARQL Semantic Web Web of data Linked data Linked open data Distributed query processing Federated query evaluation SPARQL

Search results