Spelling suggestions: "subject:"query"" "subject:"guery""
561 |
Βελτιστοποίηση ερωτημάτων με πολλαπλά κριτήρια σε βάσεις δεδομένων / Multiobjective query optimization under parametric aggregation constraintsΡήγα, Γεωργία 24 September 2007 (has links)
Το πρόβλημα της βελτιστοποίησης ερωτημάτων πολλαπλών κριτηρίων σε βάσεις δεδομένων είναι ένα αρκετά δύσκολο και ενδιαφέρον ερευνητικά πρόβλημα, διότι χαρακτηρίζεται από αντικρουόμενες απαιτήσεις. Κάθε βήμα στην απάντηση ενός ερωτήματος μπορεί να εκτελεστεί με παραπάνω από έναν τρόπους. Για την επίλυση τέτοιου είδους ερωτημάτων έχουν προταθεί διάφοροι αλγόριθμοι, με πιο πρόσφατους τους: Mariposa, M' και Generate Partitions. Ο Mariposa και ο Μ' εφαρμόζονται στην βάση δεδομένων Mariposa, η οποία δίνει την δυνατότητα στον χρήστη να καθορίζει την επιθυμητή εξισορόπηση (tradeoff) καθυστέρησης/κόστους για κάθε ερώτημα που θέτει. Ο αλγόριθμος Mariposa ακολουθεί μία προσέγγιση απληστίας (greedy approach) προσπαθώντας σε κάθε βήμα να μεγιστοποιήσει το «κέρδος» ενώ ο Μ' χρησιμοποιεί σύνολα βέτιστων κατά Pareto λύσεων για την επιλογή του επόμενου βήματος στην θέση του κριτηρίου απληστίας. Τέλος, ο αλγόριθμος Generate Partition χρησιμοποιεί έναν διαχωρισμό του χώρου απαντήσεων χρησιμοποιώντας δομές R-trees πετυχαίνοντας πολύ καλή απόδοση. / The optimization of queries in distributed database systems is known to be subject to delicate trade-offs. For example, the Mariposa database system allows users to specify a desired delay-cost tradeoff (that is to supply a decreasing function u(d) specifying how much the user is willing to pay in order to receive the query results within time d) Mariposa divides a query graph into orizontal strides analyzes each stride, and uses a greedy heuristic to find the best plan for all strides.
|
562 |
Answering Object Queries over Knowledge Bases with Expressive Underlying Description LogicsWu, Jiewen January 2013 (has links)
Many information sources can be viewed as collections of objects and descriptions about objects. The relationship between objects is often characterized by a set of constraints that semantically encode background knowledge of some domain. The most straightforward and fundamental way to access information in these repositories is to search for objects that satisfy certain selection criteria. This work considers a description logics (DL) based representation of such information sources and object queries, which allows for automated reasoning over the constraints accompanying objects. Formally, a knowledge base K=(T, A) captures constraints in the terminology (a TBox) T, and objects with their descriptions in the assertions (an ABox) A, using some DL dialect L. In such a setting, object descriptions are L-concepts and object identifiers correspond to individual names occurring in K. Correspondingly, object queries are the well known problem of instance retrieval in the underlying DL knowledge base K, which returns the identifiers of qualifying objects.
This work generalizes instance retrieval over knowledge bases to provide users with answers in which both identifiers and descriptions of qualifying objects are given. The proposed query paradigm, called assertion retrieval, is favoured over instance retrieval since it provides more informative answers to users. A more compelling reason is related to performance: assertion retrieval enables a transfer of basic relational database techniques, such as caching and query rewriting, in the context of an assertion retrieval algebra.
The main contributions of this work are two-fold: one concerns optimizing the fundamental reasoning task that underlies assertion retrieval, namely, instance checking, and the other establishes a query compilation framework based on the assertion retrieval algebra. The former is necessary because an assertion retrieval query can entail a large volume of instance checking requests in the form of K|= a:C, where "a" is an individual name and "C" is a L-concept. This work thus proposes a novel absorption technique, ABox absorption, to improve instance checking. ABox absorption handles knowledge bases that have an expressive underlying dialect L, for instance, that requires disjunctive knowledge. It works particularly well when knowledge bases contain a large number of concrete domain concepts for object descriptions.
This work further presents a query compilation framework based on the assertion retrieval algebra to make assertion retrieval more practical. In the framework, a suite of rewriting rules is provided to generate a variety of query plans, with a focus on plans that avoid reasoning w.r.t. the background knowledge bases when sufficient cached results of earlier requests exist. ABox absorption and the query compilation framework have been implemented in a prototypical system, dubbed CARE Assertion Retrieval Engine (CARE). CARE also defines a simple yet effective cost model to search for the best plan generated by query rewriting. Empirical studies of CARE have shown that the proposed techniques in this work make assertion retrieval a practical application over a variety of domains.
|
563 |
Optimization for big joins and recursive query evaluation using intersection and difference filters in MapReducePhan, Thuong-Cang 07 July 2014 (has links) (PDF)
The information technology community has created unprecedented amount of data through large-scale applications. As a result, the Big Data is considered as gold mines of information that just wait for the processing power to be available, reliable, and apt at evaluating complex analytic algorithms. MapReduce is one of the most popular programming models designed to support such processing. It has become a standard for processing, analyzing and generating large data in a massively parallel manner. However, the MapReduce programming model suffers from severe limitations of operations beyond simple scan/grouping, particularly operations with multiple inputs. In the present dissertation we efficiently investigate and optimize the evaluation, in a MapReduce environment, of one of the most salient and representative such operations: Join. It focuses not only on two-way joins, but also complex joins such as multi-way joins and recursive joins. To achieve these objectives, we first devise a new type of filter called intersection filter using a probabilistic model to represent an approximation of the set intersection. The intersection filter is then applied to two-way join operations to eliminate most non-joining elements in input datasets before sending data to actual join processing. In addition, we make an extension of the intersection filter to improve the performance of three-way joins and chain joins including both cyclic chain joins with many shared join keys. We use the Lagrangian multiplier method to indicate a good choice between our optimized solutions for the multi-way joins. Another important proposal is a difference filter, which is a probabilistic data structure designed to represent a set and examine disjoint elements of the set. It can be applied to a wide range of popular problems such as reconciliation, deduplication, error-correction, especially a recursive join operation. A recursive join using the difference filter is implemented as an iteration of one join job instead of two jobs including a join job and a difference job. This improvement will significantly reduce the number of executed jobs by half, and the related overheads such as data rescanning, intermediate data, and communication for the deduplication and difference operations. Besides, this research also improves the general semi-naive algorithm, as well as the evaluation of recursive queries in MapReduce. We then provide general cost models for two-way joins, multi-way joins, and recursive joins. Thanks to these cost models, we can make comparisons of the join algorithms more persuasive. As a result, with using the proposed filters, the join operations can minimize disk I/O and communication costs. Moreover, the intersection filter-based join operations are demonstrated to be more efficient than existing solutions through experimental evaluations. Experimental comparisons of different algorithms for joins are examined with respect to intermediate data amount, the total output amount, the total execution time, and especially task timelines. Finally, our improvements on the join operations contribute to the global scene of optimizing data management for MapReduce applications on large-scale distributed infrastructures.
|
564 |
Scalable view-based techniques for web data : algorithms and systemsKatsifodimos, Asterios 03 July 2013 (has links) (PDF)
XML was recommended by W3C in 1998 as a markup language to be used by device- and system-independent methods of representing information. XML is nowadays used as a data model for storing and querying large volumes of data in database systems. In spite of significant research and systems development, many performance problems are raised by processing very large amounts of XML data. Materialized views have long been used in databases to speed up queries. Materialized views can be seen as precomputed query results that can be re-used to evaluate (part of) another query, and have been a topic of intensive research, in particular in the context of relational data warehousing. This thesis investigates the applicability of materialized views techniques to optimize the performance of Web data management tools, in particular in distributed settings, considering XML data and queries. We make three contributions.We first consider the problem of choosing the best views to materialize within a given space budget in order to improve the performance of a query workload. Our work is the first to address the view selection problem for a rich subset of XQuery. The challenges we face stem from the expressive power and features of both the query and view languages and from the size of the search space of candidate views to materialize. While the general problem has prohibitive complexity, we propose and study a heuristic algorithm and demonstrate its superior performance compared to the state of the art.Second, we consider the management of large XML corpora in peer-to-peer networks, based on distributed hash tables (or DHTs, in short). We consider a platform leveraging distributed materialized XML views, defined by arbitrary XML queries, filled in with data published anywhere in the network, and exploited to efficiently answer queries issued by any network peer. This thesis has contributed important scalability oriented optimizations, as well as a comprehensive set of experiments deployed in a country-wide WAN. These experiments outgrow by orders of magnitude similar competitor systems in terms of data volumes and data dissemination throughput. Thus, they are the most advanced in understanding the performance behavior of DHT-based XML content management in real settings.Finally, we present a novel approach for scalable content-based publish/subscribe (pub/sub, in short) in the presence of constraints on the available computational resources of data publishers. We achieve scalability by off-loading subscriptions from the publisher, and leveraging view-based query rewriting to feed these subscriptions from the data accumulated in others. Our main contribution is a novel algorithm for organizing subscriptions in a multi-level dissemination network in order to serve large numbers of subscriptions, respect capacity constraints, and minimize latency. The efficiency and effectiveness of our algorithm are confirmed through extensive experiments and a large deployment in a WAN.
|
565 |
A portable relational algebra library for high performance data-intensive query processingSaeed, Ifrah 09 April 2014 (has links)
A growing number of industries are turning to data warehousing applications such as forecasting and risk assessment to process large volumes of data. These data warehousing applications, which utilize queries comprised of a mix of arithmetic and relational algebra (RA) operators, currently run on systems that utilize commodity multi-core CPUs. If we acknowledge the data-intensive nature of these applications, general purpose graphics processing units (GPUs) with high throughput and memory bandwidth seem to be natural candidates to host these applications. However, since such relational queries exhibit irregular parallelism and data accesses, their efficient implementation on GPUs remains challenging. Thus, although tailored solutions for individual processors using their native programming environments have evolved, these solutions are not accessible to other processors. This thesis addresses this problem by providing a portable implementation of RA, mathematical, and related primitives required to implement and accelerate relational queries over large data sets in the form of the library. These primitives can run on any modern multi- and many-core architecture that supports OpenCL, thereby enhancing the performance potential of such architectures for warehousing applications. In essence, this thesis describes the implementation of primitives and the results of their performance evaluation on a range of platforms and concludes with insights, the identification of opportunities, and lessons learned. One of the major insights from our analysis is that for complex relational queries, the time taken to transfer data between host CPUs and discrete GPUs can render the performance of discrete and integrated GPUs comparable in spite of the higher computing power and memory bandwidth of discrete GPUs. Therefore, data movement optimization is the key to eff ectively harnessing the high performance of discrete GPUs; otherwise, cost eff ectiveness would encourage the use of integrated GPUs. Furthermore, portability also enables the complete utilization of all GPUs and CPUs in the system at run time by opportunistically using any type of available processor when a kernel is ready for execution.
|
566 |
Algorithmic Problems in Access ControlMousavi, Nima 29 July 2014 (has links)
Access control is used to provide regulated access
to resources by principals. It is an important and foundational
aspect of information security. Role-Based Access Control (RBAC) is
a popular and widely-used access control model,
that, as prior work argues,
is ideally suited for enterprise settings. In this dissertation,
we address two problems in the context of RBAC.
One is the User Authorization Query (UAQ) problem, which relates
to sessions that a user creates to exercise permissions.
UAQ's objective is the identification of a
set of roles that a user needs to activate such that the session is
authorized to all permissions that the user wants to exercise in
that session. The roles that are activated must respect
a set of Separation of Duty constraints. Such constraints restrict the
roles that can be activated together in a session.
UAQ is known to be intractable (NP-hard).
In this dissertation, we give a precise formulation of UAQ as a
joint-optimization problem, and analyze it.
We examine the manner in which each input parameter contributes to its
intractability.
We then propose an approach to mitigate its intractability based on
our observation that a corresponding decision version of the problem
is in NP. We efficiently
reduce UAQ to Boolean satisfiability in conjunctive normal form
(CNF-SAT), a well-known
NP-complete problem for which solvers exist that are efficient for large
classes of instances. We also present results for UAQ posed
as an approximation problem; our results
suggest that efficient approximation is not promising for UAQ.
We discuss an open-source implementation of our approach and a
corresponding empirical assessment that we have conducted.
The other problem we consider in this dissertation regards
an efficient data structure for distributed
access enforcement. Access enforcement is the process of validating an access
request to a resource.
Distributed access enforcement has become important
with the proliferation of data, which requires access control systems
to scale to tens of thousands of resources and permissions.
Prior work has shown the effectiveness of a data structure called
the Cascade Bloom Filter (CBF) for this problem.
In this dissertation, we study the construction of instances
of the CBF.
We formulate the problem of finding an optimal instance of a
CBF, where optimality refers to the number of false positives
incurred and the number
of hash functions used. We prove that this problem
is NP-hard, and a meaningful decision version is in NP.
We then propose an approach to mitigate the intractability of
the problem by reducing it to
CNF-SAT, that allows us to use a SAT solver for instances that
arise in practice.
We discuss an open-source implementation of our approach
and an empirical assessment based on it.
|
567 |
Concept-Oriented Model and Nested Partially Ordered SetsSavinov, Alexandr 24 April 2014 (has links) (PDF)
Concept-oriented model of data (COM) has been recently defined syntactically by means of the concept-oriented query language (COQL). In this paper we propose a formal embodiment of this model, called nested partially ordered sets (nested posets), and demonstrate how it is connected with its syntactic counterpart. Nested poset is a novel formal construct that can be viewed either as a nested set with partial order relation established on its elements or as a conventional poset where elements can themselves be posets. An element of a nested poset is defined as a couple consisting of one identity tuple and one entity tuple. We formally define main operations on nested posets and demonstrate their usefulness in solving typical data management and analysis tasks such as logic navigation, constraint propagation, inference and multidimensional analysis.
|
568 |
Grid-aware evaluation of regular path queries on large Spatial networksMiao, Zhuo 20 August 2007 (has links)
Regular path queries (RPQs), expressed as regular expressions over the alphabet of database edge-labels, are commonly used for guided navigation of graph databases. RPQs are the basic building block of almost all the query languages for graph databases, providing the user with a nice and simple way to express recursion. While convenient to use, RPQs are notorious for their high computational demand. Except for few theoretical works, there has been little work evaluating RPQs on databases of great practical interest, such as large spatial networks.
In this thesis, we present a grid-aware, fault tolerant distributed algorithm for answering RPQs on spatial networks. We engineer each part of the algorithm to account for the assumed computational-grid setting. We experimentally evaluate our algorithm, and show that for typical user queries, our algorithm satisfies the desiderata for distributed computing in general, and computational-grids in particular.
|
569 |
What's in a query : analyzing, predicting, and managing linked data accessLorey, Johannes January 2014 (has links)
The term Linked Data refers to connected information sources comprising structured data about a wide range of topics and for a multitude of applications. In recent years, the conceptional and technical foundations of Linked Data have been formalized and refined. To this end, well-known technologies have been established, such as the Resource Description Framework (RDF) as a Linked Data model or the SPARQL Protocol and RDF Query Language (SPARQL) for retrieving this information.
Whereas most research has been conducted in the area of generating and publishing Linked Data, this thesis presents novel approaches for improved management. In particular, we illustrate new methods for analyzing and processing SPARQL queries. Here, we present two algorithms suitable for identifying structural relationships between these queries. Both algorithms are applied to a large number of real-world requests to evaluate the performance of the approaches and the quality of their results. Based on this, we introduce different strategies enabling optimized access of Linked Data sources. We demonstrate how the presented approach facilitates effective utilization of SPARQL endpoints by prefetching results relevant for multiple subsequent requests.
Furthermore, we contribute a set of metrics for determining technical characteristics of such knowledge bases. To this end, we devise practical heuristics and validate them through thorough analysis of real-world data sources. We discuss the findings and evaluate their impact on utilizing the endpoints. Moreover, we detail the adoption of a scalable infrastructure for improving Linked Data discovery and consumption. As we outline in an exemplary use case, this platform is eligible both for processing and provisioning the corresponding information. / Unter dem Begriff Linked Data werden untereinander vernetzte Datenbestände verstanden, die große Mengen an strukturierten Informationen für verschiedene Anwendungsgebiete enthalten. In den letzten Jahren wurden die konzeptionellen und technischen Grundlagen für die Veröffentlichung von Linked Data gelegt und verfeinert. Zu diesem Zweck wurden eine Reihe von Technologien eingeführt, darunter das Resource Description Framework (RDF) als Datenmodell für Linked Data und das SPARQL Protocol and RDF Query Language (SPARQL) zum Abfragen dieser Informationen.
Während bisher hauptsächlich die Erzeugung und Bereitstellung von Linked Data Forschungsgegenstand war, präsentiert die vorliegende Arbeit neuartige Verfahren zur besseren Nutzbarmachung. Insbesondere werden dafür Methoden zur Analyse und Verarbeitung von SPARQL-Anfragen entwickelt. Zunächst werden daher zwei Algorithmen vorgestellt, die die strukturelle Ähnlichkeit solcher Anfragen bestimmen. Beide Algorithmen werden auf eine große Anzahl von authentischen Anfragen angewandt, um sowohl die Güte der Ansätze als auch die ihrer Resultate zu untersuchen. Darauf aufbauend werden verschiedene Strategien erläutert, mittels derer optimiert auf Quellen von Linked Data zugegriffen werden kann. Es wird gezeigt, wie die dabei entwickelte Methode zur effektiven Verwendung von SPARQL-Endpunkten beiträgt, indem relevante Ergebnisse für mehrere nachfolgende Anfragen vorgeladen werden.
Weiterhin werden in dieser Arbeit eine Reihe von Metriken eingeführt, die eine Einschätzung der technischen Eigenschaften solcher Endpunkte erlauben. Hierfür werden praxisrelevante Heuristiken entwickelt, die anschließend ausführlich mit Hilfe von konkreten Datenquellen analysiert werden. Die dabei gewonnenen Erkenntnisse werden erörtert und in Hinblick auf die Verwendung der Endpunkte interpretiert. Des Weiteren wird der Einsatz einer skalierbaren Plattform vorgestellt, die die Entdeckung und Nutzung von Beständen an Linked Data erleichtert. Diese Plattform dient dabei sowohl zur Verarbeitung als auch zur Verfügbarstellung der zugehörigen Information, wie in einem exemplarischen Anwendungsfall erläutert wird.
|
570 |
Answering Object Queries over Knowledge Bases with Expressive Underlying Description LogicsWu, Jiewen January 2013 (has links)
Many information sources can be viewed as collections of objects and descriptions about objects. The relationship between objects is often characterized by a set of constraints that semantically encode background knowledge of some domain. The most straightforward and fundamental way to access information in these repositories is to search for objects that satisfy certain selection criteria. This work considers a description logics (DL) based representation of such information sources and object queries, which allows for automated reasoning over the constraints accompanying objects. Formally, a knowledge base K=(T, A) captures constraints in the terminology (a TBox) T, and objects with their descriptions in the assertions (an ABox) A, using some DL dialect L. In such a setting, object descriptions are L-concepts and object identifiers correspond to individual names occurring in K. Correspondingly, object queries are the well known problem of instance retrieval in the underlying DL knowledge base K, which returns the identifiers of qualifying objects.
This work generalizes instance retrieval over knowledge bases to provide users with answers in which both identifiers and descriptions of qualifying objects are given. The proposed query paradigm, called assertion retrieval, is favoured over instance retrieval since it provides more informative answers to users. A more compelling reason is related to performance: assertion retrieval enables a transfer of basic relational database techniques, such as caching and query rewriting, in the context of an assertion retrieval algebra.
The main contributions of this work are two-fold: one concerns optimizing the fundamental reasoning task that underlies assertion retrieval, namely, instance checking, and the other establishes a query compilation framework based on the assertion retrieval algebra. The former is necessary because an assertion retrieval query can entail a large volume of instance checking requests in the form of K|= a:C, where "a" is an individual name and "C" is a L-concept. This work thus proposes a novel absorption technique, ABox absorption, to improve instance checking. ABox absorption handles knowledge bases that have an expressive underlying dialect L, for instance, that requires disjunctive knowledge. It works particularly well when knowledge bases contain a large number of concrete domain concepts for object descriptions.
This work further presents a query compilation framework based on the assertion retrieval algebra to make assertion retrieval more practical. In the framework, a suite of rewriting rules is provided to generate a variety of query plans, with a focus on plans that avoid reasoning w.r.t. the background knowledge bases when sufficient cached results of earlier requests exist. ABox absorption and the query compilation framework have been implemented in a prototypical system, dubbed CARE Assertion Retrieval Engine (CARE). CARE also defines a simple yet effective cost model to search for the best plan generated by query rewriting. Empirical studies of CARE have shown that the proposed techniques in this work make assertion retrieval a practical application over a variety of domains.
|
Page generated in 1.5337 seconds