Spelling suggestions: "subject:"query"" "subject:"guery""
631 |
Multimodal Representation Learning for Textual Reasoning over Knowledge GraphsChoudhary, Nurendra 18 May 2023 (has links)
Knowledge graphs (KGs) store relational information in a flexible triplet schema and have become ubiquitous for information storage in domains such as web search, e-commerce, social networks, and biology. Retrieval of information from KGs is generally achieved through logical reasoning, but this process can be computationally expensive and has limited performance due to the large size and complexity of relationships within the KGs. Furthermore, to extend the usage of KGs to non-expert users, retrieval over them cannot solely rely on logical reasoning but also needs to consider text-based search. This creates a need for multi-modal representations that capture both the semantic and structural features from the KGs.
The primary objective of the proposed work is to extend the accessibility of KGs to non-expert users/institutions by enabling them to utilize non-technical textual queries to search over the vast amount of information stored in KGs. To achieve this objective, the research aims to solve four limitations: (i) develop a framework for logical reasoning over KGs that can learn representations to capture hierarchical dependencies between entities, (ii) design an architecture that can effectively learn the logic flow of queries from natural language text, (iii) create a multi-modal architecture that can capture inherent semantic and structural features from the entities and KGs, respectively, and (iv) introduce a novel hyperbolic learning framework to enable the scalability of hyperbolic neural networks over large graphs using meta-learning.
The proposed work is distinct from current research because it models the logical flow of textual queries in hyperbolic space and uses it to perform complex reasoning over large KGs. The models developed in this work are evaluated on both the standard research setting of logical reasoning, as well as, real-world scenarios of query matching and search, specifically, in the e-commerce domain.
In summary, the proposed work aims to extend the accessibility of KGs to non-expert users by enabling them to use non-technical textual queries to search vast amounts of information stored in KGs. To achieve this objective, the work proposes the use of multi-modal representations that capture both semantic and structural features from the KGs, and a novel hyperbolic learning framework to enable scalability of hyperbolic neural networks over large graphs. The work also models the logical flow of textual queries in hyperbolic space to perform complex reasoning over large KGs. The models developed in this work are evaluated on both the standard research setting of logical reasoning and real-world scenarios in the e-commerce domain. / Doctor of Philosophy / Knowledge graphs (KGs) are databases that store information in a way that allows computers to easily identify relationships between different pieces of data. They are widely used in domains such as web search, e-commerce, social networks, and biology. However, retrieving information from KGs can be computationally expensive, and relying solely on logical reasoning can limit their accessibility to non-expert users. This is where the proposed work comes in. The primary objective is to make KGs more accessible to non-experts by enabling them to use natural language queries to search the vast amounts of information stored in KGs. To achieve this objective, the research aims to address four limitations. Firstly, a framework for logical reasoning over KGs that can learn representations to capture hierarchical dependencies between entities is developed. Secondly, an architecture is designed that can effectively learn the logic flow of queries from natural language text. Thirdly, a multi-modal architecture is created that can capture inherent semantic and structural features from the entities and KGs, respectively. Finally, a novel hyperbolic learning framework is introduced to enable the scalability of hyperbolic neural networks over large graphs using meta-learning. The proposed work is unique because it models the logical flow of textual queries in hyperbolic space and uses it to perform complex reasoning over large KGs. The models developed in this work are evaluated on both the standard research setting of logical reasoning, as well as, real-world scenarios of query matching and search, specifically, in the e-commerce domain.
In summary, the proposed work aims to make KGs more accessible to non-experts by enabling them to use natural language queries to search vast amounts of information stored in KGs. To achieve this objective, the work proposes the use of multi-modal representations that capture both semantic and structural features from the KGs, and a novel hyperbolic learning framework to enable scalability of hyperbolic neural networks over large graphs. The work also models the logical flow of textual queries in hyperbolic space to perform complex reasoning over large KGs. The results of this work have significant implications for the field of information retrieval, as it provides a more efficient and accessible way to retrieve information from KGs. Additionally, the multi-modal approach taken in this work has potential applications in other areas of machine learning, such as image recognition and natural language processing. The work also contributes to the development of hyperbolic geometry as a tool for modeling complex networks, which has implications for fields such as network science and social network analysis. Overall, this work represents an important step towards making the vast amounts of information stored in KGs more accessible and useful to a wider audience.
|
632 |
Ontology-Mediated Query Answering over Log-Linear Probabilistic Data: Extended VersionBorgwardt, Stefan, Ceylan, Ismail Ilkan, Lukasiewicz, Thomas 28 December 2023 (has links)
Large-scale knowledge bases are at the heart of modern information systems. Their knowledge is inherently uncertain, and hence they are often materialized as probabilistic databases. However, probabilistic database management systems typically lack the capability to incorporate implicit background knowledge and, consequently, fail to capture some intuitive query answers. Ontology-mediated query answering is a popular paradigm for encoding commonsense knowledge, which can provide more complete answers to user queries. We propose a new data model that integrates the paradigm of ontology-mediated query answering with probabilistic databases, employing a log-linear probability model. We compare our approach to existing proposals, and provide supporting computational results.
|
633 |
Publication of the Bibliographies on the World Wide WebMoral, Ibrahim Utku 28 January 1997 (has links)
Every scientific research begins with a literature review that includes an extensive bibliographic search. Such searches are known to be difficult and time-consuming because of the vast amount of topical material existing in today's ever-changing technology base. Keeping up-to-date with related literature and being aware of the most recent publications require extensive time and effort.
The need for a WWW-based software tool for collecting and providing access to this scientific body of knowledge is undeniable. The study explained herein deals with this problem by developing an efficient, advanced, easy-to-use tool, WebBiblio, that provides a globally accessible WWW environment enabling the collection and dissemination of searchable bibliographies comprised of abstracts and keywords. This thesis describes the design, structure and features of WebBiblio, and explains the ideas and approaches used in its development. The developed system is not a prototype, but a production system that exploits the capabilities of the WWW. Currently, it is used to publish three VV&T bibliographies at the WWW site: http://manta.cs.vt.edu/biblio. With its rich set of features and ergonomically engineered interface, WebBiblio brings a comprehensive solution to solving the problem of globally collecting and providing access to a diverse set of bibliographies. / Master of Science
|
634 |
Scalable Management and Analysis of Temporal Property GraphsRost, Christopher 17 May 2024 (has links)
Graphs, as simple yet powerful data structures, play a pivotal role in modeling and analyzing relationships among real-world entities. In the data representation and analysis landscape, graph data structures have established themselves as a fundamental paradigm for modeling and understanding complex relationships in various domains. The intrinsic domain independence, expressiveness, and the wide variety of analysis options based on graph theory have gained significant attention in both research and industry.
In recent years, companies have increasingly leveraged graph technology to represent, store, query, and analyze graph-shaped data. This has been notably impactful in uncovering hidden patterns and predicting relationships within diverse domains such as social networks, Internet of Things (IoT), biological systems, and medical networks. However, the dynamic nature of most real-world graphs is often neglected in existing approaches, which might lead to inaccurate analytical results or an incomplete understanding of evolving patterns within the graph over time.
Temporal graphs, in contrast, are a particular type of graphs that maintain changing structures and properties over time. They have gained significant attention in various domains, from financial networks over micromobility networks to supply chains and biological networks. A majority of these real-world networks are not static but rather exhibit high dynamics, which are rarely considered in data models, query languages, and analyses, although analytical questions often require an evaluation of the network's evolution.
This doctoral thesis addresses this critical gap by presenting a comprehensive study on analyzing and exploring temporal property graphs. It focuses on scalability and proposes novel methodologies to enhance accuracy and comprehensiveness in analyzing evolving graph patterns over time. It also offers insights into real-time querying, addressing various challenges that emerge when the time dimension is treated as an integral part of the graph.
This thesis introduces the Temporal Property Graph Model (TPGM), a sophisticated data model designed for bitemporal modeling of vertices and edges, as well as logical abstractions of subgraphs and graph collections. The reference implementation of this model, namely Gradoop, is a graph dataflow system explicitly designed for scalable and distributed analysis of static and temporal property graphs. Gradoop empowers analysts to construct comprehensive and flexible temporal graph processing workflows through a declarative analytical language. The system supports various analytical temporal graph operators, such as snapshot retrieval, temporal graph pattern matching, time-dependent grouping, and temporal metrics such as degree evolution.
The thesis provides an in-depth analysis of the data model, system architecture, and implementation details of Gradoop and its operators. Comprehensive performance evaluations have been conducted on large real-world and synthetic temporal graphs, providing valuable insights into the system's capabilities and efficiency.
Furthermore, this thesis demonstrates the flexibility of the temporal graph model and its operators through a practical use case based on a call center network. In this scenario, a TPGM operator pipeline is developed to answer a complex and time-dependent analytical question. We also showcase the Temporal Graph Explorer (TGE), a web-based user interface designed to explore temporal graphs, leveraging Gradoop as a backend. The TGE empowers users to delve into temporal graph dynamics by enabling the retrieval of snapshots from the graph's past states, computing differences between these snapshots, and providing temporal summaries of graphs. This functionality allows for a comprehensive understanding of graph evolution through diverse visualizations. Real-world temporal graph data from bicycle rentals highlight the system's flexibility and configurability of the selected temporal operators.
The impact of graph changes on its characteristics can also be explored by examining centrality measures over time. Centrality measures, encompassing both node and graph metrics, quantify the characteristics of individual nodes or the entire graph. In the dynamic context of temporal graphs, where the graph changes over time, node and graph metrics also undergo implicit changes. This thesis tackles the challenge of adapting static node and graph metrics to temporal graphs. It proposes temporal extensions for selected degree-dependent metrics and aggregations, emphasizing the importance of including the time dimension in the metrics.
This thesis demonstrates that a metric conventionally representing a scalar value for static graphs results in a time series when applied to temporal graphs. It introduces a baseline algorithm for calculating the degree evolution of vertices within a temporal graph, and its practical implementation in Gradoop is presented. The scalability of this algorithm is evaluated using both real-world and synthetic datasets, providing valuable insights into its performance across diverse scenarios.
Such time series data can also be captured from the application scenario as properties of nodes and edges, such as sensor readings in the IoT domain.
In light of this, we showcase significant advancements, including an extended version of the TPGM that supports time series data in temporal graphs. Additionally, we introduce a temporal graph query language based on Oracle's language PGQL to facilitate seamless querying of time-oriented graph structures. Furthermore, we present a novel continuous graph-based event detection approach to support scenarios involving more time-sensitive use cases.
The increasing frequency of graph changes and the need to react quickly to emerging patterns leads to the need for a unified declarative graph query language that can evaluate queries on graph streams. To address the increasing importance of real-time data analysis and management, the thesis introduces the syntax and semantics of Seraph, a Cypher-based language that supports native streaming features within property graph query languages. The semantics of Seraph combine stream processing with property graphs and time-varying relations, treating time as a first-class citizen. Real-world industrial use cases demonstrate the usage of Seraph for graph-based continuous queries.
After evaluating lessons learned from the development and research on Gradoop, a dissertation summary and an outlook on future work are given in a final section. This doctoral thesis comprehensively examines scalable analysis and exploration techniques for temporal property graphs, focusing on Gradoop and its system architecture, data model, operators, and performance evaluations. It also explores the evolution of node and graph metrics and the theoretical foundation of a real-time query language, contributing to the advancement of temporal graph analysis in various domains.:1 Introduction
2 Background and Related Work
3 The TPGM and Gradoop
4 Gradoop Application Examples
5 Evolution of Degree Metrics
6 The Fusion of Graph and Time-Series Data
7 Seraph: Continuous Queries on Property Graph Streams
8 Lessons Learned from Gradoop
9 Conclusion and Outlook
Bibliography
|
635 |
Anfragebearbeitung auf Mehrkern-RechnerarchitekturenHuber, Frank 24 May 2012 (has links)
Der Trend zu immer mehr parallelen Recheneinheiten innerhalb eines Prozessors stellt an die Softwareentwicklung neue Herausforderungen. Um die vorhandenen Ressourcen auszulasten und die stetige Steigerung der Parallelität in einen Leistungszuwachs umzusetzen, muss Software von der sequentiellen Verarbeitung in eine hochgradig parallele Verarbeitung übergehen. Diese Arbeit untersucht, wie solch eine parallele Verarbeitung in Bezug auf Relationale Datenbankmanagementsysteme umzusetzen ist. Dazu wird zunächst der gesamte Prozess der Anfragebearbeitung betrachtet und vier Problembereiche identifiziert, die für das Ziel der parallelen Anfragebearbeitung auf Mehrkern-Rechnerarchitekturen maßgeblich sind. Diese Bereiche sind die Hardware selbst, das physische Datenmodell sowie die Anfrageausführung und -optimierung. Diese vier Bereiche werden innerhalb eines Rahmenwerkes betrachtet. Nach einer Einführung, wird sich die Arbeit zunächst mit Grundlagen befassen. Dazu werden die Hardwarebestandteile Speicher und Prozessor betrachtet und ihre Funktionsweise erläutert. Auf diesem Wissen aufbauend, wird ein Hardwaremodell definiert. Es ermöglicht eine von der jeweiligen Hardwarearchitektur unabhängige Softwareentwicklung, ohne den Verlust an Funktionalität und Leistung. Im Weiteren wird das physische Datenmodell untersucht und analysiert, wie das physische Datenmodell eine optimale Anfrageausführung unterstützen kann. Die verwendeten Datenstrukturen müssen dafür einen effizienten und parallelen Zugriff erlauben. Die Analyse führt zur Entwicklung eines neuartigen Indexes, der die datenparallele Abarbeitung nutzt. Gefolgt wird dieser Teil von der Anfrageausführung, in der ein neues Anfrageausführungsmodell entwickelt wird, das auf der Verwendung des Taskkonzepts beruht und eine hohe und sehr leicht gewichtige Parallelität erlaubt. Den Abschluss stellt die Anfrageoptimierung dar, worin verschiedene Ideen für die Optimierung der Ressourcenverwaltung präsentiert werden. / The upcoming generation of many-core architectures poses several new challenges for software development: Software design and software implementation has to change from sequential execution to a highly parallel execution, such that it takes full advantage of the steadily growing number of cores on a single processor. With this thesis, we investigate such highly parallel program execution in the context of relational database management systems (RDBMSs). We consider the complete process of query processing and identify four problem areas which are crucial for efficient parallel query processing on many-core architectures. These four areas are: Hardware, physical data model, query execution, and query optimization. Furthermore, we present a framework which covers all four parts, one after another. First, we give a detailed survey of computer hardware with a special focus on memory and processors. Based on this survey we propose a hardware model. Our abstraction aims to simplify the task of software development on many-core hardware. Based on the hardware model, we investigate physical data models and evaluate how the physical data model may support optimal query execution by providing efficient and parallelizable data structures. Additionally, we design a new index structure that utilizes data parallel execution by using SIMD operations. The next layer within our framework is query execution, for which we present a new task based query execution model. Our query execution model allows for a lightweight parallelism. Finally, we cover query optimization by explaining approaches for optimizing resource utilization on a query local point of view as well as query global point of view.
|
636 |
Efficient techniques for large-scale Web data management / Techniques efficaces de gestion de données Web à grande échelleCamacho Rodriguez, Jesus 25 September 2014 (has links)
Le développement récent des offres commerciales autour du cloud computing a fortement influé sur la recherche et le développement des plateformes de distribution numérique. Les fournisseurs du cloud offrent une infrastructure de distribution extensible qui peut être utilisée pour le stockage et le traitement des données.En parallèle avec le développement des plates-formes de cloud computing, les modèles de programmation qui parallélisent de manière transparente l'exécution des tâches gourmandes en données sur des machines standards ont suscité un intérêt considérable, à commencer par le modèle MapReduce très connu aujourd'hui puis par d'autres frameworks plus récents et complets. Puisque ces modèles sont de plus en plus utilisés pour exprimer les tâches de traitement de données analytiques, la nécessité se fait ressentir dans l'utilisation des langages de haut niveau qui facilitent la charge de l'écriture des requêtes complexes pour ces systèmes.Cette thèse porte sur des modèles et techniques d'optimisation pour le traitement efficace de grandes masses de données du Web sur des infrastructures à grande échelle. Plus particulièrement, nous étudions la performance et le coût d'exploitation des services de cloud computing pour construire des entrepôts de données Web ainsi que la parallélisation et l'optimisation des langages de requêtes conçus sur mesure selon les données déclaratives du Web.Tout d'abord, nous présentons AMADA, une architecture d'entreposage de données Web à grande échelle dans les plateformes commerciales de cloud computing. AMADA opère comme logiciel en tant que service, permettant aux utilisateurs de télécharger, stocker et interroger de grands volumes de données Web. Sachant que les utilisateurs du cloud prennent en charge les coûts monétaires directement liés à leur consommation de ressources, notre objectif n'est pas seulement la minimisation du temps d'exécution des requêtes, mais aussi la minimisation des coûts financiers associés aux traitements de données. Plus précisément, nous étudions l'applicabilité de plusieurs stratégies d'indexation de contenus et nous montrons qu'elles permettent non seulement de réduire le temps d'exécution des requêtes mais aussi, et surtout, de diminuer les coûts monétaires liés à l'exploitation de l'entrepôt basé sur le cloud.Ensuite, nous étudions la parallélisation efficace de l'exécution de requêtes complexes sur des documents XML mis en œuvre au sein de notre système PAXQuery. Nous fournissons de nouveaux algorithmes montrant comment traduire ces requêtes dans des plans exprimés par le modèle de programmation PACT (PArallelization ConTracts). Ces plans sont ensuite optimisés et exécutés en parallèle par le système Stratosphere. Nous démontrons l'efficacité et l'extensibilité de notre approche à travers des expérimentations sur des centaines de Go de données XML.Enfin, nous présentons une nouvelle approche pour l'identification et la réutilisation des sous-expressions communes qui surviennent dans les scripts Pig Latin. Notre algorithme, nommé PigReuse, agit sur les représentations algébriques des scripts Pig Latin, identifie les possibilités de fusion des sous-expressions, sélectionne les meilleurs à exécuter en fonction du coût et fusionne d'autres expressions équivalentes pour partager leurs résultats. Nous apportons plusieurs extensions à l'algorithme afin d’améliorer sa performance. Nos résultats expérimentaux démontrent l'efficacité et la rapidité de nos algorithmes basés sur la réutilisation et des stratégies d'optimisation. / The recent development of commercial cloud computing environments has strongly impacted research and development in distributed software platforms. Cloud providers offer a distributed, shared-nothing infrastructure, that may be used for data storage and processing.In parallel with the development of cloud platforms, programming models that seamlessly parallelize the execution of data-intensive tasks over large clusters of commodity machines have received significant attention, starting with the MapReduce model very well known by now, and continuing through other novel and more expressive frameworks. As these models are increasingly used to express analytical-style data processing tasks, the need for higher-level languages that ease the burden of writing complex queries for these systems arises.This thesis investigates the efficient management of Web data on large-scale infrastructures. In particular, we study the performance and cost of exploiting cloud services to build Web data warehouses, and the parallelization and optimization of query languages that are tailored towards querying Web data declaratively.First, we present AMADA, an architecture for warehousing large-scale Web data in commercial cloud platforms. AMADA operates in a Software as a Service (SaaS) approach, allowing users to upload, store, and query large volumes of Web data. Since cloud users support monetary costs directly connected to their consumption of resources, our focus is not only on query performance from an execution time perspective, but also on the monetary costs associated to this processing. In particular, we study the applicability of several content indexing strategies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the monetary costs associated with the exploitation of the cloud-based warehouse.Second, we consider the efficient parallelization of the execution of complex queries over XML documents, implemented within our system PAXQuery. We provide novel algorithms showing how to translate such queries into plans expressed in the PArallelization ConTracts (PACT) programming model. These plans are then optimized and executed in parallel by the Stratosphere system. We demonstrate the efficiency and scalability of our approach through experiments on hundreds of GB of XML data.Finally, we present a novel approach for identifying and reusing common subexpressions occurring in Pig Latin scripts. In particular, we lay the foundation of our reuse-based algorithms by formalizing the semantics of the Pig Latin query language with extended nested relational algebra for bags. Our algorithm, named PigReuse, operates on the algebraic representations of Pig Latin scripts, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and merges other equivalent expressions to share its result. We bring several extensions to the algorithm to improve its performance. Our experiment results demonstrate the efficiency and effectiveness of our reuse-based algorithms and optimization strategies.
|
637 |
Real-time Business Intelligence through Compact and Efficient Query Processing Under UpdatesIdris, Muhammad 05 March 2019 (has links) (PDF)
Responsive analytics are rapidly taking over the traditional data analytics dominated by the post-fact approaches in traditional data warehousing. Recent advancements in analytics demand placing analytical engines at the forefront of the system to react to updates occurring at high speed and detect patterns, trends, and anomalies. These kinds of solutions find applications in Financial Systems, Industrial Control Systems, Business Intelligence and on-line Machine Learning among others. These applications are usually associated with Big Data and require the ability to react to constantly changing data in order to obtain timely insights and take proactive measures. Generally, these systems specify the analytical results or their basic elements in a query language, where the main task then is to maintain query results under frequent updates efficiently. The task of reacting to updates and analyzing changing data has been addressed in two ways in the literature: traditional business intelligence (BI) solutions focus on historical data analysis where the data is refreshed periodically and in batches, and stream processing solutions process streams of data from transient sources as flows of data items. Both kinds of systems share the niche of reacting to updates (known as dynamic evaluation), however, they differ in architecture, query languages, and processing mechanisms. In this thesis, we investigate the possibility of a reactive and unified framework to model queries that appear in both kinds of systems.In traditional BI solutions, evaluating queries under updates has been studied under the umbrella of incremental evaluation of queries that are based on the relational incremental view maintenance model and mostly focus on queries that feature equi-joins. Streaming systems, in contrast, generally follow automaton based models to evaluate queries under updates, and they generally process queries that mostly feature comparisons of temporal attributes (e.g. timestamp attributes) along with comparisons of non-temporal attributes over streams of bounded sizes. Temporal comparisons constitute inequality constraints while non-temporal comparisons can either be equality or inequality constraints. Hence these systems mostly process inequality joins. As a starting point for our research, we postulate the thesis that queries in streaming systems can also be evaluated efficiently based on the paradigm of incremental evaluation just like in BI systems in a main-memory model. The efficiency of such a model is measured in terms of runtime memory footprint and the update processing cost. To this end, the existing approaches of dynamic evaluation in both kinds of systems present a trade-off between memory footprint and the update processing cost. More specifically, systems that avoid materialization of query (sub)results incur high update latency and systems that materialize (sub)results incur high memory footprint. We are interested in investigating the possibility to build a model that can address this trade-off. In particular, we overcome this trade-off by investigating the possibility of practical dynamic evaluation algorithm for queries that appear in both kinds of systems and present a main-memory data representation that allows to enumerate query (sub)results without materialization and can be maintained efficiently under updates. We call this representation the Dynamic Constant Delay Linear Representation (DCLRs).We devise DCLRs with the following properties: 1) they allow, without materialization, enumeration of query results with bounded-delay (and with constant delay for a sub-class of queries), 2) they allow tuple lookup in query results with logarithmic delay (and with constant delay for conjunctive queries with equi-joins only), 3) they take space linear in the size of the database, 4) they can be maintained efficiently under updates. We first study the DCLRs with the above-described properties for the class of acyclic conjunctive queries featuring equi-joins with projections and present the dynamic evaluation algorithm called the Dynamic Yannakakis (DYN) algorithm. Then, we present the generalization of the DYN algorithm to the class of acyclic queries featuring multi-way Theta-joins with projections and call it Generalized DYN (GDYN). We devise DCLRs with the above properties for acyclic conjunctive queries, and the working of DYN and GDYN over DCLRs are based on a particular variant of join trees, called the Generalized Join Trees (GJTs) that guarantee the above-described properties of DCLRs. We define GJTs and present algorithms to test a conjunctive query featuring Theta-joins for acyclicity and to generate GJTs for such queries. We extend the classical GYO algorithm from testing a conjunctive query with equalities for acyclicity to testing a conjunctive query featuring multi-way Theta-joins with projections for acyclicity. We further extend the GYO algorithm to generate GJTs for queries that are acyclic.GDYN is hence a unified framework based on DCLRs that enables processing of queries that appear in streaming systems as well as in BI systems in a unified main-memory model and addresses the space-time trade-off. We instantiate GDYN to the particular case where all Theta-joins involve only equalities and inequalities and call this instantiation IEDYN. We implement DYN and IEDYN as query compilers that generate executable programs in the Scala programming language and provide all the necessary data structures and their maintenance and enumeration methods in a continuous stream processing model. We evaluate DYN and IEDYN against state-of-the-art BI and streaming systems on both industrial and synthetically generated benchmarks. We show that DYN and IEDYN outperform the existing systems by over an order of magnitude efficiency in both memory footprint and update processing time. / Doctorat en Sciences de l'ingénieur et technologie / info:eu-repo/semantics/nonPublished
|
638 |
Scalable algorithms for cloud-based Semantic Web data management / Algorithmes passant à l’échelle pour la gestion de données du Web sémantique sur les platformes cloudZampetakis, Stamatis 21 September 2015 (has links)
Afin de construire des systèmes intelligents, où les machines sont capables de raisonner exactement comme les humains, les données avec sémantique sont une exigence majeure. Ce besoin a conduit à l’apparition du Web sémantique, qui propose des technologies standards pour représenter et interroger les données avec sémantique. RDF est le modèle répandu destiné à décrire de façon formelle les ressources Web, et SPARQL est le langage de requête qui permet de rechercher, d’ajouter, de modifier ou de supprimer des données RDF. Être capable de stocker et de rechercher des données avec sémantique a engendré le développement des nombreux systèmes de gestion des données RDF.L’évolution rapide du Web sémantique a provoqué le passage de systèmes de gestion des données centralisées à ceux distribués. Les premiers systèmes étaient fondés sur les architectures pair-à-pair et client-serveur, alors que récemment l’attention se porte sur le cloud computing.Les environnements de cloud computing ont fortement impacté la recherche et développement dans les systèmes distribués. Les fournisseurs de cloud offrent des infrastructures distribuées autonomes pouvant être utilisées pour le stockage et le traitement des données. Les principales caractéristiques du cloud computing impliquent l’évolutivité́, la tolérance aux pannes et l’allocation élastique des ressources informatiques et de stockage en fonction des besoins des utilisateurs.Cette thèse étudie la conception et la mise en œuvre d’algorithmes et de systèmes passant à l’échelle pour la gestion des données du Web sémantique sur des platformes cloud. Plus particulièrement, nous étudions la performance et le coût d’exploitation des services de cloud computing pour construire des entrepôts de données du Web sémantique, ainsi que l’optimisation de requêtes SPARQL pour les cadres massivement parallèles.Tout d’abord, nous introduisons les concepts de base concernant le Web sémantique et les principaux composants des systèmes fondés sur le cloud. En outre, nous présentons un aperçu des systèmes de gestion des données RDF (centralisés et distribués), en mettant l’accent sur les concepts critiques de stockage, d’indexation, d’optimisation des requêtes et d’infrastructure.Ensuite, nous présentons AMADA, une architecture de gestion de données RDF utilisant les infrastructures de cloud public. Nous adoptons le modèle de logiciel en tant que service (software as a service - SaaS), où la plateforme réside dans le cloud et des APIs appropriées sont mises à disposition des utilisateurs, afin qu’ils soient capables de stocker et de récupérer des données RDF. Nous explorons diverses stratégies de stockage et d’interrogation, et nous étudions leurs avantages et inconvénients au regard de la performance et du coût monétaire, qui est une nouvelle dimension importante à considérer dans les services de cloud public.Enfin, nous présentons CliqueSquare, un système distribué de gestion des données RDF basé sur Hadoop. CliqueSquare intègre un nouvel algorithme d’optimisation qui est capable de produire des plans massivement parallèles pour des requêtes SPARQL. Nous présentons une famille d’algorithmes d’optimisation, s’appuyant sur les équijointures n- aires pour générer des plans plats, et nous comparons leur capacité à trouver les plans les plus plats possibles. Inspirés par des techniques de partitionnement et d’indexation existantes, nous présentons une stratégie de stockage générique appropriée au stockage de données RDF dans HDFS (Hadoop Distributed File System). Nos résultats expérimentaux valident l’effectivité et l’efficacité de l’algorithme d’optimisation démontrant également la performance globale du système. / In order to build smart systems, where machines are able to reason exactly like humans, data with semantics is a major requirement. This need led to the advent of the Semantic Web, proposing standard ways for representing and querying data with semantics. RDF is the prevalent data model used to describe web resources, and SPARQL is the query language that allows expressing queries over RDF data. Being able to store and query data with semantics triggered the development of many RDF data management systems. The rapid evolution of the Semantic Web provoked the shift from centralized data management systems to distributed ones. The first systems to appear relied on P2P and client-server architectures, while recently the focus moved to cloud computing.Cloud computing environments have strongly impacted research and development in distributed software platforms. Cloud providers offer distributed, shared-nothing infrastructures that may be used for data storage and processing. The main features of cloud computing involve scalability, fault-tolerance, and elastic allocation of computing and storage resources following the needs of the users.This thesis investigates the design and implementation of scalable algorithms and systems for cloud-based Semantic Web data management. In particular, we study the performance and cost of exploiting commercial cloud infrastructures to build Semantic Web data repositories, and the optimization of SPARQL queries for massively parallel frameworks.First, we introduce the basic concepts around Semantic Web and the main components and frameworks interacting in massively parallel cloud-based systems. In addition, we provide an extended overview of existing RDF data management systems in the centralized and distributed settings, emphasizing on the critical concepts of storage, indexing, query optimization, and infrastructure. Second, we present AMADA, an architecture for RDF data management using public cloud infrastructures. We follow the Software as a Service (SaaS) model, where the complete platform is running in the cloud and appropriate APIs are provided to the end-users for storing and retrieving RDF data. We explore various storage and querying strategies revealing pros and cons with respect to performance and also to monetary cost, which is a important new dimension to consider in public cloud services. Finally, we present CliqueSquare, a distributed RDF data management system built on top of Hadoop, incorporating a novel optimization algorithm that is able to produce massively parallel plans for SPARQL queries. We present a family of optimization algorithms, relying on n-ary (star) equality joins to build flat plans, and compare their ability to find the flattest possibles. Inspired by existing partitioning and indexing techniques we present a generic storage strategy suitable for storing RDF data in HDFS (Hadoop’s Distributed File System). Our experimental results validate the efficiency and effectiveness of the optimization algorithm demonstrating also the overall performance of the system.
|
639 |
Multilingual Information Processing On Relaltional Database ArchitecturesKumaran, A 12 1900 (has links) (PDF)
No description available.
|
640 |
The development of an online energy auditing software application with remote SQL-database supportVan der Merwe, Johannes Schalk 03 1900 (has links)
Thesis (MScEng)--Stellenbosch University, 2012. / ENGLISH ABSTRACT: In the last century the earth has experienced an increase in the global mean temperature, with
the main contributing factor being the increase in greenhouse gasses. Evidence indicates that
the burning of fossil fuels, critical in the supply of energy, contributed towards three quarters
of the carbon dioxide (CO2) increase. In 2008 South Africa reached electricity capacity
constraints. A subsequent economic downturn experienced in the country, brought about by
the worldwide economic recession, has relieved some of the strain on the electricity supply
system. However, consumption levels are returning to those experienced during 2008 and no
new base load power stations have been added. Short-term capacity constraints can be
managed by shifting the peak demand, but the electricity shortage can only be avoided by
adding additional capacity or reducing the overall electricity consumption. Supply-side
solutions are both overdue and too expensive. The only solutions that can provide lasting
results are demand-side solutions.
During the past few years the Energy Efficiency and Demand-side Management (EEDSM)
programme implemented by South Africa’s electricity supply utility, Eskom, has gained
prominence. This programme relies heavily on calculating the savings incurred through any
demand-side intervention. Energy audits enable the calculation of various consumption
scenarios and can provide valuable insight into load operation and user behaviour. Energy
audits involve a two-part procedure consisting of load surveying and an analysis. This thesis
describes the development of both these procedures, combined into a single application. The
application has been tested and provides an accurate and effective tool for simulating
consumption and quantifying savings for various load adjustments.
The results gained from the auditing application surpassed the expectations and provides the
user with a sufficient base-line consumption estimate. The results do not reflect day-to-day variations, but the simulations are sufficient to quantify savings and determine whether
demand-side interventions are financially viable. The application also presents a benchmark
for the type of applications required to successfully implement an EEDSM programme. / AFRIKAANSE OPSOMMING: In die afgelope eeu het die aarde se gemiddelde temperatuur toegeneem, met die toename in
kweekhuisgasse as die grootste bydraende faktor. Dit wil ook voorkom asof die verbranding
van fossielbrandstowwe, wat noodsaaklik is vir die verskaffing van energie, verantwoordelik
is vir driekwart van die toename in koolstofdioksied (CO2). Gedurende 2008 het Suid-Afrika
elektrisiteitsbeperkings bereik. Die daaropvolgende ekonomiese afswaai wat in die land
ervaar is weensdie wêreldwye ekonomiese resessie, het van die druk op die elekriese netwerk
verlig. Verbruikersvlakke is egter besig om terug te keer na waar dit in 2008 was, maar geen
nuwe basislas-kragstasies is gebou nie. Op die kort termyn kan die kapasiteitsbeperkings
bestuur word deur die aanvraag te verskuif, maar die elektrisiteitstekort kan op die lang duur
slegs vermy word deur bykomende kapasiteit by te voeg of die totale aanvraag te verminder.
Toevoerkant-oplossings is beide agterstallig en te duur. Die enigste oplossings wat blywende
resultate kan lewer, is dus aan die verbruikerkant.
In die afgelope paar jaar het die effektiewe bestuur van energieverbruik baie aansien geniet.
Die nasionale energievoorsiener, Eskom, het ook 'n program geloods om te help met die
implimentering van energiebesparingmaatreëls. Die implementering van energie-oudits om
met die kwantifisering van besparings te help, is van integrale belang vir die sukses van die
program. Energie-oudits stel die eindverbruiker in staat om verskeie verbruiksmoontlikhede
te beproef en sodoende waardevolle inligitng te verkry rakende die verbruikspatrone van die
fasiliteit. Energie-oudits behels 'n tweeledige proses, bestaande uit 'n lasopname en 'n
verbruiksanalise. Hierdie proefskrif beskryf die ontwikkeling van 'n stelsel wat beide die
prosesse kombineer in 'n enkele applikasie. Die applikasie is getoets en bied 'n akkurate en
doeltreffende instrument om verbruik te simuleer en besparings te kwantifiseer vir verskeie
verbruiksmoontlikhede.
v
Die resultate van die oudit het die aanvanklike verwagtinge oortref en voorsien verbruikers
van 'n goeie skatting van die basisverbruik van 'n fasiliteit. Die resultate weerspieël nie dagtot-
dag variasies nie, maar die simulasies is voldoende om besparings te kwantifiseer en help
om die finansiële lewensvatbaarheid van verbruikerskant-intervensies te bepaal. Die program
bied ook 'n verwysingspunt vir applikasies wat besparingstudies wil implementeer.
|
Page generated in 0.066 seconds