Spelling suggestions: "subject:"graph analytics"" "subject:"graph dialytics""
1 |
Graph-XLL: a graph library for extra large graph analytics on a single machineWu, Jian 26 August 2019 (has links)
Graph libraries containing already-implemented algorithms are highly desired since users can conveniently use the algorithms off-the-shelf to achieve fast analyt- ics and prototyping, rather than implementing the algorithms with lower-level APIs. Besides the ease of use, the ability to efficiently process extra large graphs is also required by users. The popular existing graph libraries include the igraph R library and the NetworkX Python library. Although these libraries provide many off-the-shelf algorithms for users, the in-memory graph representation limits their scalability for computing on large graphs. Therefore, in this work, we develop Graph-XLL: a graph library implemented using the WebGraph framework in a vertex-centric manner, with much less memory requirement compared to igraph and NetworkX. Scalable analytics for extra large graphs (up to tens of millions of vertices and billions of edges) can be achieved on a single consumer grade machine within a reasonable amount of time. Such computation would cause out-of-memory error if using igraph or NetworkX. / Graduate
|
2 |
Distributed large-scale data storage and processingPapailiopoulos, Dimitrios 16 March 2015 (has links)
This thesis makes progress towards the fundamental understanding of heterogeneous and dynamic information systems and the way that we store and process massive data-sets. Reliable large-scale data storage: Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. However, traditional erasure codes are associated with high repair cost that is often considered an unavoidable price to pay. In this thesis, we show how to overcome these limitations. We construct novel families of erasure codes that are optimal under various repair cost metrics, while achieving the best possible reliability. We show how these modern storage codes significantly outperform traditional erasure codes. Low-rank approximations for large-scale data processing: A central goal in data analytics is extracting useful and interpretable information from massive data-sets. A challenge that arises from the distributed and large-scale nature of the data at hand, is having algorithms that are good in theory but can also scale up gracefully to large problem sizes. Using ideas from prior work, we develop a scalable lowrank optimization framework with provable guarantees for problems like the densest k-subgraph (DkS) and sparse PCA. Our experimental findings indicate that this low-rank framework can outperform the state-of-the art, by offering higher quality and more interpretable solutions, and by scaling up to problem inputs with billions of entries. / text
|
3 |
Scalable Discovery and Analytics on Web Linked DataAbdelaziz, Ibrahim 07 1900 (has links)
Resource Description Framework (RDF) provides a simple way for expressing facts across the web, leading to Web linked data. Several distributed and federated RDF systems have emerged to handle the massive amounts of RDF data available nowadays. Distributed systems are optimized to query massive datasets that appear as a single graph, while federated systems are designed to query hundreds of decentralized and interlinked graphs.
This thesis starts with a comprehensive experimental study of the state-of-the-art RDF systems. It identifies a set of research problems for improving the state-of-the-art, including: supporting the emerging RDF analytics required by many modern applications, querying linked data at scale, and enabling discovery on linked data. Addressing these problems is the focus of this thesis.
First, we propose Spartex; a versatile framework for complex RDF analytics. Spartex extends SPARQL to seamlessly combine generic graph algorithms with SPARQL queries. Spartex implements a generic SPARQL operator as a vertex-centric program that interprets SPARQL queries and executes them efficiently using a built-in optimizer. We demonstrate that Spartex scales to datasets with billions of edges, and is at least as fast as the state-of-the-art specialized RDF engines. For analytical tasks, Spartex is an order of magnitude faster than existing alternatives.
To address the scalability limitation of federated RDF engines, we propose Lusail; a scalable system for querying geo-distributed RDF graphs. Lusail follows a two-tier strategy: (i) locality-aware decomposition of the query into subqueries to maximize the computations at the endpoints and minimize intermediary results, and (ii) selectivity-aware execution to reduce network latency and increase parallelism. Our experiments on billions of triples show that Lusail outperforms existing systems by orders of magnitude in scalability and response time.
Finally, enabling discovery on linked data is challenging due to the prior knowledge required to formulate SPARQL queries. To address these challenges; we develop novel techniques to (i) predict semantically equivalent SPARQL queries from a set of keywords by leveraging word embeddings, and (ii) generate fine-grained and non-blocking query plans to get fast and early results.
|
4 |
Scalable analytics of massive graphsPopova, Diana 20 December 2018 (has links)
Graphs are commonly selected as a model of scientific information: graphs can successfully represent imprecise, uncertain, noisy data; and graph theory has a well-developed mathematical apparatus forming a solid and sound foundation for graph research. Design and experimental confirmation of new, scalable, and practical analytics for massive graphs have been actively researched for decades. Our work concentrates on developing new accurate and efficient algorithms that calculate the most influential nodes and communities in an arbitrary graph. Our algorithms for graph decomposition into families of most influential communities compute influential communities faster and using smaller memory footprint than existing algorithms for the problem. Our algorithms solving the problem of influence maximization in large graphs use much smaller memory than the existing state-of-the-art algorithms while providing solutions with equal accuracy. Our main contribution is designing data structures and algorithms that drastically cut the memory footprint and scale up the computation of influential communities and nodes to massive modern graphs. The algorithms and their implementations can efficiently handle networks of billions of edges using a single consumer-grade machine. These claims are supported by extensive experiments on large real-world graphs of different types. / Graduate
|
5 |
Distributed enumeration of four node graphlets at quadrillion-scaleLiu, Xiaozhou 19 November 2021 (has links)
Graphlet enumeration is a basic task in graph analysis with many applications. Thus it is important to be able to perform this task within a reasonable amount of time. However, this objective is challenging when the input graph is very large, with millions of nodes and edges. Known solutions are limited in terms of scalability. Distributed computing is often proposed as a solution to improve scalability. How- ever, it has to be done carefully to reduce the overhead cost and to really benefit from the distributed solution. We study the enumeration of four-node graphlets in undirected graphs using a distributed platform. We propose an efficient distributed solution which significantly surpasses the existing solutions. With this method we are able to process larger graphs that have never been processed before and enumerate quadrillions of graphlets using a modest cluster of machines. We convincingly show the scalability of our solution through experimental results. / Graduate
|
6 |
Algorithms and Frameworks for Graph Analytics at ScaleJamour, Fuad Tarek 28 February 2019 (has links)
Graph queries typically involve retrieving entities with certain properties and connectivity patterns. One popular property is betweenness centrality, which is a quantitative measure of importance used in many applications such as identifying influential users in social networks. Solving graph queries that involve retrieving important entities with user-defined connectivity patterns in large graphs requires efficient com- putation of betweenness centrality and efficient graph query engines. The first part of this thesis studies the betweenness centrality problem, while the second part presents a framework for building efficient graph query engines.
Computing betweenness centrality entails computing all-pairs shortest paths; thus, exact computation is costly. The performance of existing approximation algorithms is not well understood due to the lack of an established benchmark. Since graphs in many applications are inherently evolving, several incremental algorithms were proposed. However, they cannot scale to large graphs: they either require excessive memory or perform unnecessary computations rendering them prohibitively slow. Existing graph query engines rely on exhaustive indices for accelerating query evaluation. The time and memory required to build these indices can be prohibitively high for large graphs. This thesis attempts to solve the aforementioned limitations in the graph analytics literature as follows.
First, we present a benchmark for evaluating betweenness centrality approximation algorithms. Our benchmark includes ground-truth data for large graphs in addition to a systematic evaluation methodology. This benchmark is the first attempt to standardize evaluating betweenness centrality approximation algorithms and it is
currently being used by several research groups working on approximate between- ness in large graphs. Then, we present a linear-space parallel incremental algorithm for updating betweenness centrality in large evolving graphs. Our algorithm uses biconnected components decomposition to localize processing graph updates, and it performs incremental computation even within affected components. Our algorithm is up to an order of magnitude faster than the state-of-the-art parallel incremental algorithm. Finally, we present a framework for building low memory footprint graph query engines. Our framework avoids building exhaustive indices and uses highly optimized matrix algebra operations instead. Our framework loads datasets, and evaluates data-intensive queries up to an order of magnitude faster than existing engines.
|
7 |
The GraphGrind framework : fast graph analytics on large shared-memory systemsSun, Jiawen January 2018 (has links)
As shared memory systems support terabyte-sized main memory, they provide an opportunity to perform efficient graph analytics on a single machine. Graph analytics is characterised by frequent synchronisation, which is addressed in part by shared memory systems. However, performance is limited by load imbalance and poor memory locality, which originate in the irregular structure of small-world graphs. This dissertation demonstrates how graph partitioning can be used to optimise (i) load balance, (ii) Non-Uniform Memory Access (NUMA) locality and (iii) temporal locality of graph partitioning in shared memory systems. The developed techniques are implemented in GraphGrind, a new shared memory graph analytics framework. At first, this dissertation shows that heuristic edge-balanced partitioning results in an imbalance in the number of vertices per partition. Thus, load imbalance exists between partitions, either for loops iterating over vertices, or for loops iterating over edges. To address this issue, this dissertation introduces a classification of algorithms to distinguish whether they algorithmically benefit from edge-balanced or vertex-balanced partitioning. This classification supports the adaptation of partitions to the characteristics of graph algorithms. Evaluation in GraphGrind, shows that this outperforms state-of-the-art graph analytics frameworks for shared memory including Ligra by 1.46x on average, and Polymer by 1.16x on average, using a variety of graph algorithms and datasets. Secondly, this dissertation demonstrates that increasing the number of graph partitions is effective to improve temporal locality due to smaller working sets. However, the increasing number of partitions results in vertex replication in some graph data structures. This dissertation resorts to using a graph layout that is immune to vertex replication and an automatic graph traversal algorithm that extends the previously established graph traversal heuristics to a 3-way graph layout choice is designed. This new algorithm furthermore depends upon the classification of graph algorithms introduced in the first part of the work. These techniques achieve an average speedup of 1.79x over Ligra and 1.42x over Polymer. Finally, this dissertation presents a graph ordering algorithm to challenge the widely accepted heuristic to balance the number of edges per partition and minimise edge or vertex cut. This algorithm balances the number of edges per partition as well as the number of unique destinations of those edges. It balances edges and vertices for graphs with a power-law degree distribution. Moreover, this dissertation shows that the performance of graph ordering depends upon the characteristics of graph analytics frameworks, such as NUMA-awareness. This graph ordering algorithm achieves an average speedup of 1.87x over Ligra and 1.51x over Polymer.
|
8 |
A Systematic Approach for Obtaining Performance on Matrix-Like OperationsVeras, Richard Michael 01 August 2017 (has links)
Scientific Computation provides a critical role in the scientific process because it allows us ask complex queries and test predictions that would otherwise be unfeasible to perform experimentally. Because of its power, Scientific Computing has helped drive advances in many fields ranging from Engineering and Physics to Biology and Sociology to Economics and Drug Development and even to Machine Learning and Artificial Intelligence. Common among these domains is the desire for timely computational results, thus a considerable amount of human expert effort is spent towards obtaining performance for these scientific codes. However, this is no easy task because each of these domains present their own unique set of challenges to software developers, such as domain specific operations, structurally complex data and ever-growing datasets. Compounding these problems are the myriads of constantly changing, complex and unique hardware platforms that an expert must target. Unfortunately, an expert is typically forced to reproduce their effort across multiple problem domains and hardware platforms. In this thesis, we demonstrate the automatic generation of expert level high-performance scientific codes for Dense Linear Algebra (DLA), Structured Mesh (Stencil), Sparse Linear Algebra and Graph Analytic. In particular, this thesis seeks to address the issue of obtaining performance on many complex platforms for a certain class of matrix-like operations that span across many scientific, engineering and social fields. We do this by automating a method used for obtaining high performance in DLA and extending it to structured, sparse and scale-free domains. We argue that it is through the use of the underlying structure found in the data from these domains that enables this process. Thus, obtaining performance for most operations does not occur in isolation of the data being operated on, but instead depends significantly on the structure of the data.
|
9 |
Serverless Streaming Graph Analytics with Flink Stateful FunctionsChen, Sihan January 2022 (has links)
Serverless Function as a Service (FaaS) platforms have been an emerging trend nowadays with the continuous improvement of the cloud-native ecosystem. Graph streaming analytics is a widely-known research area that demands well-designed computation paradigms and complex optimization of storage and queries. Using serverless platforms to process graph streaming analytics would be a prospective field. For one thing, serverless platforms normally use a Function as the first-class citizen, and users can smoothly use or expand the Functions only caring about the application layer, to get the results without knowing the beneath architectures or environment. For another, distributed large-scale graph problems normally demand the message-passing actor model and serverless platforms could use one Function instance for one vertex with its own context, and each of the Functions could evolve its state by passing messages to each other. This way of processing is native to distributed stateful applications and can smoothly support graph streaming analytics. A temporal graph is a graph that evolves with time. With timestamps on edges, users can retrieve historical graph states and even retrieve graph states in any arbitrary event time windows for further analytics. Handling temporal graph analytics problems on serverless platforms is the focus of this thesis. Flink Stateful Functions, a newly-built API under the umbrella of Apache Flink, simplifies the building of distributed stateful applications with runtime for serverless architectures, with the full support of stateful entities modeling with location transparency, concurrency, scaling, and resiliency. Flink Stateful Functions is a powerful tool for temporal graph streaming analytics on a serverless platform. In this thesis project, a temporal graph processing library is built based on the Flink Stateful Functions. It supports efficient storage and query specifically on temporal graph analytics problems. / Serverlösa FaaS-plattformar har varit en framväxande trend nuförtiden med den kontinuerliga förbättringen av det molnbaserade ekosystemet. Grafströmningsanalys är ett allmänt känt forskningsområde som kräver väldesignade beräkningsparadigm och komplex optimering av lagring och frågor. Att använda serverlösa plattformar för att bearbeta grafströmningsanalyser skulle vara ett potentiellt område. För det första använder serverlösa plattformar normalt en funktion som den förstklassiga medborgaren, och användare kan smidigt använda eller utöka funktionerna som bara bryr sig om applikationslagret, för att få resultat utan att känna till underarkitekturerna eller miljön. För ett annat kräver distribuerade storskaliga grafproblem normalt den meddelandeöverförande aktörsmodellen och serverlösa plattformar kan använda en funktionsinstans för en vertex med sitt eget sammanhang, och var och en av funktionerna kan utveckla sitt tillstånd genom att skicka meddelanden till varandra. Det här sättet att bearbeta är inbyggt i distribuerade statistiska applikationer och kan smidigt stödja grafströmningsanalys. En tidsgraf är en graf som utvecklas med tiden. Med tidsstämplar på kanterna kan användare hämta historiskt graftillstånd och till och med hämta graftillstånd i alla godtyckliga händelsetidsfönster för ytterligare analys. Att hantera tidsmässiga grafanalysproblem på serverlösa plattformar är fokus för denna avhandling. Flink Stateful Functions, ett nybyggt API under Apache Flinks paraply, förenklar byggandet av distribuerade stateful-applikationer med runtime för serverlösa arkitekturer, med fullt stöd av stateful entitetsmodellering med platstransparens, samtidighet, skalning och resiliens. Flink Stateful Functions är ett kraftfullt verktyg för temporal grafströmningsanalys på en serverlös plattform. I detta examensarbete byggs ett bibliotek för temporal grafbehandling baserat på Flink Stateful Functions. Den stöder effektiv lagring och frågesökning specifikt på temporal grafdata för att lösa storskaliga grafproblem.
|
10 |
Bulk Synchronous Parallel Implementation of Percolation Centrality for Large Scale GraphsSaad, Kristen M. 30 August 2017 (has links)
No description available.
|
Page generated in 0.055 seconds