Global ETD Search

Return to search

Enumerating k-cliques in a large network using Apache Spark

Indiana University-Purdue University Indianapolis (IUPUI) / Network analysis is an important research task which explains the relationships
among various entities in a given domain. Most of the existing approaches of network
analysis compute global properties of a network, such as transitivity, diameter, and
all-pair shortest paths. They also study various non-random properties of a network,
such as graph densifi cation with shrinking diameter, small diameter, and scale-freeness.
Such approaches enable us to understand real-life networks with global properties.
However, the discovery of the local topological building blocks within a network
is an important task, and examples include clique enumeration, graphlet counting,
and motif counting. In this paper, my focus is to fi nd an efficient solution of k-clique
enumeration problem. A clique is a small, connected, and complete induced subgraph
over a large network. However, enumerating cliques using sequential technologies is
very time-consuming. Another promising direction that is being adopted is a solution
that runs on distributed clusters of machines using the Hadoop mapreduce
framework. However, the solution suffers from a general limitation of the framework,
as Hadoop's mapreduce performs substantial amounts of reading and writing to disk.
Thus, the running times of Hadoop-based approaches suffer enormously. To avoid
these problems, we propose an e cient, scalable, and distributed solution, kc-spark
, for enumerating cliques in real-life networks using the Apache Spark in-memory cluster
computing framework. Experiment results show that kc-spark can enumerate
k-cliques from very large real-life networks, whereas a single commodity machine cannot
produce the same desired result in a feasible amount of time. We also compared
kc-spark with Hadoop mapreduce solutions and found the algorithm to be 80-100
percent faster in terms of running times. On the other hand, we compared with the
triangle enumeration with Hadoop mapreduce and results shown that kc-spark is
8-10 times faster than mapreduce implementation with the same cluster setup. Furthermore,
the overall performance of kc-spark is improved by using Spark's inbuilt
caching and broadcast transformations.

Identifer	oai:union.ndltd.org:IUPUI/oai:scholarworks.iupui.edu:1805/12282
Date	January 2017
Creators	Dheekonda, Raja Sekhar Rao
Contributors	Al Hasan, MOHAMMAD
Source Sets	Indiana University-Purdue University Indianapolis
Language	en_US
Detected Language	English
Type	Thesis

Page generated in 0.0029 seconds

Enumerating k-cliques in a large network using Apache Spark

Description

Links & Downloads

Tags

Additional Fields