Global ETD Search

Return to search

Evaluating Clustering Techniques over Big Data in Distributed Infrastructures

Clustering is defined as the process of grouping a set of objects in a way that objects in the same group are similar in some sense to each other than to those in other groups. It is used in many fields including machine learning, image recognition, pattern recognition and knowledge discovery. In this era of Big Data, we could leverage the computing power of distributed environment to achieve it over large dataset. It can be achieved through various algorithms, but in general they have high time complexities. We see that for large datasets the scalability and the parameters of the environment in which it is running become issues which needs to be addressed. Therefore it's brute force implementation is not scalable over large datasets even in a distributed environment, which calls the need for an approximation technique or optimization to make it scalable. We study three clustering techniques: CURE, DBSCAN and k-means over distributed environment like Hadoop. For each of these algorithms we understand their performance trade offs and bottlenecks and then propose enhancements or optimizations or an approximation technique to make it scalable in Hadoop. Finally we evaluate it's performance and suitability to datasets of different sizes and distributions.

clustering

hadoop

Identifer	oai:union.ndltd.org:wpi.edu/oai:digitalcommons.wpi.edu:etd-theses-2225
Date	25 April 2018
Creators	Shetty, Kartik
Contributors	Mohamed Y. Eltabakh, Advisor, Dmitry Korkin, Reader,
Publisher	Digital WPI
Source Sets	Worcester Polytechnic Institute
Detected Language	English
Type	text
Format	application/pdf
Source	Masters Theses (All Theses, All Years)

Page generated in 0.0015 seconds

Evaluating Clustering Techniques over Big Data in Distributed Infrastructures

Description

Links & Downloads

Tags

Additional Fields