In recent years, mining data streams has been widely studied. A data streams is a
sequence of dynamic, continuous, unbounded and real time data items with a very
high data rate that can only be read once. In data mining, clustering is one of use-
ful techniques for discovering interesting data in the underlying data objects. The
problem of clustering can be defined formally as follows: given n data points in the d-
dimensional metric space, partition the data points into k clusters such that the data
points within a cluster are more similar to each other than data points in different
clusters. In the data streams environment, the difficulties of data streams clustering
contain storage overhead, low clustering quality and a low updating efficiency. Cur-
rent clustering algorithms can be broadly classified into four categories: partition,
hierarchical, density-based and grid-based approaches. The advantage of the grid-
based algorithm is that it can handle large databases. Based on the density-based
approach, the insertion or deletion of data affects the current clustering only in the
neighborhood of this data. Combining the advantages of the grid-based approach
and density-based approach, the CDS-Tree algorithm was proposed. Although it can
handle large databases, its clustering quality is restricted to the grid partition and the
threshold of a dense cell. Therefore, in this thesis, we present a new clustering algo-
rithm with high quality, GDense, for data streams. The GDense algorithm has high
quality due to two kinds of partition: cells and quadcells, and two kinds of threshold:
£_ and (1/4) . Moreover, in our GDense algorithm, in the data insertion part, the
7 cases takes 3 factors about the cell and the quadcell into consideration. In the
deletion part, the 10 cases take 5 factors about the cell into consideration. From our
simulation results, no matter what condition (including the number of data points,
the number of cells, the size of the sliding window, and the threshold of dense cell)
is, the clustering purity of our GDense algorithm is always higher than that of the
CDS-Tree algorithm. Moreover, we make a comparison of the purity between the our
GDense algorithm and the CDS-Tree algorithm with outliers. No matter whether the
number of outliers is large or small, the clustering purity of our GDense algorithm is
still higher than that of the CDS-Tree and we can improve about 20% the clustering
purity as compared to the CDS-Tree algorithm.
Identifer | oai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0625109-171938 |
Date | 25 June 2009 |
Creators | Lin, Shu-Yi |
Contributors | Ye-In Chang, Chien-I Lee, San-Yi Huang, Gen-Huey Chen |
Publisher | NSYSU |
Source Sets | NSYSU Electronic Thesis and Dissertation Archive |
Language | English |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0625109-171938 |
Rights | withheld, Copyright information available at source archive |
Page generated in 0.0017 seconds