Anomaly detection algorithms face several challenges, including processing
speed and dealing with noise in data. In this thesis, a two-layer cluster-
based anomaly detection structure is presented which is fast, noise-resilient
and incremental. In this structure, each normal pattern is considered as
a cluster, and each cluster is represented using a Gaussian Mixture Model
(GMM). Then, new instances are presented to the GMM to be labeled as
normal or abnormal.
The proposed structure comprises three main steps. In the first step, the
data are clustered. The second step is to represent each cluster in a way
that enables the model to classify new instances. The Summarization based
on Gaussian Mixture Model (SGMM) proposed in this thesis represents each
cluster as a GMM.
In the third step, a two-layer structure efficiently updates clusters using
GMM representation while detecting and ignoring redundant instances. A
new approach, called Collective Probabilistic Labeling (CPL) is presented
to update clusters in a batch mode. This approach makes the updating
phase noise-resistant and fast. The collective approach also introduces a new
concept called 'rag bag' used to store new instances. The new instances
collected in the rag bag are clustered and summarized by GMMs. This
enables online systems to identify nearby clusters in the existing and new
clusters, and merge them quickly, despite the presence of noise to update the model.
An important step in the updating is the merging of new clusters with ex-
isting ones. To this end, a new distance measure is proposed, which is a mod-
i ed Kullback-Leibler distance between two GMMs. This modi ed distance
allows accurate identi cation of nearby clusters. After finding neighboring
clusters, they are merged, quickly and accurately. One of the reasons that
GMM is chosen to represent clusters is to have a clear and valid mathematical
representation for clusters, which eases further cluster analysis.
In most real-time anomaly detection applications, incoming instances are
often similar to previous ones. In these cases, there is no need to update
clusters based on duplicates, since they have already been modeled in the
cluster distribution. The two-layer structure is responsible for identifying
redundant instances. In this structure, redundant instance are ignored, and
the remaining new instances are used to update clusters. Ignoring redundant
instances, which are typically in the majority, makes the detection phase fast.
Each part of the general structure is validated in this thesis. The experiments include, detection rates, clustering goodness, time, memory usage
and the complexity of the algorithms. The accuracy of the clustering and
summarization of clusters using GMMs is evaluated, and compared to that of
other methods. Using Davies-Bouldin (DB) and Dunn indexes, the distances
for original and regenerated clusters using GMMs is almost zero with SGMM
method while this value for ABACUS is around 0:01. Moreover, the results
show that the SGMM algorithm is 3 times faster than ABACUS in running
time, using one-third of the memory used by ABACUS.
The CPL method, used to label new instances, is found to collectively
remove the effect of noise, while increasing the accuracy of labeling new
instances. In a noisy environment, the detection rate of the CPL method
is 5% higher than other algorithms such as one-class SVM. The false alarm rate is decreased by 10% on average. Memory use is 20 times lesser that that
of the one-class SVM.
The proposed method is found to lower the false alarm rate, which is
one of the basic problems for the one-class SVM. Experiments show the false
alarm rate is decreased from 5% to 15% among different datasets, while the
detection rate is increased from 5% to 10% in di erent datasets with two-
layer structure. The memory usage for the two-layer structure is 20 to 50
times less than that of one-class SVM. One-class SVM uses support vectors in
labeling new instances, while the labeling of the two-layer structure depends
on the number of GMMs. The experiments show that the two-layer structure
is 20 to 50 times faster than the one-class SVM in labeling new instances.
Moreover, the updating time of two-layer structure is 2 to 3 times less than
one-layer structure. This reduction is the direct result of ignoring redundant
instances and using two-layer structure.
Identifer | oai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/34299 |
Date | January 2016 |
Creators | Bigdeli, Elnaz |
Contributors | Matwin, Stan, Raahemi, Bijan |
Publisher | Université d'Ottawa / University of Ottawa |
Source Sets | Université d’Ottawa |
Language | English |
Detected Language | English |
Type | Thesis |
Page generated in 0.002 seconds