1 |
Incremental Anomaly Detection Using Two-Layer Cluster-based StructureBigdeli, Elnaz January 2016 (has links)
Anomaly detection algorithms face several challenges, including processing
speed and dealing with noise in data. In this thesis, a two-layer cluster-
based anomaly detection structure is presented which is fast, noise-resilient
and incremental. In this structure, each normal pattern is considered as
a cluster, and each cluster is represented using a Gaussian Mixture Model
(GMM). Then, new instances are presented to the GMM to be labeled as
normal or abnormal.
The proposed structure comprises three main steps. In the first step, the
data are clustered. The second step is to represent each cluster in a way
that enables the model to classify new instances. The Summarization based
on Gaussian Mixture Model (SGMM) proposed in this thesis represents each
cluster as a GMM.
In the third step, a two-layer structure efficiently updates clusters using
GMM representation while detecting and ignoring redundant instances. A
new approach, called Collective Probabilistic Labeling (CPL) is presented
to update clusters in a batch mode. This approach makes the updating
phase noise-resistant and fast. The collective approach also introduces a new
concept called 'rag bag' used to store new instances. The new instances
collected in the rag bag are clustered and summarized by GMMs. This
enables online systems to identify nearby clusters in the existing and new
clusters, and merge them quickly, despite the presence of noise to update the model.
An important step in the updating is the merging of new clusters with ex-
isting ones. To this end, a new distance measure is proposed, which is a mod-
i ed Kullback-Leibler distance between two GMMs. This modi ed distance
allows accurate identi cation of nearby clusters. After finding neighboring
clusters, they are merged, quickly and accurately. One of the reasons that
GMM is chosen to represent clusters is to have a clear and valid mathematical
representation for clusters, which eases further cluster analysis.
In most real-time anomaly detection applications, incoming instances are
often similar to previous ones. In these cases, there is no need to update
clusters based on duplicates, since they have already been modeled in the
cluster distribution. The two-layer structure is responsible for identifying
redundant instances. In this structure, redundant instance are ignored, and
the remaining new instances are used to update clusters. Ignoring redundant
instances, which are typically in the majority, makes the detection phase fast.
Each part of the general structure is validated in this thesis. The experiments include, detection rates, clustering goodness, time, memory usage
and the complexity of the algorithms. The accuracy of the clustering and
summarization of clusters using GMMs is evaluated, and compared to that of
other methods. Using Davies-Bouldin (DB) and Dunn indexes, the distances
for original and regenerated clusters using GMMs is almost zero with SGMM
method while this value for ABACUS is around 0:01. Moreover, the results
show that the SGMM algorithm is 3 times faster than ABACUS in running
time, using one-third of the memory used by ABACUS.
The CPL method, used to label new instances, is found to collectively
remove the effect of noise, while increasing the accuracy of labeling new
instances. In a noisy environment, the detection rate of the CPL method
is 5% higher than other algorithms such as one-class SVM. The false alarm rate is decreased by 10% on average. Memory use is 20 times lesser that that
of the one-class SVM.
The proposed method is found to lower the false alarm rate, which is
one of the basic problems for the one-class SVM. Experiments show the false
alarm rate is decreased from 5% to 15% among different datasets, while the
detection rate is increased from 5% to 10% in di erent datasets with two-
layer structure. The memory usage for the two-layer structure is 20 to 50
times less than that of one-class SVM. One-class SVM uses support vectors in
labeling new instances, while the labeling of the two-layer structure depends
on the number of GMMs. The experiments show that the two-layer structure
is 20 to 50 times faster than the one-class SVM in labeling new instances.
Moreover, the updating time of two-layer structure is 2 to 3 times less than
one-layer structure. This reduction is the direct result of ignoring redundant
instances and using two-layer structure.
|
2 |
Efficient, Parameter-Free Online ClusteringCunningham, James January 2020 (has links)
No description available.
|
3 |
Online stochastic algorithms / Algorithmes stochastiques en ligneLi, Le 27 November 2018 (has links)
Cette thèse travaille principalement sur trois sujets. Le premier concentre sur le clustering en ligne dans lequel nous présentons un nouvel algorithme stochastique adaptatif pour regrouper des ensembles de données en ligne. Cet algorithme repose sur l'approche quasi-bayésienne, avec une estimation dynamique (i.e., dépendant du temps) du nombre de clusters. Nous prouvons que cet algorithme atteint une borne de regret de l'ordre et que cette borne est asymptotiquement minimax sous la contrainte sur le nombre de clusters. Nous proposons aussi une implémentation par RJMCMC. Le deuxième sujet est lié à l'apprentissage séquentiel des courbes principales qui cherche à résumer une séquence des données par une courbe continue. Pour ce faire, nous présentons une procédure basée sur une approche maximum a posteriori pour le quasi-posteriori de Gibbs. Nous montrons que la borne de regret de cet algorithme et celui de sa version adaptative est sous-linéaire en l'horizon temporel T. En outre, nous proposons une implémentation par un algorithme glouton local qui intègre des éléments de sleeping experts et de bandit à plusieurs bras. Le troisième concerne les travaux qui visent à accomplir des tâches pratiques au sein d'iAdvize, l'entreprise qui soutient cette thèse. Il inclut l'analyse des sentiments pour les messages textuels et l'implémentation de chatbot dans lesquels la première est réalisé par les méthodes classiques dans la fouille de textes et les statistiques et la seconde repose sur le traitement du langage naturel et les réseaux de neurones artificiels. / This thesis works mainly on three subjects. The first one is online clustering in which we introduce a new and adaptive stochastic algorithm to cluster online dataset. It relies on a quasi-Bayesian approach, with a dynamic (i.e., time-dependent) estimation of the (unknown and changing) number of clusters. We prove that this algorithm has a regret bound of the order of and is asymptotically minimax under the constraint on the number of clusters. A RJMCMC-flavored implementation is also proposed. The second subject is related to the sequential learning of principal curves which seeks to represent a sequence of data by a continuous polygonal curve. To this aim, we introduce a procedure based on the MAP of Gibbs-posterior that can give polygonal lines whose number of segments can be chosen automatically. We also show that our procedure is supported by regret bounds with sublinear remainder terms. In addition, a greedy local search implementation that incorporates both sleeping experts and multi-armed bandit ingredients is presented. The third one concerns about the work which aims to fulfilling practical tasks within iAdvize, the company which supports this thesis. It includes sentiment analysis for textual messages by using methods in both text mining and statistics, and implementation of chatbot based on nature language processing and neural networks.
|
4 |
Détection d'anomalies à la volée dans des signaux vibratoires / Anomaly detection in high-dimensional datastreamsBellas, Anastasios 28 January 2014 (has links)
Le thème principal de cette thèse est d’étudier la détection d’anomalies dans des flux de données de grande dimension avec une application spécifique au Health Monitoring des moteurs d’avion. Dans ce travail, on considère que le problème de la détection d’anomalies est un problème d’apprentissage non supervisée. Les données modernes, notamment celles issues de la surveillance des systèmes industriels sont souvent des flux d’observations de grande dimension, puisque plusieurs mesures sont prises à de hautes fréquences et à un horizon de temps qui peut être infini. De plus, les données peuvent contenir des anomalies (pannes) du système surveillé. La plupart des algorithmes existants ne peuvent pas traiter des données qui ont ces caractéristiques. Nous introduisons d’abord un algorithme de clustering probabiliste offline dans des sous-espaces pour des données de grande dimension qui repose sur l’algorithme d’espérance-maximisation (EM) et qui est, en plus, robuste aux anomalies grâce à la technique du trimming. Ensuite, nous nous intéressons à la question du clustering probabiliste online de flux de données de grande dimension en développant l’inférence online du modèle de mélange d’analyse en composantes principales probabiliste. Pour les deux méthodes proposées, nous montrons leur efficacité sur des données simulées et réelles, issues par exemple des moteurs d’avion. Enfin, nous développons une application intégrée pour le Health Monitoring des moteurs d’avion dans le but de détecter des anomalies de façon dynamique. Le système proposé introduit des techniques originales de détection et de visualisation d’anomalies reposant sur les cartes auto-organisatrices. Des résultats de détection sont présentés et la question de l’identification des anomalies est aussi discutée. / The subject of this Thesis is to study anomaly detection in high-dimensional data streams with a specific application to aircraft engine Health Monitoring. In this work, we consider the problem of anomaly detection as an unsupervised learning problem. Modern data, especially those is-sued from industrial systems, are often streams of high-dimensional data samples, since multiple measurements can be taken at a high frequency and at a possibly infinite time horizon. More-over, data can contain anomalies (malfunctions, failures) of the system being monitored. Most existing unsupervised learning methods cannot handle data which possess these features. We first introduce an offline subspace clustering algorithm for high-dimensional data based on the expectation-maximization (EM) algorithm, which is also robust to anomalies through the use of the trimming technique. We then address the problem of online clustering of high-dimensional data streams by developing an online inference algorithm for the popular mixture of probabilistic principal component analyzers (MPPCA) model. We show the efficiency of both methods on synthetic and real datasets, including aircraft engine data with anomalies. Finally, we develop a comprehensive application for the aircraft engine Health Monitoring domain, which aims at detecting anomalies in aircraft engine data in a dynamic manner and introduces novel anomaly detection visualization techniques based on Self-Organizing Maps. Detection results are presented and anomaly identification is also discussed.
|
Page generated in 0.0838 seconds