1 |
Tuning and Optimising Concept Drift DetectionDo, Ethan Quoc-Nam January 2021 (has links)
Data drifts naturally occur in data streams due to seasonality, change in data usage,
and the data generation process. Concepts modelled via the data streams will also
experience such drift. The problem of differentiating concept drift from anomalies
is important to identify normal vs abnormal behaviour. Existing techniques achieve
poor responsiveness and accuracy towards this differentiation task.
We take two approaches to address this problem. First, we extend an existing
sliding window algorithm to include multiple windows to model recently seen data
stream patterns, and define new parameters to compare the data streams. Second,
we study a set of optimisers and tune a Bi-LSTM model parameters to maximize
accuracy. / Thesis / Master of Applied Science (MASc)
|
2 |
A Reservoir of Adaptive Algorithms for Online Learning from Evolving Data StreamsPesaranghader, Ali 26 September 2018 (has links)
Continuous change and development are essential aspects of evolving environments and applications, including, but not limited to, smart cities, military, medicine, nuclear reactors, self-driving cars, aviation, and aerospace. That is, the fundamental characteristics of such environments may evolve, and so cause dangerous consequences, e.g., putting people lives at stake, if no reaction is adopted. Therefore, learning systems need to apply intelligent algorithms to monitor evolvement in their environments and update themselves effectively. Further, we may experience fluctuations regarding the performance of learning algorithms due to the nature of incoming data as it continuously evolves. That is, the current efficient learning approach may become deprecated after a change in data or environment. Hence, the question 'how to have an efficient learning algorithm over time against evolving data?' has to be addressed. In this thesis, we have made two contributions to settle the challenges described above.
In the machine learning literature, the phenomenon of (distributional) change in data is known as concept drift. Concept drift may shift decision boundaries, and cause a decline in accuracy. Learning algorithms, indeed, have to detect concept drift in evolving data streams and replace their predictive models accordingly. To address this challenge, adaptive learners have been devised which may utilize drift detection methods to locate the drift points in dynamic and changing data streams. A drift detection method able to discover the drift points quickly, with the lowest false positive and false negative rates, is preferred. False positive refers to incorrectly alarming for concept drift, and false negative refers to not alarming for concept drift. In this thesis, we introduce three algorithms, called as the Fast Hoeffding Drift Detection Method (FHDDM), the Stacking Fast Hoeffding Drift Detection Method (FHDDMS), and the McDiarmid Drift Detection Methods (MDDMs), for detecting drift points with the minimum delay, false positive, and false negative rates. FHDDM is a sliding window-based algorithm and applies Hoeffding’s inequality (Hoeffding, 1963) to detect concept drift. FHDDM slides its window over the prediction results, which are either 1 (for a correct prediction) or 0 (for a wrong prediction). Meanwhile, it compares the mean of elements inside the window with the maximum mean observed so far; subsequently, a significant difference between the two means, upper-bounded by the Hoeffding inequality, indicates the occurrence of concept drift. The FHDDMS extends the FHDDM algorithm by sliding multiple windows over its entries for a better drift detection regarding the detection delay and false negative rate. In contrast to FHDDM/S, the MDDM variants assign weights to their entries, i.e., higher weights are associated with the most recent entries in the sliding window, for faster detection of concept drift. The rationale is that recent examples reflect the ongoing situation adequately. Then, by putting higher weights on the latest entries, we may detect concept drift quickly. An MDDM algorithm bounds the difference between the weighted mean of elements in the sliding window and the maximum weighted mean seen so far, using McDiarmid’s inequality (McDiarmid, 1989). Eventually, it alarms for concept drift once a significant difference is experienced. We experimentally show that FHDDM/S and MDDMs outperform the state-of-the-art by representing promising results in terms of the adaptation and classification measures.
Due to the evolving nature of data streams, the performance of an adaptive learner, which is defined by the classification, adaptation, and resource consumption measures, may fluctuate over time. In fact, a learning algorithm, in the form of a (classifier, detector) pair, may present a significant performance before a concept drift point, but not after. We define this problem by the question 'how can we ensure that an efficient classifier-detector pair is present at any time in an evolving environment?' To answer this, we have developed the Tornado framework which runs various kinds of learning algorithms simultaneously against evolving data streams. Each algorithm incrementally and independently trains a predictive model and updates the statistics of its drift detector. Meanwhile, our framework monitors the (classifier, detector) pairs, and recommends the efficient one, concerning the classification, adaptation, and resource consumption performance, to the user. We further define the holistic CAR measure that integrates the classification, adaptation, and resource consumption measures for evaluating the performance of adaptive learning algorithms. Our experiments confirm that the most efficient algorithm may differ over time because of the developing and evolving nature of data streams.
|
3 |
Analyse des différences dans le Big Data : Exploration, Explication, Évolution / Difference Analysis in Big Data : Exploration, Explanation, EvolutionKleisarchaki, Sofia 28 November 2016 (has links)
La Variabilité dans le Big Data se réfère aux données dont la signification change de manière continue. Par exemple, les données des plateformes sociales et les données des applications de surveillance, présentent une grande variabilité. Cette variabilité est dûe aux différences dans la distribution de données sous-jacente comme l’opinion de populations d’utilisateurs ou les mesures des réseaux d’ordinateurs, etc. L’Analyse de Différences a comme objectif l’étude de la variabilité des Données Massives. Afin de réaliser cet objectif, les data scientists ont besoin (a) de mesures de comparaison de données pour différentes dimensions telles que l’âge pour les utilisateurs et le sujet pour le traffic réseau, et (b) d’algorithmes efficaces pour la détection de différences à grande échelle. Dans cette thèse, nous identifions et étudions trois nouvelles tâches analytiques : L’Exploration des Différences, l’Explication des Différences et l’Evolution des Différences.L’Exploration des Différences s’attaque à l’extraction de l’opinion de différents segments d’utilisateurs (ex., sur un site de films). Nous proposons des mesures adaptées à la com- paraison de distributions de notes attribuées par les utilisateurs, et des algorithmes efficaces qui permettent, à partir d’une opinion donnée, de trouver les segments qui sont d’accord ou pas avec cette opinion. L’Explication des Différences s’intéresse à fournir une explication succinte de la différence entre deux ensembles de données (ex., les habitudes d’achat de deux ensembles de clients). Nous proposons des fonctions de scoring permettant d’ordonner les explications, et des algorithmes qui guarantissent de fournir des explications à la fois concises et informatives. Enfin, l’Evolution des Différences suit l’évolution d’un ensemble de données dans le temps et résume cette évolution à différentes granularités de temps. Nous proposons une approche basée sur le requêtage qui utilise des mesures de similarité pour comparer des clusters consécutifs dans le temps. Nos index et algorithmes pour l’Evolution des Différences sont capables de traiter des données qui arrivent à différentes vitesses et des types de changements différents (ex., soudains, incrémentaux). L’utilité et le passage à l’échelle de tous nos algorithmes reposent sur l’exploitation de la hiérarchie dans les données (ex., temporelle, démographique).Afin de valider l’utilité de nos tâches analytiques et le passage à l’échelle de nos algo- rithmes, nous réalisons un grand nombre d’expériences aussi bien sur des données synthé- tiques que réelles.Nous montrons que l’Exploration des Différences guide les data scientists ainsi que les novices à découvrir l’opinion de plusieurs segments d’internautes à grande échelle. L’Explication des Différences révèle la nécessité de résumer les différences entre deux ensembles de donnes, de manière parcimonieuse et montre que la parcimonie peut être atteinte en exploitant les relations hiérarchiques dans les données. Enfin, notre étude sur l’Evolution des Différences fournit des preuves solides qu’une approche basée sur les requêtes est très adaptée à capturer des taux d’arrivée des données variés à plusieurs granularités de temps. De même, nous montrons que les approches de clustering sont adaptées à différents types de changement. / Variability in Big Data refers to data whose meaning changes continuously. For instance, data derived from social platforms and from monitoring applications, exhibits great variability. This variability is essentially the result of changes in the underlying data distributions of attributes of interest, such as user opinions/ratings, computer network measurements, etc. {em Difference Analysis} aims to study variability in Big Data. To achieve that goal, data scientists need: (a) measures to compare data in various dimensions such as age for users or topic for network traffic, and (b) efficient algorithms to detect changes in massive data. In this thesis, we identify and study three novel analytical tasks to capture data variability: {em Difference Exploration, Difference Explanation} and {em Difference Evolution}.Difference Exploration is concerned with extracting the opinion of different user segments (e.g., on a movie rating website). We propose appropriate measures for comparing user opinions in the form of rating distributions, and efficient algorithms that, given an opinion of interest in the form of a rating histogram, discover agreeing and disargreeing populations. Difference Explanation tackles the question of providing a succinct explanation of differences between two datasets of interest (e.g., buying habits of two sets of customers). We propose scoring functions designed to rank explanations, and algorithms that guarantee explanation conciseness and informativeness. Finally, Difference Evolution tracks change in an input dataset over time and summarizes change at multiple time granularities. We propose a query-based approach that uses similarity measures to compare consecutive clusters over time. Our indexes and algorithms for Difference Evolution are designed to capture different data arrival rates (e.g., low, high) and different types of change (e.g., sudden, incremental). The utility and scalability of all our algorithms relies on hierarchies inherent in data (e.g., time, demographic).We run extensive experiments on real and synthetic datasets to validate the usefulness of the three analytical tasks and the scalability of our algorithms. We show that Difference Exploration guides end-users and data scientists in uncovering the opinion of different user segments in a scalable way. Difference Explanation reveals the need to parsimoniously summarize differences between two datasets and shows that parsimony can be achieved by exploiting hierarchy in data. Finally, our study on Difference Evolution provides strong evidence that a query-based approach is well-suited to tracking change in datasets with varying arrival rates and at multiple time granularities. Similarly, we show that different clustering approaches can be used to capture different types of change.
|
4 |
[en] A METHOD FOR INTERPRETING CONCEPT DRIFTS IN A STREAMING ENVIRONMENT / [pt] UM MÉTODO PARA INTERPRETAÇÃO DE MUDANÇAS DE REGIME EM UM AMBIENTE DE STREAMINGJOAO GUILHERME MATTOS DE O SANTOS 10 August 2021 (has links)
[pt] Em ambientes dinâmicos, os modelos de dados tendem a ter desempenho
insatisfatório uma vez que a distribuição subjacente dos dados muda. Este
fenômeno é conhecido como Concept Drift. Em relação a este tema, muito
esforço tem sido direcionado ao desenvolvimento de métodos capazes de
detectar tais fenômenos com antecedência suficiente para que os modelos
possam se adaptar. No entanto, explicar o que levou ao drift e entender
suas consequências ao modelo têm sido pouco explorado pela academia.
Tais informações podem mudar completamente a forma como adaptamos os
modelos. Esta dissertação apresenta uma nova abordagem, chamada Detector
de Drift Interpretável, que vai além da identificação de desvios nos dados. Ele
aproveita a estrutura das árvores de decisão para prover um entendimento
completo de um drift, ou seja, suas principais causas, as regiões afetadas do
modelo e sua severidade. / [en] In a dynamic environment, models tend to perform poorly once the
underlying distribution shifts. This phenomenon is known as Concept Drift.
In the last decade, considerable research effort has been directed towards
developing methods capable of detecting such phenomena early enough so
that models can adapt. However, not so much consideration is given to
explain the drift, and such information can completely change the handling
and understanding of the underlying cause. This dissertation presents a novel
approach, called Interpretable Drift Detector, that goes beyond identifying
drifts in data. It harnesses decision trees’ structure to provide a thorough
understanding of a drift, i.e., its principal causes, the affected regions of a tree model, and its severity. Moreover, besides all information it provides, our
method also outperforms benchmark drift detection methods in terms of falsepositive rates and true-positive rates across several different datasets available in the literature.
|
5 |
[en] NEUROEVOLUTIVE LEARNING AND CONCEPT DRIFT DETECTION IN NON-STATIONARY ENVIRONMENTS / [pt] APRENDIZAGEM NEUROEVOLUTIVA E DETECÇÃO DE CONCEPT DRIFT EM AMBIENTES NÃO ESTACIONÁRIOSTATIANA ESCOVEDO 04 July 2016 (has links)
[pt] Os conceitos do mundo real muitas vezes não são estáveis: eles
mudam com o tempo. Assim como os conceitos, a distribuição de dados
também pode se alterar. Este problema de mudança de conceitos ou
distribuição de dados é conhecido como concept drift e é um desafio para um
modelo na tarefa de aprender a partir de dados. Este trabalho apresenta um
novo modelo neuroevolutivo com inspiração quântica, baseado em um comitê
de redes neurais do tipo Multi-Layer Perceptron (MLP), para a aprendizagem
em ambientes não estacionários, denominado NEVE (Neuro-EVolutionary
Ensemble). Também apresenta um novo mecanismo de detecção de concept
drift, denominado DetectA (Detect Abrupt) com a capacidade de detectar
mudanças tanto de forma proativa quanto de forma reativa. O algoritmo
evolutivo com inspiração quântica binário-real AEIQ-BR é utilizado no NEVE
para gerar automaticamente novos classificadores para o comitê, determinando
a topologia mais adequada para a nova rede, selecionando as variáveis de
entrada mais apropriadas e determinando todos os pesos da rede neural MLP.
O algoritmo AEIQ-R determina os pesos de votação de cada rede neural
membro do comitê, sendo possível utilizar votação por combinação linear,
votação majoritária ponderada e simples. São implementadas quatro diferentes
abordagens do NEVE, que se diferem uma da outra pela forma de detectar e
tratar os drifts ocorridos. O trabalho também apresenta resultados de
experimentos realizados com o método DetectA e com o modelo NEVE em
bases de dados reais e artificiais. Os resultados mostram que o detector se
mostrou robusto e eficiente para bases de dados de alta dimensionalidade,
blocos de tamanho intermediário, bases de dados com qualquer proporção de
drift e com qualquer balanceamento de classes e que, em geral, os melhores
resultados obtidos foram usando algum tipo de detecção. Comparando a
acurácia do NEVE com outros modelos consolidados da literatura, verifica-se
que o NEVE teve acurácia superior na maioria dos casos. Isto reforça que a
abordagem por comitê neuroevolutivo é uma escolha robusta para situações
em que as bases de dados estão sujeitas a mudanças repentinas de
comportamento. / [en] Real world concepts are often not stable: they change with time. Just as
the concepts, data distribution may change as well. This problem of change in
concepts or distribution of data is known as concept drift and is a challenge for
a model in the task of learning from data. This work presents a new
neuroevolutive model with quantum inspiration called NEVE (Neuro-
EVolutionary Ensemble), based on an ensemble of Multi-Layer Perceptron
(MLP) neural networks for learning in non-stationary environments. It also
presents a new concept drift detection mechanism, called DetectA (DETECT
Abrupt) with the ability to detect changes both proactively as reactively. The
evolutionary algorithm with binary-real quantum inspiration AEIQ-BR is used in
NEVE to automatically generate new classifiers for the ensemble, determining
the most appropriate topology for the new network and by selecting the most
appropriate input variables and determining all the weights of the neural
network. The AEIQ-R algorithm determines the voting weight of each neural
network ensemble member, and you can use voting by linear combination and
voting by weighted or simple majority. Four different approaches of NEVE are
implemented and they differ from one another by the way of detecting and
treating occurring drifts. The work also presents results of experiments
conducted with the DetectA method and with the NEVE model in real and
artificial databases. The results show that the detector has proved efficient and
suitable for data bases with high-dimensionality, intermediate sized blocks, any
proportion of drifts and with any class balancing. Comparing the accuracy of
NEVE with other consolidated models in the literature, it appears that NEVE
had higher accuracy in most cases. This reinforces that the neuroevolution
ensemble approach is a robust choice to situations in which the databases are
subject to sudden changes in behavior.
|
6 |
Avaliação criteriosa dos algoritmos de detecção de concept driftsSANTOS, Silas Garrido Teixeira de Carvalho 27 February 2015 (has links)
Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-07-11T12:33:28Z
No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
silas-dissertacao-versao-final-2016.pdf: 1708159 bytes, checksum: 6c0efc5f2f0b27c79306418c9de516f1 (MD5) / Made available in DSpace on 2016-07-11T12:33:28Z (GMT). No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
silas-dissertacao-versao-final-2016.pdf: 1708159 bytes, checksum: 6c0efc5f2f0b27c79306418c9de516f1 (MD5)
Previous issue date: 2015-02-27 / FACEPE / A extração de conhecimento em ambientes com fluxo contínuo de dados é uma atividade que
vem crescendo progressivamente. Diversas são as situações que necessitam desse mecanismo,
como o monitoramento do histórico de compras de clientes; a detecção de presença por meio
de sensores; ou o monitoramento da temperatura da água. Desta maneira, os algoritmos
utilizados para esse fim devem ser atualizados constantemente, buscando adaptar-se às
novas instâncias e levando em consideração as restrições computacionais. Quando se
trabalha em ambientes com fluxo contínuo de dados, em geral não é recomendável supor
que sua distribuição permanecerá estacionária. Diversas mudanças podem ocorrer ao longo
do tempo, desencadeando uma situação geralmente conhecida como mudança de conceito
(concept drift). Neste trabalho foi realizado um estudo comparativo entre alguns dos
principais métodos de detecção de mudanças: ADWIN, DDM, DOF, ECDD, EDDM, PL e
STEPD. Para execução dos experimentos foram utilizadas bases artificiais – simulando
mudanças abruptas, graduais rápidas, e graduais lentas – e também bases com problemas
reais. Os resultados foram analisados baseando-se na precisão, tempo de execução, uso
de memória, tempo médio de detecção das mudanças, e quantidade de falsos positivos e
negativos. Já os parâmetros dos métodos foram definidos utilizando uma versão adaptada
de um algoritmo genético. De acordo com os resultados do teste de Friedman juntamente
com Nemenyi, em termos de precisão, DDM se mostrou o método mais eficiente com as
bases utilizadas, sendo estatisticamente superior ao DOF e ECDD. Já EDDM foi o método
mais rápido e também o mais econômico no uso da memória, sendo superior ao DOF,
ECDD, PL e STEPD, em ambos os casos. Conclui-se então que métodos mais sensíveis
às detecções de mudanças, e consequentemente mais propensos a alarmes falsos, obtêm
melhores resultados quando comparados a métodos menos sensíveis e menos suscetíveis a
alarmes falsos. / Knowledge extraction from data streams is an activity that has been progressively receiving
an increased demand. Examples of such applications include monitoring purchase history
of customers, movement data from sensors, or water temperatures. Thus, algorithms used
for this purpose must be constantly updated, trying to adapt to new instances and taking
into account computational constraints. When working in environments with a continuous
flow of data, there is no guarantee that the distribution of the data will remain stationary.
On the contrary, several changes may occur over time, triggering situations commonly
known as concept drift. In this work we present a comparative study of some of the main
drift detection methods: ADWIN, DDM, DOF, ECDD, EDDM, PL and STEPD. For
the execution of the experiments, artificial datasets were used – simulating abrupt, fast
gradual, and slow gradual changes – and also datasets with real problems. The results
were analyzed based on the accuracy, runtime, memory usage, average time to change
detection, and number of false positives and negatives. The parameters of methods were
defined using an adapted version of a genetic algorithm. According to the Friedman test
with Nemenyi results, in terms of accuracy, DDM was the most efficient method with
the datasets used, and statistically superior to DOF and ECDD. EDDM was the fastest
method and also the most economical in memory usage, being statistically superior to
DOF, ECDD, PL and STEPD, in both cases. It was concluded that more sensitive change
detection methods, and therefore more prone to false alarms, achieve better results when
compared to less sensitive and less susceptible to false alarms methods.
|
7 |
RELIABLE SENSING WITH UNRELIABLE SENSORS: FROM PHYSICAL MODELING TO DATA ANALYSIS TO APPLICATIONSAjanta Saha (19827849) 10 October 2024 (has links)
<p dir="ltr">In today’s age of information, we are constantly informed about our surroundings by the network of distributed sensors to decide the next action. One major class of distributed sensors is wearable, implantable, and environmental (WIE) electrochemical sensors, widely used for analyte concentration measurement in personalized healthcare, environmental monitoring, smart agriculture, food, and chemical industries. Although WIE sensors offer an opportunity for prompt and prudent decisions, reliable sensing with such sensors is a big challenge. Among them, one is uncontrolled outside environment. Rapidly varying temperature, humidity, and target concentration increase noise and decrease the data reliability of the sensors. Second, because they are closely coupled to the physical world, they are subject to biofouling, radiation exposure, and water ingress which causes physical degradation. Moreover, to correct the drift due to degradation, frequent calibration is not possible once the sensor is deployed in the field. Another challenge is the energy supply needed to support the autonomous WIE sensors. If the sensor is wireless, it must be powered by a battery or an energy harvester. Unfortunately, batteries have limited lifetime and energy harvesters cannot supply power on-demand limiting their overall operation.</p><p dir="ltr">The objective of this thesis is to achieve reliable sensing with WIE sensors by overcoming the challenges of uncontrolled environment, drift or degradation, and calibration subject to limited power supplies. First, we have developed a concept of “Nernst thermometry” for potentiometric ion-selective electrodes (ISE) with which we have self-corrected concentration fluctuation due to uncontrolled temperature. Next, by using “Nernst thermometry,” we have developed a physics-guided data analysis method for drift detection and self-calibration of WIE ISE. For WIE sensor, wireless data transmission is an energy-intensive operation. To reduce unreliable data transmission, we have developed a statistical approach to monitor the credibility of the sensor continuously and transmit only credible sensor data. To understand and monitor the cause of ISE degradation, we have proposed a novel on-the-fly equivalent circuit extraction method that does not require any external power supply or complex measurements. To ensure an on-demand power supply, we have presented the concept of “signal as a source of energy.” By circuit simulation and long-term experimental analysis, we have shown that ISE can indefinitely sense and harvest energy from the analyte. We have theoretically calculated the maximum achievable power with such systems and presented ways to achieve it practically. Overall, the thesis presents a holistic approach to developing a self-sustainable WIE sensor with environmental variation correction, self-calibration, reliable data transmission, and lifelong self-powering capabilities, bringing smart agriculture and environmental sensing one step closer to reality.</p>
|
Page generated in 0.1005 seconds