71 |
Klasifikace v proudu dat pomocí souboru klasifikátorů / Classification in Data Streams Using Ensemble MethodsJarosch, Martin January 2013 (has links)
This master's thesis deals with knowledge discovery and is focused on data stream classification. Three ensemble classification methods are described here. These methods are implemented in practical part of this thesis and are included in the classification system. Extensive measurements and experimentation were used for method analysis and comparison. Implemented methods were then integrated into Malware analysis system. At the conclusion are presented obtained results.
|
72 |
Regularised feed forward neural networks for streamed data classification problemsEllis, Mathys January 2020 (has links)
Streamed data classification problems (SDCPs) require classifiers with the ability to learn and to adjust to the underlying relationships in data streams, in real-time. This requirement poses a challenge to classifiers, because the learning task is no longer just to find the optimal decision boundaries, but also to track changes in the decision boundaries as new training data is received. The challenge is due to concept drift, i.e. the changing of decision boundaries over time. Changes include disappearing, appearing, or shifting decision boundaries. This thesis proposes an online learning approach for feed forward neural networks (FFNNs) that meets the requirements of SDCPs. The approach uses regularisation to optimise the architecture via the weights, and quantum particle swarm optimisation (QPSO) to dynamically adjust the weights. The learning approach is applied to a FFNN, which uses rectified linear activation functions, to form a novel SDCP classifier. The classifier is empirically investigated on several SDCPs. Both weight decay (WD) and weight elimination (WE) are investigated as regularisers. Empirical results show that using QPSO with no regularisation, causes the classifier to completely saturate. However, using QPSO with regularisation enables the classifier to dynamically adapt both its implicit architecture and weights as decision boundaries change. Furthermore, the results favour WE over WD as a regulariser for QPSO. / Dissertation (MSc)--University of Pretoria, 2020. / National Research Foundation (NRF) / Computer Science / MSc / Unrestricted
|
73 |
Interopérabilité des systèmes distribués produisant des flux de données sémantiques au profit de l'aide à la prise de décision / Interoperability of distributed systems producing semantic data stream for decision-makingBelghaouti, Fethi 26 January 2017 (has links)
Internet est une source infinie de données émanant de sources telles que les réseaux sociaux ou les capteurs (domotique, ville intelligente, véhicule autonome, etc.). Ces données hétérogènes et de plus en plus volumineuses, peuvent être gérées grâce au web sémantique, qui propose de les homogénéiser et de les lier et de raisonner dessus, et aux systèmes de gestion de flux de données, qui abordent essentiellement les problèmes liés au volume, à la volatilité et à l’interrogation continue. L’alliance de ces deux disciplines a vu l’essor des systèmes de gestion de flux de données sémantiques RSP (RDF Stream Processing systems). L’objectif de cette thèse est de permettre à ces systèmes, via de nouvelles approches et algorithmes à faible coût, de rester opérationnels, voire plus performants, même en cas de gros volumes de données en entrée et/ou de ressources système limitées.Pour atteindre cet objectif, notre thèse s’articule principalement autour de la problématique du : "Traitement de flux de données sémantiques dans un contexte de systèmes informatiques à ressources limitées". Elle adresse les questions de recherche suivantes : (i) Comment représenter un flux de données sémantiques ? Et (ii) Comment traiter les flux de données sémantiques entrants, lorsque leurs débits et/ou volumes dépassent les capacités du système cible ?Nous proposons comme première contribution une analyse des données circulant dans les flux de données sémantiques pour considérer non pas une succession de triplets indépendants mais plutôt une succession de graphes en étoiles, préservant ainsi les liens entre les triplets. En utilisant cette approche, nous avons amélioré significativement la qualité des réponses de quelques algorithmes d’échantillonnage bien connus dans la littérature pour le délestage des flux. L’analyse de la requête continue permet d’optimiser cette solution en repèrant les données non pertinentes pour être délestées les premières. Dans la deuxième contribution, nous proposons un algorithme de détection de motifs fréquents de graphes RDF dans les flux de données RDF, appelé FreGraPaD (Frequent RDF Graph Patterns Detection). C’est un algorithme en une passe, orienté mémoire et peu coûteux. Il utilise deux structures de données principales un vecteur de bits pour construire et identifier le motif de graphe RDF assurant une optimisation de l’espace mémoire et une table de hachage pour le stockage de ces derniers. La troisième contribution de notre thèse consiste en une solution déterministe de réduction de charge des systèmes RSP appelée POL (Pattern Oriented Load-shedding for RDF Stream Processing systems). Elle utilise des opérateurs booléens très peu coûteux, qu’elle applique aux deux motifs binaires construits de la donnée et de la requête continue pour déterminer et éjecter celle qui est non-pertinente. Elle garantit un rappel de 100%, réduit la charge du système et améliore son temps de réponse. Enfin, notre quatrième contribution est un outil de compression en ligne de flux RDF, appelé Patorc (Pattern Oriented Compression for RSP systems). Il se base sur les motifs fréquents présents dans les flux qu’il factorise. C’est une solution de compression sans perte de données dont l’interrogation sans décompression est très envisageable. Les solutions apportées par cette thèse permettent l’extension des systèmes RSP existants en leur permettant le passage à l’échelle dans un contexte de Bigdata. Elles leur permettent ainsi de manipuler un ou plusieurs flux arrivant à différentes vitesses, sans perdre de leur qualité de réponse et tout en garantissant leur disponibilité au-delà même de leurs limites physiques. Les résultats des expérimentations menées montrent que l’extension des systèmes existants par nos solutions améliore leurs performances. Elles illustrent la diminution considérable de leur temps de réponse, l’augmentation de leur seuil de débit de traitement en entrée tout en optimisant l’utilisation de leurs ressources systèmes / Internet is an infinite source of data coming from sources such as social networks or sensors (home automation, smart city, autonomous vehicle, etc.). These heterogeneous and increasingly large data can be managed through semantic web technologies, which propose to homogenize, link these data and reason above them, and data flow management systems, which mainly address the problems related to volume, volatility and continuous querying. The alliance of these two disciplines has seen the growth of semantic data stream management systems also called RSP (RDF Stream Processing Systems). The objective of this thesis is to allow these systems, via new approaches and "low cost" algorithms, to remain operational, even more efficient, even for large input data volumes and/or with limited system resources.To reach this goal, our thesis is mainly focused on the issue of "Processing semantic data streamsin a context of computer systems with limited resources". It directly contributes to answer the following research questions : (i) How to represent semantic data stream ? And (ii) How to deal with input semantic data when their rates and/or volumes exceed the capabilities of the target system ?As first contribution, we propose an analysis of the data in the semantic data streams in order to consider a succession of star graphs instead of just a success of andependent triples, thus preserving the links between the triples. By using this approach, we significantly impoved the quality of responses of some well known sampling algoithms for load-shedding. The analysis of the continuous query allows the optimisation of this solution by selection the irrelevant data to be load-shedded first. In the second contribution, we propose an algorithm for detecting frequent RDF graph patterns in semantic data streams.We called it FreGraPaD for Frequent RDF Graph Patterns Detection. It is a one pass algorithm, memory oriented and "low-cost". It uses two main data structures : A bit-vector to build and identify the RDF graph pattern, providing thus memory space optimization ; and a hash-table for storing the patterns.The third contribution of our thesis consists of a deterministic load-shedding solution for RSP systems, called POL (Pattern Oriented Load-shedding for RDF Stream Processing systems). It uses very low-cost boolean operators, that we apply on the built binary patterns of the data and the continuous query inorder to determine which data is not relevant to be ejected upstream of the system. It guarantees a recall of 100%, reduces the system load and improves response time. Finally, in the fourth contribution, we propose Patorc (Pattern Oriented Compression for RSP systems). Patorc is an online compression toolfor RDF streams. It is based on the frequent patterns present in RDF data streams that factorizes. It is a data lossless compression solution whith very possible querying without any need to decompression.This thesis provides solutions that allow the extension of existing RSP systems and makes them able to scale in a bigdata context. Thus, these solutions allow the RSP systems to deal with one or more semantic data streams arriving at different speeds, without loosing their response quality while ensuring their availability, even beyond their physical limitations. The conducted experiments, supported by the obtained results show that the extension of existing systems with the new solutions improves their performance. They illustrate the considerable decrease in their engine’s response time, increasing their processing rate threshold while optimizing the use of their system resources
|
74 |
Développement de méthodes d'analyse de données en ligne / Development of methods to analyze data steamsBar, Romain 29 November 2013 (has links)
On suppose que des vecteurs de données de grande dimension arrivant en ligne sont des observations indépendantes d'un vecteur aléatoire. Dans le second chapitre, ce dernier, noté Z, est partitionné en deux vecteurs R et S et les observations sont supposées identiquement distribuées. On définit alors une méthode récursive d'estimation séquentielle des r premiers facteurs de l'ACP projetée de R par rapport à S. On étudie ensuite le cas particulier de l'analyse canonique, puis de l'analyse factorielle discriminante et enfin de l'analyse factorielle des correspondances. Dans chacun de ces cas, on définit plusieurs processus spécifiques à l'analyse envisagée. Dans le troisième chapitre, on suppose que l'espérance En du vecteur aléatoire Zn dont sont issues les observations varie dans le temps. On note Rn = Zn - En et on suppose que les vecteurs Rn forment un échantillon indépendant et identiquement distribué d'un vecteur aléatoire R. On définit plusieurs processus d'approximation stochastique pour estimer des vecteurs directeurs des axes principaux d'une analyse en composantes principales (ACP) partielle de R. On applique ensuite ce résultat au cas particulier de l'analyse canonique généralisée (ACG) partielle après avoir défini un processus d'approximation stochastique de type Robbins-Monro de l'inverse d'une matrice de covariance. Dans le quatrième chapitre, on considère le cas où à la fois l'espérance et la matrice de covariance de Zn varient dans le temps. On donne finalement des résultats de simulation dans le chapitre 5 / High dimensional data are supposed to be independent on-line observations of a random vector. In the second chapter, the latter is denoted by Z and sliced into two random vectors R et S and data are supposed to be identically distributed. A recursive method of sequential estimation of the factors of the projected PCA of R with respect to S is defined. Next, some particular cases are investigated : canonical correlation analysis, canonical discriminant analysis and canonical correspondence analysis ; in each case, several specific methods for the estimation of the factors are proposed. In the third chapter, data are observations of the random vector Zn whose expectation En varies with time. Let Rn = Zn - En be and suppose that the vectors Rn form an independent and identically distributed sample of a random vector R. Stochastic approximation processes are used to estimate on-line direction vectors of the principal axes of a partial principal components analysis (PCA) of ~Z. This is applied next to the particular case of a partial generalized canonical correlation analysis (gCCA) after defining a stochastic approximation process of the Robbins-Monro type to estimate recursively the inverse of a covariance matrix. In the fourth chapter, the case when both expectation and covariance matrix of Zn vary with time n is considered. Finally, simulation results are given in chapter 5
|
75 |
Classificação de data streams utilizando árvore de decisão estatística e a teoria dos fractais na análise evolutiva dos dadosCazzolato, Mirela Teixeira 24 March 2014 (has links)
Made available in DSpace on 2016-06-02T19:06:13Z (GMT). No. of bitstreams: 1
5984.pdf: 1962060 bytes, checksum: d943b973e9dd5f12ab87985f7388cb80 (MD5)
Previous issue date: 2014-03-24 / Financiadora de Estudos e Projetos / A data stream is generated in a fast way, continuously, ordered, and in large quantities. To process data streams there must be considered, among others factors, the limited use of memory, the need of real-time processing, the accuracy of the results and the concept drift (which occurs when there is a change in the concept of the data being analyzed). Decision tree is a popular form of representation of the classifier, that is intuitive and fast to build, generally obtaining high accuracy. The techniques of incremental decision trees present in the literature generally have high computational costs to construct and update the model, especially regarding the calculation to split the decision nodes. The existent methods have a conservative characteristic to deal with limited amounts of data, tending to improve their results as the number of examples increases. Another problem is that many real-world applications generate data with noise, and the existing techniques have a low tolerance to these events. This work aims to develop decision tree methods for data streams, that supply the deficiencies of the current state of the art. In addition, another objective is to develop a technique to detect concept drift using the fractal theory. This functionality should indicate when there is a need to correct the model, allowing the adequate description of most recent events. To achieve the objectives, three decision tree algorithms were developed: StARMiner Tree, Automatic StARMiner Tree, and Information Gain StARMiner Tree. These algorithms use a statistical method as heuristic to split the nodes, which is not dependent on the number of examples and is fast. In the experiments the algorithms achieved high accuracy, also showing a tolerant behavior in the classification of noisy data. Finally, a drift detection method was proposed to detect changes in the data distribution, based on the fractal theory. The method, called Fractal Detection Method, detects significant changes on the data distribution, causing the model to be updated when it does not describe the data (becoming obsolete). The method achieved good results in the classification of data containing concept drift, proving to be suitable for evolutionary analysis of data. / Um data stream e gerado de forma rápida, contínua, ordenada e em grande quantidade. Para o processamento de data streams deve-se considerar, dentre outros fatores, o uso limitado de memoria, a necessidade de processamento em tempo real, a precisão dos resultados e o concept drift (que ocorre quando há uma mudança no conceito dos dados que estão sendo analisados). À arvore de decisão e uma popular forma de representação do modelo classificador, intuitiva, e rápida de construir, geralmente possuindo alta acurada. Às técnicas de arvores de decisão incrementais presentes na literatura geralmente apresentam um alto custo computacional para a construção e atualização do modelo, principalmente no que se refere ao calculo para a decisão de divisão dos nós. Os métodos existentes possuem uma característica conservadora para lidar com quantidades de dados limitadas, tendendo a melhorar seus resultados conforme o número de exemplos aumenta. Outro problema e a geração dos dados com ruídos por muitas aplicações reais, pois as técnicas existentes possuem baixa tolerância a essas ocorrências. Este trabalho tem como objetivo o desenvolvimento de métodos de arvores de decisão para data streams, que suprem as deficiências do atual estado da arte. Além disso, outro objetivo deste projeto e o desenvolvimento de uma funcionalidade para detecção de concept drift utilizando a teoria dos fractais, corrigindo o modelo sempre que necessário, possibilitando a descrição correta dos acontecimentos mais recentes dos dados. Para atingir os objetivos foram desenvolvidos três algoritmos de arvore de decisão: o StÀRMiner Tree, o Àutomatic StÀRMiner Tree, e o Information Gain StÀR-Miner Tree. Esses algoritmos utilizam um método estatístico como heurística de divisão de nós, que não é dependente do numero de exemplos lidos e que e rápida. Os algoritmos obtiveram alta acurácia nos experimentos realizados, mostrando também um comportamento tolerante na classificação de dados ruidosos. Finalmente, foi proposto um método para a detecção de mudanças no comportamento dos dados baseado na teoria dos fractais, o Fractal Drift Detection Method. Ele detecta mudanças significativas na distribuicao dos dados, fazendo com que o modelo seja atualizado sempre que o mesmo não descrever os dados atuais (se tornar obsoleto). O método obteve bons resultados na classificação de dados contendo concept drift, mostrando ser adequado para a análise evolutiva dos dados.
|
76 |
Obtenção de padrões sequenciais em data streams atendendo requisitos do Big DataCarvalho, Danilo Codeco 06 June 2016 (has links)
Submitted by Daniele Amaral (daniee_ni@hotmail.com) on 2016-10-20T18:13:56Z
No. of bitstreams: 1
DissDCC.pdf: 2421455 bytes, checksum: 5fd16625959b31340d5f845754f109ce (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-11-08T18:42:36Z (GMT) No. of bitstreams: 1
DissDCC.pdf: 2421455 bytes, checksum: 5fd16625959b31340d5f845754f109ce (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-11-08T18:42:42Z (GMT) No. of bitstreams: 1
DissDCC.pdf: 2421455 bytes, checksum: 5fd16625959b31340d5f845754f109ce (MD5) / Made available in DSpace on 2016-11-08T18:42:49Z (GMT). No. of bitstreams: 1
DissDCC.pdf: 2421455 bytes, checksum: 5fd16625959b31340d5f845754f109ce (MD5)
Previous issue date: 2016-06-06 / Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) / The growing amount of data produced daily, by both businesses and individuals in the web, increased the demand for analysis and extraction of knowledge of this data. While the last two decades the solution was to store and perform data mining algorithms, currently it has become unviable even to supercomputers. In addition, the requirements of the Big Data age go far beyond the large amount of data to analyze. Response time requirements and complexity of the data acquire more weight in many areas in the real world. New models have been researched and developed, often proposing distributed computing or different ways to handle the data stream mining. Current researches shows that an alternative in the data stream mining is to join a real-time event handling mechanism with a classic mining association rules or sequential patterns algorithms. In this work is shown a data stream mining approach to meet the Big Data response time requirement, linking the event handling mechanism in real time Esper and Incremental Miner of Stretchy Time Sequences (IncMSTS) algorithm. The results show that is possible to take a static data mining algorithm for data stream environment and keep tendency
in the patterns, although not possible to continuously read all data coming into the data stream. / O crescimento da quantidade de dados produzidos diariamente, tanto por empresas como por indivíduos na web, aumentou a exigência para a análise e extração de conhecimento sobre esses dados. Enquanto nas duas últimas décadas a solução era armazenar e executar algoritmos de mineração de dados, atualmente isso se tornou inviável mesmo em super computadores. Além disso, os requisitos da chamada era do Big Data vão muito além da grande quantidade de dados a se analisar. Requisitos de tempo de resposta e complexidade dos dados adquirem maior peso em muitos domínios no mundo real. Novos modelos têm sido pesquisados e desenvolvidos, muitas vezes propondo computação distribuída ou diferentes formas de se tratar a mineração de fluxo de dados. Pesquisas atuais mostram que uma alternativa na mineração de fluxo de dados é unir um mecanismo de tratamento de eventos em tempo real com algoritmos clássicos de mineração de regras de associação ou padrões sequenciais. Neste trabalho é mostrada uma abordagem de mineração de fluxo de dados (data stream) para atender ao requisito de tempo de resposta do Big Data, que une o mecanismo de manipulação de eventos em tempo real Esper e o algoritmo Incremental Miner of Stretchy Time Sequences (IncMSTS). Os resultados mostram ser possível levar um algoritmo de mineração de dados estático para o ambiente de fluxo de dados e manter as tendências de padrões encontrados, mesmo não sendo possível ler todos os dados vindos continuamente no fluxo de dados.
|
77 |
Avaliação criteriosa dos algoritmos de detecção de concept driftsSANTOS, Silas Garrido Teixeira de Carvalho 27 February 2015 (has links)
Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-07-11T12:33:28Z
No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
silas-dissertacao-versao-final-2016.pdf: 1708159 bytes, checksum: 6c0efc5f2f0b27c79306418c9de516f1 (MD5) / Made available in DSpace on 2016-07-11T12:33:28Z (GMT). No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
silas-dissertacao-versao-final-2016.pdf: 1708159 bytes, checksum: 6c0efc5f2f0b27c79306418c9de516f1 (MD5)
Previous issue date: 2015-02-27 / FACEPE / A extração de conhecimento em ambientes com fluxo contínuo de dados é uma atividade que
vem crescendo progressivamente. Diversas são as situações que necessitam desse mecanismo,
como o monitoramento do histórico de compras de clientes; a detecção de presença por meio
de sensores; ou o monitoramento da temperatura da água. Desta maneira, os algoritmos
utilizados para esse fim devem ser atualizados constantemente, buscando adaptar-se às
novas instâncias e levando em consideração as restrições computacionais. Quando se
trabalha em ambientes com fluxo contínuo de dados, em geral não é recomendável supor
que sua distribuição permanecerá estacionária. Diversas mudanças podem ocorrer ao longo
do tempo, desencadeando uma situação geralmente conhecida como mudança de conceito
(concept drift). Neste trabalho foi realizado um estudo comparativo entre alguns dos
principais métodos de detecção de mudanças: ADWIN, DDM, DOF, ECDD, EDDM, PL e
STEPD. Para execução dos experimentos foram utilizadas bases artificiais – simulando
mudanças abruptas, graduais rápidas, e graduais lentas – e também bases com problemas
reais. Os resultados foram analisados baseando-se na precisão, tempo de execução, uso
de memória, tempo médio de detecção das mudanças, e quantidade de falsos positivos e
negativos. Já os parâmetros dos métodos foram definidos utilizando uma versão adaptada
de um algoritmo genético. De acordo com os resultados do teste de Friedman juntamente
com Nemenyi, em termos de precisão, DDM se mostrou o método mais eficiente com as
bases utilizadas, sendo estatisticamente superior ao DOF e ECDD. Já EDDM foi o método
mais rápido e também o mais econômico no uso da memória, sendo superior ao DOF,
ECDD, PL e STEPD, em ambos os casos. Conclui-se então que métodos mais sensíveis
às detecções de mudanças, e consequentemente mais propensos a alarmes falsos, obtêm
melhores resultados quando comparados a métodos menos sensíveis e menos suscetíveis a
alarmes falsos. / Knowledge extraction from data streams is an activity that has been progressively receiving
an increased demand. Examples of such applications include monitoring purchase history
of customers, movement data from sensors, or water temperatures. Thus, algorithms used
for this purpose must be constantly updated, trying to adapt to new instances and taking
into account computational constraints. When working in environments with a continuous
flow of data, there is no guarantee that the distribution of the data will remain stationary.
On the contrary, several changes may occur over time, triggering situations commonly
known as concept drift. In this work we present a comparative study of some of the main
drift detection methods: ADWIN, DDM, DOF, ECDD, EDDM, PL and STEPD. For
the execution of the experiments, artificial datasets were used – simulating abrupt, fast
gradual, and slow gradual changes – and also datasets with real problems. The results
were analyzed based on the accuracy, runtime, memory usage, average time to change
detection, and number of false positives and negatives. The parameters of methods were
defined using an adapted version of a genetic algorithm. According to the Friedman test
with Nemenyi results, in terms of accuracy, DDM was the most efficient method with
the datasets used, and statistically superior to DOF and ECDD. EDDM was the fastest
method and also the most economical in memory usage, being statistically superior to
DOF, ECDD, PL and STEPD, in both cases. It was concluded that more sensitive change
detection methods, and therefore more prone to false alarms, achieve better results when
compared to less sensitive and less susceptible to false alarms methods.
|
78 |
Získávání frekventovaných vzorů z proudu dat / Frequent Pattern Discovery in a Data StreamDvořák, Michal January 2012 (has links)
Frequent-pattern mining from databases has been widely studied and frequently observed. Unfortunately, these algorithms are not suitable for data stream processing. In frequent-pattern mining from data streams, it is important to manage sets of items and also their history. There are several reasons for this; it is not just the history of frequent items, but also the history of potentially frequent sets that can become frequent later. This requires more memory and computational power. This thesis describes two algorithms: Lossy Counting and FP-stream. An effective implementation of these algorithms in C# is an integral part of this thesis. In addition, the two algorithms have been compared.
|
79 |
Intelligent flood adaptative contex-aware system / Système sensible et adaptatif au contexte pour la gestion intelligente de cruesSun, Jie 23 October 2017 (has links)
A l’avenir, l'agriculture et l'environnement vont pouvoir bénéficier de plus en plus de données hétérogènes collectées par des réseaux de capteurs sans fil (RCSF). Ces données alimentent généralement des outils d’aide à la décision (OAD). Dans cette thèse, nous nous intéressons spécifiquement aux systèmes sensibles et adaptatifs au contexte basés sur un RCSF et un OAD, dédiés au suivi de phénomènes naturels. Nous proposons ainsi une formalisation pour la conception et la mise en œuvre de ces systèmes. Le contexte considéré se compose de données issues du phénomène étudié mais également des capteurs sans fil (leur niveau d’énergie par exemple). Par l’utilisation des ontologies et de techniques de raisonnement, nous visons à maintenir le niveau de qualité de service (QdS) des données collectées (en accord avec le phénomène étudié) tant en préservant le fonctionnement du RCSF. Pour illustrer notre proposition, un cas d'utilisation complexe, l'étude des inondations dans un bassin hydrographique, est considéré. Cette thèse a produit un logiciel de simulation de ces systèmes qui intègre un système de simulation multi-agents (JADE) avec un moteur d’inférence à base de règles (Jess). / In the future, agriculture and environment will rely on more and more heterogeneous data collected by wireless sensor networks (WSN). These data are generally used in decision support systems (DSS). In this dissertation, we focus on adaptive context-aware systems based on WSN and DSS, dedicated to the monitoring of natural phenomena. Thus, a formalization for the design and the deployment of these kinds of systems is proposed. The considered context is established using the data from the studied phenomenon but also from the wireless sensors (e.g., their energy level). By the use of ontologies and reasoning techniques, we aim to maintain the required quality of service (QoS) level of the collected data (according to the studied phenomenon) while preserving the resources of the WSN. To illustrate our proposal, a complex use case, the study of floods in a watershed, is described. During this PhD thesis, a simulator for context-aware systems which integrates a multi-agent system (JADE) and a rule engine (Jess) has been developed.Keywords: ontologies, rule-based inferences, formalization, heterogeneous data, sensors data streams integration, WSN, limited resources, DSS, adaptive context-aware systems, QoS, agriculture, environment.
|
80 |
Détection d'anomalies à la volée dans des signaux vibratoires / Anomaly detection in high-dimensional datastreamsBellas, Anastasios 28 January 2014 (has links)
Le thème principal de cette thèse est d’étudier la détection d’anomalies dans des flux de données de grande dimension avec une application spécifique au Health Monitoring des moteurs d’avion. Dans ce travail, on considère que le problème de la détection d’anomalies est un problème d’apprentissage non supervisée. Les données modernes, notamment celles issues de la surveillance des systèmes industriels sont souvent des flux d’observations de grande dimension, puisque plusieurs mesures sont prises à de hautes fréquences et à un horizon de temps qui peut être infini. De plus, les données peuvent contenir des anomalies (pannes) du système surveillé. La plupart des algorithmes existants ne peuvent pas traiter des données qui ont ces caractéristiques. Nous introduisons d’abord un algorithme de clustering probabiliste offline dans des sous-espaces pour des données de grande dimension qui repose sur l’algorithme d’espérance-maximisation (EM) et qui est, en plus, robuste aux anomalies grâce à la technique du trimming. Ensuite, nous nous intéressons à la question du clustering probabiliste online de flux de données de grande dimension en développant l’inférence online du modèle de mélange d’analyse en composantes principales probabiliste. Pour les deux méthodes proposées, nous montrons leur efficacité sur des données simulées et réelles, issues par exemple des moteurs d’avion. Enfin, nous développons une application intégrée pour le Health Monitoring des moteurs d’avion dans le but de détecter des anomalies de façon dynamique. Le système proposé introduit des techniques originales de détection et de visualisation d’anomalies reposant sur les cartes auto-organisatrices. Des résultats de détection sont présentés et la question de l’identification des anomalies est aussi discutée. / The subject of this Thesis is to study anomaly detection in high-dimensional data streams with a specific application to aircraft engine Health Monitoring. In this work, we consider the problem of anomaly detection as an unsupervised learning problem. Modern data, especially those is-sued from industrial systems, are often streams of high-dimensional data samples, since multiple measurements can be taken at a high frequency and at a possibly infinite time horizon. More-over, data can contain anomalies (malfunctions, failures) of the system being monitored. Most existing unsupervised learning methods cannot handle data which possess these features. We first introduce an offline subspace clustering algorithm for high-dimensional data based on the expectation-maximization (EM) algorithm, which is also robust to anomalies through the use of the trimming technique. We then address the problem of online clustering of high-dimensional data streams by developing an online inference algorithm for the popular mixture of probabilistic principal component analyzers (MPPCA) model. We show the efficiency of both methods on synthetic and real datasets, including aircraft engine data with anomalies. Finally, we develop a comprehensive application for the aircraft engine Health Monitoring domain, which aims at detecting anomalies in aircraft engine data in a dynamic manner and introduces novel anomaly detection visualization techniques based on Self-Organizing Maps. Detection results are presented and anomaly identification is also discussed.
|
Page generated in 0.0713 seconds