• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 44
  • 22
  • 3
  • 2
  • 2
  • 2
  • 1
  • Tagged with
  • 80
  • 80
  • 30
  • 25
  • 17
  • 16
  • 15
  • 15
  • 14
  • 13
  • 13
  • 12
  • 11
  • 11
  • 10
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

Análise espaço-temporal de data streams multidimensionais / Spatio-temporal analysis in multidimensional data streams

Santiago Augusto Nunes 06 April 2015 (has links)
Fluxos de dados são usualmente caracterizados por grandes quantidades de dados gerados continuamente em processos síncronos ou assíncronos potencialmente infinitos, em aplicações como: sistemas meteorológicos, processos industriais, tráfego de veículos, transações financeiras, redes de sensores, entre outras. Além disso, o comportamento dos dados tende a sofrer alterações significativas ao longo do tempo, definindo data streams evolutivos. Estas alterações podem significar eventos temporários (como anomalias ou eventos extremos) ou mudanças relevantes no processo de geração da stream (que resultam em alterações na distribuição dos dados). Além disso, esses conjuntos de dados podem possuir características espaciais, como a localização geográfica de sensores, que podem ser úteis no processo de análise. A detecção dessas variações de comportamento que considere os aspectos da evolução temporal, assim como as características espaciais dos dados, é relevante em alguns tipos de aplicação, como o monitoramento de eventos climáticos extremos em pesquisas na área de Agrometeorologia. Nesse contexto, esse projeto de mestrado propõe uma técnica para auxiliar a análise espaço-temporal em data streams multidimensionais que contenham informações espaciais e não espaciais. A abordagem adotada é baseada em conceitos da Teoria de Fractais, utilizados para análise de comportamento temporal, assim como técnicas para manipulação de data streams e estruturas de dados hierárquicas, visando permitir uma análise que leve em consideração os aspectos espaciais e não espaciais simultaneamente. A técnica desenvolvida foi aplicada a dados agrometeorológicos, visando identificar comportamentos distintos considerando diferentes sub-regiões definidas pelas características espaciais dos dados. Portanto, os resultados deste trabalho incluem contribuições para a área de mineração de dados e de apoio a pesquisas em Agrometeorologia. / Data streams are usually characterized by large amounts of data generated continuously in synchronous or asynchronous potentially infinite processes, in applications such as: meteorological systems, industrial processes, vehicle traffic, financial transactions, sensor networks, among others. In addition, the behavior of the data tends to change significantly over time, defining evolutionary data streams. These changes may mean temporary events (such as anomalies or extreme events) or relevant changes in the process of generating the stream (that result in changes in the distribution of the data). Furthermore, these data sets can have spatial characteristics such as geographic location of sensors, which can be useful in the analysis process. The detection of these behavioral changes considering aspects of evolution, as well as the spatial characteristics of the data, is relevant for some types of applications, such as monitoring of extreme weather events in Agrometeorology researches. In this context, this project proposes a technique to help spatio-temporal analysis in multidimensional data streams containing spatial and non-spatial information. The adopted approach is based on concepts of the Fractal Theory, used for temporal behavior analysis, as well as techniques for data streams handling also hierarchical data structures, allowing analysis tasks that take into account the spatial and non-spatial aspects simultaneously. The developed technique has been applied to agro-meteorological data to identify different behaviors considering different sub-regions defined by the spatial characteristics of the data. Therefore, results from this work include contribution to data mining area and support research in Agrometeorology.
22

An adaptive ensemble classifier for mining concept drifting data streams

Farid, D.M., Zhang, L., Hossain, A., Rahman, C.M., Strachan, R., Sexton, G., Dahal, Keshav P. January 2013 (has links)
No / It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications. (C) 2013 Elsevier Ltd. All rights reserved.
23

Efficient And Scalable Evaluation Of Continuous, Spatio-temporal Queries In Mobile Computing Environments

Cazalas, Jonathan M 01 January 2012 (has links)
A variety of research exists for the processing of continuous queries in large, mobile environments. Each method tries, in its own way, to address the computational bottleneck of constantly processing so many queries. For this research, we present a two-pronged approach at addressing this problem. Firstly, we introduce an efficient and scalable system for monitoring traditional, continuous queries by leveraging the parallel processing capability of the Graphics Processing Unit. We examine a naive CPU-based solution for continuous range-monitoring queries, and we then extend this system using the GPU. Additionally, with mobile communication devices becoming commodity, location-based services will become ubiquitous. To cope with the very high intensity of location-based queries, we propose a view oriented approach of the location database, thereby reducing computation costs by exploiting computation sharing amongst queries requiring the same view. Our studies show that by exploiting the parallel processing power of the GPU, we are able to significantly scale the number of mobile objects, while maintaining an acceptable level of performance. Our second approach was to view this research problem as one belonging to the domain of data streams. Several works have convincingly argued that the two research fields of spatiotemporal data streams and the management of moving objects can naturally come together. [IlMI10, ChFr03, MoXA04] For example, the output of a GPS receiver, monitoring the position of a mobile object, is viewed as a data stream of location updates. This data stream of location updates, along with those from the plausibly many other mobile objects, is received at a centralized server, which processes the streams upon arrival, effectively updating the answers to the currently active queries in real time. iv For this second approach, we present GEDS, a scalable, Graphics Processing Unit (GPU)-based framework for the evaluation of continuous spatio-temporal queries over spatiotemporal data streams. Specifically, GEDS employs the computation sharing and parallel processing paradigms to deliver scalability in the evaluation of continuous, spatio-temporal range queries and continuous, spatio-temporal kNN queries. The GEDS framework utilizes the parallel processing capability of the GPU, a stream processor by trade, to handle the computation required in this application. Experimental evaluation shows promising performance and shows the scalability and efficacy of GEDS in spatio-temporal data streaming environments. Additional performance studies demonstrate that, even in light of the costs associated with memory transfers, the parallel processing power provided by GEDS clearly counters and outweighs any associated costs. Finally, in an effort to move beyond the analysis of specific algorithms over the GEDS framework, we take a broader approach in our analysis of GPU computing. What algorithms are appropriate for the GPU? What types of applications can benefit from the parallel and stream processing power of the GPU? And can we identify a class of algorithms that are best suited for GPU computing? To answer these questions, we develop an abstract performance model, detailing the relationship between the CPU and the GPU. From this model, we are able to extrapolate a list of attributes common to successful GPU-based applications, thereby providing insight into which algorithms and applications are best suited for the GPU and also providing an estimated theoretical speedup for said GPU-based applications
24

Classificação de fluxo de dados não estacionários com aplicação em sensores identificadores de insetos / Classification of non-stationary data stream with application in sensors for insect identification.

Souza, Vinicius Mourão Alves de 23 May 2016 (has links)
Diversas aplicações são responsáveis por gerar dados ao longo do tempo de maneira contínua, ordenada e ininterrupta em um ambiente dinâmico, denominados fluxo de dados. Entre possíveis tarefas que podem ser realizadas com estes dados, classificação é uma das mais proeminentes. Devido à natureza não estacionária do ambiente responsável por gerar os dados, as características que descrevem os conceitos das classes do problema de classificação podem se alterar ao longo do tempo. Por isso, classificadores de fluxo de dados requerem constantes atualizações em seus modelos para que a taxa de acerto se mantenha estável ao longo do tempo. Na etapa de atualização a maior parte das abordagens considera que, após a predição de cada exemplo, o seu rótulo correto é imediatamente disponibilizado sem qualquer atraso de tempo (latência nula). Devido aos altos custos do processo de rotulação, os rótulos corretos nem sempre podem ser obtidos para a maior parte dos dados ou são obtidos após um considerável atraso de tempo. No caso mais desafiador, encontram-se as aplicações em que após a etapa de classificação dos exemplos, os seus respectivos rótulos corretos nunca sã disponibilizados para o algoritmo, caso chamado de latência extrema. Neste cenário, não é possível o uso de abordagens tradicionais, sendo necessário o desenvolvimento de novos métodos que sejam capazes de manter um modelo de classificação atualizado mesmo na ausência de dados rotulados. Nesta tese, além de discutir o problema de latência na tarefa de classificação de fluxo de dados não estacionários, negligenciado por boa parte da literatura, também sã propostos dois algoritmos denominados SCARGC e MClassification para o cenário de latência extrema. Ambas as propostas se baseiam no uso de técnicas de agrupamento para a adaptação à mudanças de maneira não supervisionada. Os algoritmos propostos são intuitivos, simples e apresentam resultados superiores ou equivalentes a outros algoritmos da literatura em avaliações com dados sintéticos e reais, tanto em termos de acurácia de classificação como em tempo computacional. Aléem de buscar o avanço no estado-da-arte na área de aprendizado em fluxo de dados, este trabalho também apresenta contribuições para uma importante aplicação tecnológica com impacto social e na saúde pública. Especificamente, explorou-se um sensor óptico para a identificação automática de espécies de insetos a partir da análise de informações provenientes do batimento de asas dos insetos. Para a descrição dos dados, foi verificado que os coeficientes Mel-cepstrais apresentaram os melhores resultados entre as diferentes técnicas de processamento digital de sinais avaliadas. Este sensor é um exemplo concreto de aplicação responsável por gerar um fluxo de dados em que é necessário realizar classificações em tempo real. Durante a etapa de classificação, este sensor exige a adaptação a possíveis variações em condições ambientais, responsáveis por alterar o comportamento dos insetos ao longo do tempo. Para lidar com este problema, é proposto um Sistema com Múltiplos Classificadores que realiza a seleção dinâmica do classificador mais adequado de acordo com características de cada exemplo de teste. Em avaliações com mudanças pouco significativas nas condições ambientais, foi possível obter uma acurácia de classificação próxima de 90%, no cenário com múltiplas classes e, cerca de 95% para a identificação da espécie Aedes aegypti, considerando o treinamento com uma única classe. No cenário com mudanças significativas nos dados, foi possível obter 91% de acurácia em um problema com 5 classes e 96% para a classificação de insetos vetores de importantes doenças como dengue e zika vírus. / Many applications are able to generate data continuously over t ime in an ordered and uninterrupted way in a dynamic environment , called data streams. Among possible tasks that can be performed with these data, classification is one of the most prominent . Due to non-stationarity of the environment that generates the data, the features that describe the concepts of the classes can change over time. Thus, the classifiers that deal with data streams require constants updates in their classification models to maintain a stable accuracy over time. In the update phase, most of the approaches assume that after the classification of each example from the stream, their actual class label is available without any t ime delay (zero latency). Given the high label costs, it is more reasonable to consider that this delay could vary for the most portion of the data. In the more challenging case, there are applications with extreme latency, where in after the classification of the examples, heir actual class labels are never available to the algorithm. In this scenario, it is not possible to use traditional approaches. Thus, there is the need of new methods that are able to maintain a classification model updated in the absence of labeled data. In this thesis, besides to discuss the problem of latency to obtain actual labels in data stream classification problems, neglected by most of the works, we also propose two new algorithms to deal with extreme latency, called SCARGC and MClassification. Both algorithms are based on the use of clustering approaches to adapt to changes in an unsupervised way. The proposed algorithms are intuitive, simpleand showed superior or equivalent results in terms of accuracy and computation time compared to other approaches from literature in an evaluation on synthetic and real data. In addition to the advance in the state-of-the-art in the stream learning area, this thesis also presents contributions to an important technological application with social and public health impacts. Specifically, it was studied an optical sensor to automatically identify insect species by the means of the analysis of information coming from wing beat of insects. To describe the data, we conclude that the Mel-cepst ral coefficients guide to the best results among different evaluated digital signal processing techniques. This sensor is a concrete example of an applicat ion that generates a data st ream for which it is necessary to perform real-time classification. During the classification phase, this sensor must adapt their classification model to possible variat ions in environmental conditions, responsible for changing the behavior of insects. To address this problem, we propose a System with Multiple Classifiers that dynamically selects the most adequate classifier according to characteristics of each test example. In evaluations with minor changes in the environmental conditions, we achieved a classification accuracy close to 90% in a scenario with multiple classes and 95% when identifying Aedes aegypti species considering the training phase with only the positive class. In the scenario with considerable changes in the environmental conditions, we achieved 91% of accuracy considering 5 species and 96% to classify vector mosquitoes of important diseases as dengue and zika virus.
25

Agrupamento de séries temporais em fluxos contínuos de dados / Time series clustering for data streams

Pereira, Cássio Martini Martins 29 October 2013 (has links)
Recentemente, a área de mineração de fluxos contínuos de dados ganhou importância, a qual visa extrair informação útil a partir de conjuntos massivos e contínuos de dados que evoluem com o tempo. Uma das técnicas que mais se destaca nessa área e a de agrupamento de dados, a qual busca estruturar grandes volumes de dados em hierarquias ou partições, tais que objetos mais similares estejam em um mesmo grupo. Diversos algoritmos foram propostos nesse contexto, porém a maioria concentrou-se no agrupamento de fluxos compostos por pontos em um espaço multidimensional. Poucos trabalhos voltaram-se para o agrupamento de séries temporais, as quais se caracterizam por serem coleções de observações coletadas sequencialmente no tempo. Técnicas atuais para agrupamento de séries temporais em fluxos contínuos apresentam uma limitação na escolha da medida de similaridade, a qual na maioria dos casos e baseada em uma simples correlação, como a de Pearson. Este trabalho mostra que até para modelos clássicos de séries temporais, como os de Box e Jenkins, a correlação de Pearson não é capaz de detectar similaridade, apesar das séries serem provenientes de um mesmo modelo matemático e com mesma parametrização. Essa limitação nas técnicas atuais motivou este trabalho a considerar os modelos geradores de séries temporais, ou seja, as equações que regem sua geração, por meio de diversas medidas descritivas, tais como a Autoinformação Mútua, o Expoente de Hurst e várias outras. A hipótese considerada e que, por meio do uso de medidas descritivas, pode-se obter uma melhor caracterização do modelo gerador de séries temporais e, consequentemente, um agrupamento de maior qualidade. Nesse sentido, foi realizada uma avaliação de diversas medidas descritivas, as quais foram usadas como entrada para um novo algoritmo de agrupamento baseado em árvores, denominado TS-Stream. Experimentos com bases sintéticas compostas por diversos modelos de séries temporais foram realizados, mostrando a superioridade de TS-Stream sobre ODAC, a técnica mais popular para esta tarefa encontrada na literatura. Experimentos com séries reais provenientes de preços de ações da NYSE e NASDAQ mostraram que o uso de TS-Stream na escolha de ações, por meio da criação de uma carteira de investimentos diversificada, pode aumentar os retornos das aplicações em várias ordens de grandeza, se comparado a estratégias baseadas somente no indicador econômico Moving Average Convergence Divergence / Recently, the data streams mining area has gained importance, which aims to extract useful information from massive and continuous data sources that evolve over time. One of the most popular techniques in this area is clustering, which aims to structure large volumes of data into hierarchies or partitions, such that similar objects are placed in the same group. Several algorithms were proposed in this context, however most of them focused on the clustering of streams composed of multidimensional points. Few studies have focused on clustering streaming time series, which are characterized by being collections of observations sampled sequentially along time. Current techniques for clustering streaming time series have a limitation in the choice of the similarity measure, as most are based on a simple correlation, such as Pearson. This thesis shows that even for classic time series models, such as those from Box and Jenkins, the Pearson correlation is not capable of detecting similarity, despite dealing with series originating from the same mathematical model and the same parametrization. This limitation in current techniques motivated this work to consider time series generating models, i.e., generating equations, through the use of several descriptive measures, such as Auto Mutual Information, the Hurst Exponent and several others. The hypothesis is that through the use of several descriptive measures, a better characterization of time series generating models can be achieved, which in turn will lead to better clustering quality. In that context, several descriptive measures were evaluated and then used as input to a new tree-based clustering algorithm, entitled TS-Stream. Experiments were conducted with synthetic data sets composed of various time series models, confirming the superiority of TS-Stream when compared to ODAC, the most successful technique in the literature for this task. Experiments with real-world time series from stock market data of the NYSE and NASDAQ showed that the use of TS-Stream in the selection of stocks, by the creation of a diversified portfolio, can increase the returns of the investment in several orders of magnitude when compared to trading strategies solely based on the Moving Average Convergence Divergence financial indicator
26

Apprentissage non supervisé de flux de données massives : application aux Big Data d'assurance / Unsupervided learning of massive data streams : application to Big Data in insurance

Ghesmoune, Mohammed 25 November 2016 (has links)
Le travail de recherche exposé dans cette thèse concerne le développement d'approches à base de growing neural gas (GNG) pour le clustering de flux de données massives. Nous proposons trois extensions de l'approche GNG : séquentielle, distribuée et parallèle, et une méthode hiérarchique; ainsi qu'une nouvelle modélisation pour le passage à l'échelle en utilisant le paradigme MapReduce et l'application de ce modèle pour le clustering au fil de l'eau du jeu de données d'assurance. Nous avons d'abord proposé la méthode G-Stream. G-Stream, en tant que méthode "séquentielle" de clustering, permet de découvrir de manière incrémentale des clusters de formes arbitraires et en ne faisant qu'une seule passe sur les données. G-Stream utilise une fonction d'oubli an de réduire l'impact des anciennes données dont la pertinence diminue au fil du temps. Les liens entre les nœuds (clusters) sont également pondérés par une fonction exponentielle. Un réservoir de données est aussi utilisé an de maintenir, de façon temporaire, les observations très éloignées des prototypes courants. L'algorithme batchStream traite les données en micro-batch (fenêtre de données) pour le clustering de flux. Nous avons défini une nouvelle fonction de coût qui tient compte des sous ensembles de données qui arrivent par paquets. La minimisation de la fonction de coût utilise l'algorithme des nuées dynamiques tout en introduisant une pondération qui permet une pénalisation des données anciennes. Une nouvelle modélisation utilisant le paradigme MapReduce est proposée. Cette modélisation a pour objectif de passer à l'échelle. Elle consiste à décomposer le problème de clustering de flux en fonctions élémentaires (Map et Reduce). Ainsi de traiter chaque sous ensemble de données pour produire soit les clusters intermédiaires ou finaux. Pour l'implémentation de la modélisation proposée, nous avons utilisé la plateforme Spark. Dans le cadre du projet Square Predict, nous avons validé l'algorithme batchStream sur les données d'assurance. Un modèle prédictif combinant le résultat du clustering avec les arbres de décision est aussi présenté. L'algorithme GH-Stream est notre troisième extension de GNG pour la visualisation et le clustering de flux de données massives. L'approche présentée a la particularité d'utiliser une structure hiérarchique et topologique, qui consiste en plusieurs arbres hiérarchiques représentant des clusters, pour les tâches de clustering et de visualisation. / The research outlined in this thesis concerns the development of approaches based on growing neural gas (GNG) for clustering of data streams. We propose three algorithmic extensions of the GNG approaches: sequential, distributed and parallel, and hierarchical; as well as a model for scalability using MapReduce and its application to learn clusters from the real insurance Big Data in the form of a data stream. We firstly propose the G-Stream method. G-Stream, as a “sequential" clustering method, is a one-pass data stream clustering algorithm that allows us to discover clusters of arbitrary shapes without any assumptions on the number of clusters. G-Stream uses an exponential fading function to reduce the impact of old data whose relevance diminishes over time. The links between the nodes are also weighted. A reservoir is used to hold temporarily the distant observations in order to reduce the movements of the nearest nodes to the observations. The batchStream algorithm is a micro-batch based method for clustering data streams which defines a new cost function taking into account that subsets of observations arrive in discrete batches. The minimization of this function, which leads to a topological clustering, is carried out using dynamic clusters in two steps: an assignment step which assigns each observation to a cluster, followed by an optimization step which computes the prototype for each node. A scalable model using MapReduce is then proposed. It consists of decomposing the data stream clustering problem into the elementary functions, Map and Reduce. The observations received in each sub-dataset (within a time interval) are processed through deterministic parallel operations (Map and Reduce) to produce the intermediate states or the final clusters. The batchStream algorithm is validated on the insurance Big Data. A predictive and analysis system is proposed by combining the clustering results of batchStream with decision trees. The architecture and these different modules from the computational core of our Big Data project, called Square Predict. GH-Stream for both visualization and clustering tasks is our third extension. The presented approach uses a hierarchical and topological structure for both of these tasks.
27

Adaptação de viés indutivo de algoritmos de agrupamento de fluxos de dados / Adapting the inductive bias of data-stream clustering algorithms

Albertini, Marcelo Keese 11 April 2012 (has links)
Diversas áreas de pesquisa são dedicadas à compreensão de fenômenos que exigem a coleta ininterrupta de sequências de amostras, denominadas fluxos de dados. Esses fenômenos frequentemente apresentam comportamento variável e são estudados por meio de indução não supervisionada baseada em agrupamento de dados. Atualmente, o processo de agrupamento tem exibido sérias limitações em sua aplicação a fluxos de dados, devido às exigências impostas pelas variações comportamentais e pelo modo de coleta de dados. Embora tem-se desenvolvido algoritmos eficientes para agrupar fluxos de dados, há a necessidade de estudos sobre a influência de variações comportamentais nos parâmetros de algoritmos (e.g., taxas de aprendizado e limiares de proximidade), as quais interferem diretamente na compreensão de fenômenos. Essa lacuna motivou esta tese, cujo objetivo foi a proposta de uma abordagem para a adaptação do viés indutivo de algoritmos de agrupamento de fluxos de dados de acordo com variações comportamentais dos fenômenos em estudo. Para cumprir esse objetivo projetou-se: i) uma abordagem baseada em uma nova arquitetura de rede neural artificial que permite avaliação de comportamento de fenômenos por meio da estimação de cadeias de Markov e entropia de Shannon; ii) uma abordagem para adaptar parâmetros de algoritmos de agrupamento tradicional de acordo com variações comportamentais em blocos sequenciais de dados; e iii) uma abordagem para adaptar parâmetros de agrupamento de acordo com a contínua avaliação da estabilidade de dados. Adicionalmente, apresenta-se nesta tese uma taxonomia de técnicas de detecção de variação comportamental de fenômenos e uma formalização para o problema de agrupamento de fluxos de dados / Several research fields have described phenomena that produce endless sequences of samples, referred to as data streams. These phenomena usually present behavior variation and are studied by means of unsupervised induction based on data clustering. In order to cope with the characteristics of data streams, researchers have designed clustering algorithms with low time and space complexity requirements. However, predefined and static parameters (thresholds, number of clusters and learning rates) found in current algorithms still limit the application of clustering to data streams. This limitation motivated this thesis, which proposes a continuous approach to evaluate behavior variations and adapt algorithm inductive bias by changing its parameters. The main contribution of this thesis is the proposal of three approaches to adapt induction bias: i) an approach based on the design of an adaptive artificial self-organizing neural network architecture that enables behavior evaluation by means of Markov chain and Shannon entropy estimations; ii) an approach to adapt traditional data clustering algorithms according to behavior variations in sequences of data chunks; and iii) an approach based on the proposed neural network architecture to continuously adapt parameters by means of the evaluation of data stability. Additionally, in order to analyze the essential characteristics of data streams, this thesis presents a formalization for the problem of data stream clustering and a taxonomy on approaches to detect behavior variations
28

Exploratory Visualization of Data Pattern Changes in Multivariate Data Streams

Xie, Zaixian 21 October 2011 (has links)
" More and more researchers are focusing on the management, querying and pattern mining of streaming data. The visualization of streaming data, however, is still a very new topic. Streaming data is very similar to time-series data since each datapoint has a time dimension. Although the latter has been well studied in the area of information visualization, a key characteristic of streaming data, unbounded and large-scale input, is rarely investigated. Moreover, most techniques for visualizing time-series data focus on univariate data and seldom convey multidimensional relationships, which is an important requirement in many application areas. Therefore, it is necessary to develop appropriate techniques for streaming data instead of directly applying time-series visualization techniques to it. As one of the main contributions of this dissertation, I introduce a user-driven approach for the visual analytics of multivariate data streams based on effective visualizations via a combination of windowing and sampling strategies. To help users identify and track how data patterns change over time, not only the current sliding window content but also abstractions of past data in which users are interested are displayed. Sampling is applied within each single time window to help reduce visual clutter as well as preserve data patterns. Sampling ratios scheduled for different windows reflect the degree of user interest in the content. A degree of interest (DOI) function is used to represent a user's interest in different windows of the data. Users can apply two types of pre-defined DOI functions, namely RC (recent change) and PP (periodic phenomena) functions. The developed tool also allows users to interactively adjust DOI functions, in a manner similar to transfer functions in volume visualization, to enable a trial-and-error exploration process. In order to visually convey the change of multidimensional correlations, four layout strategies were designed. User studies showed that three of these are effective techniques for conveying data pattern changes compared to traditional time-series data visualization techniques. Based on this evaluation, a guide for the selection of appropriate layout strategies was derived, considering the characteristics of the targeted datasets and data analysis tasks. Case studies were used to show the effectiveness of DOI functions and the various visualization techniques. A second contribution of this dissertation is a data-driven framework to merge and thus condense time windows having small or no changes and distort the time axis. Only significant changes are shown to users. Pattern vectors are introduced as a compact format for representing the discovered data model. Three views, juxtaposed views, pattern vector views, and pattern change views, were developed for conveying data pattern changes. The first shows more details of the data but needs more canvas space; the last two need much less canvas space via conveying only the pattern parameters, but lose many data details. The experiments showed that the proposed merge algorithms preserves more change information than an intuitive pattern-blind averaging. A user study was also conducted to confirm that the proposed techniques can help users find pattern changes more quickly than via a non-distorted time axis. A third contribution of this dissertation is the history views with related interaction techniques were developed to work under two modes: non-merge and merge. In the former mode, the framework can use natural hierarchical time units or one defined by domain experts to represent timelines. This can help users navigate across long time periods. Grid or virtual calendar views were designed to provide a compact overview for the history data. In addition, MDS pattern starfields, distance maps, and pattern brushes were developed to enable users to quickly investigate the degree of pattern similarity among different time periods. For the merge mode, merge algorithms were applied to selected time windows to generate a merge-based hierarchy. The contiguous time windows having similar patterns are merged first. Users can choose different levels of merging with the tradeoff between more details in the data and less visual clutter in the visualizations. The usability evaluation demonstrated that most participants could understand the concepts of the history views correctly and finished assigned tasks with a high accuracy and relatively fast response time. "
29

An incremental gaussian mixture network for data stream classification in non-stationary environments / Uma rede de mistura de gaussianas incrementais para classificação de fluxos contínuos de dados em cenários não estacionários

Diaz, Jorge Cristhian Chamby January 2018 (has links)
Classificação de fluxos contínuos de dados possui muitos desafios para a comunidade de mineração de dados quando o ambiente não é estacionário. Um dos maiores desafios para a aprendizagem em fluxos contínuos de dados está relacionado com a adaptação às mudanças de conceito, as quais ocorrem como resultado da evolução dos dados ao longo do tempo. Duas formas principais de desenvolver abordagens adaptativas são os métodos baseados em conjunto de classificadores e os algoritmos incrementais. Métodos baseados em conjunto de classificadores desempenham um papel importante devido à sua modularidade, o que proporciona uma maneira natural de se adaptar a mudanças de conceito. Os algoritmos incrementais são mais rápidos e possuem uma melhor capacidade anti-ruído do que os conjuntos de classificadores, mas têm mais restrições sobre os fluxos de dados. Assim, é um desafio combinar a flexibilidade e a adaptação de um conjunto de classificadores na presença de mudança de conceito, com a simplicidade de uso encontrada em um único classificador com aprendizado incremental. Com essa motivação, nesta dissertação, propomos um algoritmo incremental, online e probabilístico para a classificação em problemas que envolvem mudança de conceito. O algoritmo é chamado IGMN-NSE e é uma adaptação do algoritmo IGMN. As duas principais contribuições da IGMN-NSE em relação à IGMN são: melhoria de poder preditivo para tarefas de classificação e a adaptação para alcançar um bom desempenho em cenários não estacionários. Estudos extensivos em bases de dados sintéticas e do mundo real demonstram que o algoritmo proposto pode rastrear os ambientes em mudança de forma muito próxima, independentemente do tipo de mudança de conceito. / Data stream classification poses many challenges for the data mining community when the environment is non-stationary. The greatest challenge in learning classifiers from data stream relates to adaptation to the concept drifts, which occur as a result of changes in the underlying concepts. Two main ways to develop adaptive approaches are ensemble methods and incremental algorithms. Ensemble method plays an important role due to its modularity, which provides a natural way of adapting to change. Incremental algorithms are faster and have better anti-noise capacity than ensemble algorithms, but have more restrictions on concept drifting data streams. Thus, it is a challenge to combine the flexibility and adaptation of an ensemble classifier in the presence of concept drift, with the simplicity of use found in a single classifier with incremental learning. With this motivation, in this dissertation we propose an incremental, online and probabilistic algorithm for classification as an effort of tackling concept drifting. The algorithm is called IGMN-NSE and is an adaptation of the IGMN algorithm. The two main contributions of IGMN-NSE in relation to the IGMN are: predictive power improvement for classification tasks and adaptation to achieve a good performance in non-stationary environments. Extensive studies on both synthetic and real-world data demonstrate that the proposed algorithm can track the changing environments very closely, regardless of the type of concept drift.
30

Detecção de novidade com aplicação a fluxos contínuos de dados / Novelty detection with application to data streams

Spinosa, Eduardo Jaques 20 February 2008 (has links)
Neste trabalho a detecção de novidade é tratada como o problema de identificação de conceitos emergentes em dados que podem ser apresentados em um fluxo contínuo. Considerando a relação intrínseca entre tempo e novidade e os desafios impostos por fluxos de dados, uma nova abordagem é proposta. OLINDDA (OnLIne Novelty and Drift Detection Algorithm) vai além da classficação com uma classe e concentra-se no aprendizado contínuo não-supervisionado de novos conceitos. Tendo aprendido uma descrição inicial de um conceito normal, prossegue à análise de novos dados, tratando-os como um fluxo contínuo em que novos conceitos podem aparecer a qualquer momento. Com o uso de técnicas de agrupamento, OLINDDA pode empregar diversos critérios de validação para avaliar grupos em termos de sua coesão e representatividade. Grupos considerados válidos produzem conceitos que podem sofrer fusão, e cujo conhecimento é continuamente incorporado. A técnica é avaliada experimentalmente com dados artificiais e reais. O módulo de classificação com uma classe é comparado a outras técnicas de detecção de novidade, e a abordagem como um todo é analisada sob vários aspectos por meio da evolução temporal de diversas métricas. Os resultados reforçam a importância da detecção contínua de novos conceitos, assim como as dificuldades e desafios do aprendizado não-supervisionado de novos conceitos em fluxos de dados / In this work novelty detection is treated as the problem of identifying emerging concepts in data that may be presented in a continuous ow. Considering the intrinsic relationship between time and novelty and the challenges imposed by data streams, a novel approach is proposed. OLINDDA, an OnLIne Novelty and Drift Detection Algorithm, goes beyond one-class classification and focuses on the unsupervised continuous learning of novel concepts. Having learned an initial description of a normal concept, it proceeds to the analysis of new data, treating them as a continuous ow where novel concepts may appear at any time. By the use of clustering techniques, OLINDDA may employ several validation criteria to evaluate clusters in terms of their cohesiveness and representativeness. Clusters considered valid produce concepts that may be merged, and whose knowledge is continuously incorporated. The technique is experimentally evaluated with artificial and real data. The one-class classification module is compared to other novelty detection techniques, and the whole approach is analyzed from various aspects through the temporal evolution of several metrics. Results reinforce the importance of continuous detection of novel concepts, as well as the dificulties and challenges of the unsupervised learning of novel concepts in data streams

Page generated in 0.0703 seconds