Global ETD Search

71	串流資料分析在台灣股市指數期貨之應用 / An Application of Streaming Data Analysis on TAIEX Futures 林宏哲, Lin, Hong Che Unknown Date (has links) 資料串流探勘是一個重要的研究領域，因為在現實中有許多重要的資料以串流的形式產生或被收集，金融市場的資料常常是一種資料串流，而通常這類型資料的本質是變動性大的。在這篇論文中我們運應了資料串流探勘的技術去預測台灣加權指數期貨的漲跌。對機器而言，預測期貨這種資料串流並不容易，而困難度跟概念飄移的種類與程度或頻率有關。概念飄移表示資料的潛在分布改變，這造成預測的準確率會急遽下降，因此我們專注在如何處理概念飄移。首先我們根據實驗的結果推測台灣加權指數期貨可能存在高頻率的概念飄移。另外實驗結果指出，使用偵測概念飄移的演算法可以大幅改善預測的準確率，甚至對於原本表現不好的演算法都能有顯著的改善。在這篇論文中我們亦整理出專門處理各類概念飄移的演算法。此外，我們提出了一個多分類器演算法，有助於偵測「重複發生」類別的概念飄移。該演算法相比改進之前，其最大的特色在於不需要使用者設定每個子分類器的樣本數，而該樣本數是影響演算法的關鍵之一。 / Data stream mining is an important research field, because data is usually generated and collected in a form of a stream in many cases in the real world. Financial market data is such an example. It is intrinsically dynamic and usually generated in a sequential manner. In this thesis, we apply data stream mining techniques to the prediction of Taiwan Stock Exchange Capitalization Weighted Stock Index Futures or TAIEX Futures. Our goal is to predict the rising or falling of the futures. The prediction is difficult and the difficulty is associated with concept drift, which indicates changes in the underlying data distribution. Therefore, we focus on concept drift handling. We first show that concept drift occurs frequently in the TAIEX Futures data by referring to the results from an empirical study. In addition, the results indicate that a concept drift detection method can improve the accuracy of the prediction even when it is used with a data stream mining algorithm that does not perform well. Next, we explore methods that can help us identify the types of concept drift. The experimental results indicate that sudden and reoccurring concept drift exist in the TAIEX Futures data. Moreover, we propose an ensemble based algorithm for reoccurring concept drift. The most characteristic feature of the proposed algorithm is that it can adaptively determine the chunk size, which is an important parameter for other concept drift handling algorithms. 資料串流探勘概念飄移台灣股市期貨 data stream mining concept drift TAIEX Futures
72	Development Of Gis-based National Hydrography Dataset, Sub-basin Boundaries, And Water Quality/quantity Data Analysis System For Turkey Girgin, Serkan 01 December 2003 (has links) (PDF) Computerized data visualization and analysis tools, especially Geographic Information Systems (GIS), constitute an important part of today&amp / #65533 / s water resources development and management studies. In order to obtain satisfactory results from such tools, accurate and comprehensive hydrography datasets are needed that include both spatial and hydrologic information on surface water resources and watersheds. If present, such datasets may support many applications, such as hydrologic and environmental modeling, impact assessment, and construction planning. The primary purposes of this study are production of prototype national hydrography and watershed datasets for Turkey, and development of GIS-based tools for the analysis of local water quality and quantity data. For these purposes national hydrography datasets and analysis systems of several counties are reviewed, and based on gained experience / 1) Sub-watershed boundaries of 26 major national basins are derived from digital elevation model of the country by using raster-based analysis methods and these watersheds are named according to coding system of the European Union, 2) A prototype hydrography dataset with built-in connectivity and water flow direction information is produced from publicly available data sources, 3) GIS based spatial tools are developed to facilitate navigation through streams and watersheds in the hydrography dataset, and 4) A state-of-the art GIS-based stream flow and water quality data analysis system is developed, which is based on the structure of nationally available data and includes advanced statistical and spatial analysis capabilities. All datasets and developed tools are gathered in a single graphical user-interface within GIS and made available to the end-users.
73	[en] AN EFFICIENT APPROACH TO COORDINATED RECONFIGURATION IN DISTRIBUTED DATA STREAM SYSTEMS / [pt] UMA ABORDAGEM EFICIENTE PARA RECONFIGURAÇÃO COORDENADA EM SISTEMAS DISTRIBUÍDOS DE PROCESSAMENTO DE DATA STREAMS RAFAEL OLIVEIRA VASCONCELOS 24 July 2017 (has links) [pt] Ao mesmo tempo em que sistemas de processamento de fluxo de dados devem prover serviços de análise e manipulação de dados ininterruptamente (disponibilidade 24x7), eles comumente também precisam lidar com mudanças em seus ambientes de execução (e.g., alterar a topologia da rede) e nos requisitos que eles devem cumprir (e.g., adição de novas funções de processamento dos fluxos de dados). Por um lado, reconfiguração dinâmica de software (i.e., a capacidade de substituir parte do software em tempo de execução) é uma característica desejável. Por outro lado, sistemas de fluxo de dados podem sofrer com a interrupção e sobrecarga causada pela reconfiguração. Por conta da necessidade de reconfigurar (i.e., evoluir) o sistema ao mesmo tempo em que o sistema não pode ser interrompido (i.e., bloqueado), reconfiguração consistente e não bloqueante é ainda considerada um problema em aberto na literatura. Esta tese apresenta e valida uma abordagem não quiescente para reconfiguração dinâmica de software que preserva a consistência de sistemas de fluxo de dados distribuídos. A abordagem proposta permite que o sistema seja reconfigurado gradual e suavemente, sem precisar interromper o processamento do fluxo de dados ou atingir a quiescência. A avaliação indica que a abordagem proposta realiza reconfiguração distribuída consistentemente e tem um impacto desprezível sobre a diminuição na disponibilidade e no desempenho do sistema. Além disto, a implementação da abordagem proposta teve um desempenho melhor em todos os testes comparativos. / [en] While many data stream systems have to provide continuous (24x7) services with no acceptable downtime, they also have to cope with changes in their execution environments and in the requirements that they must comply (e.g., moving from on-premises architecture to a cloud system, changing the network technology, adding new functionality or modifying existing parts). On one hand, dynamic software reconfiguration (i.e., the capability of evolving on the fly) is a desirable feature. On the other hand, stream systems may suffer from the disruption and overhead caused by the reconfiguration. Due to the necessity of reconfiguring (i.e., evolving) the system whilst the system must not be disrupted (i.e., blocked), consistent and non-disruptive reconfiguration is still considered an open problem. This thesis presents and validates a non-quiescent approach for dynamic software reconfiguration that preserves the consistency of distributed data stream processing systems. Unlike many works that require the system to reach a safe state (e.g., quiescence) before performing a reconfiguration, the proposed approach enables the system to smoothly evolve (i.e., be reconfigured) in a non-disruptive way without reaching quiescence. The evaluation indicates that the proposed approach supports consistent distributed reconfiguration and has negligible impact on availability and performance. Furthermore, the implementation of the proposed approach showed better performance results in all experiments than the quiescent approach and Upstart. [pt] ADAPTABILIDADE [en] ADAPTABILITY [pt] COMUNICACAO MOVEL [en] MOBILE COMMUNICATION [pt] RECONFIGURACAO DINAMICA [en] DYNAMIC RECONFIGURATION [pt] PROCESSAMENTO DE FLUXO DE DADOS [en] DATA STREAM PROCESSING [pt] ADAPTACAO DE SOFTWARE [en] SOFTWARE ADAPTATION
74	A Reservoir of Adaptive Algorithms for Online Learning from Evolving Data Streams Pesaranghader, Ali 26 September 2018 (has links) Continuous change and development are essential aspects of evolving environments and applications, including, but not limited to, smart cities, military, medicine, nuclear reactors, self-driving cars, aviation, and aerospace. That is, the fundamental characteristics of such environments may evolve, and so cause dangerous consequences, e.g., putting people lives at stake, if no reaction is adopted. Therefore, learning systems need to apply intelligent algorithms to monitor evolvement in their environments and update themselves effectively. Further, we may experience fluctuations regarding the performance of learning algorithms due to the nature of incoming data as it continuously evolves. That is, the current efficient learning approach may become deprecated after a change in data or environment. Hence, the question 'how to have an efficient learning algorithm over time against evolving data?' has to be addressed. In this thesis, we have made two contributions to settle the challenges described above. In the machine learning literature, the phenomenon of (distributional) change in data is known as concept drift. Concept drift may shift decision boundaries, and cause a decline in accuracy. Learning algorithms, indeed, have to detect concept drift in evolving data streams and replace their predictive models accordingly. To address this challenge, adaptive learners have been devised which may utilize drift detection methods to locate the drift points in dynamic and changing data streams. A drift detection method able to discover the drift points quickly, with the lowest false positive and false negative rates, is preferred. False positive refers to incorrectly alarming for concept drift, and false negative refers to not alarming for concept drift. In this thesis, we introduce three algorithms, called as the Fast Hoeffding Drift Detection Method (FHDDM), the Stacking Fast Hoeffding Drift Detection Method (FHDDMS), and the McDiarmid Drift Detection Methods (MDDMs), for detecting drift points with the minimum delay, false positive, and false negative rates. FHDDM is a sliding window-based algorithm and applies Hoeffding’s inequality (Hoeffding, 1963) to detect concept drift. FHDDM slides its window over the prediction results, which are either 1 (for a correct prediction) or 0 (for a wrong prediction). Meanwhile, it compares the mean of elements inside the window with the maximum mean observed so far; subsequently, a significant difference between the two means, upper-bounded by the Hoeffding inequality, indicates the occurrence of concept drift. The FHDDMS extends the FHDDM algorithm by sliding multiple windows over its entries for a better drift detection regarding the detection delay and false negative rate. In contrast to FHDDM/S, the MDDM variants assign weights to their entries, i.e., higher weights are associated with the most recent entries in the sliding window, for faster detection of concept drift. The rationale is that recent examples reflect the ongoing situation adequately. Then, by putting higher weights on the latest entries, we may detect concept drift quickly. An MDDM algorithm bounds the difference between the weighted mean of elements in the sliding window and the maximum weighted mean seen so far, using McDiarmid’s inequality (McDiarmid, 1989). Eventually, it alarms for concept drift once a significant difference is experienced. We experimentally show that FHDDM/S and MDDMs outperform the state-of-the-art by representing promising results in terms of the adaptation and classification measures. Due to the evolving nature of data streams, the performance of an adaptive learner, which is defined by the classification, adaptation, and resource consumption measures, may fluctuate over time. In fact, a learning algorithm, in the form of a (classifier, detector) pair, may present a significant performance before a concept drift point, but not after. We define this problem by the question 'how can we ensure that an efficient classifier-detector pair is present at any time in an evolving environment?' To answer this, we have developed the Tornado framework which runs various kinds of learning algorithms simultaneously against evolving data streams. Each algorithm incrementally and independently trains a predictive model and updates the statistics of its drift detector. Meanwhile, our framework monitors the (classifier, detector) pairs, and recommends the efficient one, concerning the classification, adaptation, and resource consumption performance, to the user. We further define the holistic CAR measure that integrates the classification, adaptation, and resource consumption measures for evaluating the performance of adaptive learning algorithms. Our experiments confirm that the most efficient algorithm may differ over time because of the developing and evolving nature of data streams. Machine Learning Adaptive Learning Multi-Strategy Learning Data Stream Mining Evolving Data Streams Concept Drift Drift Detection Drift Detection Methods Window-based Drift Detection Hoeffding's inequality McDiarmid's inequality
75	[en] DG2CEP: AN ON-LINE ALGORITHM FOR REAL-TIME DETECTION OF SPATIAL CLUSTERS FROM LARGE DATA STREAMS THROUGH COMPLEX EVENT PROCESSING / [pt] DG2CEP: UM ALGORITMO ON-LINE PARA DETECÇÃO EM TEMPO REAL DE AGLOMERADOS ESPACIAIS EM GRANDES FLUXOS DE DADOS ATRAVÉS DE PROCESSAMENTO DE FLUXO DE DADOS MARCOS PAULINO RORIZ JUNIOR 08 June 2017 (has links) [pt] Clusters (ou concentrações) de objetos móveis, como veículos e seres humanos, é um padrão de mobilidade relevante para muitas aplicações. Uma detecção rápida deste padrão e de sua evolução, por exemplo, se o cluster está encolhendo ou crescendo, é útil em vários cenários, como detectar a formação de engarrafamentos ou detectar uma rápida dispersão de pessoas em um show de música. A detecção on-line deste padrão é uma tarefa desafiadora porque requer algoritmos que sejam capazes de processar de forma contínua e eficiente o alto volume de dados enviados pelos objetos móveis em tempo hábil. Atualmente, a maioria das abordagens para a detecção destes clusters operam em lote. As localizações dos objetos móveis são armazenadas durante um determinado período e depois processadas em lote por uma rotina externa, atrasando o resultado da detecção do cluster até o final do período ou do próximo lote. Além disso, essas abordagem utilizam extensivamente estruturas de dados e operadores espaciais, o que pode ser problemático em cenários de grande fluxos de dados. Com intuito de abordar estes problemas, propomos nesta tese o DG2CEP, um algoritmo que combina o conhecido algoritmo de aglomeração por densidade (DBSCAN) com o paradigma de processamento de fluxos de dados (Complex Event Processing) para a detecção contínua e rápida dos aglomerados. Nossos experimentos com dados reais indicam que o DG2CEP é capaz de detectar a formação e dispersão de clusters rapidamente, em menos de alguns segundos, para milhares de objetos móveis. Além disso, os resultados obtidos indicam que o DG2CEP possui maior similaridade com DBSCAN do que abordagens baseadas em lote. / [en] Spatial concentrations (or spatial clusters) of moving objects, such as vehicles and humans, is a mobility pattern that is relevant to many applications. A fast detection of this pattern and its evolution, e.g., if the cluster is shrinking or growing, is useful in numerous scenarios, such as detecting the formation of traffic jams or detecting a fast dispersion of people in a music concert. An on-line detection of this pattern is a challenging task because it requires algorithms that are capable of continuously and efficiently processing the high volume of position updates in a timely manner. Currently, the majority of approaches for spatial cluster detection operate in batch mode, where moving objects location updates are recorded during time periods of certain length and then batch-processed by an external routine, thus delaying the result of the cluster detection until the end of the time period. Further, they extensively use spatial data structures and operators, which can be troublesome to maintain or parallelize in on-line scenarios. To address these issues, in this thesis we propose DG2CEP, an algorithm that combines the well-known density-based clustering algorithm DBSCAN with the data stream processing paradigm Complex Event Processing (CEP) to achieve continuous and timely detection of spatial clusters. Our experiments with real world data streams indicate that DG2CEP is able to detect the formation and dispersion of clusters with small latency while having a higher similarity to DBSCAN than batch-based approaches. [pt] PROCESSAMENTO DE FLUXO DE DADOS [en] DATA STREAM PROCESSING [pt] AGLOMERACAO ESPACIAL [pt] AGLOMERACAO EM FLUXO DE DADOS [pt] AGLOMERACAO EM TEMPO REAL [pt] DETECCAO ON-LINE DE AGLOMERADOS
76	Propagação em grafos bipartidos para extração de tópicos em fluxo de documentos textuais / Propagation in bipartite graphs for topic extraction in stream of textual data Thiago de Paulo Faleiros 08 June 2016 (has links) Tratar grandes quantidades de dados é uma exigência dos modernos algoritmos de mineração de texto. Para algumas aplicações, documentos são constantemente publicados, o que demanda alto custo de armazenamento em longo prazo. Então, é necessário criar métodos de fácil adaptação para uma abordagem que considere documentos em fluxo, e que analise os dados em apenas um passo sem requerer alto custo de armazenamento. Outra exigência é a de que essa abordagem possa explorar heurísticas a fim de melhorar a qualidade dos resultados. Diversos modelos para a extração automática das informações latentes de uma coleção de documentos foram propostas na literatura, dentre eles destacando-se os modelos probabilísticos de tópicos. Modelos probabilísticos de tópicos apresentaram bons resultados práticos, sendo estendidos para diversos modelos com diversos tipos de informações inclusas. Entretanto, descrever corretamente esses modelos, derivá-los e em seguida obter o apropriado algoritmo de inferência são tarefas difíceis, exigindo um tratamento matemático rigoroso para as descrições das operações efetuadas no processo de descoberta das dimensões latentes. Assim, para a elaboração de um método simples e eficiente para resolver o problema da descoberta das dimensões latentes, é necessário uma apropriada representação dos dados. A hipótese desta tese é a de que, usando a representação de documentos em grafos bipartidos, é possível endereçar problemas de aprendizado de máquinas, para a descoberta de padrões latentes em relações entre objetos, por exemplo nas relações entre documentos e palavras, de forma simples e intuitiva. Para validar essa hipótese, foi desenvolvido um arcabouço baseado no algoritmo de propagação de rótulos utilizando a representação em grafos bipartidos. O arcabouço, denominado PBG (Propagation in Bipartite Graph), foi aplicado inicialmente para o contexto não supervisionado, considerando uma coleção estática de documentos. Em seguida, foi proposta uma versão semissupervisionada, que considera uma pequena quantidade de documentos rotulados para a tarefa de classificação transdutiva. E por fim, foi aplicado no contexto dinâmico, onde se considerou fluxo de documentos textuais. Análises comparativas foram realizadas, sendo que os resultados indicaram que o PBG é uma alternativa viável e competitiva para tarefas nos contextos não supervisionado e semissupervisionado. / Handling large amounts of data is a requirement for modern text mining algorithms. For some applications, documents are published constantly, which demand a high cost for long-term storage. So it is necessary easily adaptable methods for an approach that considers documents flow, and be capable of analyzing the data in one step without requiring the high cost of storage. Another requirement is that this approach can exploit heuristics in order to improve the quality of results. Several models for automatic extraction of latent information in a collection of documents have been proposed in the literature, among them probabilistic topic models are prominent. Probabilistic topic models achieve good practical results, and have been extended to several models with different types of information included. However, properly describe these models, derive them, and then get appropriate inference algorithms are difficult tasks, requiring a rigorous mathematical treatment for descriptions of operations performed in the latent dimensions discovery process. Thus, for the development of a simple and efficient method to tackle the problem of latent dimensions discovery, a proper representation of the data is required. The hypothesis of this thesis is that by using bipartite graph for representation of textual data one can address the task of latent patterns discovery, present in the relationships between documents and words, in a simple and intuitive way. For validation of this hypothesis, we have developed a framework based on label propagation algorithm using the bipartite graph representation. The framework, called PBG (Propagation in Bipartite Graph) was initially applied to the unsupervised context for a static collection of documents. Then a semi-supervised version was proposed which need only a small amount of labeled documents to the transductive classification task. Finally, it was applied in the dynamic context in which flow of textual data was considered. Comparative analyzes were performed, and the results indicated that the PBG is a viable and competitive alternative for tasks in the unsupervised and semi-supervised contexts. Aprendizado em grafos bipartidos Extração de tópicos Fluxo de dados textuais Redução de dimensionalidade Dimensionality reduction Learning in bipartite graphs Text data stream Topic extraction
77	An efficient entropy estimation approach Paavola, M. (Marko) 01 November 2011 (has links) Abstract Advances in miniaturisation have led to the development of new wireless measurement technologies such as wireless sensor networks (WSNs). A WSN consists of low cost nodes, which are battery-operated devices, capable of sensing the environment, transmitting and receiving, and computing. While a WSN has several advantages, including cost-effectiveness and easy installation, the nodes suffer from small memory, low computing power, small bandwidth and limited energy supply. In order to cope with restrictions on resources, data processing methods should be as efficient as possible. As a result, high quality approximates are preferred instead of accurate answers. The aim of this thesis was to propose an efficient entropy approximation method for resource-constrained environments. Specifically, the algorithm should use a small, constant amount of memory, and have certain accuracy and low computational demand. The performance of the proposed algorithm was evaluated experimentally with three case studies. The first study focused on the online monitoring of WSN communications performance in an industrial environment. The monitoring approach was based on the observation that entropy could be applied to assess the impact of interferences on time-delay variation of periodic tasks. The main purpose of the additional two cases, depth of anaesthesia (DOA) –monitoring and benchmarking with simulated data sets was to provide additional evidence on the general applicability of the proposed method. Moreover, in case of DOA-monitoring, an efficient entropy approximation could assist in the development of handheld devices or processing large amount of online data from different channels simultaneously. The initial results from the communication and DOA monitoring applications as well as from simulations were encouraging. Therefore, based on the case studies, the proposed method was able to meet the stated requirements. Since entropy is a widely used quantity, the method is also expected to have a variety of applications in measurement systems with similar requirements. / Tiivistelmä Mekaanisten- ja puolijohdekomponenttien pienentyminen on mahdollistanut uusien mittaustekniikoiden, kuten langattomien anturiverkkojen kehittämisen. Anturiverkot koostuvat halvoista, paristokäyttöisistä solmuista, jotka pystyvät mittaamaan ympäristöään sekä käsittelemään, lähettämään ja vastaanottamaan tietoja. Anturiverkkojen etuja ovat kustannustehokkuus ja helppo käyttöönotto, rajoitteina puolestaan vähäinen muisti- ja tiedonsiirtokapasiteetti, alhainen laskentateho ja rajoitettu energiavarasto. Näiden rajoitteiden vuoksi solmuissa käytettävien laskentamenetelmien tulee olla mahdollisimman tehokkaita. Tämän työn tavoitteena oli esittää tehokas entropian laskentamenetelmä resursseiltaan rajoitettuihin ympäristöihin. Algoritmin vaadittiin olevan riittävän tarkka, muistinkulutukseltaan pieni ja vakiosuuruinen sekä laskennallisesti tehokas. Työssä kehitetyn menetelmän suorituskykyä tutkittiin sovellusesimerkkien avulla. Ensimmäisessä tapauksessa perehdyttiin anturiverkon viestiyhteyksien reaaliaikaiseen valvontaan. Lähestymistavan taustalla oli aiempi tutkimus, jonka perusteella entropian avulla voidaan havainnoida häiriöiden vaikutusta viestien viiveiden vaihteluun. Muiden sovellusesimerkkien, anestesian syvyysindikaattorin ja simulaatiokokeiden, päätavoite oli tutkia menetelmän yleistettävyyttä. Erityisesti anestesian syvyyden seurannassa menetelmän arvioitiin voivan olla lisäksi hyödyksi langattomien, käsikäyttöisten syvyysmittareiden kehittämisessä ja suurten mittausmäärien reaaliaikaisessa käsittelyssä. Alustavat tulokset langattoman verkon yhteyksien ja anestesian syvyyden valvonnasta sekä simuloinneista olivat lupaavia. Sovellusesimerkkien perusteella esitetty algoritmi kykeni vastaamaan asetettuihin vaatimuksiin. Koska entropia on laajalti käytetty suure, menetelmä saattaa soveltua useisiin mittausympäristöihin, joissa on samankaltaisia vaatimuksia. data stream mining depth of anaesthesia differential evolution entropy estimation wireless sensor networks anestesian syvyys differentiaalievoluutio entropian arviointi langattomat anturiverkot tietovirtojen louhinta
78	Système de gestion de flux pour l'Internet des objets intelligents / Data stream management system for the future internet of things Billet, Benjamin 19 March 2015 (has links) L'Internet des objets (ou IdO) se traduit à l'heure actuelle par l'accroissement du nombre d'objets connectés, c'est-à-dire d'appareils possédant une identité propre et des capacités de calcul et de communication de plus en plus sophistiquées : téléphones, montres, appareils ménagers, etc. Ces objets embarquent un nombre grandissant de capteurs et d'actionneurs leur permettant de mesurer l'environnement et d'agir sur celui-ci, faisant ainsi le lien entre le monde physique et le monde virtuel. Spécifiquement, l'Internet des objets pose plusieurs problèmes, notamment du fait de sa très grande échelle, de sa nature dynamique et de l'hétérogénéité des données et des systèmes qui le composent (appareils puissants/peu puissants, fixes/mobiles, batteries/alimentations continues, etc.). Ces caractéristiques nécessitent des outils et des méthodes idoines pour la réalisation d'applications capables (i) d'extraire des informations utiles depuis les nombreuses sources de données disponibles et (ii) d'interagir aussi bien avec l'environnement, au moyen des actionneurs, qu'avec les utilisateurs, au moyen d'interfaces dédiées. Dans cette optique, nous défendons la thèse suivante : en raison de la nature continue des données (mesures physiques, évènements, etc.) et leur volume, il est important de considérer (i) les flux comme modèle de données de référence de l'Internet des objets et (ii) le traitement continu comme modèle de calcul privilégié pour transformer ces flux. En outre, étant donné les préoccupations croissantes relatives à la consommation énergétique et au respect de la vie privée, il est préférable de laisser les objets agir au plus près des utilisateurs, si possible de manière autonome, au lieu de déléguer systématiquement l'ensemble des tâches à de grandes entités extérieures telles que le cloud. À cette fin, notre principale contribution porte sur la réalisation d'un système distribué de gestion de flux de données pour l'Internet des objets. Nous réexaminons notamment deux aspects clés du génie logiciel et des systèmes distribués : les architectures de services et le déploiement. Ainsi, nous apportons des solutions (i) pour l'accès aux flux de données sous la forme de services et (ii) pour le déploiement automatique des traitements continus en fonction des caractéristiques des appareils. Ces travaux sont concrétisés sous la forme d'un intergiciel, Dioptase, spécifiquement conçu pour être exécuté directement sur les objets et les transformer en fournisseurs génériques de services de calcul et de stockage.Pour valider nos travaux et montrer la faisabilité de notre approche, nous introduisons un prototype de Dioptase dont nous évaluons les performances en pratique. De plus, nous montrons que Dioptase est une solution viable, capable de s'interfacer avec les systèmes antérieurs de capteurs et d'actionneurs déjà déployés dans l'environnement. / The Internet of Things (IoT) is currently characterized by an ever-growing number of networked Things, i.e., devices which have their own identity together with advanced computation and networking capabilities: smartphones, smart watches, smart home appliances, etc. In addition, these Things are being equipped with more and more sensors and actuators that enable them to sense and act on their environment, enabling the physical world to be linked with the virtual world. Specifically, the IoT raises many challenges related to its very large scale and high dynamicity, as well as the great heterogeneity of the data and systems involved (e.g., powerful versus resource-constrained devices, mobile versus fixed devices, continuously-powered versus battery-powered devices, etc.). These challenges require new systems and techniques for developing applications that are able to (i) collect data from the numerous data sources of the IoT and (ii) interact both with the environment using the actuators, and with the users using dedicated GUIs. To this end, we defend the following thesis: given the huge volume of data continuously being produced by sensors (measurements and events), we must consider (i) data streams as the reference data model for the IoT and (ii) continuous processing as the reference computation model for processing these data streams. Moreover, knowing that privacy preservation and energy consumption are increasingly critical concerns, we claim that all the Things should be autonomous and work together in restricted areas as close as possible to the users rather than systematically shifting the computation logic into powerful servers or into the cloud. For this purpose, our main contribution can be summarized as designing and developing a distributed data stream management system for the IoT. In this context, we revisit two fundamental aspects of software engineering and distributed systems: service-oriented architecture and task deployment. We address the problems of (i) accessing data streams through services and (ii) deploying continuous processing tasks automatically, according to the characteristics of both tasks and devices. This research work lead to the development of a middleware layer called Dioptase, designed to run on the Things and abstract them as generic devices that can be dynamically assigned communication, storage and computation tasks according to their available resources. In order to validate the feasability and the relevance of our work, we implemented a prototype of Dioptase and evaluated its performance. In addition, we show that Dioptase is a realistic solution which can work in cooperation with legacy sensor and actuator networks currently deployed in the environment. Internet des objets Gestion de flux de données Réseau de capteurs et d'actionneurs Intergiciel Architecture orientée service Internet of Things Data Stream Management Sensor and Actuator Network Middleware Service-Oriented Architecture
79	Novel Online Data Cleaning Protocols for Data Streams in Trajectory, Wireless Sensor Networks Pumpichet, Sitthapon 12 November 2013 (has links) The promise of Wireless Sensor Networks (WSNs) is the autonomous collaboration of a collection of sensors to accomplish some specific goals which a single sensor cannot offer. Basically, sensor networking serves a range of applications by providing the raw data as fundamentals for further analyses and actions. The imprecision of the collected data could tremendously mislead the decision-making process of sensor-based applications, resulting in an ineffectiveness or failure of the application objectives. Due to inherent WSN characteristics normally spoiling the raw sensor readings, many research efforts attempt to improve the accuracy of the corrupted or “dirty” sensor data. The dirty data need to be cleaned or corrected. However, the developed data cleaning solutions restrict themselves to the scope of static WSNs where deployed sensors would rarely move during the operation. Nowadays, many emerging applications relying on WSNs need the sensor mobility to enhance the application efficiency and usage flexibility. The location of deployed sensors needs to be dynamic. Also, each sensor would independently function and contribute its resources. Sensors equipped with vehicles for monitoring the traffic condition could be depicted as one of the prospective examples. The sensor mobility causes a transient in network topology and correlation among sensor streams. Based on static relationships among sensors, the existing methods for cleaning sensor data in static WSNs are invalid in such mobile scenarios. Therefore, a solution of data cleaning that considers the sensor movements is actively needed. This dissertation aims to improve the quality of sensor data by considering the consequences of various trajectory relationships of autonomous mobile sensors in the system. First of all, we address the dynamic network topology due to sensor mobility. The concept of virtual sensor is presented and used for spatio-temporal selection of neighboring sensors to help in cleaning sensor data streams. This method is one of the first methods to clean data in mobile sensor environments. We also study the mobility pattern of moving sensors relative to boundaries of sub-areas of interest. We developed a belief-based analysis to determine the reliable sets of neighboring sensors to improve the cleaning performance, especially when node density is relatively low. Finally, we design a novel sketch-based technique to clean data from internal sensors where spatio-temporal relationships among sensors cannot lead to the data correlations among sensor streams. Digital Communications and Networking Systems and Communications
80	Algoritmos anytime baseados em instâncias para classificação em fluxo de dados / Instance-based anytime algorithm to data stream classification Cristiano Inácio Lemes 09 March 2016 (has links) Aprendizado em fluxo de dados é uma área de pesquisa importante e que vem crescendo nos últimos tempos. Em muitas aplicações reais os dados são gerados em uma sequência temporal potencialmente infinita. O processamento em fluxo possui como principal característica a necessidade por respostas que atendam restrições severas de tempo e memória. Por exemplo, um classificador aplicado a um fluxo de dados deve prover uma resposta a um determinado evento antes que o próximo evento ocorra. Caso isso não ocorra, alguns eventos do fluxo podem ficar sem classificação. Muitos fluxos geram eventos em uma taxa de chegada com grande variabilidade, ou seja, o intervalo de tempo de ocorrência entre dois eventos sucessivos pode variar muito. Para que um sistema de aprendizado obtenha sucesso na aquisição de conhecimento é preciso que ele apresente duas características principais: (i) ser capaz de prover uma classificação para um novo exemplo em tempo hábil e (ii) ser capaz de adaptar o modelo de classificação de maneira a tratar mudanças de conceito, uma vez que os dados podem não apresentar uma distribuição estacionária. Algoritmos de aprendizado de máquina em lote não possuem essas propriedades, pois assumem que as distribuições são estacionárias e não estão preparados para atender restrições de memória e processamento. Para atender essas necessidades, esses algoritmos devem ser adaptados ao contexto de fluxo de dados. Uma possível adaptação é tornar o algoritmo de classificação anytime. Algoritmos anytime são capazes de serem interrompidos e prover uma resposta (classificação) aproximada a qualquer instante. Outra adaptação é tornar o algoritmo incremental, de maneira que seu modelo possa ser atualizado para novos exemplos do fluxo de dados. Neste trabalho é realizada a investigação de dois métodos capazes de realizar o aprendizado em um fluxo de dados. O primeiro é baseado no algoritmo k-vizinhos mais próximo anytime estado-da-arte, onde foi proposto um novo método de desempate para ser utilizado neste algoritmo. Os experimentos mostraram uma melhora consistente no desempenho deste algoritmo em várias bases de dados de benchmark. O segundo método proposto possui as características dos algoritmos anytime e é capaz de tratar a mudança de conceito nos dados. Este método foi chamado de Algoritmo Anytime Incremental e possui duas versões, uma baseado no algoritmo Space Saving e outra em uma Janela Deslizante. Os experimentos mostraram que em cada fluxo cada versão deste método proposto possui suas vantagens e desvantagens. Mas no geral, comparado com outros métodos baselines, ambas as versões apresentaram melhor desempenho. / Data stream learning is a very important research field that has received much attention from the scientific community. In many real-world applications, data is generated as potentially infinite temporal sequences. The main characteristic of stream processing is to provide answers observing stringent restrictions of time and memory. For example, a data stream classifier must provide an answer for each event before the next one arrives. If this does not occur, some events from the data stream may be left unclassified. Many streams generate events with highly variable output rate, i.e. the time interval between two consecutive events may vary greatly. For a learning system to be successful, two properties must be satisfied: (i) it must be able to provide a classification for a new example in a short time and (ii) it must be able to adapt the classification model to treat concept change, since the data may not follow a stationary distribution. Batch machine learning algorithms do not satisfy those properties because they assume that the distribution is stationary and they are not prepared to operate with severe memory and processing constraints. To satisfy these requirements, these algorithms must be adapted to the data stream context. One possible adaptation is to turn the algorithm into an anytime classifier. Anytime algorithms may be interrupted and still provide an approximated answer (classification) at any time. Another adaptation is to turn the algorithm into an incremental classifier so that its model may be updated with new examples from the data stream. In this work, it is performed an evaluation of two approaches for data stream learning. The first one is based on a state-of-the-art k-nearest neighbor anytime classifier. A new tiebreak approach is proposed to be used with this algorithm. Experiments show consistently better results in the performance of this algorithm in many benchmark data sets. The second proposed approach is to adapt the anytime algorithm for concept change. This approach was called Incremental Anytime Algorithm, and it was designed with two versions. One version is based on the Space Saving algorithm and the other is based in a Sliding Window. Experiments show that both versions are significantly better than baseline approaches. Algoritmo anytime Algoritmo incremental Classificação baseada em instância Fluxo de dados Mudança de conceito Anytime algorithm Concept change Data stream Incremental algorithm Instance-based classification

Search results