Global ETD Search

41	Scalable Integration View Computation and Maintenance with Parallel, Adaptive and Grouping Techniques Liu, Bin 19 August 2005 (has links) " Materialized integration views constructed by integrating data from multiple distributed data sources help to achieve better access, reliable performance, and high availability for a wide range of applications. In this dissertation, we propose parallel, adaptive, and grouping techniques to address scalability challenges in high-performance integration view computation and maintenance due to increasingly large data sources and high rates of source updates. State-of-the-art parallel integration view computation makes the common assumption that the maximal pipelined parallelism leads to superior performance. We instead propose segmented bushy parallel processing that combines pipelined parallelism with alternate forms of parallelism to achieve an overall more effective strategy. Experimental studies conducted over a cluster of high-performance PCs confirm that the proposed strategy has an on average of 50\% improvement in terms of total processing time in comparison to existing solutions. Run-time adaptation becomes critical for parallel integration view computation due to its long running and memory intensive nature. We investigate two types of state level adaptations, namely, state spill and state relocation, to address the run-time memory shortage. We propose lazy-disk and active-disk approaches that integrate both adaptations to maximize run-time query throughput in a memory constrained environment. We also propose global throughput-oriented state adaptation strategies for computation plans with multiple state intensive operators. Extensive experiments confirm the effectiveness of our proposed adaptation solutions. Once results have been computed and materialized, it's typically more efficient to maintain them incrementally instead of full recomputation. However, state-of-the-art incremental view maintenance require O($n^2$) maintenance queries with n being the number of data sources that the view is defined upon. Moreover, they do not exploit view definitions and data source processing capabilities to further improve view maintenance performance. We propose novel grouping maintenance algorithms that dramatically reduce the number of maintenance queries to (O(n)). A cost-based view maintenance framework has been proposed to generate optimized maintenance plans tuned to particular environmental settings. Extensive experimental studies verify the effectiveness of our maintenance algorithms as well as the maintenance framework. " parallel multi-join computation state level adaptation materialized view maintenance grouping maintenance cyclic join views distributed data sources Virtual storage (Computer science)
42	\"Armazenamento distribuído de dados e checkpointing de aplicações paralelas em grades oportunistas\" / Distributed data storage and checkpointing of parallel applications in opportunistic grids Raphael Yokoingawa de Camargo 04 May 2007 (has links) Grades computacionais oportunistas utilizam recursos ociosos de máquinas compartilhadas para executar aplicações que necessitam de um alto poder computacional e/ou trabalham com grandes quantidades de dados. Mas a execução de aplicações paralelas computacionalmente intensivas em ambientes dinâmicos e heterogêneos, como grades computacionais oportunistas, é uma tarefa difícil. Máquinas podem falhar, ficar inacessíveis ou passar de ociosas para ocupadas inesperadamente, comprometendo a execução de aplicações. Um mecanismo de tolerância a falhas que dê suporte a arquiteturas heterogêneas é um importante requisito para estes sistemas. Neste trabalho, analisamos, implementamos e avaliamos um mecanismo de tolerância a falhas baseado em checkpointing para aplicações paralelas em grades computacionais oportunistas. Este mecanismo permite o monitoramento de execuções e a migração de aplicações entre nós heterogêneos da grade. Mas além da execução, é preciso gerenciar e armazenar os dados gerados e utilizados por estas aplicações. Desejamos uma infra-estrutura de armazenamento de dados de baixo custo e que utilize o espaço livre em disco de máquinas compartilhadas da grade. Devemos utilizar somente os ciclos ociosos destas máquinas para armazenar e recuperar dados, de modo que um sistema de armazenamento distribuído que as utilize deve ser redundante e tolerante a falhas. Para resolver o problema do armazenamento de dados em grades oportunistas, projetamos, implementamos e avaliamos o middleware OppStore. Este middleware provê armazenamento distribuído e confiável de dados, que podem ser acessados de qualquer máquina da grade. As máquinas são organizadas em aglomerados, que são conectados por uma rede peer-to-peer auto-organizável e tolerante a falhas. Dados são codificados em fragmentos redundantes antes de serem armazenados, de modo que arquivos podem ser reconstruídos utilizando apenas um subconjunto destes fragmentos. Finalmente, para lidar com a heterogeneidade dos recursos, desenvolvemos uma extensão ao protocolo de roteamento em redes peer-to-peer Pastry. Esta extensão adiciona balanceamento de carga e suporte à heterogeneidade de máquinas ao protocolo Pastry. / Opportunistic computational grids use idle resources from shared machines to execute applications that need large amounts of computational power and/or deal with large amounts of data. But executing computationally intensive parallel applications in dynamic and heterogeneous environments, such as opportunistic grids, is a daunting task. Machines may fail, become inaccessible, or change from idle to occupied unexpectedly, compromising the application execution. A fault tolerance mechanism that supports heterogeneous architectures is an important requisite for such systems. In this work, we analyze, implement and evaluate a checkpointing-based fault tolerance mechanism for parallel applications running on opportunistic grids. The mechanism monitors application execution and allows the migration of applications between heterogeneous nodes of the grid. But besides application execution, it is necessary to manage data generated and used by those applications. We want a low cost data storage infrastructure that utilizes the unused disk space of grid shared machines. The system should use the machines to store and recover data only during their idle periods, requiring the system to be redundant and fault-tolerant. To solve the data storage problem in opportunistic grids, we designed, implemented and evaluated the OppStore middleware. This middleware provides reliable distributed storage for application data, which can be accessed from any machine in the grid. The machines are organized in clusters, connected by a self-organizing and fault-tolerant peer-to-peer network. During storage, data is codified into redundant fragments, allowing the reconstruction of the original file using only a subset of those fragments. Finally, to deal with resource heterogeneity, we developed an extension to the Pastry peer-to-peer routing substrate, enabling heterogeneity-aware load-balancing message routing. armazenamento distribuído BSP checkpointing grades computacionais peer-to-peer tolerância a falhas BSP checkpointing computational grids distributed data storage fault-tolerance grid computing peer-to-peer
43	Lhrs p2p : une nouvelle structure de données distribuée et scalable pour les environnements Pair à Pair / Lhrsp2p : a new scalable and distributed data structure for Peer to Peer environnements Yakouben, Hanafi 14 May 2013 (has links) Nous proposons une nouvelle structure de données distribuée et scalable appelée LHRSP2P conçue pour les environnements pair à pair(P2P).Les données de l'application forment un fichier d’enregistrements identifiés par les clés primaires. Les enregistrements sont dans des cases mémoires sur des pairs, adressées par le hachage distribué (LH). Des éclatements créent dynamiquement de nouvelles cases pour accommoder les insertions. L'accès par clé à un enregistrement comporte un seul renvoi au maximum. Le scan du fichier s’effectue au maximum en deux rounds. Ces résultats sont parmi les meilleurs à l'heure actuelle. Tout fichier LHRSP2P est également protégé contre le Churn. Le calcul de parité protège toute indisponibilité jusqu’à k cases, où k ≥ 1 est un paramètre scalable. Un nouveau type de requêtes, qualifiées de sûres, protège également contre l’accès à toute case périmée. Nous prouvons les propriétés de notre SDDS formellement par une implémentation prototype et des expérimentations. LHRSP2P apparaît utile aux applications Big Data, sur des RamClouds tout particulièrement / We propose a new scalable and distributed data structure termed LHRSP2P designed for Peer-to-Peer environment (P2P). Application data forms a file of records identified by primary keys. Records are in buckets on peers, addressed by distributed linear hashing (LH). Splits create new buckets dynamically, to accommodate inserts. Key access to a record uses at most one hop. Scan of the file proceeds in two rounds at most. These results are among best at present. An LHRSP2P file is also protected against Churn. Parity calculation recovers from every unavailability of up to k≥1, k is a scalable parameter. A new type of queries, qualified as sure, protects also against access to any out-of-date bucket. We prove the properties of our SDDS formally, by a prototype implementation and experiments. LHRSP2P appears useful for Big Data manipulations, over RamClouds especially. Sdds Système pair à pair P2p Hachage linaire distribué Lh* Haute disponibilité Churn Scalable and distributed data structure P2P system Distributed linear hashing (LH*) High availability Churn
44	Network Coding in Distributed, Dynamic, and Wireless Environments: Algorithms and Applications Chaudhry, Mohammad 2011 December 1900 (has links) The network coding is a new paradigm that has been shown to improve throughput, fault tolerance, and other quality of service parameters in communication networks. The basic idea of the network coding techniques is to relish the "mixing" nature of the information flows, i.e., many algebraic operations (e.g., addition, subtraction etc.) can be performed over the data packets. Whereas traditionally information flows are treated as physical commodities (e.g., cars) over which algebraic operations can not be performed. In this dissertation we answer some of the important open questions related to the network coding. Our work can be divided into four major parts. Firstly, we focus on network code design for the dynamic networks, i.e., the networks with frequently changing topologies and frequently changing sets of users. Examples of such dynamic networks are content distribution networks, peer-to-peer networks, and mobile wireless networks. A change in the network might result in infeasibility of the previously assigned feasible network code, i.e., all the users might not be able to receive their demands. The central problem in the design of a feasible network code is to assign local encoding coefficients for each pair of links in a way that allows every user to decode the required packets. We analyze the problem of maintaining the feasibility of a network code, and provide bounds on the number of modifications required under dynamic settings. We also present distributed algorithms for the network code design, and propose a new path-based assignment of encoding coefficients to construct a feasible network code. Secondly, we investigate the network coding problems in wireless networks. It has been shown that network coding techniques can significantly increase the overall throughput of wireless networks by taking advantage of their broadcast nature. In wireless networks each packet transmitted by a device is broadcasted within a certain area and can be overheard by the neighboring devices. When a device needs to transmit packets, it employs the Index Coding that uses the knowledge of what the device's neighbors have heard in order to reduce the number of transmissions. With the Index Coding, each transmitted packet can be a linear combination of the original packets. The Index Coding problem has been proven to be NP-hard, and NP-hard to approximate. We propose an efficient exact, and several heuristic solutions for the Index Coding problem. Noting that the Index Coding problem is NP-hard to approximate, we look at it from a novel perspective and define the Complementary Index Coding problem, where the objective is to maximize the number of transmissions that are saved by employing coding compared to the solution that does not involve coding. We prove that the Complementary Index Coding problem can be approximated in several cases of practical importance. We investigate both the multiple unicast and multiple multicast scenarios for the Complementary Index Coding problem for computational complexity, and provide polynomial time approximation algorithms. Thirdly, we consider the problem of accessing large data files stored at multiple locations across a content distribution, peer-to-peer, or massive storage network. Parts of the data can be stored in either original form, or encoded form at multiple network locations. Clients access the parts of the data through simultaneous downloads from several servers across the network. For each link used client has to pay some cost. A client might not be able to access a subset of servers simultaneously due to network restrictions e.g., congestion etc. Furthermore, a subset of the servers might contain correlated data, and accessing such a subset might not increase amount of information at the client. We present a novel efficient polynomial-time solution for this problem that leverages the matroid theory. Fourthly, we explore applications of the network coding for congestion mitigation and over flow avoidance in the global routing stage of Very Large Scale Integration (VLSI) physical design. Smaller and smarter devices have resulted in a significant increase in the density of on-chip components, which has given rise to congestion and over flow as critical issues in on-chip networks. We present novel techniques and algorithms for reducing congestion and minimizing over flows. Network Coding Index Coding Distributed Data Retrieval Data Storage VLSI CAD Distributed Networks Global Routing Wireless Network Coding
45	Distributed knowledge sharing and production through collaborative e-Science platforms / Partage et production de connaissances distribuées dans des plateformes scientifiques collaboratives Gaignard, Alban 15 March 2013 (has links) Cette thèse s'intéresse à la production et au partage cohérent de connaissances distribuées dans le domaine des sciences de la vie. Malgré l'augmentation constante des capacités de stockage et de calcul des infrastructures informatiques, les approches centralisées pour la gestion de grandes masses de données scientifiques multi-sources deviennent inadaptées pour plusieurs raisons: (i) elles ne garantissent pas l'autonomie des fournisseurs de données qui doivent conserver un certain contrôle sur les données hébergées pour des raisons éthiques et/ou juridiques, (ii) elles ne permettent pas d'envisager le passage à l'échelle des plateformes en sciences computationnelles qui sont la source de productions massives de données scientifiques. Nous nous intéressons, dans le contexte des plateformes collaboratives en sciences de la vie NeuroLOG et VIP, d'une part, aux problématiques de distribution et d'hétérogénéité sous-jacentes au partage de ressources, potentiellement sensibles ; et d'autre part, à la production automatique de connaissances au cours de l'usage de ces plateformes, afin de faciliter l'exploitation de la masse de données produites. Nous nous appuyons sur une approche ontologique pour la modélisation des connaissances et proposons à partir des technologies du web sémantique (i) d'étendre ces plateformes avec des stratégies efficaces, statiques et dynamiques, d'interrogations sémantiques fédérées et (ii) d'étendre leur environnent de traitement de données pour automatiser l'annotation sémantique des résultats d'expérience ``in silico'', à partir de la capture d'informations de provenance à l'exécution et de règles d'inférence spécifiques au domaine. Les résultats de cette thèse, évalués sur l'infrastructure distribuée et contrôlée Grid'5000, apportent des éléments de réponse à trois enjeux majeurs des plateformes collaboratives en sciences computationnelles : (i) un modèle de collaborations sécurisées et une stratégie de contrôle d'accès distribué pour permettre la mise en place d'études multi-centriques dans un environnement compétitif, (ii) des résumés sémantiques d'expérience qui font sens pour l'utilisateur pour faciliter la navigation dans la masse de données produites lors de campagnes expérimentales, et (iii) des stratégies efficaces d'interrogation et de raisonnement fédérés, via les standards du Web Sémantique, pour partager les connaissances capitalisées dans ces plateformes et les ouvrir potentiellement sur le Web de données. Mots-clés: Flots de services et de données scientifiques, Services web sémantiques, Provenance, Web de données, Web sémantique, Fédération de bases de connaissances, Intégration de données distribuées, e-Sciences, e-Santé. / This thesis addresses the issues of coherent distributed knowledge production and sharing in the Life-science area. In spite of the continuously increasing computing and storage capabilities of computing infrastructures, the management of massive scientific data through centralized approaches became inappropriate, for several reasons: (i) they do not guarantee the autonomy property of data providers, constrained, for either ethical or legal concerns, to keep the control over the data they host, (ii) they do not scale and adapt to the massive scientific data produced through e-Science platforms. In the context of the NeuroLOG and VIP Life-science collaborative platforms, we address on one hand, distribution and heterogeneity issues underlying, possibly sensitive, resource sharing ; and on the other hand, automated knowledge production through the usage of these e-Science platforms, to ease the exploitation of the massively produced scientific data. We rely on an ontological approach for knowledge modeling and propose, based on Semantic Web technologies, to (i) extend these platforms with efficient, static and dynamic, transparent federated semantic querying strategies, and (ii) to extend their data processing environment, from both provenance information captured at run-time and domain-specific inference rules, to automate the semantic annotation of ``in silico'' experiment results. The results of this thesis have been evaluated on the Grid'5000 distributed and controlled infrastructure. They contribute to addressing three of the main challenging issues faced in the area of computational science platforms through (i) a model for secured collaborations and a distributed access control strategy allowing for the setup of multi-centric studies while still considering competitive activities, (ii) semantic experiment summaries, meaningful from the end-user perspective, aimed at easing the navigation into massive scientific data resulting from large-scale experimental campaigns, and (iii) efficient distributed querying and reasoning strategies, relying on Semantic Web standards, aimed at sharing capitalized knowledge and providing connectivity towards the Web of Linked Data. Services web sémantiques Web de données Fédération de bases de connaissances Intégration de données distribuées E-Sciences E-Santé Scientific workflows Semantic web services Web of linked data Federated knowledge bases Distributed data integration E-Science E-Health 004
46	Codes With Locality For Distributed Data Storage Moorthy, Prakash Narayana 03 1900 (has links) (PDF) This thesis deals with the problem of code design in the setting of distributed storage systems consisting of multiple storage nodes, storing many different data les. A primary goal in such systems is the efficient repair of a failed node. Regenerating codes and codes with locality are two classes of coding schemes that have recently been proposed in literature to address this goal. While regenerating codes aim to minimize the amount of data-download needed to carry out node repair, codes with locality seek to minimize the number of nodes accessed during node repair. Our focus here is on linear codes with locality, which is a concept originally introduced by Gopalan et al. in the context of recovering from a single node failure. A code-symbol of a linear code C is said to have locality r, if it can be recovered via a linear combination of r other code-symbols of C. The code C is said to have (i) information-symbol locality r, if all of its message symbols have locality r, and (ii) all-symbol locality r, if all the code-symbols have locality r. We make the following three contributions to the area of codes with locality. Firstly, we extend the notion of locality, in two directions, so as to permit local recovery even in the presence of multiple node failures. In the first direction, we consider codes with \local error correction" in which a code-symbol is protected by a local-error-correcting code having local-minimum-distance 3, and thus allowing local recovery of the code-symbol even in the presence of 2 other code-symbol erasures. In the second direction, we study codes with all-symbol locality that can recover from two erasures via a sequence of two local, parity-check computations. When restricted to the case of all-symbol locality and two erasures, the second approach allows, in general, for design of codes having larger minimum distance than what is possible via the rst approach. Under both approaches, by studying the generalized Hamming weights of the dual codes, we derive tight upper bounds on their respective minimum distances. Optimal code constructions are identified under both approaches, for a class of code parameters. A few interesting corollaries result from this part of our work. Firstly, we obtain a new upper bound on the minimum distance of concatenated codes and secondly, we show how it is always possible to construct the best-possible code (having largest minimum distance) of a given dimension when the code's parity check matrix is partially specified. In a third corollary, we obtain a new upper bound for the minimum distance of codes with all-symbol locality in the single erasure case. Secondly, we introduce the notion of codes with local regeneration that seek to combine the advantages of both codes with locality as well as regenerating codes. These are vector-alphabet analogues of codes with local error correction in which the local codes themselves are regenerating codes. An upper bound on the minimum distance is derived when the constituent local codes have a certain uniform rank accumulation (URA) property. This property is possessed by both the minimum storage regenerating (MSR) and the minimum bandwidth regenerating (MBR) codes. We provide several optimal constructions of codes with local regeneration, where the local codes are either the MSR or the MBR codes. The discussion here is also extended to the case of general vector-linear codes with locality, in which the local codes do not necessarily have the URA property. Finally, we evaluate the efficacy of two specific coding solutions, both possessing an inherent double replication of data, in a practical distributed storage setting known as Hadoop. Hadoop is an open-source platform dealing with distributed storage of data in which the primary aim is to perform distributed computation on the stored data via a paradigm known as Map Reduce. Our evaluation shows that while these codes have efficient repair properties, their vector-alphabet-nature can negatively a affect Map Reduce performance, if they are implemented under the current Hadoop architecture. Specifically, we see that under the current architecture, the choice of number processor cores per node and Map-task scheduling algorithm play a major role in determining their performance. The performance evaluation is carried out via a combination of simulations and actual experiments in Hadoop clusters. As a remedy to the problem, we also pro-pose a modified architecture in which one allows erasure coding across blocks belonging to different les. Under the modified architecture, the new coding solutions will not suffer from any Map Reduce performance-loss as seen in the original architecture, while retaining all of their desired repair properties Distributed Storage Coding Regeneration Codes Local Repair Codes Linear Codes Information Theory Error Correcting Codes Encoding Vector Codes Minimum Storage Regenerating (MSR) Codes Coding Theory Hadoop Distributed Data Storage Computer Science
47	Distributed Coding For Wireless Sensor Networks Varshneya, Virendra K 11 1900 (has links) (PDF) No description available. Wireless Communication Systems Detectors Data Compression (Telecommunication) Coding Theory Sensor Network Model Wireless Sensor Networks Distributed Lossy Source Coding Distributed Coding Sensor Networks Diversity Coding Distributed Data Compression Multiple Access Channel (MAC) Communication Engineering
48	Smart Meters Big Data : Behavioral Analytics via Incremental Data Mining and Visualization Singh, Shailendra January 2016 (has links) The big data framework applied to smart meters offers an exception platform for data-driven forecasting and decision making to achieve sustainable energy efficiency. Buying-in consumer confidence through respecting occupants' energy consumption behavior and preferences towards improved participation in various energy programs is imperative but difficult to obtain. The key elements for understanding and predicting household energy consumption are activities occupants perform, appliances and the times that appliances are used, and inter-appliance dependencies. This information can be extracted from the context rich big data from smart meters, although this is challenging because: (1) it is not trivial to mine complex interdependencies between appliances from multiple concurrent data streams; (2) it is difficult to derive accurate relationships between interval based events, where multiple appliance usage persist; (3) continuous generation of the energy consumption data can trigger changes in appliance associations with time and appliances. To overcome these challenges, we propose an unsupervised progressive incremental data mining technique using frequent pattern mining (appliance-appliance associations) and cluster analysis (appliance-time associations) coupled with a Bayesian network based prediction model. The proposed technique addresses the need to analyze temporal energy consumption patterns at the appliance level, which directly reflect consumers' behaviors and provide a basis for generalizing household energy models. Extensive experiments were performed on the model with real-world datasets and strong associations were discovered. The accuracy of the proposed model for predicting multiple appliances usage outperformed support vector machine during every stage while attaining accuracy of 81.65\%, 85.90\%, 89.58\% for 25\%, 50\% and 75\% of the training dataset size respectively. Moreover, accuracy results of 81.89\%, 75.88\%, 79.23\%, 74.74\%, and 72.81\% were obtained for short-term (hours), and long-term (day, week, month, and season) energy consumption forecasts, respectively. Smart Grid Smart Meter Behavioral Analytics Data-Driven Approach Energy Consumption Patterns Frequent Pattern Correlation Pattern Cluster Analysis Online Data Mining Incremental Data Mining Distributed Data Mining Association Rules Appliance Usage Prediction Energy Consumption Prediction Prediction Visualization
49	[en] OPPORTUNISTIC ROUTING TOWARDS MOBILE SINK NODES IN BLUETOOTH MESH NETWORKS / [pt] ROTEAMENTO OPORTUNÍSTICO EM DIREÇÃO A NÓS SINK MÓVEIS EM REDES BLUETOOTH MESH MARCELO PAULON JUCA VASCONCELOS 26 April 2021 (has links) [pt] Este trabalho avalia a coleta esporádica de dados em uma rede sem fio Bluetooth Mesh, usando o simulador OMNET (mais mais) INET. O coletor de dados é um nó sink em movimento, que poderia ser um smartphone ou outro dispositivo portátil, carregado por um pedestre, ciclista, animal, ou um drone. O nó sink poderia se conectar a uma rede mesh em áreas de difícil acesso onde não há acesso a internet, e coletar dados de sensores. Após implementar extensões ao Bluetooth Mesh, funcionalidades de nós Low Power e Friends no OMNET (mais mais), conseguimos propor e avaliar algoritmos para roteamento adaptativo, e com foco em mobilidade, de dados de sensores em direção ao nó sink. Uma variação de um dos algoritmos de roteamento propostos alcançou um aumento de 173,54 por cento na quantidade de dados únicos entregues ao nó sink em comparação ao algoritmo de roteamento padrão do Bluetooth Mesh. Neste caso, houve um aumento de apenas 4,63 por cento no consumo de energia para o mesmo cenário. Além disso, a taxa de entrega aumentou em 111.82 por cento. / [en] This work evaluates sporadic data collection on a Bluetooth Mesh network, using the OMNET (plus plus) INET simulator. The data collector is a roaming sink node, which could be a smartphone or other portable device, carried by a pedestrian, a biker, an animal, or a drone. The sink node could connect to a mesh network in hard-to-reach areas that do not have internet access and collect sensor data. After implementing Bluetooth Mesh relay extensions, Low Power, and Friend features in OMNET (plus plus), we were able to propose and evaluate algorithms for mobility-aware, adaptive, routing of sensor data towards the sink node. One variation of a proposed routing algorithm achieved a 173.54 percent increase in unique data delivered to the sink node compared to Bluetooth Mesh s default routing algorithm. In that case, there was only a 4.63 percent increase in energy consumption for the same scenario. Also, the delivery rate increased by 111.82 percent. [pt] REDES DE SENSORES SEM FIO [pt] RECONFIGURACAO MESH [pt] BLUETOOTH MESH [pt] REDES MESH [pt] COLETA DISTRIBUIDA DE DADOS [en] WIRELESS SENSOR NETWORK [en] MESH RECONFIGURATION [en] BLUETOOTH MESH [en] MESH NETWORKS [en] DISTRIBUTED DATA COLLECTION
50	A scalable evolutionary learning classifier system for knowledge discovery in stream data mining Dam, Hai Huong, Information Technology & Electrical Engineering, Australian Defence Force Academy, UNSW January 2008 (has links) Data mining (DM) is the process of finding patterns and relationships in databases. The breakthrough in computer technologies triggered a massive growth in data collected and maintained by organisations. In many applications, these data arrive continuously in large volumes as a sequence of instances known as a data stream. Mining these data is known as stream data mining. Due to the large amount of data arriving in a data stream, each record is normally expected to be processed only once. Moreover, this process can be carried out on different sites in the organisation simultaneously making the problem distributed in nature. Distributed stream data mining poses many challenges to the data mining community including scalability and coping with changes in the underlying concept over time. In this thesis, the author hypothesizes that learning classifier systems (LCSs) - a class of classification algorithms - have the potential to work efficiently in distributed stream data mining. LCSs are an incremental learner, and being evolutionary based they are inherently adaptive. However, they suffer from two main drawbacks that hinder their use as fast data mining algorithms. First, they require a large population size, which slows down the processing of arriving instances. Second, they require a large number of parameter settings, some of them are very sensitive to the nature of the learning problem. As a result, it becomes difficult to choose a right setup for totally unknown problems. The aim of this thesis is to attack these two problems in LCS, with a specific focus on UCS - a supervised evolutionary learning classifier system. UCS is chosen as it has been tested extensively on classification tasks and it is the supervised version of XCS, a state of the art LCS. In this thesis, the architectural design for a distributed stream data mining system will be first introduced. The problems that UCS should face in a distributed data stream task are confirmed through a large number of experiments with UCS and the proposed architectural design. To overcome the problem of large population sizes, the idea of using a Neural Network to represent the action in UCS is proposed. This new system - called NLCS { was validated experimentally using a small fixed population size and has shown a large reduction in the population size needed to learn the underlying concept in the data. An adaptive version of NLCS called ANCS is then introduced. The adaptive version dynamically controls the population size of NLCS. A comprehensive analysis of the behaviour of ANCS revealed interesting patterns in the behaviour of the parameters, which motivated an ensemble version of the algorithm with 9 nodes, each using a different parameter setting. In total they cover all patterns of behaviour noticed in the system. A voting gate is used for the ensemble. The resultant ensemble does not require any parameter setting, and showed better performance on all datasets tested. The thesis concludes with testing the ANCS system in the architectural design for distributed environments proposed earlier. The contributions of the thesis are: (1) reducing the UCS population size by an order of magnitude using a neural representation; (2) introducing a mechanism for adapting the population size; (3) proposing an ensemble method that does not require parameter setting; and primarily (4) showing that the proposed LCS can work efficiently for distributed stream data mining tasks. Data mining Action map Classification Data stream Neural network Noisy data Non-stationary environment Reinforcement learning Rule-based system Static environment Stream data mining Supervised learning Distributed data mining Dynamic environment Ensemble learning Evolutionary computation Genetic algorithm Knowledge discovery Learning classifier system Negative correlation learning

Search results