Global ETD Search

31	Agrupamento de dados em fluxos contínuos com estimativa automática do número de grupos / Clustering data streams with automatic estimation of the number of cluster Silva, Jonathan de Andrade 04 March 2015 (has links) Técnicas de agrupamento de dados usualmente assumem que o conjunto de dados é de tamanho fixo e pode ser alocado na memória. Neste contexto, um desafio consiste em aplicar técnicas de agrupamento em bases de dados de tamanho ilimitado, com dados gerados continuamente e em ambientes dinâmicos. Dados gerados nessas condições originam o que se convencionou chamar de Fluxo Contínuo de Dados (FCD). Em aplicações de FCD, operações de acesso aos dados são restritas a apenas uma leitura ou a um pequeno número de acessos aos dados, com limitações de memória e de tempo de processamento. Além disso, a distribuição dos dados gerados por essas fontes pode ser não estacionária, ou seja, podem ocorrer mudanças ao longo do tempo, denominadas de mudanças de conceito. Nesse sentido, algumas técnicas de agrupamento em FCD foram propostas na literatura. Muitas dessas técnicas são baseadas no algoritmo das k-Médias. Uma das limitações do algoritmo das k-Médias consiste na definição prévia do número de grupos. Ao se assumir que o número de grupos é desconhecido a priori e que deveria ser estimado a partir dos dados, percorrer o grande espaço de soluções possíveis (tanto em relação ao número de grupos, k, quanto em relação às partições possíveis para um determinado k) torna desafiadora a tarefa de agrupamento de dados - ainda mais sob a limitação de tempo e armazenamento imposta em aplicações de FCD. Neste contexto, essa tese tem como principais contribuições: (i) adaptar algoritmos que têm sido usados com sucesso em aplicações de Fluxo Contínuo de Dados (FCD) nas quais k é conhecido para cenários em que se deseja estimar o número de grupos; (ii) propor novos algoritmos para agrupamento que estimem k automaticamente a partir do FCD; (iii) avaliar sistematicamente, e de maneira quantitativa, os algoritmos propostos de acordo com as características específicas dos cenários de FCD. Foram desenvolvidos 14 algoritmos de agrupamento para FCD capazes de estimar o número de grupos a partir dos dados. Tais algoritmos foram avaliados em seis bases de dados artificiais e duas bases de dados reais amplamente utilizada na literatura. Os algoritmos desenvolvidos podem auxiliar em diversas áreas da Mineração em FCD. Os algoritmos evolutivos desenvolvidos mostraram a melhor relação de custo-benefício entre eficiência computacional e qualidade das partições obtidas. / Several algorithms for clustering data streams based on k-Means have been proposed in the literature. However, most of them assume that the number of clusters, k, is known a priori by the user and can be kept fixed throughout the data analysis process. Besides the dificulty in choosing k, data stream clustering imposes several challenges to be dealt with, such as addressing non-stationary, unbounded data that arrives in an online fashion. In data stream applications, the dataset must be accessed in order and that can be read only once or a small number of times. In this context, the main contributions of this thesis are: (i) adapt algorithms that have been used successfully in data stream applications where k is known to be able to estimate the number of clusters from data; (ii) propose new algorithms for clustering to estimate k automatically from the data stream; (iii) evaluate the proposed algorithms according to diferent scenarios. Fourteen clustering data stream algorithms were developed which are able to estimate the number of clusters from data. They were evaluated in six artificial datasets and two real-world datasets widely used in the literature. The developed algorithms are useful for several data mining tasks. The developed evolutionary algorithms have shown the best trade-off between computational efficiency and data partition quality. Agrupamento de dados Algoritmos evolutivos Clustering Data stream Evolutionary algorithms Fluxo contínuo de dados
32	Metadata-Aware Query Processing over Data Streams Ding, Luping 22 April 2008 (has links) Many modern applications need to process queries over potentially infinite data streams to provide answers in real-time. This dissertation proposes novel techniques to optimize CPU and memory utilization in stream processing by exploiting metadata on streaming data or queries. It focuses on four topics: 1) exploiting stream metadata to optimize SPJ query operators via operator configuration, 2) exploiting stream metadata to optimize SPJ query plans via query-rewriting, 3) exploiting workload metadata to optimize parameterized queries via indexing, and 4) exploiting event constraints to optimize event stream processing via run-time early termination. The first part of this dissertation proposes algorithms for one of the most common and expensive query operators, namely join, to at runtime identify and purge no-longer-needed data from the state based on punctuations. Exploitations of the combination of punctuation and commonly-used window constraints are also studied. Extensive experimental evaluations demonstrate both reduction on memory usage and improvements on execution time due to the proposed strategies. The second part proposes herald-driven runtime query plan optimization techniques. We identify four query optimization techniques, design a lightweight algorithm to efficiently detect the optimization opportunities at runtime upon receiving heralds. We propose a novel execution paradigm to support multiple concurrent logical plans by maintaining one physical plan. Extensive experimental study confirms that our techniques significantly reduce query execution times. The third part deals with the shared execution of parameterized queries instantiated from a query template. We design a lightweight index mechanism to provide multiple access paths to data to facilitate a wide range of parameterized queries. To withstand workload fluctuations, we propose an index tuning framework to tune the index configurations in a timely manner. Extensive experimental evaluations demonstrate the effectiveness of the proposed strategies. The last part proposes event query optimization techniques by exploiting event constraints such as exclusiveness or ordering relationships among events extracted from workflows. Significant performance gains are shown to be achieved by our proposed constraint-aware event processing techniques. metadata constraint data stream continuous query optimization Querying (Computer science) Metadata Data processing
33	Optimizing the Dynamic Distribution of Data-stream for High Speed Communications Zhao, Z.W., Chen, I-Ming 01 1900 (has links) The performances of high-speed network communications frequently rest with the distribution of data-stream. In this paper, a dynamic data-stream balancing architecture based on link information is introduced and discussed firstly. Then the algorithms for simultaneously acquiring the passing nodes and links of a path between any two source-destination nodes rapidly, as well as a dynamic data-stream distribution planning are proposed. Some related topics such as data fragment disposal, fair service, etc. are further studied and discussed. Besides, the performance and efficiency of proposed algorithms, especially for fair service and convergence, are evaluated through a demonstration with regard to the rate of bandwidth utilization. Hoping the discussion presented here can be helpful to application developers in selecting an effective strategy for planning the distribution of data-stream. / Singapore-MIT Alliance (SMA) Data-stream balancing Adjacency matrix Service metric Link information Non-directional graph fair service
34	Quality-of-Service-Aware Data Stream Processing Schmidt, Sven 21 March 2007 (has links) (PDF) Data stream processing in the industrial as well as in the academic field has gained more and more importance during the last years. Consider the monitoring of industrial processes as an example. There, sensors are mounted to gather lots of data within a short time range. Storing and post-processing these data may occasionally be useless or even impossible. On the one hand, only a small part of the monitored data is relevant. To efficiently use the storage capacity, only a preselection of the data should be considered. On the other hand, it may occur that the volume of incoming data is generally too high to be stored in time or–in other words–the technical efforts for storing the data in time would be out of scale. Processing data streams in the context of this thesis means to apply database operations to the stream in an on-the-fly manner (without explicitly storing the data). The challenges for this task lie in the limited amount of resources while data streams are potentially infinite. Furthermore, data stream processing must be fast and the results have to be disseminated as soon as possible. This thesis focuses on the latter issue. The goal is to provide a so-called Quality-of-Service (QoS) for the data stream processing task. Therefore, adequate QoS metrics like maximum output delay or minimum result data rate are defined. Thereafter, a cost model for obtaining the required processing resources from the specified QoS is presented. On that basis, the stream processing operations are scheduled. Depending on the required QoS and on the available resources, the weight can be shifted among the individual resources and QoS metrics, respectively. Calculating and scheduling resources requires a lot of expert knowledge regarding the characteristics of the stream operations and regarding the incoming data streams. Often, this knowledge is based on experience and thus, a revision of the resource calculation and reservation becomes necessary from time to time. This leads to occasional interruptions of the continuous data stream processing, of the delivery of the result, and thus, of the negotiated Quality-of-Service. The proposed robustness concept supports the user and facilitates a decrease in the number of interruptions by providing more resources. data stream processing quality-of-service robustness Datenstromverarbeitung Qualität Robustheit ddc:004 rvk:ST 274 Datenstrom Datenverarbeitung Dienstgüte
35	Real-time Distributed Computation of Formal Concepts and Analytics De Alburquerque Melo, Cassio 19 July 2013 (has links) (PDF) The advances in technology for creation, storage and dissemination of data have dramatically increased the need for tools that effectively provide users with means of identifying and understanding relevant information. Despite the great computing opportunities distributed frameworks such as Hadoop provide, it has only increased the need for means of identifying and understanding relevant information. Formal Concept Analysis (FCA) may play an important role in this context, by employing more intelligent means in the analysis process. FCA provides an intuitive understanding of generalization and specialization relationships among objects and their attributes in a structure known as a concept lattice. The present thesis addresses the problem of mining and visualising concepts over a data stream. The proposed approach is comprised of several distributed components that carry the computation of concepts from a basic transaction, filter and transforms data, stores and provides analytic features to visually explore data. The novelty of our work consists of: (i) a distributed processing and analysis architecture for mining concepts in real-time; (ii) the combination of FCA with visual analytics visualisation and exploration techniques, including association rules analytics; (iii) new algorithms for condensing and filtering conceptual data and (iv) a system that implements all proposed techniques, called Cubix, and its use cases in Biology, Complex System Design and Space Applications. [SPI:OTHER] Engineering Sciences/Other Formal Concept Analysis Visual Analytics Data Stream Mining
36	Processing of continuous queries over infinite data streams Vossough, Ehsan. January 2004 (has links) Thesis (Ph.D.)--University of Wollongong, 2004. / Typescript. Includes bibliographical references: leaf 151-159.
37	An extended BIRCH-based clustering algorithm for large time-series datasets Lei, Jiahuan January 2017 (has links) Temporal data analysis and mining has attracted substantial interest due to theproliferation and ubiquity of time series in many fields. Time series clustering isone of the most popular mining methods, and many time series clustering algorithmsprimarily focus on detecting the clusters in a batch fashion that will use alot of memory space and thus limit the scalability and capability for large timeseries.The BIRCH algorithm has been proven to scale well to large datasets,which is characterized by an incrementally clustering data objects using a singlescan. However the Euclidean distance metric employed in BIRCH has beenproven to not be accurate for time series and will degrade the accuracy performance.To overcome this drawback, this work proposes an extended BIRCH algorithmfor large time series. The BIRCH clustering algorithm is extended bychanging the cluster feature vector to the proposed modified cluster feature, replacingthe original Euclidean distance measure with dynamic time warping andemploying DTW barycenter averaging method as the centroid computation approach,which is more suitable for time-series clustering than any other averagingmethods. To demonstrate the effectiveness of the proposed algorithm, weconducted an extensive evaluation of our algorithm against BIRCH, k-meansand their variants with combinations of competitive distance measures. Experimentalresults show that the extended BIRCH algorithm improves the accuracysignificantly compared to the BIRCH algorithm and its variants, and achievescompetitive and similar accuracy as k-means and its variant, k-DBA. However,unlike k-means and k-DBA, the extended BIRCH algorithm maintains the abilityof incrementally handling continuous incoming data objects, which is thekey to cluster large time-series datasets. Finally the extended BIRCH-based algorithmis applied to solve a subsequence time-series clustering task of a simulationmulti-variate time-series dataset with the help of a sliding window. Time series Data stream Clustering BIRCH DTW DBA. Computer Engineering Datorteknik
38	Handling Tradeoffs between Performance and Query-Result Quality in Data Stream Processing Ji, Yuanzhen 27 March 2018 (has links) (PDF) Data streams in the form of potentially unbounded sequences of tuples arise naturally in a large variety of domains including finance markets, sensor networks, social media, and network traffic management. The increasing number of applications that require processing data streams with high throughput and low latency have promoted the development of data stream processing systems (DSPS). A DSPS processes data streams with continuous queries, which are issued once and return query results to users continuously as new tuples arrive. For stream-based applications, both the query-execution performance (in terms of, e.g., throughput and end-to-end latency) and the quality of produced query results (in terms of, e.g., accuracy and completeness) are important. However, a DSPS often needs to make tradeoffs between these two requirements, either because of the data imperfection within the streams, or because of the limited computation capacity of the DSPS itself. Performance versus result-quality tradeoffs caused by data imperfection are inevitable, because the quality of the incoming data is beyond the control of a DSPS, whereas tradeoffs caused by system limitations can be alleviated—even erased—by enhancing the DSPS itself. This dissertation seeks to advance the state of the art on handling the performance versus result-quality tradeoffs in data stream processing caused by the above two aspects of reasons. For tradeoffs caused by data imperfection, this dissertation focuses on the typical data-imperfection problem of stream disorder and proposes the concept of quality-driven disorder handling (QDDH). QDDH enables a DSPS to make flexible and user-configurable tradeoffs between the end-to-end latency and the query-result quality when dealing with stream disorder. Moreover, compared to existing disorder handling approaches, QDDH can significantly reduce the end-to-end latency, and at the same time provide users with desired query-result quality. In this dissertation, a generic buffer-based QDDH framework and three instantiations of the generic framework for distinct query types are presented. For tradeoffs caused by system limitations, this dissertation proposes a system-enhancement approach that combines the row-oriented and the column-oriented data layout and processing techniques in data stream processing to improve the throughput. To fully exploit the potential of such hybrid execution of continuous queries, a static, cost-based query optimizer is introduced. The optimizer works at the operator level and takes the unique property of execution plans of continuous queries—feasibility—into account. Datenstromverarbeitung Data Stream Processing ddc:004 rvk:ST 234 rvk:ST 277 rvk:ST 265
39	Extraction and Energy Efficient Processing of Streaming Data García-Martín, Eva January 2017 (has links) The interest in machine learning algorithms is increasing, in parallel with the advancements in hardware and software required to mine large-scale datasets. Machine learning algorithms account for a significant amount of energy consumed in data centers, which impacts the global energy consumption. However, machine learning algorithms are optimized towards predictive performance and scalability. Algorithms with low energy consumption are necessary for embedded systems and other resource constrained devices; and desirable for platforms that require many computations, such as data centers. Data stream mining investigates how to process potentially infinite streams of data without the need to store all the data. This ability is particularly useful for companies that are generating data at a high rate, such as social networks. This thesis investigates algorithms in the data stream mining domain from an energy efficiency perspective. The thesis comprises of two parts. The first part explores how to extract and analyze data from Twitter, with a pilot study that investigates a correlation between hashtags and followers. The second and main part investigates how energy is consumed and optimized in an online learning algorithm, suitable for data stream mining tasks. The second part of the thesis focuses on analyzing, understanding, and reformulating the Very Fast Decision Tree (VFDT) algorithm, the original Hoeffding tree algorithm, into an energy efficient version. It presents three key contributions. First, it shows how energy varies in the VFDT from a high-level view by tuning different parameters. Second, it presents a methodology to identify energy bottlenecks in machine learning algorithms, by portraying the functions of the VFDT that consume the largest amount of energy. Third, it introduces dynamic parameter adaptation for Hoeffding trees, a method to dynamically adapt the parameters of Hoeffding trees to reduce their energy consumption. The results show an average energy reduction of 23% on the VFDT algorithm. / Scalable resource-efficient systems for big data analytics machine learning green computing data mining data stream mining green machine learning Computer Sciences Datavetenskap (datalogi)
40	Intelligent Adaptation of Ensemble Size in Data Streams Using Online Bagging Olorunnimbe, Muhammed January 2015 (has links) In this era of the Internet of Things and Big Data, a proliferation of connected devices continuously produce massive amounts of fast evolving streaming data. There is a need to study the relationships in such streams for analytic applications, such as network intrusion detection, fraud detection and financial forecasting, amongst other. In this setting, it is crucial to create data mining algorithms that are able to seamlessly adapt to temporal changes in data characteristics that occur in data streams. These changes are called concept drifts. The resultant models produced by such algorithms should not only be highly accurate and be able to swiftly adapt to changes. Rather, the data mining techniques should also be fast, scalable, and efficient in terms of resource allocation. It then becomes important to consider issues such as storage space needs and memory utilization. This is especially relevant when we aim to build personalized, near-instant models in a Big Data setting. This research work focuses on mining in a data stream with concept drift, using an online bagging method, with consideration to the memory utilization. Our aim is to take an adaptive approach to resource allocation during the mining process. Specifically, we consider metalearning, where the models of multiple classifiers are combined into an ensemble, has been very successful when building accurate models against data streams. However, little work has been done to explore the interplay between accuracy, efficiency and utility. This research focuses on this issue. We introduce an adaptive metalearning algorithm that takes advantage of the memory utilization cost of concept drift, in order to vary the ensemble size during the data mining process. We aim to minimize the memory usage, while maintaining highly accurate models with a high utility. We evaluated our method against a number of benchmarking datasets and compare our results against the state-of-the art. Return on Investment (ROI) was used to evaluate the gain in performance in terms of accuracy, in contrast to the time and memory invested. We aimed to achieve high ROI without compromising on the accuracy of the result. Our experimental results indicate that we achieved this goal. Data stream Concept drift Metalearning Cost sensitive adaptation ROI Utility Adaptive ensemble size Online bagging

Search results