Global ETD Search

71	Optimization for big joins and recursive query evaluation using intersection and difference filters in MapReduce / Utilisation de filtres d’intersection et de différence pour l’optimisation des jointures à grande échelle et l’exécution de requêtes récursives à l’aide MapReduce Phan, Thuong-Cang 07 July 2014 (has links) La communauté informatique a créé une quantité de données sans précédent grâce aux applications à grande échelle. Ces données massives sont considérées comme une mine d’or, ces informations n’attendant que la puissance de traitement sûre et appropriée à l’évaluation d’algorithmes d’analyse complexe. MapReduce est un des modèles de programmation les plus réputé, connu pour la gestion de ce type de traitement. Il est devenu un standard pour le traitement, l’analyse et la génération de grandes quantités de données en parallèle. Cependant, le modèle de programmation MapReduce souffre d’importantes limites pour des opérations non simples (scans ou regroupements simples), en particulier les traitements avec entrées multiples. Dans ce mémoire, nous étudions et optimisons l’évaluation, dans un environnement MapReduce, d’une des opérations les plus importantes et représentatives : la jointure. Notre travail aborde, en plus de la jointure binaire, des jointures complexes comme la jointure multidimensionnelle et la jointure récursive. Pour atteindre ces objectifs, nous proposons d’abord un nouveau type de filtre appelé filter d’intersection qui utilise un modèle probabiliste pour représenter une approximation de l’intersection des ensembles. Le filtre d’intersection est ensuite appliqué à l’opération de jointure bidirectionnelle pour éliminer la majorité des éléments non-joints dans des ensembles de données d'entrée, avant d’envoyer les données pour le processus de jointure. De plus, nous proposons une extension du filtre d’intersection pour améliorer l’efficacité de la jointure ternaire et de la jointure en cascade correspondant à un cycle de jointure avec plusieurs clés partagées lors de la jointure. Nous utilisons la méthode des multiplicateurs de Lagrange afin de réaliser un choix pertinent entre les différentes solutions proposées pour les jointures multidimensionnelles. Une autre proposition est le filtre de différence, une structure de données probabiliste formée pour représenter un ensemble et examiner des éléments disjoints. Ce filtre peut être appliqué à un grand nombre de problèmes, tels que la réconciliation, la déduplication, la correction d’erreur et en ce qui nous concerne la jointure récursive. Une jointure récursive utilisant un filtre de différence est effectuée comme une répétition de jointures en lieu et place d’une jointure et d’un processus de différenciation. Cette amélioration réduit de moitié le nombre de tâches effectuées et les associés tels que la lecture des données, la génération des données intermédiaires et les communications. Ceci permet notamment une amélioration de l’évaluation de l’algorithme semi-naïf et par conséquent l’évaluation des requêtes récursives en MapReduce. Ensuite, nous fournissons des modèles de coût généraux pour les jointures binaire, à n-aire et récursive. Grâce à ces modèles, nous pouvons comparer les algorithmes de jointure les plus représentatifs. Ainsi, nous pouvons montrer l’intérêt des filtres proposés, grâce notamment à la réduction des coûts E/S (entrée/ sortie) sur disque et sur réseau. De plus, des expérimentations ont été menées, montrant l’efficacité du filtre d’intersection par rapport aux solutions, en comparant en particulier des critères tels que la quantité de données intermédiaires, la quantité de données produites en sortie, le temps d’exécution et la répartition des tâches. Nos propositions pour les opérations de jointure contribuent à l’optimisation en général de la gestion de données à l’aide du paradigme MapReduce sur des infrastructures distribuées à grande échelle. / The information technology community has created unprecedented amount of data through large-scale applications. As a result, the Big Data is considered as gold mines of information that just wait for the processing power to be available, reliable, and apt at evaluating complex analytic algorithms. MapReduce is one of the most popular programming models designed to support such processing. It has become a standard for processing, analyzing and generating large data in a massively parallel manner. However, the MapReduce programming model suffers from severe limitations of operations beyond simple scan/grouping, particularly operations with multiple inputs. In the present dissertation we efficiently investigate and optimize the evaluation, in a MapReduce environment, of one of the most salient and representative such operations: Join. It focuses not only on two-way joins, but also complex joins such as multi-way joins and recursive joins. To achieve these objectives, we first devise a new type of filter called intersection filter using a probabilistic model to represent an approximation of the set intersection. The intersection filter is then applied to two-way join operations to eliminate most non-joining elements in input datasets before sending data to actual join processing. In addition, we make an extension of the intersection filter to improve the performance of three-way joins and chain joins including both cyclic chain joins with many shared join keys. We use the Lagrangian multiplier method to indicate a good choice between our optimized solutions for the multi-way joins. Another important proposal is a difference filter, which is a probabilistic data structure designed to represent a set and examine disjoint elements of the set. It can be applied to a wide range of popular problems such as reconciliation, deduplication, error-correction, especially a recursive join operation. A recursive join using the difference filter is implemented as an iteration of one join job instead of two jobs including a join job and a difference job. This improvement will significantly reduce the number of executed jobs by half, and the related overheads such as data rescanning, intermediate data, and communication for the deduplication and difference operations. Besides, this research also improves the general semi-naive algorithm, as well as the evaluation of recursive queries in MapReduce. We then provide general cost models for two-way joins, multi-way joins, and recursive joins. Thanks to these cost models, we can make comparisons of the join algorithms more persuasive. As a result, with using the proposed filters, the join operations can minimize disk I/O and communication costs. Moreover, the intersection filter-based join operations are demonstrated to be more efficient than existing solutions through experimental evaluations. Experimental comparisons of different algorithms for joins are examined with respect to intermediate data amount, the total output amount, the total execution time, and especially task timelines. Finally, our improvements on the join operations contribute to the global scene of optimizing data management for MapReduce applications on large-scale distributed infrastructures. Données massives MapReduce Filtre Bloom Jointure Évaluation de requêtes récursives Optimisation Big data MapReduce Bloom filter Join Recursive query evaluation Optimization
72	Processamento de consultas SOLAP drill-across e com junção espacial em data warehouses geográficos / Processing of drill-across and spatial join SOLAP queries over geographic data warehouses Brito, Jaqueline Joice 28 November 2012 (has links) Um data warehouse geográco (DWG) é um banco de dados multidimensional, orientado a assunto, integrado, histórico, não-volátil e geralmente organizado em níveis de agregação. Além disso, também armazena dados espaciais em uma ou mais dimensões ou em pelo menos uma medida numérica. Visando oferecer suporte à tomada de decisão, é possível realizar em DWGs consultas SOLAP (spatial online analytical processing ), isto é, consultas analíticas multidimensionais (e.g., drill-down, roll-up, drill-across ) com predicados espaciais (e.g., intersecta, contém, está contido) denidos para range queries e junções espaciais. Um desafio no processamento dessas consultas é recuperar, de forma eficiente, dados espaciais e convencionais em DWGs muito volumosos. Na literatura, existem poucos índices voltados à indexação de DWGs, e ainda assim nenhum desses índices dedica-se a indexar consultas SOLAP drill-across e com junção espacial. Esta dissertação visa suprir essa limitação, por meio da proposta de estratégias para o processamento dessas consultas complexas. Para o processamento de consultas SOLAP drill-across foram propostas duas estratégias, Divide e Única, além da especicação de um conjunto de diretrizes que deve ser seguido para o projeto de um esquema de DWG que possibilite a execução dessas consultas e da especicação de classes de consultas. Para o processamento de consultas SOLAP com junção espacial foi proposta a estratégia SJB, além da identicação de quais características o esquema de DWG deve possuir para possibilitar a execução dessas consultas e da especicação do formato dessas consultas. A validação das estratégias propostas foi realizada por meio de testes de desempenho considerando diferentes congurações, sendo que os resultados obtidos foram contrastados com a execução de consultas do tipo junção estrela e o uso de visões materializadas. Os resultados mostraram que as estratégias propostas são muito eficientes. No processamento de consultas SOLAP drill-across, as estratégias Divide e Única mostraram uma redução no tempo de 82,7% a 98,6% com relação à junção estrela e ao uso de visões materializadas. No processamento de consultas SOLAP com junção espacial, a estratégia SJB garantiu uma melhora de desempenho na grande maioria das consultas executadas. Para essas consultas, o ganho de desempenho variou de 0,3% até 99,2% / A geographic data warehouse (GDW) is a special kind of multidimensional database. It is subject-oriented, integrated, historical, non-volatile and usually organized in levels of aggregation. Furthermore, a GDW also stores spatial data in one or more dimensions or at least in one numerical measure. Aiming at decision support, GDWs allow SOLAP (spatial online analytical processing) queries, i.e., multidimensional analytical queries (e.g., drill-down, roll-up, drill-across) extended with spatial predicates (e.g., intersects, contains, is contained) dened for range and spatial join queries. A challenging issue related to the processing of these complex queries is how to recover spatial and conventional data stored in huge GDWs eciently. In the literature, there are few access methods dedicated to index GDWs, and none of these methods focus on drill-across and spatial join SOLAP queries. In this master\'s thesis, we propose novel strategies for processing these complex queries. We introduce two strategies for processing SOLAP drill-across queries (namely, Divide and Unique), dene a set of guidelines for the design of a GDW schema that enables the execution of these queries, and determine a set of classes of these queries to be issued over a GDW schema that follows the proposed guidelines. As for the processing of spatial join SOLAP queries, we propose the SJB strategy, and also identify the characteristics of a DWG schema that enables the execution of these queries as well as dene the format of these queries. We validated the proposed strategies through performance tests that compared them with the star join computation and the use of materialized views. The obtained results showed that our strategies are very ecient. Regarding the SOLAP drill-across queries, the Divide and Unique strategies showed a time reduction that ranged from 82,7% to 98,6% with respect to star join computation and the use of materialized views. Regarding the SOLAP spatial join queries, the SJB strategy guaranteed best results for most of the analyzed queries. For these queries, the performance gain of the SJB strategy ranged from 0,3% to 99,2% over the star join computation and the use of materialized view Data warehouses geográficos Drill-across Drill-across Geographic data warehouses Junção espacial Processamento de consultas SOLAP Processing of SOLAP queries Spatial join
73	Data Warehouses na era do Big Data: processamento eficiente de Junções Estrela no Hadoop / Data Warehouses na era do Big Data: processamento eficiente de Junções Estrela no Hadoop Brito, Jaqueline Joice 12 December 2017 (has links) The era of Big Data is here: the combination of unprecedented amounts of data collected every day with the promotion of open source solutions for massively parallel processing has shifted the industry in the direction of data-driven solutions. From recommendation systems that help you find your next significant one to the dawn of self-driving cars, Cloud Computing has enabled companies of all sizes and areas to achieve their full potential with minimal overhead. In particular, the use of these technologies for Data Warehousing applications has decreased costs greatly and provided remarkable scalability, empowering business-oriented applications such as Online Analytical Processing (OLAP). One of the most essential primitives in Data Warehouses are the Star Joins, i.e. joins of a central table with satellite dimensions. As the volume of the database scales, Star Joins become unpractical and may seriously limit applications. In this thesis, we proposed specialized solutions to optimize the processing of Star Joins. To achieve this, we used the Hadoop software family on a cluster of 21 nodes. We showed that the primary bottleneck in the computation of Star Joins on Hadoop lies in the excessive disk spill and overhead due to network communication. To mitigate these negative effects, we proposed two solutions based on a combination of the Spark framework with either Bloom filters or the Broadcast technique. This reduced the computation time by at least 38%. Furthermore, we showed that the use of full scan may significantly hinder the performance of queries with low selectivity. Thus, we proposed a distributed Bitmap Join Index that can be processed as a secondary index with loose-binding and can be used with random access in the Hadoop Distributed File System (HDFS). We also implemented three versions (one in MapReduce and two in Spark) of our processing algorithm that uses the distributed index, which reduced the total computation time up to 88% for Star Joins with low selectivity from the Star Schema Benchmark (SSB). Because, ideally, the system should be able to perform both random access and full scan, our solution was designed to rely on a two-layer architecture that is framework-agnostic and enables the use of a query optimizer to select which approaches should be used as a function of the query. Due to the ubiquity of joins as primitive queries, our solutions are likely to fit a broad range of applications. Our contributions not only leverage the strengths of massively parallel frameworks but also exploit more efficient access methods to provide scalable and robust solutions to Star Joins with a significant drop in total computation time. / A era do Big Data chegou: a combinação entre o volume dados coletados diarimente com o surgimento de soluções de código aberto para o processamento massivo de dados mudou para sempre a indústria. De sistemas de recomendação que assistem às pessoas a encontrarem seus pares românticos à criação de carros auto-dirigidos, a Computação em Nuvem permitiu que empresas de todos os tamanhos e áreas alcançassem o seu pleno potencial com custos reduzidos. Em particular, o uso dessas tecnologias em aplicações de Data Warehousing reduziu custos e proporcionou alta escalabilidade para aplicações orientadas a negócios, como em processamento on-line analítico (Online Analytical Processing- OLAP). Junções Estrelas são das primitivas mais essenciais em Data Warehouses, ou seja, consultas que realizam a junções de tabelas de fato com tabelas de dimensões. Conforme o volume de dados aumenta, Junções Estrela tornam-se custosas e podem limitar o desempenho das aplicações. Nesta tese são propostas soluções especializadas para otimizar o processamento de Junções Estrela. Para isso, utilizamos a família de software Hadoop em um cluster de 21 nós. Nós mostramos que o gargalo primário na computação de Junções Estrelas no Hadoop reside no excesso de operações escrita do disco (disk spill) e na sobrecarga da rede devido a comunicação excessiva entre os nós. Para reduzir estes efeitos negativos, são propostas duas soluções em Spark baseadas nas técnicas Bloom filters ou Broadcast, reduzindo o tempo total de computação em pelo menos 38%. Além disso, mostramos que a realização de uma leitura completa das tables (full table scan) pode prejudicar significativamente o desempenho de consultas com baixa seletividade. Assim, nós propomos um Índice Bitmap de Junção distribuído que é implementado como um índice secundário que pode ser combinado com acesso aleatório no Hadoop Distributed File System (HDFS). Nós implementamos três versões (uma em MapReduce e duas em Spark) do nosso algoritmo de processamento baseado nesse índice distribuído, os quais reduziram o tempo de computação em até 77% para Junções Estrelas de baixa seletividade do Star Schema Benchmark (SSB). Como idealmente o sistema deve ser capaz de executar tanto acesso aleatório quanto full scan, nós também propusemos uma arquitetura genérica que permite a inserção de um otimizador de consultas capaz de selecionar quais abordagens devem ser usadas dependendo da consulta. Devido ao fato de consultas de junção serem frequentes, nossas soluções são pertinentes a uma ampla gama de aplicações. A contribuições desta tese não só fortalecem o uso de frameworks de processamento de código aberto, como também exploram métodos mais eficientes de acesso aos dados para promover uma melhora significativa no desempenho Junções Estrela. Big Data Big Data Cloud Computing Computação em Nuvem Data Warehouse Data Warehouse Hadoop Hadoop Junção Estrela Star Join
74	State Spill Policies for State Intensive Continuous Query Plan Evaluation Jbantova, Mariana G 02 May 2007 (has links) The needs of new modern day applications such as network monitoring systems, telecommunications data management, web applications, remote medical monitoring applications and others for near real time results over continuous data streams have spurred the development of new data management systems called Data Stream Management Systems (DSMS). Unlike traditional database systems which answer one-time user queries only after the finite data has been captured on disk, DSMSs provide on-the-fly answers to user queries as data is arriving at various rates in the form of continuous, potentially infinite streams of tuples. To meet the timeliness requirements of applications, DSMSs aim to keep all data in main memory. Thus queries with multiple stateful operators pose a major strain on memory. Existing adaptation techniques designed to address this issue are ineffective when faced with continuous bursts of high data rates. When system load exceeds system capacity, a DSMS has three options: 1) discard some new data; 2) crash; or 3) spill data to disk. Only option three allows it to produce delayed, yet accurate and complete query results. However, this option involves disk access overhead and change in the natural order of tuples flowing through the query plan tree. As not all stream operators can process correctly out of order tuples, data spilling may have a negative impact on the quality of the final results. Moreover, since operators in a query plan are interconnected, changes in the order of tuple flows inevitably impact the stages of execution of affected downstream operators such as for example data purging . Data purging is necessary for processing continuous queries composed of stateful operators. The state of such operators is divided into finite non-overlapping sets of tuples called windows. Thus, after all the tuples for a window have been processed and all results output, these tuples can be discarded to free memory for new data. To address these issues, we have redesigned the state structure of continuous operators into smaller, finite, non-overlapping sets of tuples such as partitioned window groups, which incur less disk-access overhead. Second, we provide for the capability of continuous operators to correctly process out of order tuples using punctuation pointers. Third, we design methods for downstream operators to synchronize their processing stages with those of upstream operators to achieve optimized query plan throughput. Putting these techniques together, we have designed a consolidated spilling adaptation strategy which considers all aspects of operators' inter-connections in a query plan for making optimal adaptation decisions. The effectiveness of our integrated approach was empirically tested in a comparative evaluation study against several alternate spilling adaptation strategies. We conducted our experiments on CAPE, a DSMS developed at WPI, using different types of query plans composed of multiple partitioned window join operators. Our experiments prove that despite the higher overhead of a more synchronized adaptation approach, our consolidated strategy provides better query plan performance and higher plan throughput during periods of continuous bursts of high data rates. continuous query processing adaptation policies partitioned window join operator Database management Query languages (Computer science)
75	DistJoin: plataforma de processamento distribuído de operações de junção espacial com bases de dados dinâmicas / DistJoin: platform for distributed processing of spatial join operations with dynamic datasets Oliveira, Sávio Salvarino Teles de 28 June 2013 (has links) Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2014-10-09T12:30:33Z No. of bitstreams: 2 Dissertação - Savio Salvarino Teles de Oliveira - 2013.pdf: 6348358 bytes, checksum: 12e62cd925367772158d94e466de5827 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2014-10-09T14:44:35Z (GMT) No. of bitstreams: 2 Dissertação - Savio Salvarino Teles de Oliveira - 2013.pdf: 6348358 bytes, checksum: 12e62cd925367772158d94e466de5827 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Made available in DSpace on 2014-10-09T14:44:35Z (GMT). No. of bitstreams: 2 Dissertação - Savio Salvarino Teles de Oliveira - 2013.pdf: 6348358 bytes, checksum: 12e62cd925367772158d94e466de5827 (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Previous issue date: 2013-06-28 / Fundação de Apoio à Pesquisa - FUNAPE / Geographic Information Systems (GIS) have received increasing attention in research institutes and industry in recent years. A Spatial Database Managament System (SDBMS) is one of the main components of a GIS and spatial join is one of the most important operations in SDBMS. Spatial join involves the relationship between two datasets, combining the geometries according some spatial predicate, such as intersection. Due to the increasing availability of spatial data, the growing number of GIS users, and the high cost of the processing of spatial operations, distributed SGBDEs (SGBDED) have been proposed as a good option to efficiently process spatial join on a cluster. This distributed processing brings some challenges, such as the data distribution and parallel and distributed processing of spatial join. This paper presents a platform for parallel and distributed processing of spatial joins in a cluster using data distribution techniques for dynamic datasets. Studies in the literature have explored data distribution techniques for static datasets, where any update requires data redistribution. This becomes unfeasible when using large datasets with frequent updates. Therefore, this paper proposes two new data distribution techniques for dynamic datasets: Proximity Area and Grid Proximity Area. These techniques have been evaluated to determine which scenarios each technique is more appropriate for. For this purpose, these techniques are evaluated in a real environment using datasets with different characteristics. Therefore, it is possible to evaluate the spatial join operation in real scenarios with each technique. / Os Sistemas de Informação Geográfica (SIG) têm recebido cada vez mais destaque nos institutos de pesquisa e na indústria nos últimos anos. Um Sistema de Gerência de Bancos de Dados Espaciais (SGBDE) é um dos principais componentes de um SIG e a junção espacial uma das operações mais importantes nos SGBDEs. Ela envolve o relacionamento entre duas bases de dados, combinando as geometrias de acordo com algum predicado espacial, como intersecção. Devido à crescente disponibilidade de dados espaciais, ao aumento no número de usuários dos SIGS e ao alto custo de processamento das operações espaciais, os SGBDE distribuídos (SGBDED) surgem com uma boa opção para processar a junção espacial de forma eficiente em um cluster de computadores. Esse processamento distribuído traz consigo alguns desafios, tais como a distribuição dos dados pelo cluster e o processamento paralelo e distribuído da junção espacial. O objetivo deste trabalho é apresentar uma plataforma de geoprocessamento paralelo e distribuído da junção espacial em um cluster de computadores, utilizando técnicas de distribuição de dados para bases de dados dinâmicas. Os trabalhos encontrados na literatura têm explorado técnicas de distribuição de dados indicadas para bases de dados estáticas, onde qualquer atualização da base de dados requer que todos os dados sejam novamente distribuídos pelo cluster. Isto se torna inviável com grandes bases de dados e que sofrem constantes atualizações. Por isso, este trabalho propõe duas novas técnicas de distribuição de dados com bases de dados dinâmicas: Proximity Area e Grid Proximity Area. Estas técnicas foram avaliadas para definir em quais cenários cada uma delas é mais apropriada. Para tal, estas técnicas foram avaliadas em um ambiente real com bases de dados com características diferentes, para que fosse possível experimentar a junção espacial distribuída em cenários diversos com cada técnica de distribuição de dados. Junção espacial distribuída Bases de dados dinâmicas R-Tree Distributed spatial join Dynamic dataset R-Tree
76	Processing Exact Results for Queries over Data Streams Chakraborty, Abhirup 23 February 2010 (has links) In a growing number of information-processing applications, such as network-traffic monitoring, sensor networks, financial analysis, data mining for e-commerce, etc., data takes the form of continuous data streams rather than traditional stored databases/relational tuples. These applications have some common features like the need for real time analysis, huge volumes of data, and unpredictable and bursty arrivals of stream elements. In all of these applications, it is infeasible to process queries over data streams by loading the data into a traditional database management system (DBMS) or into main memory. Such an approach does not scale with high stream rates. As a consequence, systems that can manage streaming data have gained tremendous importance. The need to process a large number of continuous queries over bursty, high volume online data streams, potentially in real time, makes it imperative to design algorithms that should use limited resources. This dissertation focuses on processing exact results for join queries over high speed data streams using limited resources, and proposes several novel techniques for processing join queries incorporating secondary storages and non-dedicated computers. Existing approaches for stream joins either, (a) deal with memory limitations by shedding loads, and therefore can not produce exact or highly accurate results for the stream joins over data streams with time varying arrivals of stream tuples, or (b) suffer from large I/O-overheads due to random disk accesses. The proposed techniques exploit the high bandwidth of a disk subsystem by rendering the data access pattern largely sequential, eliminating small, random disk accesses. This dissertation proposes an I/O-efficient algorithm to process hybrid join queries, that join a fast, time varying or bursty data stream and a persistent disk relation. Such a hybrid join is the crux of a number of common transformations in an active data warehouse. Experimental results demonstrate that the proposed scheme reduces the response time in output results by exploiting spatio-temporal locality within the input stream, and minimizes disk overhead through disk-I/O amortization. The dissertation also proposes an algorithm to parallelize a stream join operator over a shared-nothing system. The proposed algorithm distributes the processing loads across a number of independent, non-dedicated nodes, based on a fixed or predefined communication pattern; dynamically maintains the degree of declustering in order to minimize communication and processing overheads; and presents mechanisms for reducing storage and communication overheads while scaling over a large number of nodes. We present experimental results showing the efficacy of the proposed algorithms. Data Streams query processing join processing Shared Nothing Cluster Sliding windows adaptive parallelism
77	Processing Exact Results for Queries over Data Streams Chakraborty, Abhirup 23 February 2010 (has links) In a growing number of information-processing applications, such as network-traffic monitoring, sensor networks, financial analysis, data mining for e-commerce, etc., data takes the form of continuous data streams rather than traditional stored databases/relational tuples. These applications have some common features like the need for real time analysis, huge volumes of data, and unpredictable and bursty arrivals of stream elements. In all of these applications, it is infeasible to process queries over data streams by loading the data into a traditional database management system (DBMS) or into main memory. Such an approach does not scale with high stream rates. As a consequence, systems that can manage streaming data have gained tremendous importance. The need to process a large number of continuous queries over bursty, high volume online data streams, potentially in real time, makes it imperative to design algorithms that should use limited resources. This dissertation focuses on processing exact results for join queries over high speed data streams using limited resources, and proposes several novel techniques for processing join queries incorporating secondary storages and non-dedicated computers. Existing approaches for stream joins either, (a) deal with memory limitations by shedding loads, and therefore can not produce exact or highly accurate results for the stream joins over data streams with time varying arrivals of stream tuples, or (b) suffer from large I/O-overheads due to random disk accesses. The proposed techniques exploit the high bandwidth of a disk subsystem by rendering the data access pattern largely sequential, eliminating small, random disk accesses. This dissertation proposes an I/O-efficient algorithm to process hybrid join queries, that join a fast, time varying or bursty data stream and a persistent disk relation. Such a hybrid join is the crux of a number of common transformations in an active data warehouse. Experimental results demonstrate that the proposed scheme reduces the response time in output results by exploiting spatio-temporal locality within the input stream, and minimizes disk overhead through disk-I/O amortization. The dissertation also proposes an algorithm to parallelize a stream join operator over a shared-nothing system. The proposed algorithm distributes the processing loads across a number of independent, non-dedicated nodes, based on a fixed or predefined communication pattern; dynamically maintains the degree of declustering in order to minimize communication and processing overheads; and presents mechanisms for reducing storage and communication overheads while scaling over a large number of nodes. We present experimental results showing the efficacy of the proposed algorithms. Data Streams query processing join processing Shared Nothing Cluster Sliding windows adaptive parallelism
78	The effect of Uplink and Downlink relationship within Multi-level Marketing Sales Tsai, Chin-Chang 10 November 2000 (has links) Abstract In this research we focus on MLM companies in Taiwan area, interrelationship between each direct sales and how their own characteristics affect business. Our goal is to understand the uplink and downlink interrelationship in order to improve the their relationship in direct sales companies. This will promote the uplink to help downlink into fast development state and improve the business. As for MLM system, every direct sale does not have legal contract with MLM Company and every marketing unit act as an independent unit. On the other hand, every direct sales requires other units for support and makeup the whole MLM system. Therefore, the relationship between uplink and downlink is the base and strength element of MLM system. Researchers discover that direct sales joins in the MLM system is based on product requirement market more than open market or potential market. Therefore, all those direct sales need to take product requirement market into concern. If the product has large requirement market then it will be easier for the direct sales to expend their MLM system. For those people who did not join into direct sale their direct downlink will be their classmates and friends. In long term direct sales marketing, classmates and friends market are not enough, we need to walk out and open stranger market. In order to gain new market we need to continuously build up new relationship with all kind of people and only this will continuously expend the direct sales market. After direct sales join the MLM system, during the interaction with uplink it will go through initial state, enhance state, stable state and independent state four different states. Every single state will affect the relationship between uplink and downlink on how they build up, maintain and expend. During the direct sales time period we need to think of different tactics base on each state in order to achieve successful direct sales marketing. During each state when we are the uplink then we need to think how to lead the downlink into the whole system and make them become one of the whole system. In order to build up the long term relationship between direct sales and direct sales system is based on trust and mutually beneficial. In direct sales market the relationship between uplink and downlink makeup the base strength of the whole direct sales system. Therefore, enhance the relationship between the uplink and downlink is the key element in successful direct sales market. When direct sales are leading all the downlink they need to use admiration and inspirer as enhance elements. Let downlink learn, grow and independent and lead into road of success. Direct sales market mainly focus on ¡§duplicate¡¨. Therefore, uplinks pass all kind of successful ideas into downlinks and build up the strength base on the interaction between uplinks and downlinks. This is the truth strength behind the MLM. previous relationship and join motive. network sales Keywords: multiple level marketing (MLM) interaction relationship quality relationship state concept of workvalue
79	Measurement and resource allocation problems in data streaming systems Zhao, Haiquan 26 April 2010 (has links) In a data streaming system, each component consumes one or several streams of data on the fly and produces one or several streams of data for other components. The entire Internet can be viewed as a giant data streaming system. Other examples include real-time exploratory data mining and high performance transaction processing. In this thesis we study several measurement and resource allocation optimization problems of data streaming systems. Measuring quantities associated with one or several data streams is often challenging because the sheer volume of data makes it impractical to store the streams in memory or ship them across the network. A data streaming algorithm processes a long stream of data in one pass using a small working memory (called a sketch). Estimation queries can then be answered from one or more such sketches. An important task is to analyze the performance guarantee of such algorithms. In this thesis we describe a tail bound problem that often occurs and present a technique for solving it using majorization and convex ordering theories. We present two algorithms that utilize our technique. The first is to store a large array of counters in DRAM while achieving the update speed of SRAM. The second is to detect global icebergs across distributed data streams. Resource allocation decisions are important for the performance of a data streaming system. The processing graph of a data streaming system forms a fork and join network. The underlying data processing tasks consists of a rich set of semantics that include synchronous and asynchronous data fork and data join. The different types of semantics and processing requirements introduce complex interdependence between various data streams within the network. We study the distributed resource allocation problem in such systems with the goal of achieving the maximum total utility of output streams. For networks with only synchronous fork and join semantics, we present several decentralized iterative algorithms using primal and dual based optimization techniques. For general networks with both synchronous and asynchronous fork and join semantics, we present a novel modeling framework to formulate the resource allocation problem, and present a shadow-queue based decentralized iterative algorithm to solve the resource allocation problem. We show that all the algorithms guarantee optimality and demonstrate through simulation that they can adapt quickly to dynamically changing environments. Back pressure Fork and join Data streaming Data mining Algorithms Transaction systems (Computer systems) Real-time data processing
80	Optimization for big joins and recursive query evaluation using intersection and difference filters in MapReduce Phan, Thuong-Cang 07 July 2014 (has links) (PDF) The information technology community has created unprecedented amount of data through large-scale applications. As a result, the Big Data is considered as gold mines of information that just wait for the processing power to be available, reliable, and apt at evaluating complex analytic algorithms. MapReduce is one of the most popular programming models designed to support such processing. It has become a standard for processing, analyzing and generating large data in a massively parallel manner. However, the MapReduce programming model suffers from severe limitations of operations beyond simple scan/grouping, particularly operations with multiple inputs. In the present dissertation we efficiently investigate and optimize the evaluation, in a MapReduce environment, of one of the most salient and representative such operations: Join. It focuses not only on two-way joins, but also complex joins such as multi-way joins and recursive joins. To achieve these objectives, we first devise a new type of filter called intersection filter using a probabilistic model to represent an approximation of the set intersection. The intersection filter is then applied to two-way join operations to eliminate most non-joining elements in input datasets before sending data to actual join processing. In addition, we make an extension of the intersection filter to improve the performance of three-way joins and chain joins including both cyclic chain joins with many shared join keys. We use the Lagrangian multiplier method to indicate a good choice between our optimized solutions for the multi-way joins. Another important proposal is a difference filter, which is a probabilistic data structure designed to represent a set and examine disjoint elements of the set. It can be applied to a wide range of popular problems such as reconciliation, deduplication, error-correction, especially a recursive join operation. A recursive join using the difference filter is implemented as an iteration of one join job instead of two jobs including a join job and a difference job. This improvement will significantly reduce the number of executed jobs by half, and the related overheads such as data rescanning, intermediate data, and communication for the deduplication and difference operations. Besides, this research also improves the general semi-naive algorithm, as well as the evaluation of recursive queries in MapReduce. We then provide general cost models for two-way joins, multi-way joins, and recursive joins. Thanks to these cost models, we can make comparisons of the join algorithms more persuasive. As a result, with using the proposed filters, the join operations can minimize disk I/O and communication costs. Moreover, the intersection filter-based join operations are demonstrated to be more efficient than existing solutions through experimental evaluations. Experimental comparisons of different algorithms for joins are examined with respect to intermediate data amount, the total output amount, the total execution time, and especially task timelines. Finally, our improvements on the join operations contribute to the global scene of optimizing data management for MapReduce applications on large-scale distributed infrastructures. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Big data MapReduce Bloom filter Join Recursive query evaluation Optimization

Search results