Global ETD Search

11	Algorithmic Engineering Towards More Efficient Key-Value Systems Fan, Bin 18 December 2013 (has links) Distributed key-value systems have been widely used as elemental components of many Internet-scale services at sites such as Amazon, Facebook and Twitter. This thesis examines a system design approach to scale existing key-value systems, both horizontally and vertically, by carefully engineering and integrating techniques that are grounded in recent theory but also informed by underlying architectures and expected workloads in practice. As a case study, we re-design FAWN-KV—a distributed key-value cluster consisting of “wimpy” key-value nodes—to use less memory but achieve higher throughput even in the worst case. First, to improve the worst-case throughput of a FAWN-KV system, we propose a randomized load balancing scheme that can fully utilize all the nodes regardless of their query distribution. We analytically prove and empirically demonstrate that deploying a very small but extremely fast load balancer at FAWN-KV can effectively prevent uneven or dynamic workloads creating hotspots on individual nodes. Moreover, our analysis provides service designers a mathematically tractable approach to estimate the worst-case throughput and also avoid drastic overprovisioning in similar distributed key-value systems. Second, to implement the high-speed load balancer and also to improve the space efficiency of individual key-value nodes, we propose novel data structures and algorithms, including the cuckoo filter, a Bloom filter replacement that is high-speed, highly compact and delete-supporting, and optimistic cuckoo hashing, a fast and space-efficient hashing scheme that scales on multiple CPUs. Both algorithms are built upon conventional cuckoo hashing but are optimized for our target architectures and workloads. Using them as building blocks, we design and implement MemC3 to serve transient data from DRAM with high throughput and low-latency retrievals, and SILT to provide cost-effective access to persistent data on flash storage with extremely small memory footprint (e.g., 0.7 bytes per entry) Memory Efficiency Hash Table Bloom Filter Caching Load Balancing Computer Sciences
12	A Statistically Rigorous Evaluation of the Cascade Bloom Filter for Distributed Access Enforcement in Role-Based Access Control (RBAC) Systems Zitouni, Toufik January 2010 (has links) We consider the distributed access enforcement problem for Role-Based Access Control (RBAC) systems. Such enforcement has become important with RBAC’s increasing adoption, and the proliferation of data that needs to be protected. Our particular interest is in the evaluation of a new data structure that has recently been proposed for enforcement: the Cascade Bloom Filter. The Cascade Bloom Filter is an extension of the Bloom filter, and provides for time- and space-efficient encodings of sets. We compare the Cascade Bloom Filter to the Bloom Filter, and another approach called Authorization Recycling that has been proposed for distributed access enforcement in RBAC. One of the challenges we address is the lack of a benchmark: we propose and justify a benchmark for the assessment. Also, we adopt a statistically rigorous approach for empirical assessment from recent work. We present our results for time- and space-efficiency based on our benchmark. We demonstrate that, of the three data structures that we consider, the Cascade Bloom Filter scales the best with the number of RBAC sessions from the standpoints of time- and space-efficiency. Role Based Access Control Cascade Bloom Filter Authorization Recycling Electrical and Computer Engineering
13	Comparaison de novo de données de séquençage issues de très grands échantillons métagénomiques : application sur le projet Tara Oceans / De novo comparision of huge metagenomic experiments coming from NGS technologies : application on Tara Oceans project Maillet, Nicolas 19 December 2013 (has links) La métagénomique vise à étudier le contenu génétique et génomique d'un échantillon provenant d'un environnement naturel. Cette discipline récente s'attache à étudier les génomes de différents organismes provenant d'un même milieu. La métagénomique pose de nouvelles questions, tant d'un point de vue biologique qu'informatique. Les masses de données générées par les études métagénomiques et la complexité des milieux étudiés, nécessitent de développer de nouvelles structures de données et de nouveaux algorithmes dédiés. Parmi les différentes approches existantes en métagénomique, la métagénomique comparative consiste à comparer plusieurs métagénomes afin d'en connaître les divers degrés de similarité. Lorsque cette comparaison se base uniquement sur le contenu brut des échantillons, sans faire appel à des connaissances externes, on parle de métagénomique comparative de novo. L'objectif des travaux que nous proposons est de développer une méthode permettant d'extraire les séquences similaires de deux jeux de données métagénomiques, où chaque jeu peut être composé de centaines de millions de courtes séquences. La comparaison proposée consiste à identifier les séquences d'un premier jeu similaires à au moins une séquence d'un second jeu. Afin d'être rapide et économe en mémoire, l'implémentation de notre méthode a nécessité la conception d'une nouvelle structure d'indexation, basée sur le filtre de bloom. Le logiciel final, nommé Compareads, a une consommation mémoire faible (de l'ordre de quelques go) et peut calculer l'intersection de deux échantillons de 100 millions de séquences chacun en une dizaine d'heures. Notre méthode est une heuristique qui génère un faible taux de faux positifs. Le logiciel Compareads est dédié à l'analyse de grands jeux de données métagénomiques. À l'heure actuelle, il est le seul outil capable de comparer de tels jeux. Compareads a été appliqué sur plusieurs projets métagénomiques. Notre outil produit des résultats robustes, biologiquement exploitables et en accord avec diverses méthodes fondamentalement différentes. Il est actuellement utilisé de manière intensive sur les échantillons provenant de l'expédition tara oceans. Sur ce projet, notre méthode à permis de mettre en évidence que les grands systèmes océaniques influent sur la répartition globale des micro-organismes marins. / Metagenomics studies overall genomic information of multiple organisms coming from the same biotope. The information is generally provided by next generation sequencing technologies (NGS). Typical data are samples of short reads (i.e. reads of few hundred base pairs). To study such metagenomics information, we developed an original method for extracting similarities between two samples of reads. More precisely, this approach locates the set of common reads present in two samples. In order to fit with current memory capacities and to be time efficient, we used a modified Bloom filter data structure. Finding the common reads between multiple samples and crossing this information with the location of samples leads to visualize some biological processes like ubiquitous species or effect of water stream caring some species. Finally, the tool can also be used as a filter on metagenomics datas to remove for example only one specie. Our software, Compareads, is actually used on the Tara Oceans project where it shows that global dynamic of oceans seems to play a part on the dispersion of marine microorganisms. Bioinformatique Tara Oceans Métagénomique comparative Filtre de Bloom Bioinformatics Tara Oceans Comparative metagenomics Bloom Filter
14	Privacy Preserving Audit Proofs / Integritetsbevarande bevis av digitalt spårbara händelser Lindqvist, Anton January 2017 (has links) The increased dependence on computers for critical tasks demands sufficient and transparent methods to audit its execution. This is commonly solved using logging where the log must not only be resilient against tampering and rewrites in hindsight but also be able to answer queries concerning (non)-membership of events in the log while preserving privacy. Since the log cannot assume to be trusted the answers must be verifiable using a proof of correctness. This thesis describes a protocol capable of producing verifiable privacy preserving membership proofs using Merkle trees. For non-membership, a method used to authenticate Bloom filters using Merkle trees is proposed and analyzed. Since Bloom filters are a probabilistic data structures, a method of handling false positives is also proposed. / Den ökande avlastningen av kritisk funktionalitet till datorer ställer högre krav på loggning och möjlighet till övervakning. Loggen måste vara resistent mot manipulation och möjliggöra för andra parter att ställa frågor berörande en viss händelse i loggen utan att läcka känslig information. Eftersom loggen inte antas vara att lita på måste varje svar vara verifierbart med hjälp av ett bevis. Denna rapport presenterar ett protokoll kapabelt till att producera verifierbara och integritetsbevarande svar på frågor om en viss händelse i loggen genom användning av Merkle-träd. Vid avsaknad av den förfrågade händelsen används ny metod för att autentisera ett Bloom filter med hjälp av Merkle-träd. Eftersom Bloom filtren är en probabilistisk konstruktion presenteras även en metod för att hantera falsk positiva svar. privacy audit logging security bloom filter merkle tree Computer Sciences Datavetenskap (datalogi)
15	Increasing big data front end processing efficiency via locally sensitive Bloom filter for elderly healthcare Cheng, Yongqiang, Jiang, Ping, Peng, Yonghong January 2015 (has links) No / In support of the increasing number of elderly population, wearable sensors and portable mobile devices capable of monitoring, recording, reporting and alerting are envisaged to enable them an independent lifestyle without relying on intrusive care programmes. However, the big data readings generated from the sensors are characterized as multidimensional, dynamic and non-linear with weak correlation with observable human behaviors and health conditions which challenges the information transmission, storing and processing. This paper proposes to use Locality Sensitive Bloom Filter to increase the Instance Based Learning efficiency for the front end sensor data pre-processing so that only relevant and meaningful information will be sent out for further processing aiming to relieve the burden of the above big data challenges. The approach is proven to optimize and enhance a popular instance-based learning method benefits from its faster speed, less space requirements and is adequate for the application.
16	Approximate Distributed Set Reconciliation with Defined Accuracy Kruber, Nico 24 April 2020 (has links) Mit aktuell vorhandenen Mitteln ist es schwierig, objektiv approximative Algorithmen zum Mengenabgleich gegenüberzustellen und zu vergleichen. Jeder Algorithmus kann durch unterschiedliche Wahl seiner jeweiligen Parameter an ein gegebenes Szenario angepasst werden und so zum Beispiel Bandbreiten- oder CPU-optimiert werden. Änderungen an den Parametern gehen jedoch meistens auch mit Änderungen an der Genauigkeit bei der Erkennung von Differenzen in den teilnehmenden Mengen einher und behindern somit objektive Vergleiche, die auf derselben Genauigkeit basieren. In dieser Arbeit wird eine Methodik entwickelt, die einen fairen Vergleich von approximativen Algorithmen zum Mengenabgleich erlaubt. Dabei wird eine feste Zielgenauigkeit definiert und im Weiteren alle die Genauigkeit beeinflussenden Parameter entsprechend gesetzt. Diese Methode ist universell genug, um für eine breite Masse an Algorithmen eingesetzt zu werden. In der Arbeit wird sie auf zwei triviale hashbasierte Algorithmen, einem basierend auf Bloom Filtern und einem basierend auf Merkle Trees angewandt, um dies zu untermauern. Im Vergleich zu vorherigen Arbeiten zu Merkle Trees wird vorgeschlagen, die Größe der Hashsummen dynamisch im Baum zu wählen und so den Bandbreitenbedarf an die gewünschte Zielgenauigkeit anzupassen. Dabei entsteht eine neue Variante des Mengenabgleichs mit Merkle Trees, bei der sich erstmalig die Genauigkeit konfigurieren lässt. Eine umfassende Evaluation eines jeden der vier unter dem Genauigkeitsmodell angepassten Algorithmen bestätigt die Anwendbarkeit der entwickelten Methodik und nimmt eine Neubewertung dieser Algorithmen vor. Die vorliegenden Ergebnisse erlauben die Auswahl eines effizienten Algorithmus für unterschiedliche praktische Szenarien basierend auf einer gewünschten Zielgenauigkeit. Die präsentierte Methodik zur Bestimmung passender Parameter, um für unterschiedliche Algorithmen die gleiche Genauigkeit zu erreichen, kann auch auf weitere Algorithmen zum Mengenabgleich angewandt werden und erlaubt eine objektive, allgemeingültige Einordnung ihrer Leistung unter verschiedenen Metriken. Der in der Arbeit entstandene neue approximative Mengenabgleich mit Merkle Trees erweitert die Anwendbarkeit von Merkle Trees und wirft ein neues Licht auf dessen Effektivität. / The objective comparison of approximate versioned set reconciliation algorithms is challenging. Each algorithm's behaviour can be tuned for a given use case, e.g. low bandwidth or computational overhead, using different sets of parameters. Changes of these parameters, however, often also influence the algorithm's accuracy in recognising differences between participating sets and thus hinder objective comparisons based on the same level of accuracy. We develop a method to fairly compare approximate set reconciliation algorithms by enforcing a fixed accuracy and deriving accuracy-influencing parameters accordingly. We show this method's universal applicability by adopting two trivial hash-based algorithms as well as set reconciliation with Bloom filters and Merkle trees. Compared to previous research on Merkle trees, we propose to use dynamic hash sizes to align the transfer overhead with the desired accuracy and create a new Merkle tree reconciliation algorithm with an adjustable accuracy target. An extensive evaluation of each algorithm under this accuracy model verifies its feasibility and ranks these four algorithms. Our results allow to easily choose an efficient algorithm for practical set reconciliation tasks based on the required level of accuracy. Our way to find configuration parameters for different, yet equally accurate, algorithms can also be adopted to other set reconciliation algorithms and allows to rate their respective performance in an objective manner. The resultant new approximate Merkle tree reconciliation broadens the applicability of Merkle trees and sheds some new light on its effectiveness. Verteilte Systeme Mengenabgleich Approximative Algorithmen Genauigkeitsmodelle Merkle Tree Bloom Filter Synchronisation Replikation Distributed Systems Set Reconciliation Approximate Algorithms Accuracy Models Merkle Tree Bloom Filter Synchronisation Replication ST 200 ddc:000
17	Secure and Privacypreserving V2X multicast DNS Atif, Ayub, Arieltan, Justin January 2020 (has links) The Domain Name System is a hierarchical naming system that provides information of network resources or services given domain names. DNS applications in vehicular networks raise new challenges with regards to security and privacy of vehicles. In particular, vehicular communications outside the coverage of roadside infrastructure needs to be preserved. Multicast DNS is proposed as a method to restrict queries to vehicles in a Vehicle-to-Everything environment which could include other connected devices. Contemporary DNS applications rely on robust security protocols provided by the DNS Security Extensions to authenticate responses and verify resource records. Vehicular DNS communications need authentication to verify the source and legitimacy of DNS resource records. This can be achieved through multihop Vehicle- to-Vehicle communications to reach a name server supplemented by a novel approach to verify records using the Bloom filter.In this thesis, we analyze the security and privacy risks posed by a non-authenticated baseline communication protocol. We then build a secure and privacy-preserving networked system based on pseudonym certificate-based public key infrastructure solution. The experimental analysis confirmed the improvement on security and privacy at the cost of communication and computation overhead. / Domännamnssystemet är en hierarkisk benämningssystem som ger information om nätverksresurser eller tjänster för givna domännamn. DNS application i fordon nätverk framkallar nya utmaningar när det handlar om datasäkerhet och fordons integritet. Det är särskilt fordon kommunikation utanför vägkant-infrastrukturens räckvidd som behöver bevara och försäkra operationer av DNS applikation i fordon nätverk. Multicast DNS är en föreslagen metod för att begränsa förfrågan till fordon i en fordon-till-all-miljö som kan inkludera andra anslutna enheter. Nuvarande applikationer förlitar sig på en robust säkerhetsprotokoll som kommer från DNS säkerhetsförlängning för att autentisera svar och verifiera resurs rekord. Fordon DNS kommunikationer behöver autentisering för att verifiera källor och legitimitet av DNS resurs rekord. Detta kan uppnås genom multihop fordon-till-fordon kommunikation för att ansluta sig till en namn server med hjälp av en ny metod för att verifiera uppgifter med hjälp av bloomfilter datastruktur.I tesen analyserar vi risken som finns i en icke-autentiserad integritets-läckande kommunikationsprotokoll. Vi bygger sedan ett nätverk och använder en pseudonym certifikatbaserad publik nyckel infrastruktur lösning för att undersöka förbättringar inom säkerhet och integritet. Analysen från experimenten visar att det finns en förbättring för säkerhet och integritet i utbyte mot tidsprestanda, vilket är en intressant kompromiss. Vehicular Communications DNS Security and Privacy PKI Pseudonym Bloom Filter Vehicle Multi-hop Protocol Fordonskommunikation DNS Säkerhet och integritet PKI Pseudonym Bloom filter Fordons multihopp-protokoll Computer and Information Sciences Data- och informationsvetenskap
18	Optimization for big joins and recursive query evaluation using intersection and difference filters in MapReduce / Utilisation de filtres d’intersection et de différence pour l’optimisation des jointures à grande échelle et l’exécution de requêtes récursives à l’aide MapReduce Phan, Thuong-Cang 07 July 2014 (has links) La communauté informatique a créé une quantité de données sans précédent grâce aux applications à grande échelle. Ces données massives sont considérées comme une mine d’or, ces informations n’attendant que la puissance de traitement sûre et appropriée à l’évaluation d’algorithmes d’analyse complexe. MapReduce est un des modèles de programmation les plus réputé, connu pour la gestion de ce type de traitement. Il est devenu un standard pour le traitement, l’analyse et la génération de grandes quantités de données en parallèle. Cependant, le modèle de programmation MapReduce souffre d’importantes limites pour des opérations non simples (scans ou regroupements simples), en particulier les traitements avec entrées multiples. Dans ce mémoire, nous étudions et optimisons l’évaluation, dans un environnement MapReduce, d’une des opérations les plus importantes et représentatives : la jointure. Notre travail aborde, en plus de la jointure binaire, des jointures complexes comme la jointure multidimensionnelle et la jointure récursive. Pour atteindre ces objectifs, nous proposons d’abord un nouveau type de filtre appelé filter d’intersection qui utilise un modèle probabiliste pour représenter une approximation de l’intersection des ensembles. Le filtre d’intersection est ensuite appliqué à l’opération de jointure bidirectionnelle pour éliminer la majorité des éléments non-joints dans des ensembles de données d'entrée, avant d’envoyer les données pour le processus de jointure. De plus, nous proposons une extension du filtre d’intersection pour améliorer l’efficacité de la jointure ternaire et de la jointure en cascade correspondant à un cycle de jointure avec plusieurs clés partagées lors de la jointure. Nous utilisons la méthode des multiplicateurs de Lagrange afin de réaliser un choix pertinent entre les différentes solutions proposées pour les jointures multidimensionnelles. Une autre proposition est le filtre de différence, une structure de données probabiliste formée pour représenter un ensemble et examiner des éléments disjoints. Ce filtre peut être appliqué à un grand nombre de problèmes, tels que la réconciliation, la déduplication, la correction d’erreur et en ce qui nous concerne la jointure récursive. Une jointure récursive utilisant un filtre de différence est effectuée comme une répétition de jointures en lieu et place d’une jointure et d’un processus de différenciation. Cette amélioration réduit de moitié le nombre de tâches effectuées et les associés tels que la lecture des données, la génération des données intermédiaires et les communications. Ceci permet notamment une amélioration de l’évaluation de l’algorithme semi-naïf et par conséquent l’évaluation des requêtes récursives en MapReduce. Ensuite, nous fournissons des modèles de coût généraux pour les jointures binaire, à n-aire et récursive. Grâce à ces modèles, nous pouvons comparer les algorithmes de jointure les plus représentatifs. Ainsi, nous pouvons montrer l’intérêt des filtres proposés, grâce notamment à la réduction des coûts E/S (entrée/ sortie) sur disque et sur réseau. De plus, des expérimentations ont été menées, montrant l’efficacité du filtre d’intersection par rapport aux solutions, en comparant en particulier des critères tels que la quantité de données intermédiaires, la quantité de données produites en sortie, le temps d’exécution et la répartition des tâches. Nos propositions pour les opérations de jointure contribuent à l’optimisation en général de la gestion de données à l’aide du paradigme MapReduce sur des infrastructures distribuées à grande échelle. / The information technology community has created unprecedented amount of data through large-scale applications. As a result, the Big Data is considered as gold mines of information that just wait for the processing power to be available, reliable, and apt at evaluating complex analytic algorithms. MapReduce is one of the most popular programming models designed to support such processing. It has become a standard for processing, analyzing and generating large data in a massively parallel manner. However, the MapReduce programming model suffers from severe limitations of operations beyond simple scan/grouping, particularly operations with multiple inputs. In the present dissertation we efficiently investigate and optimize the evaluation, in a MapReduce environment, of one of the most salient and representative such operations: Join. It focuses not only on two-way joins, but also complex joins such as multi-way joins and recursive joins. To achieve these objectives, we first devise a new type of filter called intersection filter using a probabilistic model to represent an approximation of the set intersection. The intersection filter is then applied to two-way join operations to eliminate most non-joining elements in input datasets before sending data to actual join processing. In addition, we make an extension of the intersection filter to improve the performance of three-way joins and chain joins including both cyclic chain joins with many shared join keys. We use the Lagrangian multiplier method to indicate a good choice between our optimized solutions for the multi-way joins. Another important proposal is a difference filter, which is a probabilistic data structure designed to represent a set and examine disjoint elements of the set. It can be applied to a wide range of popular problems such as reconciliation, deduplication, error-correction, especially a recursive join operation. A recursive join using the difference filter is implemented as an iteration of one join job instead of two jobs including a join job and a difference job. This improvement will significantly reduce the number of executed jobs by half, and the related overheads such as data rescanning, intermediate data, and communication for the deduplication and difference operations. Besides, this research also improves the general semi-naive algorithm, as well as the evaluation of recursive queries in MapReduce. We then provide general cost models for two-way joins, multi-way joins, and recursive joins. Thanks to these cost models, we can make comparisons of the join algorithms more persuasive. As a result, with using the proposed filters, the join operations can minimize disk I/O and communication costs. Moreover, the intersection filter-based join operations are demonstrated to be more efficient than existing solutions through experimental evaluations. Experimental comparisons of different algorithms for joins are examined with respect to intermediate data amount, the total output amount, the total execution time, and especially task timelines. Finally, our improvements on the join operations contribute to the global scene of optimizing data management for MapReduce applications on large-scale distributed infrastructures. Données massives MapReduce Filtre Bloom Jointure Évaluation de requêtes récursives Optimisation Big data MapReduce Bloom filter Join Recursive query evaluation Optimization
19	Forensic analysis of unallocated space Lei, Zhenxing 01 June 2011 (has links) Computer forensics has become an important technology in providing evidence in investigations of computer misuse, attacks against computer systems and more traditional crimes like money laundering and fraud where digital devices are involved. Investigators frequently perform preliminary analysis at the crime scene on suspects‟ devices to determine the existence of any inappropriate materials such as child pornography on them and conduct further analysis after the seizure of computers to glean leads or valuable evidence. Hence, it is crucial to design a tool which is portable and can perform efficient instant analysis. Many tools have been developed for this purpose, such as Computer Online Forensic Evidence Extractor (COFEE), but unfortunately, they become ineffective in cases where forensic data has been removed. In this thesis, we design a portable forensic tool which can be used to compliment COFEE for preliminary screening to analyze unallocated disk space by adopting a space efficient data structure of fingerprint hash tables for storing the massive forensic data from law enforcement databases in a flash drive and utilizing hash tree indexing for fast searching. We also apply group testing to identify the fragmentation point of the file and locate the starting cluster of each fragment based on statistics on the gap between the fragments. Furthermore, in order to retrieve evidence and clues from unallocated space by recovering deleted files, a file structure based carving algorithm for Windows registry hive files is presented based on their internal structure and unique patterns of storage. / UOIT Computer forensics Fingerprint hash table Bloom filter Fragmentation Fragmentation point Registry hive files Hive bin Key cell
20	Optimization for big joins and recursive query evaluation using intersection and difference filters in MapReduce Phan, Thuong-Cang 07 July 2014 (has links) (PDF) The information technology community has created unprecedented amount of data through large-scale applications. As a result, the Big Data is considered as gold mines of information that just wait for the processing power to be available, reliable, and apt at evaluating complex analytic algorithms. MapReduce is one of the most popular programming models designed to support such processing. It has become a standard for processing, analyzing and generating large data in a massively parallel manner. However, the MapReduce programming model suffers from severe limitations of operations beyond simple scan/grouping, particularly operations with multiple inputs. In the present dissertation we efficiently investigate and optimize the evaluation, in a MapReduce environment, of one of the most salient and representative such operations: Join. It focuses not only on two-way joins, but also complex joins such as multi-way joins and recursive joins. To achieve these objectives, we first devise a new type of filter called intersection filter using a probabilistic model to represent an approximation of the set intersection. The intersection filter is then applied to two-way join operations to eliminate most non-joining elements in input datasets before sending data to actual join processing. In addition, we make an extension of the intersection filter to improve the performance of three-way joins and chain joins including both cyclic chain joins with many shared join keys. We use the Lagrangian multiplier method to indicate a good choice between our optimized solutions for the multi-way joins. Another important proposal is a difference filter, which is a probabilistic data structure designed to represent a set and examine disjoint elements of the set. It can be applied to a wide range of popular problems such as reconciliation, deduplication, error-correction, especially a recursive join operation. A recursive join using the difference filter is implemented as an iteration of one join job instead of two jobs including a join job and a difference job. This improvement will significantly reduce the number of executed jobs by half, and the related overheads such as data rescanning, intermediate data, and communication for the deduplication and difference operations. Besides, this research also improves the general semi-naive algorithm, as well as the evaluation of recursive queries in MapReduce. We then provide general cost models for two-way joins, multi-way joins, and recursive joins. Thanks to these cost models, we can make comparisons of the join algorithms more persuasive. As a result, with using the proposed filters, the join operations can minimize disk I/O and communication costs. Moreover, the intersection filter-based join operations are demonstrated to be more efficient than existing solutions through experimental evaluations. Experimental comparisons of different algorithms for joins are examined with respect to intermediate data amount, the total output amount, the total execution time, and especially task timelines. Finally, our improvements on the join operations contribute to the global scene of optimizing data management for MapReduce applications on large-scale distributed infrastructures. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Big data MapReduce Bloom filter Join Recursive query evaluation Optimization

Search results