• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 591
  • 119
  • 109
  • 75
  • 40
  • 40
  • 27
  • 22
  • 19
  • 10
  • 8
  • 7
  • 6
  • 6
  • 5
  • Tagged with
  • 1225
  • 1225
  • 181
  • 170
  • 163
  • 156
  • 150
  • 150
  • 149
  • 129
  • 112
  • 110
  • 110
  • 109
  • 108
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
271

Détection d'évènements complexes dans les flux d'évènements massifs / Complex event detection over large event streams

Braik, William 15 May 2017 (has links)
La détection d’évènements complexes dans les flux d’évènements est un domaine qui a récemment fait surface dans le ecommerce. Notre partenaire industriel Cdiscount, parmi les sites ecommerce les plus importants en France, vise à identifier en temps réel des scénarios de navigation afin d’analyser le comportement des clients. Les objectifs principaux sont la performance et la mise à l’échelle : les scénarios de navigation doivent être détectés en moins de quelques secondes, alorsque des millions de clients visitent le site chaque jour, générant ainsi un flux d’évènements massif.Dans cette thèse, nous présentons Auros, un système permettant l’identification efficace et à grande échelle de scénarios de navigation conçu pour le eCommerce. Ce système s’appuie sur un langage dédié pour l’expression des scénarios à identifier. Les règles de détection définies sont ensuite compilées en automates déterministes, qui sont exécutés au sein d’une plateforme Big Data adaptée au traitement de flux. Notre évaluation montre qu’Auros répond aux exigences formulées par Cdiscount, en étant capable de traiter plus de 10,000 évènements par seconde, avec une latence de détection inférieure à une seconde. / Pattern detection over streams of events is gaining more and more attention, especially in the field of eCommerce. Our industrial partner Cdiscount, which is one of the largest eCommerce companies in France, aims to use pattern detection for real-time customer behavior analysis. The main challenges to consider are efficiency and scalability, as the detection of customer behaviors must be achieved within a few seconds, while millions of unique customers visit the website every day,thus producing a large event stream. In this thesis, we present Auros, a system for large-scale an defficient pattern detection for eCommerce. It relies on a domain-specific language to define behavior patterns. Patterns are then compiled into deterministic finite automata, which are run on a BigData streaming platform. Our evaluation shows that our approach is efficient and scalable, and fits the requirements of Cdiscount.
272

Nerelační databáze a jejich využití v prostředí finančních institucí / The use of NoSQL databases in the environment of financial institutions

Stejskal, Jan January 2012 (has links)
This work deals with the use of NoSQL database systems in an environment of financial institutions. The work has several objectives: to characterize the types of NoSQL database systems, for selected systems to analyze their properties, their potential use in financial institutions to develop proposals case studies for their use, and one of them select and implement a demonstration of the possibilities of using this type of database system in the specific environment of financial institutions. These objectives are to be achieved by providing a description and analysis of the theoretical part, practical part in designing, choosing, implementation, verification and acceptance of one case study - based on acceptances criteria. In the thesis are the basic concepts of database systems explained first. It is explained in more detail the concept of NoSQL and related terms including causes and genesis, classification systems NoSQL in each category. The next part contains a comparison of the characteristics of relational database - relational systems and NoSQL database systems. The next chapter deals with the needs of financial institutions in the context of the use of database systems. There are also analyzed the properties of several selected NoSQL database systems . The next chapter is based on the analytical findings from previous chapters devoted to finding poten-tials lu use NoSQL database systems in an environment of financial institutions, which is the basic theme of the thesis . The penultimate chapter contains a suggestions of case studies, one of which is selected and a description of the results of its implementation are described in the last chapter . The main contribution of this work is a contribution to the theory of NoSQL systems and the possibili-ty of their use by financial institutions, which take into account when choosing a database system, or a combination of database systems, in practical terms can lead not only to increase the efficiency of their use, but also to optimize the acquisition and operational the costs of such systems.
273

Big Data Governance / Big Data Governance

Blahová, Leontýna January 2016 (has links)
This master thesis is about Big Data Governance and about software, which is used for this purposes. Because Big Data are huge opportunity and also risk, I wanted to map products which can be easily use for Data Quality and Big Data Governance in one platform. This thesis is not only on theoretical knowledge level, but also evaluates five key products (from my point of view). I defined requirements for every kind of domain and then I set up the weights and points. The main objective is to evaluate software capabilities and compere them.
274

Authorship attribution on micro-messages = Atribuição de autoria em micro-mensagens / Atribuição de autoria em micro-mensagens

Cavalcante, Thiago, 1989- 26 August 2018 (has links)
Orientadores: Ariadne Maria Brito Rizzoni Carvalho, Anderson de Rezende Rocha / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Matemática Estatística e Computação Científica / Made available in DSpace on 2018-08-26T21:23:31Z (GMT). No. of bitstreams: 1 Cavalcante_Thiago_M.pdf: 3493838 bytes, checksum: 369bd6608e7326d0a998b426a1c7455b (MD5) Previous issue date: 2014 / Resumo: Com o crescimento continuo do uso de midias sociais, a atribuição de autoria tem um papel imortante na prevenção dos crimes cibernéticos e na análise de rastros online deixados por assediadores, \textit{bullies}, ladrões de identidade entre outros. Nesta dissertação, nós propusemos um método para atribuição de autoria que é de cem a mil vezes mais rápido que o estado da arte. Nós também obtivemos uma acurácia 65\% na classificação de 50 autores. O método proposto se baseia numa representação de caracteristicas escalável utilizando os padrões das mensagens dos micro-blogs, e também nos utilizamos de um classificador de padrões customizado para lidar com grandes quantidades de dados e alta dimensionalidade. Por fim, nós discutimos a redução do espaço de busca na análise de centenas de suspeitos online e milões de micro mensagens online, o que torna essa abordagem valiosa para forense digital e aplicação das leis / Abstract: With the ever-growing use of social media, authorship attribution plays an important role in avoiding cybercrime, and helping the analysis of online trails left behind by cyber pranks, stalkers, bullies, identity thieves and alike. In this dissertation, we propose a method for authorship attribution in micro blogs with efficiency one hundred to a thousand times faster than state-of-the-art counterparts. We also achieved a accuracy of 65% when classifying texts from 50 authors. The method relies on a powerful and scalable feature representation approach taking advantage of user patterns on micro-blog messages, and also on a custom-tailored pattern classifier adapted to deal with big data and high-dimensional data. Finally, we discuss search space reduction when analysing hundreds of online suspects and millions of online micro messages, which makes this approach invaluable for digital forensics and law enforcement / Mestrado / Ciência da Computação / Mestre em Ciência da Computação
275

Data Warehouses na era do Big Data: processamento eficiente de Junções Estrela no Hadoop / Data Warehouses na era do Big Data: processamento eficiente de Junções Estrela no Hadoop

Jaqueline Joice Brito 12 December 2017 (has links)
The era of Big Data is here: the combination of unprecedented amounts of data collected every day with the promotion of open source solutions for massively parallel processing has shifted the industry in the direction of data-driven solutions. From recommendation systems that help you find your next significant one to the dawn of self-driving cars, Cloud Computing has enabled companies of all sizes and areas to achieve their full potential with minimal overhead. In particular, the use of these technologies for Data Warehousing applications has decreased costs greatly and provided remarkable scalability, empowering business-oriented applications such as Online Analytical Processing (OLAP). One of the most essential primitives in Data Warehouses are the Star Joins, i.e. joins of a central table with satellite dimensions. As the volume of the database scales, Star Joins become unpractical and may seriously limit applications. In this thesis, we proposed specialized solutions to optimize the processing of Star Joins. To achieve this, we used the Hadoop software family on a cluster of 21 nodes. We showed that the primary bottleneck in the computation of Star Joins on Hadoop lies in the excessive disk spill and overhead due to network communication. To mitigate these negative effects, we proposed two solutions based on a combination of the Spark framework with either Bloom filters or the Broadcast technique. This reduced the computation time by at least 38%. Furthermore, we showed that the use of full scan may significantly hinder the performance of queries with low selectivity. Thus, we proposed a distributed Bitmap Join Index that can be processed as a secondary index with loose-binding and can be used with random access in the Hadoop Distributed File System (HDFS). We also implemented three versions (one in MapReduce and two in Spark) of our processing algorithm that uses the distributed index, which reduced the total computation time up to 88% for Star Joins with low selectivity from the Star Schema Benchmark (SSB). Because, ideally, the system should be able to perform both random access and full scan, our solution was designed to rely on a two-layer architecture that is framework-agnostic and enables the use of a query optimizer to select which approaches should be used as a function of the query. Due to the ubiquity of joins as primitive queries, our solutions are likely to fit a broad range of applications. Our contributions not only leverage the strengths of massively parallel frameworks but also exploit more efficient access methods to provide scalable and robust solutions to Star Joins with a significant drop in total computation time. / A era do Big Data chegou: a combinação entre o volume dados coletados diarimente com o surgimento de soluções de código aberto para o processamento massivo de dados mudou para sempre a indústria. De sistemas de recomendação que assistem às pessoas a encontrarem seus pares românticos à criação de carros auto-dirigidos, a Computação em Nuvem permitiu que empresas de todos os tamanhos e áreas alcançassem o seu pleno potencial com custos reduzidos. Em particular, o uso dessas tecnologias em aplicações de Data Warehousing reduziu custos e proporcionou alta escalabilidade para aplicações orientadas a negócios, como em processamento on-line analítico (Online Analytical Processing- OLAP). Junções Estrelas são das primitivas mais essenciais em Data Warehouses, ou seja, consultas que realizam a junções de tabelas de fato com tabelas de dimensões. Conforme o volume de dados aumenta, Junções Estrela tornam-se custosas e podem limitar o desempenho das aplicações. Nesta tese são propostas soluções especializadas para otimizar o processamento de Junções Estrela. Para isso, utilizamos a família de software Hadoop em um cluster de 21 nós. Nós mostramos que o gargalo primário na computação de Junções Estrelas no Hadoop reside no excesso de operações escrita do disco (disk spill) e na sobrecarga da rede devido a comunicação excessiva entre os nós. Para reduzir estes efeitos negativos, são propostas duas soluções em Spark baseadas nas técnicas Bloom filters ou Broadcast, reduzindo o tempo total de computação em pelo menos 38%. Além disso, mostramos que a realização de uma leitura completa das tables (full table scan) pode prejudicar significativamente o desempenho de consultas com baixa seletividade. Assim, nós propomos um Índice Bitmap de Junção distribuído que é implementado como um índice secundário que pode ser combinado com acesso aleatório no Hadoop Distributed File System (HDFS). Nós implementamos três versões (uma em MapReduce e duas em Spark) do nosso algoritmo de processamento baseado nesse índice distribuído, os quais reduziram o tempo de computação em até 77% para Junções Estrelas de baixa seletividade do Star Schema Benchmark (SSB). Como idealmente o sistema deve ser capaz de executar tanto acesso aleatório quanto full scan, nós também propusemos uma arquitetura genérica que permite a inserção de um otimizador de consultas capaz de selecionar quais abordagens devem ser usadas dependendo da consulta. Devido ao fato de consultas de junção serem frequentes, nossas soluções são pertinentes a uma ampla gama de aplicações. A contribuições desta tese não só fortalecem o uso de frameworks de processamento de código aberto, como também exploram métodos mais eficientes de acesso aos dados para promover uma melhora significativa no desempenho Junções Estrela.
276

Problématisation prospective des stratégies de la singularité / Prospective problematisation of singularity strategies

Nabholtz, Franz-Olivier 09 March 2018 (has links)
De la mondialisation à la globalisation, de la modernité à la postmodernité, de l"humain au transhumain : - la révolution numérique et technologique fait émerger des enjeux qui imprègnent notre quotidien au-delà même de ce que le sens commun peut imaginer. La massification des données, analysée comme la résultante d'une hyper-connectivité, liée à une convergence « big data-intelligence artificielle » pose la question de sa juste utilisation et répartition entre des acteurs privés très volontaristes (GAFA) et des institutions publics pour le moins dépassées, quant aux principes d'efficacité rationnelle représentant l'une des caractéristiques des datas. Une caractéristique prédictive qui correspond donc à un besoin vital des états. Une société humaine qui disposerait des connaissances précises de sa situation, pourrait faire des choix rationnels en fonction de scénarios prédictifs et n'agirait plus de la même façon et ne se normaliserait plus de la même façon. Si nous rejetons le transhumanisme dans sa dimension idéologique, nous prenons pour acquises les dimensions conceptuelles de la théorie dite de la singularité que nous problématisons dans ce travail par une analyse de l'information propre à une démarche d'intelligence économique, au-delà même de la pensée commune et d'un consensus hérité d'une école de pensée déductive qui s'est affirmée par la démonstration et imposée par une forme d'idéologie qui existe partout, si ce n'est dans les sciences sociales. La pensée inductive, dont la caractéristique première est la corrélation à vocation prédictive, verrait l'élaboration de scénarios probabilistes multidisciplinaires, audacieux et propres à la science politique, dont l'idée principale serait de détecter et d'anticiper, à l'instar de la médecine prédictive (c'est ce que nous dit la singularité), les grandes tendances sociétales et politiques futures. Cependant, la nature de ces travaux devra faire l'objet d'une indépendance totale. Le processus d'exploitation du big data par le biais d'algorithmes, hors processus traditionnels de validation scientifique, prendra appui sur un modèle nouveau, dans lequel la démonstration de la cause prendra sans doute une dimension quantique ou synaptique dans un futur proche, analysé ainsi, comme singulier. / From past world to globalization, from modernity to postmodernity, from the human to the transhuman: - the digital and technological revolution brings out issues that permeate our daily lives beyond even what common sense can imagine. The massification of data, analyzed as the result of a hyper-connectivity, linked to a convergence "big data-artificial intelligence" raises the question of its fair use and distribution between highly voluntary private actors (GAFA) and public institutions for the least outdated, as to the principles of rational efficiency representing one of the characteristics of datas. A predictive characteristic that corresponds to a vital need of states. A human society with specific knowledge of its situation could make rational choices based on predictive scenarios and would no longer behave in the same way and no longer normalize in the same way. If we reject transhumanism in its ideological dimension, we take for granted the conceptual dimensions of the theory of singularity that we problematize in this work by an analysis of information specific to an approach of economic intelligence, even beyond of common thought and consensus inherited from a deductive school of thought that has been affirmed by demonstration and imposed by a form of ideology that exists everywhere, if not in social sciences. Inductive thinking, whose primary characteristic is predictive correlation, would see the development of probabilistic, multidisciplinary, bold and peculiar political science scenarios, the main idea of which would be to detect and anticipate, as predictive medicine (this is what singularity tells us), major societal and political future trends. However, the nature of this work will have to be fully independent. The process of exploiting big data by means of algorithms, outside traditional processes of scientific validation, will be based on a new model, in which the proof of the cause will undoubtedly take on a quantum or synaptic dimension in a near future, analyzed thus, as singular.
277

Automatisation de détections d'anomalies en temps réel par combinaison de traitements numériques et sémantiques / Automation of anomaly detections in real time by combining numeric and semantic processing

Belabbess, Badre 03 December 2018 (has links)
Les systèmes informatiques impliquant la détection d’anomalies émergent aussi bien dans le domaine de la recherche que dans l'industrie. Ainsi, des domaines aussi variés que la médecine (identification de tumeurs malignes), la finance (détection de transactions frauduleuses), les technologies de l’information (détection d’intrusion réseau) et l'environnement (détection de situation de pollution) sont largement impactés. L’apprentissage automatique propose un ensemble puissant d'approches qui peuvent aider à résoudre ces cas d'utilisation de manière efficace. Cependant, il représente un processus lourd avec des règles strictes qui supposent une longue liste de tâches telles que l'analyse et le nettoyage de données, la réduction des dimensions, l'échantillonnage, la sélection des algorithmes, l'optimisation des hyper-paramètres, etc. Il implique également plusieurs experts qui travailleront ensemble pour trouver les bonnes approches. De plus, les possibilités ouvertes aujourd'hui par le monde de la sémantique montrent qu'il est possible de tirer parti des technologies du web afin de raisonner intelligemment sur les données brutes pour en extraire de l'information à forte valeur ajoutée. L'absence de systèmes combinant les approches numériques d'apprentissage automatique et les techniques sémantiques du web des données constitue la motivation principale derrière les différents travaux proposés dans cette thèse. Enfin, les anomalies détectées ne signifient pas nécessairement des situations de réalité anormales. En effet, la présence d'informations externes pourrait aider à la prise de décision en contextualisant l'environnement dans sa globalité. Exploiter le domaine spatial et les réseaux sociaux permet de construire des contextes enrichis sur les données des capteurs. Ces contextes spatio-temporels deviennent ainsi une partie intégrante de la détection des anomalies et doivent être traités en utilisant une approche Big Data. Dans cette thèse, nous présentons trois systèmes aux architectures variées, chacun ayant porté sur un élément essentiel des écosystèmes big data, temps-réel, web sémantique et apprentissage automatique : WAVES : Plateforme Big Data d'analyse en temps réel des flux de données RDF capturées à partir de réseaux denses de capteurs IoT. Son originalité tient dans sa capacité à raisonner intelligemment sur des données brutes afin d'inférer des informations implicites à partir d'informations explicites et d'aider dans la prise de décision. Cette plateforme a été développée dans le cadre d'un projet FUI dont le principal cas d'usage est la détection d'anomalies dans un réseau d'eau potable. RAMSSES : Système hybride d'apprentissage automatique dont l'originalité est de combiner des approches numériques avancées ainsi que des techniques sémantiques éprouvées. Il a été spécifiquement conçu pour supprimer le lourd fardeau de l'apprentissage automatique qui est chronophage, complexe, source d'erreurs et impose souvent de disposer d'une équipe pluridisciplinaire. SCOUTER : Système intelligent de "scrapping web" permettant la contextualisation des singularités liées à l'Internet des Objets en exploitant aussi bien des informations spatiales que le web des données / Computer systems involving anomaly detection are emerging in both research and industry. Thus, fields as varied as medicine (identification of malignant tumors), finance (detection of fraudulent transactions), information technologies (network intrusion detection) and environment (pollution situation detection) are widely impacted. Machine learning offers a powerful set of approaches that can help solve these use cases effectively. However, it is a cumbersome process with strict rules that involve a long list of tasks such as data analysis and cleaning, dimension reduction, sampling, algorithm selection, optimization of hyper-parameters. etc. It also involves several experts who will work together to find the right approaches. In addition, the possibilities opened today by the world of semantics show that it is possible to take advantage of web technologies to reason intelligently on raw data to extract information with high added value. The lack of systems combining numeric approaches to machine learning and semantic techniques of the web of data is the main motivation behind the various works proposed in this thesis. Finally, the anomalies detected do not necessarily mean abnormal situations in reality. Indeed, the presence of external information could help decision-making by contextualizing the environment as a whole. Exploiting the space domain and social networks makes it possible to build contexts enriched with sensor data. These spatio-temporal contexts thus become an integral part of anomaly detection and must be processed using a Big Data approach.In this thesis, we present three systems with different architectures, each focused on an essential element of big data, real-time, semantic web and machine learning ecosystems:WAVES: Big Data platform for real-time analysis of RDF data streams captured from dense networks of IoT sensors. Its originality lies in its ability to reason intelligently on raw data in order to infer implicit information from explicit information and assist in decision-making. This platform was developed as part of a FUI project whose main use case is the detection of anomalies in a drinking water network. RAMSSES: Hybrid machine learning system whose originality is to combine advanced numerical approaches as well as proven semantic techniques. It has been specifically designed to remove the heavy burden of machine learning that is time-consuming, complex, error-prone, and often requires a multi-disciplinary team. SCOUTER: Intelligent system of "web scrapping" allowing the contextualization of singularities related to the Internet of Things by exploiting both spatial information and the web of data
278

Environnement big data et prise de décision intuitive : le cas du Centre d'Information et de Commandement (CIC) de la Police nationale des Bouches du Rhône (DDSP 13) / Big data environment and intuitive decision making : the case of the command and information center of the French national police

Vazquez llana, Jordan Diego 29 November 2018 (has links)
La thèse de ce travail de recherche se pose la question de la place de l’intuition dans le processus décisionnel en environnement big data. Il s’appuie sur une étude de cas exploratoire développée près des décideurs du Centre d’Information et de Commandement (CIC) de la Police Nationale (PN) des Bouches du Rhône. Ces derniers évoluent en environnement big data et doivent régulièrement gérer des situations imprévues. Le corpus des données de terrain a été construit par triangulation de 28 entretiens individuels et collectifs, d’observations non participantes ainsi que d’archives et de rapports officiels. Ces nouvelles informations sont autant d’indices qui permettent aux décideurs de mieux anticiper les imprévus, les conduisant à reconfigurer leurs attentes, leurs objectifs et leurs actions. Ces aspects positifs sont cependant à évaluer au regard du risque induit par le volume conséquent d’informations dorénavant à disposition des décideurs. Ils doivent maîtriser les nouveaux systèmes et les applications qui permettent d’exploiter l’environnement big data. Les résultats suggèrent que lorsque les décideurs ne maîtrisent pas ces systèmes, l’environnement big data peut conduire un décideur expert métier à redevenir un novice. / Godé and Vazquez have previously demonstrated that French Police team operate in extreme contexts (Godé & Vazquez, 2017), simultaneously marked by high levels of change, uncertainty and mainly vital, material and legal risks (Godé, 2016), but also technological. In this context, the notion of big data environment, can affect the police decision-making process. The problematic of this thesis is : "What is the status of intuition in decision-making process in a big data environment?". We explain how the growth of available information volumes, the great diversity of their sources (social networks, websites, connected objects), their speed of diffusion (in real time or near real time) and their unstructured nature (Davenport & Soulard, 2014) introduces new decision-making challenges for National Police forces.
279

Das Industrial Internet – Engineering Prozesse und IT-Lösungen

Eigner, Martin January 2016 (has links)
Das Engineering unterliegt derzeit einem massiven Wandel. Smarte Systeme und Technologien, Cybertronische Produkte, Big Data und Cloud Computing im Kontext des Internet der Dinge und Dienste sowie Industrie 4.0. Der amerikanische Ansatz des „Industrial Internet“ beschreibt diese (R)evolution jedoch weitaus besser als der eingeschränkte und stark deutsch geprägte Begriff Industrie 4.0. Industrial Internet berücksichtigt den gesamten Produktlebenszyklus und adressiert sowohl Konsum- und Investitionsgüter als auch Dienstleistungen. Dieser Beitrag beleuchtet das zukunftsträchtige Trendthema und bietet fundierte Einblicke in die vernetzte Engineering-Welt von morgen, auf Ihre Konstruktionsmethoden und –prozesse sowie auf die IT-Lösungen.
280

Approximate Data Analytics Systems

Le Quoc, Do 22 January 2018 (has links)
Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Moore’s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: • StreamApprox—a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. • IncApprox—a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. • PrivApprox—a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves user’s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. • ApproxJoin—an approximate distributed joins system. This system improves the performance of joins — critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications.

Page generated in 0.1201 seconds