Spelling suggestions: "subject:"bloom filter"" "subject:"loom filter""
21 |
Optimization for big joins and recursive query evaluation using intersection and difference filters in MapReducePhan, Thuong-Cang 07 July 2014 (has links) (PDF)
The information technology community has created unprecedented amount of data through large-scale applications. As a result, the Big Data is considered as gold mines of information that just wait for the processing power to be available, reliable, and apt at evaluating complex analytic algorithms. MapReduce is one of the most popular programming models designed to support such processing. It has become a standard for processing, analyzing and generating large data in a massively parallel manner. However, the MapReduce programming model suffers from severe limitations of operations beyond simple scan/grouping, particularly operations with multiple inputs. In the present dissertation we efficiently investigate and optimize the evaluation, in a MapReduce environment, of one of the most salient and representative such operations: Join. It focuses not only on two-way joins, but also complex joins such as multi-way joins and recursive joins. To achieve these objectives, we first devise a new type of filter called intersection filter using a probabilistic model to represent an approximation of the set intersection. The intersection filter is then applied to two-way join operations to eliminate most non-joining elements in input datasets before sending data to actual join processing. In addition, we make an extension of the intersection filter to improve the performance of three-way joins and chain joins including both cyclic chain joins with many shared join keys. We use the Lagrangian multiplier method to indicate a good choice between our optimized solutions for the multi-way joins. Another important proposal is a difference filter, which is a probabilistic data structure designed to represent a set and examine disjoint elements of the set. It can be applied to a wide range of popular problems such as reconciliation, deduplication, error-correction, especially a recursive join operation. A recursive join using the difference filter is implemented as an iteration of one join job instead of two jobs including a join job and a difference job. This improvement will significantly reduce the number of executed jobs by half, and the related overheads such as data rescanning, intermediate data, and communication for the deduplication and difference operations. Besides, this research also improves the general semi-naive algorithm, as well as the evaluation of recursive queries in MapReduce. We then provide general cost models for two-way joins, multi-way joins, and recursive joins. Thanks to these cost models, we can make comparisons of the join algorithms more persuasive. As a result, with using the proposed filters, the join operations can minimize disk I/O and communication costs. Moreover, the intersection filter-based join operations are demonstrated to be more efficient than existing solutions through experimental evaluations. Experimental comparisons of different algorithms for joins are examined with respect to intermediate data amount, the total output amount, the total execution time, and especially task timelines. Finally, our improvements on the join operations contribute to the global scene of optimizing data management for MapReduce applications on large-scale distributed infrastructures.
|
22 |
Algorithmic Problems in Access ControlMousavi, Nima 29 July 2014 (has links)
Access control is used to provide regulated access
to resources by principals. It is an important and foundational
aspect of information security. Role-Based Access Control (RBAC) is
a popular and widely-used access control model,
that, as prior work argues,
is ideally suited for enterprise settings. In this dissertation,
we address two problems in the context of RBAC.
One is the User Authorization Query (UAQ) problem, which relates
to sessions that a user creates to exercise permissions.
UAQ's objective is the identification of a
set of roles that a user needs to activate such that the session is
authorized to all permissions that the user wants to exercise in
that session. The roles that are activated must respect
a set of Separation of Duty constraints. Such constraints restrict the
roles that can be activated together in a session.
UAQ is known to be intractable (NP-hard).
In this dissertation, we give a precise formulation of UAQ as a
joint-optimization problem, and analyze it.
We examine the manner in which each input parameter contributes to its
intractability.
We then propose an approach to mitigate its intractability based on
our observation that a corresponding decision version of the problem
is in NP. We efficiently
reduce UAQ to Boolean satisfiability in conjunctive normal form
(CNF-SAT), a well-known
NP-complete problem for which solvers exist that are efficient for large
classes of instances. We also present results for UAQ posed
as an approximation problem; our results
suggest that efficient approximation is not promising for UAQ.
We discuss an open-source implementation of our approach and a
corresponding empirical assessment that we have conducted.
The other problem we consider in this dissertation regards
an efficient data structure for distributed
access enforcement. Access enforcement is the process of validating an access
request to a resource.
Distributed access enforcement has become important
with the proliferation of data, which requires access control systems
to scale to tens of thousands of resources and permissions.
Prior work has shown the effectiveness of a data structure called
the Cascade Bloom Filter (CBF) for this problem.
In this dissertation, we study the construction of instances
of the CBF.
We formulate the problem of finding an optimal instance of a
CBF, where optimality refers to the number of false positives
incurred and the number
of hash functions used. We prove that this problem
is NP-hard, and a meaningful decision version is in NP.
We then propose an approach to mitigate the intractability of
the problem by reducing it to
CNF-SAT, that allows us to use a SAT solver for instances that
arise in practice.
We discuss an open-source implementation of our approach
and an empirical assessment based on it.
|
23 |
GPU Network ProcessingYanggratoke, Rerngvit January 2010 (has links)
Networking technology is connecting more and more people around the world. It has become an essential part of our daily life. For this connectivity to be seamless, networks need to be fast. Nonetheless, rapid growth in network traffic and variety of communication protocols overwhelms the Central Processing Units (CPUs) processing packets in the networks. Existing solutions to this problem such as ASIC, FPGA, NPU, and TOE are not cost effective and easy to manage because they require special hardware and custom configurations. This thesis approaches the problem differently by offloading the network processing to off-the-shelf Graphic Processing Units (GPUs). The thesis's primary goal is to find out how the GPUs should be used for the offloading. The thesis follows the case study approach and the selected case studies are layer 2 Bloom filter forwarding and flow lookup in Openflow switch. Implementation alternatives and evaluation methodology are proposed for both of the case studies. Then, the prototype implementation for comparing between traditional CPU-only and GPU-offloading approach is developed and evaluated. The primary findings from this work are criteria of network processing functions suitable for GPU offloading and tradeoffs involved. The criteria are no inter-packet dependency, similar processing flows for all packets, and within-packet parallel processing opportunity. This offloading trades higher latency and memory consumption for higher throughput. / Nätverksteknik ansluter fler och fler människor runt om i världen. Det har blivit en viktig del av vårt dagliga liv. För att denna anslutning skall vara sömlös, måste nätet vara snabbt. Den snabba tillväxten i nätverkstrafiken och olika kommunikationsprotokoll sätter stora krav på processorer som hanterar all trafik. Befintliga lösningar på detta problem, t.ex. ASIC, FPGA, NPU, och TOE är varken kostnadseffektivt eller lätta att hantera, eftersom de kräver speciell hårdvara och anpassade konfigurationer. Denna avhandling angriper problemet på ett annat sätt genom att avlasta nätverks processningen till grafikprocessorer som sitter i vanliga pc-grafikkort. Avhandlingen främsta mål är att ta reda på hur GPU bör användas för detta. Avhandlingen följer fallstudie modell och de valda fallen är lager 2 Bloom filter forwardering och ``flow lookup'' i Openflow switch. Implementerings alternativ och utvärderingsmetodik föreslås för både fallstudierna. Sedan utvecklas och utvärderas en prototyp för att jämföra mellan traditionell CPU- och GPU-offload. Det primära resultatet från detta arbete utgör kriterier för nätvärksprocessfunktioner lämpade för GPU offload och vilka kompromisser som måste göras. Kriterier är inget inter-paket beroende, liknande processflöde för alla paket. och möjlighet att köra fler processer på ett paket paralellt. GPU offloading ger ökad fördröjning och minneskonsumption till förmån för högre troughput.
|
24 |
Space-efficient data sketching algorithms for network applicationsHua, Nan 06 July 2012 (has links)
Sketching techniques are widely adopted in network applications. Sketching algorithms “encode” data into succinct data structures that can later be accessed and “decoded” for various purposes, such as network measurement, accounting, anomaly detection and etc. Bloom filters and counter braids are two well-known representatives in this category. Those sketching algorithms usually need to strike a tradeoff between performance (how much information can be revealed and how fast) and cost (storage, transmission and computation). This dissertation is dedicated to the
research and development of several sketching techniques including improved forms of stateful Bloom Filters, Statistical Counter Arrays and Error Estimating Codes. Bloom filter is a space-efficient randomized data structure for approximately representing a set in order to support membership queries. Bloom filter and its variants have found widespread use in many networking applications, where it is important to minimize the cost of storing and communicating network data. In this thesis, we propose a family of Bloom Filter variants augmented by rank-indexing method. We will show such augmentation can bring a significant reduction of space and also the number of memory accesses, especially when deletions of set elements from the Bloom Filter need to be supported. Exact active counter array is another important building block in many sketching algorithms, where storage cost of the array is of paramount concern. Previous approaches reduce the storage costs while either losing accuracy or supporting only passive measurements. In this thesis, we propose an exact statistics counter array architecture that can support active measurements (real-time read and write). It also leverages the aforementioned rank-indexing method and exploits statistical multiplexing to minimize the storage
costs of the counter array. Error estimating coding (EEC) has recently been established as an important tool to estimate bit error rates in the transmission of packets over wireless links. In essence, the EEC problem is also a sketching problem, since the EEC codes can be viewed as a sketch of the packet sent, which is decoded by the receiver to estimate bit error rate. In this thesis, we will first investigate the asymptotic bound of error estimating coding by viewing the problem from two-party computation perspective and then investigate its coding/decoding efficiency using Fisher information analysis. Further, we develop several sketching techniques including Enhanced tug-of-war(EToW) sketch and the generalized EEC (gEEC)sketch family which can achieve around 70% reduction of sketch size with similar estimation accuracies. For all solutions proposed above, we will use theoretical tools such as information theory and communication complexity to investigate how far our proposed solutions are away from the theoretical optimal. We will show that the proposed techniques are asymptotically or empirically very close to the theoretical bounds.
|
25 |
Improving The Communication Performance Of I/O Intensive And Communication Intensive Application In Cluster Computer SystemsKumar, V Santhosh 10 1900 (has links)
Cluster computer systems assembled from commodity off-the-shelf components have emerged as a viable and cost-effective alternative to high-end custom parallel computer systems.In this thesis, we investigate how scalable performance can be achieved for database systems on clusters. In this context we specfically considered database query processing for evaluation of botlenecks and suggest optimization techniques for obtaining scalable application performance.
First we systematically demonstrated that in a large cluster with high disk bandwidth, the processing capability and the I/O bus bandwidth are the two major performance bottlenecks in database systems. To identify and assess bottlenecks, we developed a Petri net model of parallel query execution on a cluster. Once identified and assessed,we address the above two performance bottlenecks by offoading certain application related tasks to the processor in the network interface card. Offoading application tasks to the processor in the network interface cards shifts the bottleneck from cluster processor to I/O bus. Further, we propose a hardware scheme,network attached disk ,and a software scheme to achieve a balanced utilization of re-sources like host processor, I/O bus, and processor in the network interface card. The proposed schemes result in a speedup of upto 1.47 compared to the base scheme, and ensures scalable performance upto 64 processors.
Encouraged by the benefits of offloading application tasks to network processors, we explore the possibilities of performing the bloom filter operations in network processors. We combine offloading bloom filter operations with the proposed hardware schemes to achieve upto 50% reduction in execution time.
The later part of the thesis provides introductory experiments conducted in Community At-mospheric Model(CAM), a large scale parallel application used for global weather and climate prediction. CAM is a communication intensive application that involves collective communication of large messages. In our limited experiment, we identified CAM to see the effect of compression techniques and offloading techniques (as formulated for database) on the performance of communication intensive applications. Due to time constraint, we considered only the possibility of compression technique for improving the application performance. However, offloading technique could be taken as a full-fledged research problem for further investigation
In our experiment, we found compression of messages reduces the message latencies, and hence improves the execution time and scalability of the application. Without using compression techniques, performance measured on 64 processor cluster resulted in a speed up of only 15.6. While lossless compression retains the accuracy and correctness of the program, it does not result in high compression. We therefore propose lossy compression technique which can achieve a higher compression, yet retain the accuracy and numerical stability of the application while achieving a scalable performance. This leads to speedup of 31.7 on 64 processors compared to a speedup of 15.6 without message compression. We establish that the accuracy within prescribed limit of variation and numerical stability of CAM is retained under lossy compression.
|
26 |
Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of CodeJang, Jiyong 01 August 2013 (has links)
Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we will always be one step behind attackers. Thus developing scalable analysis to bridge the gap is essential.
In this dissertation, we argue that automatic code reuse detection enables an efficient data reduction of a high volume of incoming malware for downstream analysis and enhances software security by efficiently finding known vulnerabilities across large code bases. In order to demonstrate the benefits of automatic software similarity detection, we discuss two representative problems that are remedied by scalable analysis: malware triage and unpatched code clone detection.
First, we tackle the onslaught of malware. Although over one million new malware are reported each day, existing research shows that most malware are not written from scratch; instead, they are automatically generated variants of existing malware. When groups of highly similar variants are clustered together, new malware more easily stands out. Unfortunately, current systems struggle with handling this high volume of malware. We scale clustering using feature hashing and perform semantic analysis using co-clustering. Our evaluation demonstrates that these techniques are an order of magnitude faster than previous systems and automatically discover highly correlated features and malware groups. Furthermore, we design algorithms to infer evolutionary relationships among malware, which helps analysts understand trends over time and make informed decisions about which malware to analyze first.
Second, we address the problem of detecting unpatched code clones at scale. When buggy code gets copied from project to project, eventually all projects will need to be patched. We call clones of buggy code that have been fixed in only a subset of projects unpatched code clones. Unfortunately, code copying is usually ad-hoc and is often not tracked, which makes it challenging to identify all unpatched vulnerabilities in code basesat the scale of entire OS distributions. We scale unpatched code clone detection to spot over15,000 latent security vulnerabilities in 2.1 billion lines of code from the Linux kernel, allDebian and Ubuntu packages, and all C/C++ projects in SourceForge in three hours on asingle machine. To the best of our knowledge, this is the largest set of bugs ever reported in a single paper.
|
27 |
Efficient algorithms for de novo assembly of alternative splicing events from RNA-seq data / Algorithmes efficaces pour l’assemblage de novo d’événements d’épissage alternatif dans des données de RNA-seqTominaga Sacomoto, Gustavo Akio 06 March 2014 (has links)
Dans cette thèse, nous abordons le problème de l'identification et de la quantification de variants (épissage alternatif et polymorphisme génomique) dans des données de RNA-seq sans génome de référence, et sans faire un assemblage complet des transcripts. Basé sur l'idée que chaque variant correspond à un motif reconnaissable, qu'on appelle une bulle, dans un graphe de Bruijn construit à partir des lectures de RNA-seq, nous proposons un modèle pour les variants dans de tels graphes. Nous introduisons ensuite une méthode, appelé KisSplice, pour extraire les événements d'épissage alternatif, et nous montrons qu'il trouve plus d'événements corrects que les assembleurs de transcriptome traditionnels. Afin d'améliorer son temps d'exécution, nous proposons un nouvel algorithme polynomial pour énumérer les bulles. On montre qu'il est plusieurs ordres de grandeur plus rapide que les approches précédentes. Afin de réduire sa consommation en mémoire, nous proposons une nouvelle façon de représenter un graphe de Bruijn. Nous montrons que notre approche utilise 30% à 40% moins de mémoire que l'état de l'art. Nous appliquons les techniques développées pour énumérer les bulles à deux problémes classiques. Nous donnons le premier algorithme optimal pour énumérer les cycles dans des graphes non orientés. Il s'agit de la première amélioration à ce probléme en près de 40 ans. Nous considérons ensuite une variante du problème des K chemins plus courts: au lieu de limiter le nombre des chemins, nous limitons leurs poids. Nous présentons de nouveaux algorithmes qui utilisent exponentiellement moins mémoire que les approches précédentes / In this thesis, we address the problem of identifying and quantifying variants (alternative splicing and genomic polymorphism) in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the idea that each variant corresponds to a recognizable pattern, a bubble, in a de Bruijn graph constructed from the RNA-seq reads, we propose a general model for all variants in such graphs. We then introduce an exact method, called KisSplice, to extract alternative splicing events and show that it outperforms general purpose transcriptome assemblers. We put an extra effort to make KisSplice as scalable as possible. In order to improve the running time, we propose a new polynomial delay algorithm to enumerate bubbles. We show that it is several orders of magnitude faster than previous approaches. In order to reduce its memory consumption, we propose a new compact way to build and represent a de Bruijn graph. We show that our approach uses 30% to 40% less memory than the state of the art, with an insignificant impact on the construction time. Additionally, we apply the techniques developed to list bubbles in two classical problems: cycle enumeration and the K-shortest paths problem. We give the first optimal algorithm to list cycles in undirected graphs, improving over Johnson’s algorithm. This is the first improvement to this problem in almost 40 years. We then consider a different parameterization of the K-shortest (simple) paths problem: instead of bounding the number of st-paths, we bound the weight of the st-paths. We present new algorithms using exponentially less memory than previous approaches
|
28 |
Towards Accurate and Scalable Recommender Systems / Contributions à l'efficacité et au passage à l'échelle des Systèmes de RecommandationsPozo, Manuel 12 October 2016 (has links)
Les systèmes de recommandation visent à présélectionner et présenter en premier les informations susceptibles d'intéresser les utilisateurs. Ceci a suscité l'attention du commerce électronique, où l'historique des achats des utilisateurs sont analysés pour prédire leurs intérêts futurs et pouvoir personnaliser les offres ou produits (appelés aussi items) qui leur sont proposés. Dans ce cadre, les systèmes de recommandation exploitent les préférences des utilisateurs et les caractéristiques des produits et des utilisateurs pour prédire leurs préférences pour des futurs items. Bien qu'ils aient démontré leur précision, ces systèmes font toujours face à de grands défis tant pour le monde académique que pour l'industrie : ces techniques traitent un grand volume de données qui exige une parallélisation des traitements, les données peuvent être également très hétérogènes, et les systèmes de recommandation souffrent du démarrage à froid, situation dans laquelle le système n'a pas (ou pas assez) d'informations sur (les nouveaux) utilisateurs/items pour proposer des recommandations précises. La technique de factorisation matricielle a démontré une précision dans les prédictions et une simplicité de passage à l'échelle. Cependant, cette approche a deux inconvénients : la complexité d'intégrer des données hétérogènes externes (telles que les caractéristiques des items) et le démarrage à froid pour un nouvel utilisateur. Cette thèse a pour objectif de proposer un système offrant une précision dans les recommandations, un passage à l'échelle pour traiter des données volumineuses, et permettant d'intégrer des données variées sans remettre en question l'indépendance du système par rapport au domaine d'application. De plus, le système doit faire face au démarrage à froid utilisateurs car il est important de fidéliser et satisfaire les nouveaux utilisateurs. Cette thèse présente quatre contributions au domaine des systèmes de recommandation: (1) nous proposons une implémentation d'un algorithme de recommandation de factorisation matricielle parallélisable pour assurer un meilleur passage à l'échelle, (2) nous améliorons la précision des recommandations en prenant en compte l'intérêt implicite des utilisateurs dans les attributs des items, (3) nous proposons une représentation compacte des caractéristiques des utilisateurs/items basée sur les filtres de bloom permettant de réduire la quantité de mémoire utile, (4) nous faisons face au démarrage à froid d'un nouvel utilisateur en utilisant des techniques d'apprentissage actif. La phase d'expérimentation utilise le jeu de données MovieLens et la base de données IMDb publiquement disponibles, ce qui permet d'effectuer des comparaisons avec des techniques existantes dans l'état de l'art. Ces expérimentations ont démontré la précision et l'efficacité de nos approches. / Recommender Systems aim at pre-selecting and presenting first the information in which users may be interested. This has raised the attention of the e-commerce, where the interests of users are analysed in order to predict future interests and to personalize the offers (a.k.a. items). Recommender systems exploit the current preferences of users and the features of items/users in order to predict their future preference in items.Although they demonstrate accuracy in many domains, these systems still face great challenges for both academia and industry: they require distributed techniques to deal with a huge volume of data, they aim to exploit very heterogeneous data, and they suffer from cold-start, situation in which the system has not (enough) information about (new) users/items to provide accurate recommendations. Among popular techniques, Matrix Factorization has demonstrated high accurate predictions and scalability to parallelize the analysis among multiple machines. However, it has two main drawbacks: (1) difficulty of integrating external heterogeneous data such as items' features, and (2) the cold-start issue. The objective of this thesis is to answer to many challenges in the field of recommender systems: (1) recommendation techniques deal with complex analysis and a huge volume of data; in order to alleviate the time consumption of analysis, these techniques need to parallelize the process among multiple machines, (2) collaborative filtering techniques do not naturally take into account the items' descriptions in the recommendation, although this information may help to perform more accurate recommendations, (3) users' and items' descriptions in very large dataset contexts can become large and memory-consuming; this makes data analysis more complex, and (4) the new user cold-start is particularly important to perform new users' recommendations and to assure new users fidelity. Our contributions to this area are given by four aspects: (1) we improve the distribution of a matrix factorization recommendation algorithm in order to achieve better scalability, (2) we enhance recommendations performed by matrix factorization by studying the implicit interest of the users in the attributes of the items, (3) we propose an accurate and low-space binary vector based on Bloom Filters for representing users/items through a high quantity of features in low memory-consumption, and (4) we cope with the new user cold-start in collaborative filtering by using active learning techniques. The experimentation phase uses the publicly available MovieLens dataset and IMDb database, what allows to perform fair comparisons to the state of the art. Our contributions demonstrate their performance in terms of accuracy and efficiency.
|
29 |
Systèmes coopératifs décentralisés de détection et de contre-mesures des incidents et attaques sur les réseaux IP / Collaborative and decentralized detection and mitigation of network attacksGuerid, Hachem 06 December 2014 (has links)
La problématique des botnets, réseaux de machines infectées par des logiciels malveillants permettant de les contrôler à distance, constitue une préoccupation majeure du fait du nombre de machines infectées et des menaces associées: attaque par déni de service distribué (DDoS), spam, vol de données bancaires. Les solutions de lutte contre les botnets proposées présentent des limitations majeures dans le contexte d'un opérateur réseau (contraintes de volumétrie et de passage à l'échelle, respect de la confidentialité et de la vie privée des utilisateurs). Cette thèse propose quatre contributions orientées réseau de lutte contre les botnets. Chaque contribution traite d'une étape complémentaire dans la problématique des botnets: la première contribution permet de remonter à la source d'attaques par déni de service, et ainsi d'identifier un groupe de machines infectées à l'origine de ces attaques. La deuxième contribution concerne la détection des communications entre les machines infectées et leurs serveurs de contrôle et commande dans un réseau à large échelle, et offre ainsi l'opportunité de bloquer ces serveurs pour limiter le risque de nouvelles attaques. La troisième contribution permet une détection collaborative de botnets dans un contexte inter-domaine et inter-opérateur, permettant ainsi de lutter contre l'aspect hautement distribué de ces botnets. Enfin, la dernière contribution proposée permet de remédier aux botnets en ralentissant les communications entre les machines infectées et leur serveur de contrôle, offrant par ce biais une contre-mesure aux stratégies d'évasions développées par les cybercriminels afin de rendre leurs botnets plus résilients. / The problem of botnets, networks of infected hosts controlled remotely by attackers, is a major concern because of the number of infected hosts and associated threats, like distributed denial of service (DDoS), spams, and data theft. State of the art solutions to fight against botnets have major limitations in a context of a network operator (scalability of the solution, confidentiality and privacy of users). In this thesis, we propose four network-based contributions to fight against botnets. Each solution address a different and complementary issue in this area: the first contribution tracebacks the source of denial of service attacks which threaten the network availability, allowing by that way to identify infected devices used to perpetrate these attacks. The second contribution detects the communications between infected computers and their command and control server (C&C) in a large scale network and offers the opportunity to block these servers to minimize the risk of future attacks. The third contribution enables collaborative detection of botnets in an inter-domain and inter-operator context in order to fight against the highly distributed aspect of these botnets. Finally, the last contribution mitigates botnets by slowing down the communication between infected hosts and their C&C server, providing a countermeasure against evasion techniques developed by cybercriminals to make their botnets more resilient
|
30 |
Workload- and Data-based Automated Design for a Hybrid Row-Column Storage Model and Bloom Filter-Based Query Processing for Large-Scale DICOM Data Management / Conception automatisée basée sur la charge de travail et les données pour un modèle de stockage hybride ligne-colonne et le traitement des requêtes à l’aide de filtres de Bloom pour la gestion de données DICOM à grande échelleNguyen, Cong-Danh 04 May 2018 (has links)
Dans le secteur des soins de santé, les données d'images médicales toujours croissantes, le développement de technologies d'imagerie, la conservation à long terme des données médicales et l'augmentation de la résolution des images entraînent une croissance considérable du volume de données. En outre, la variété des dispositifs d'acquisition et la différence de préférences des médecins ou d'autres professionnels de la santé ont conduit à une grande variété de données. Bien que la norme DICOM (Digital Imaging et Communication in Medicine) soit aujourd'hui largement adoptée pour stocker et transférer les données médicales, les données DICOM ont toujours les caractéristiques 3V du Big Data: volume élevé, grande variété et grande vélocité. En outre, il existe une variété de charges de travail, notamment le traitement transactionnel en ligne (en anglais Online Transaction Processing, abrégé en OLTP), le traitement analytique en ligne (anglais Online Analytical Processing, abrégé en OLAP) et les charges de travail mixtes. Les systèmes existants ont des limites concernant ces caractéristiques des données et des charges de travail. Dans cette thèse, nous proposons de nouvelles méthodes efficaces pour stocker et interroger des données DICOM. Nous proposons un modèle de stockage hybride des magasins de lignes et de colonnes, appelé HYTORMO, ainsi que des stratégies de stockage de données et de traitement des requêtes. Tout d'abord, HYTORMO est conçu et mis en œuvre pour être déployé sur un environnement à grande échelle afin de permettre la gestion de grandes données médicales. Deuxièmement, la stratégie de stockage de données combine l'utilisation du partitionnement vertical et un stockage hybride pour créer des configurations de stockage de données qui peuvent réduire la demande d'espace de stockage et augmenter les performances de la charge de travail. Pour réaliser une telle configuration de stockage de données, l'une des deux approches de conception de stockage de données peut être appliquée: (1) conception basée sur des experts et (2) conception automatisée. Dans la première approche, les experts créent manuellement des configurations de stockage de données en regroupant les attributs des données DICOM et en sélectionnant une disposition de stockage de données appropriée pour chaque groupe de colonnes. Dans la dernière approche, nous proposons un cadre de conception automatisé hybride, appelé HADF. HADF dépend des mesures de similarité (entre attributs) qui prennent en compte les impacts des informations spécifiques à la charge de travail et aux données pour générer automatiquement les configurations de stockage de données: Hybrid Similarity (combinaison pondérée de similarité d'accès d'attribut et de similarité de densité d'attribut) les attributs dans les groupes de colonnes; Inter-Cluster Access Similarity est utilisé pour déterminer si deux groupes de colonnes seront fusionnés ou non (pour réduire le nombre de jointures supplémentaires); et Intra-Cluster Access La similarité est appliquée pour décider si un groupe de colonnes sera stocké dans une ligne ou un magasin de colonnes. Enfin, nous proposons une stratégie de traitement des requêtes adaptée et efficace construite sur HYTORMO. Il considère l'utilisation des jointures internes et des jointures externes gauche pour empêcher la perte de données si vous utilisez uniquement des jointures internes entre des tables partitionnées verticalement. De plus, une intersection de filtres Bloom (intersection of Bloom filters, abrégé en ) est appliqué pour supprimer les données non pertinentes des tables d'entrée des opérations de jointure; cela permet de réduire les coûts d'E / S réseau. (...) / In the health care industry, the ever-increasing medical image data, the development of imaging technologies, the long-term retention of medical data and the increase of image resolution are causing a tremendous growth in data volume. In addition, the variety of acquisition devices and the difference in preferences of physicians or other health-care professionals have led to a high variety in data. Although today DICOM (Digital Imaging and Communication in Medicine) standard has been widely adopted to store and transfer the medical data, DICOM data still has the 3Vs characteristics of Big Data: high volume, high variety and high velocity. Besides, there is a variety of workloads including Online Transaction Processing (OLTP), Online Analytical Processing (OLAP) and mixed workloads. Existing systems have limitations dealing with these characteristics of data and workloads. In this thesis, we propose new efficient methods for storing and querying DICOM data. We propose a hybrid storage model of row and column stores, called HYTORMO, together with data storage and query processing strategies. First, HYTORMO is designed and implemented to be deployed on large-scale environment to make it possible to manage big medical data. Second, the data storage strategy combines the use of vertical partitioning and a hybrid store to create data storage configurations that can reduce storage space demand and increase workload performance. To achieve such a data storage configuration, one of two data storage design approaches can be applied: (1) expert-based design and (2) automated design. In the former approach, experts manually create data storage configurations by grouping attributes and selecting a suitable data layout for each column group. In the latter approach, we propose a hybrid automated design framework, called HADF. HADF depends on similarity measures (between attributes) that can take into consideration the combined impact of both workload- and data-specific information to generate data storage configurations: Hybrid Similarity (a weighted combination of Attribute Access and Density Similarity measures) is used to group the attributes into column groups; Inter-Cluster Access Similarity is used to determine whether two column groups will be merged together or not (to reduce the number of joins); and Intra-Cluster Access Similarity is applied to decide whether a column group will be stored in a row or a column store. Finally, we propose a suitable and efficient query processing strategy built on top of HYTORMO. It considers the use of both inner joins and left-outer joins. Furthermore, an Intersection Bloom filter () is applied to reduce network I/O cost.We provide experimental evaluations to validate the benefits of the proposed methods over real DICOM datasets. Experimental results show that the mixed use of both row and column stores outperforms a pure row store and a pure column store. The combined impact of both workload-and data-specific information is helpful for HADF to be able to produce good data storage configurations. Moreover, the query processing strategy with the use of the can improve the execution time of an experimental query up to 50% when compared to the case where no is applied.
|
Page generated in 0.0607 seconds