Spelling suggestions: "subject:"reduplication"" "subject:"duplication""
21 |
Towards a Flexible High-efficiency Storage System for Containerized ApplicationsZhao, Nannan 08 October 2020 (has links)
Due to their tight isolation, low overhead, and efficient packaging of the execution environment, Docker containers have become a prominent solution for deploying modern applications. Consequently, a large amount of Docker images are created and this massive image dataset presents challenges to the registry and container storage infrastructure and so far has remained a largely unexplored area. Hence, there is a need of docker image characterization that can help optimize and improve the storage systems for containerized applications. Moreover, existing deduplication techniques significantly degrade the performance of registries, which will slow down the container startup time. Therefore, there is growing demand for high storage efficiency and high-performance registry storage systems. Last but not least, different storage systems can be integrated with containers as backend storage systems and provide persistent storage for containerized applications. So, it is important to analyze the performance of different backend storage systems and storage drivers and draw out the implications for container storage system design. These above observations and challenges motivate my dissertation.
In this dissertation, we aim to improve the flexibility, performance, and efficiency of the storage systems for containerized applications. To this end, we focus on the following three important aspects: Docker images, Docker registry storage system, and Docker container storage drivers with their backend storage systems. Specifically, this dissertation adopts three steps: (1) analyzing the Docker image dataset; (2) deriving the design implications; (3) designing a new storage framework for Docker registries and propose different optimizations for container storage systems.
In the first part of this dissertation (Chapter 3), we analyze over 167TB of uncompressed Docker Hub images, characterize them using multiple metrics and evaluate the potential of le level deduplication in Docker Hub. In the second part of this dissertation (Chapter 4), we conduct a comprehensive performance analysis of container storage systems based on the key insights from our image characterizations, and derive several design implications. In the third part of this dissertation (Chapter 5), we propose DupHunter, a new Docker registry architecture, which not only natively deduplicates layers for space savings but also reduces layer restore overhead. DupHunter supports several configurable deduplication modes, which provide different levels of storage efficiency, durability, and performance, to support a range of uses. In the fourth part of this dissertation (Chapter 6), we explore an innovative holistic approach, Chameleon, that employs data redundancy techniques such as replication and erasure-coding, coupled with endurance-aware write offloading, to mitigate wear level imbalance in distributed SSD-based storage systems. This high-performance fash cluster can be used for registries to speedup performance. / Doctor of Philosophy / The amount of Docker images stored in Docker registries is increasing rapidly and present challenges for the underlying storage infrastructures. Before we do any optimizations for the storage system, we should first analyze this big Docker image dataset. To this end, in this dissertation we perform the first large-scale characterization and redundancy analysis of the images and layers stored in the Docker Hub registry. Based on the findings, this dissertation presents a series of practical and efficient techniques, algorithms, optimizations to achieve high performance and flexibility, and space-efficient storage system for containerized applications. The experimental evaluation demonstrates the effectiveness of our optimizations and techniques to make storage systems flexible and space-efficacy.
|
22 |
Towards Secure Outsourced Data Services in the Public CloudSun, Wenhai 25 July 2018 (has links)
Past few years have witnessed a dramatic shift for IT infrastructures from a self-sustained model to a centralized and multi-tenant elastic computing paradigm -- Cloud Computing, which significantly reshapes the landscape of existing data utilization services. In truth, public cloud service providers (CSPs), e.g. Google, Amazon, offer us unprecedented benefits, such as ubiquitous and flexible access, considerable capital expenditure savings and on-demand resource allocation. Cloud has become the virtual ``brain" as well to support and propel many important applications and system designs, for example, artificial intelligence, Internet of Things, and so forth; on the flip side, security and privacy are among the primary concerns with the adoption of cloud-based data services in that the user loses control of her/his outsourced data. Encrypting the sensitive user information certainly ensures the confidentiality. However, encryption places an extra layer of ambiguity and its direct use may be at odds with the practical requirements and defeat the purpose of cloud computing technology. We believe that security in nature should not be in contravention of the cloud outsourcing model. Rather, it is expected to complement the current achievements to further fuel the wide adoption of the public cloud service. This, in turn, requires us not to decouple them from the very beginning of the system design. Drawing the successes and failures from both academia and industry, we attempt to answer the challenges of realizing efficient and useful secure data services in the public cloud. In particular, we pay attention to security and privacy in two essential functions of the cloud ``brain", i.e. data storage and processing. Our first work centers on the secure chunk-based deduplication of encrypted data for cloud backup and achieves the performance comparable to the plaintext cloud storage deduplication while effectively mitigating the information leakage from the low-entropy chunks. On the other hand, we comprehensively study the promising yet challenging issue of search over encrypted data in the cloud environment, which allows a user to delegate her/his search task to a CSP server that hosts a collection of encrypted files while still guaranteeing some measure of query privacy. In order to accomplish this grand vision, we explore both software-based secure computation research that often relies on cryptography and concentrates on algorithmic design and theoretical proof, and trusted execution solutions that depend on hardware-based isolation and trusted computing. Hopefully, through the lens of our efforts, insights could be furnished into future research in the related areas. / Ph. D. / Past few years have witnessed a dramatic shift for IT infrastructures from a self-sustained model to a centralized and multi-tenant elastic computing paradigm – Cloud Computing, which significantly reshapes the landscape of existing data utilization services. In truth, public cloud service providers (CSPs), e.g. Google, Amazon, offer us unprecedented benefits, such as ubiquitous and flexible access, considerable capital expenditure savings and on-demand resource allocation. Cloud has become the virtual “brain” as well to support and propel many important applications and system designs, for example, artificial intelligence, Internet of Things, and so forth; on the flip side, security and privacy are among the primary concerns with the adoption of cloud-based data services in that the user loses control of her/his outsourced data. Encryption definitely provides strong protection to user sensitive data, but it also disables the direct use of cloud data services and may defeat the purpose of cloud computing technology. We believe that security in nature should not be in contravention of the cloud outsourcing model. Rather, it is expected to complement the current achievements to further fuel the wide adoption of the public cloud service. This, in turn, requires us not to decouple them from the very beginning of the system design. Drawing the successes and failures from both academia and industry, we attempt to answer the challenges of realizing efficient and useful secure data services in the public cloud. In particular, we pay attention to security and privacy in two essential functions of the cloud “brain”, i.e. data storage and processing. The first part of this research aims to provide a privacy-preserving data deduplication scheme with the performance comparable to the existing cloud backup storage deduplication. In the second part, we attempt to secure the fundamental information retrieval functions and offer effective solutions in various contexts of cloud data services.
|
23 |
Deduplicerings påverkan på effektförbrukningen : en studie av deduplicering i ZFSAndersson, Tommy, Carlsson, Marcus January 2011 (has links)
Uppsatsen beskriver arbetet och undersökning för hur deduplicering i filsystemet ZFS påverkar effektförbrukningen. En större mängd redundant data förekommer i centraliserade lagringssystem som förser virtualiserade servrar med lagringsutrymme. Deduplicering kan för den typen av lagringsmiljö eliminera redundant data och ger en stor besparing av lagringsutrymme. Frågan som undersökningen avsåg att besvara var hur ett lagringssystem påverkas av det extra arbete som det innebär att deduplicera data i realtid.Metoden för att undersöka problemet var att utföra fem experiment med olika typer av scenarion. Varje scenario innebar att filer kopierades till ett lagringssystem med eller utan deduplicering för att senare kunna analysera skillnaden. Dessutom varierades mängden deduplicerbar data under experimenten vilket skulle visa om belastningen på hårddiskarna förändrades.Resultatet av experimenten visar att deduplicering ökar effektförbrukning och processorbelastning medan antalet I/O-operationer minskar. Analysen av resultatet visar att med en stigande andel deduplicerbar data som skrivs till hårddiskarna så stiger också effektförbrukning och processorbelastning. / This report describes the process and outcome of the research on how the power consumption is affected by deduplication in a ZFS file system. A large amount of redundant data exists in centralized storage systems that provide virtualized servers with storage space. Deduplication can be used to eliminate redundant data and give an improved utilization of available space in this kind of storage environment. The question that the study sought to answer was how a storage systems power consumption is affected by the extra workload deduplication introduces.The method used to investigate the problem was to perform five experiments with different types of scenarios. The difference in each scenario was that the data was written to a storage system with or without deduplication to later analyze the difference. Each scenario had a varied amount deduplicatable data during the experiments which would show if the load on disks changed.The results show that deduplication increases the power consumption and CPU load while the I/O-operations decrease. The analysis of the result shows that increasing the deduplicatable data also increases the power consumption and CPU load.
|
24 |
Network compression via network memory: realization principles and coding algorithmsSardari, Mohsen 13 January 2014 (has links)
The objective of this dissertation is to investigate both the theoretical and practical aspects of redundancy elimination methods in data networks. Redundancy elimination provides a powerful technique to improve the efficiency of network links in the face of redundant data. In this work, the concept of network compression is introduced to address the redundancy elimination problem. Network compression aspires to exploit the statistical correlation in data to better suppress redundancy. In a nutshell, network compression enables memorization of data packets in some nodes in the network. These nodes can learn the statistics of the information source generating the packets which can then be used toward reducing the length of codewords describing the packets emitted by the source. Memory elements facilitate the compression of individual packets using the side-information obtained from memorized data which is called ``memory-assisted compression''. Network compression improves upon de-duplication methods that only remove duplicate strings from flows.
The first part of the work includes the design and analysis of practical algorithms for memory-assisted compression. These algorithms are designed based on the theoretical foundation proposed in our group by Beirami et al. The performance of these algorithms are compared to
the existing compression techniques when the algorithms are tested on the real Internet traffic traces. Then, novel clustering techniques are proposed which can identify various information sources and apply the compression accordingly. This approach results in superior performance for memory-assisted compression when the input data comprises sequences generated by various and unrelated information sources.
In the second part of the work the application of memory-assisted compression in wired networks is investigated. In particular, networks with random and power-law graphs are studied. Memory-assisted compression is applied in these graphs and the routing problem for compressed flows is addressed. Furthermore, the network-wide gain of the memorization is defined and its scaling behavior versus the number of memory nodes is characterized. In particular, through our analysis on these graphs, we show that non-vanishing network-wide gain of memorization is obtained even when the number of memory units is a tiny fraction of the total number of nodes in the network.
In the third part of the work the application of memory-assisted compression in wireless networks is studied. For wireless networks, a novel network compression approach via memory-enabled helpers is proposed. Helpers provide side-information that is obtained via overhearing.
The performance of network compression in wireless networks is characterized and the following benefits are demonstrated: offloading the wireless gateway, increasing the maximum number of mobile nodes served by the gateway, reducing the average packet delay, and improving the overall throughput in the network.
Furthermore, the effect of wireless channel loss on the performance of the network compression scheme is studied. Finally, the performance of memory-assisted compression working in tandem with de-duplication is investigated and simulation results on real data traces from wireless users are provided.
|
25 |
Reduzindo custos da deduplicação de dados utilizando heurísticas e computação em nuvem.NASCIMENTO FILHO, Dimas Cassimiro do. 02 May 2018 (has links)
Submitted by Lucienne Costa (lucienneferreira@ufcg.edu.br) on 2018-05-02T21:20:23Z
No. of bitstreams: 1
DIMAS CASSIMIRO DO NASCIMENTO FILHO – TESE (PPGCC) 2017.pdf: 1879329 bytes, checksum: bda72914ec66d17611d9d0ab5b9ec6d5 (MD5) / Made available in DSpace on 2018-05-02T21:20:23Z (GMT). No. of bitstreams: 1
DIMAS CASSIMIRO DO NASCIMENTO FILHO – TESE (PPGCC) 2017.pdf: 1879329 bytes, checksum: bda72914ec66d17611d9d0ab5b9ec6d5 (MD5)
Previous issue date: 2017-11-10 / Na era de Big Data, na qual a escala dos dados provê inúmeros desafios para algoritmos
clássicos, a tarefa de avaliar a qualidade dos dados pode se tornar custosa e apresentar tempos de execução elevados. Por este motivo, gerentes de negócio podem optar por terceirizar o monitoramento da qualidade de bancos de dados para um serviço específico, usualmente baseado em computação em nuvem. Neste contexto, este trabalho propõe abordagens para redução de custos da tarefa de deduplicação de dados, a qual visa detectar entidades duplicadas em bases de dados, no contexto de um serviço de qualidade de dados em nuvem. O trabalho tem como foco a tarefa de deduplicação de dados devido a sua importância em diversos contextos e sua elevada complexidade. É proposta a arquitetura em alto nível de um serviço de monitoramento de qualidade de dados que emprega o provisionamento dinâmico de recursos computacionais por meio da utilização de heurísticas e técnicas de aprendizado de máquina. Além disso, são propostas abordagens para a adoção de algoritmos incrementais de deduplicação de dados e controle do tamanho de blocos gerados na etapa de indexação do problema investigado. Foram conduzidos quatro experimentos diferentes visando avaliar a eficácia dos algoritmos de provisionamento de recursos propostos e das heurísticas empregadas no contexto de algoritmos incrementais de deduplicação de dados e de controle de tamanho dos blocos. Os resultados dos experimentos apresentam uma gama de opções englobando diferentes relações de custo e benefício, envolvendo principalmente: custo de
infraestrutura do serviço e quantidade de violações de SLA ao longo do tempo. Outrossim,
a avaliação empírica das heurísticas propostas para o problema de deduplicação incremental de dados também apresentou uma série de padrões nos resultados, envolvendo principalmente o tempo de execução das heurísticas e os resultados de eficácia produzidos. Por fim, foram avaliadas diversas heurísticas para controlar o tamanho dos blocos produzidos em uma tarefa de deduplicação de dados, cujos resultados de eficácia são bastante influenciados pelos valores dos parâmetros empregados. Além disso, as heurísticas apresentaram resultados de
eficiência que variam significativamente, dependendo da estratégia de poda de blocos adotada. Os resultados dos quatro experimentos conduzidos apresentam suporte para demonstrar que diferentes estratégias (associadas ao provisionamento de recursos computacionais e aos algoritmos de qualidade de dados) adotadas por um serviço de qualidade de dados podem influenciar significativamente nos custos do serviço e, consequentemente, os custos repassados aos usuários do serviço. / In the era of Big Data, in which the scale of the data provides many challenges for classical
algorithms, the task of assessing the quality of datasets may become costly and complex.
For this reason, business managers may opt to outsource the data quality monitoring for a
specific cloud service for this purpose. In this context, this work proposes approaches for
reducing the costs generated from solutions for the data deduplication problem, which aims
to detect duplicate entities in datasets, in the context of a service for data quality monitoring. This work investigates the deduplication task due to its importance in a variety of contexts and its high complexity. We propose a high-level architecture of a service for data quality monitoring, which employs provisioning algorithms that use heuristics and machine learning techniques. Furthermore, we propose approaches for the adoption of incremental data quality algorithms and heuristics for controlling the size of the blocks produced in the indexing phase of the investigated problem. Four different experiments have been conducted to evaluate the effectiveness of the proposed provisioning algorithms, the heuristics for incremental record linkage and the heuristics to control block sizes for entity resolution. The results of the experiments show a range of options covering different tradeoffs, which involves: infrastructure costs of the service and the amount of SLA violations over time. In turn, the empirical evaluation of the proposed heuristics for incremental record linkage also presented a number of patterns in the results, which involves tradeoffs between the runtime of the heuristics and the obtained efficacy results. Lastly, the evaluation of the heuristics proposed to control block sizes have presented a large number of tradeoffs regarding execution time, amount of pruning approaches and the obtained efficacy results. Besides, the efficiency results of these heuristics may vary significantly, depending of the adopted pruning strategy. The results from the conducted experiments support the fact that different approaches (associated with cloud computing provisioning and the employed data quality algorithms) adopted by a data quality service may produce significant influence over the generated service costs, and thus, the final costs forwarded to the service customers.
|
26 |
Sûreté de fonctionnement dans le nuage de stockage / Dependability in cloud storageObame Meye, Pierre 01 December 2016 (has links)
La quantité de données stockées dans le monde ne cesse de croître et cela pose des challenges aux fournisseurs de service de stockage qui doivent trouver des moyens de faire face à cette croissance de manière scalable, efficace, tout en optimisant les coûts. Nous nous sommes intéressés aux systèmes de stockage de données dans le nuage qui est une grande tendance dans les solutions de stockage de données. L'International Data Corporation (IDC) prédit notamment que d'ici 2020, environ 40% des données seront stockées et traitées dans le nuage. Cette thèse adresse les challenges liés aux performances d'accès aux données et à la sûreté de fonctionnement dans les systèmes de stockage dans le nuage. Nous avons proposé Mistore, un système de stockage distribué que nous avons conçu pour assurer la disponibilité des données, leur durabilité, ainsi que de faibles latences d'accès aux données en exploitant des zones de stockage dans les box, les Points de Présence (POP), et les centre de données dans une infrastructure Digital Subscriber Line (xDSL) d'un Fournisseur d'Accès à Internet (FAI). Dans Mistore, nous adressons aussi les problèmes de cohérence de données en fournissant plusieurs critères de cohérence des données ainsi qu'un système de versioning. Nous nous sommes aussi intéressés à la sécurité des données dans le contexte de systèmes de stockage appliquant une déduplication des données, qui est l'une des technologies les plus prometteuses pour réduire les coût de stockage et de bande passante réseau. Nous avons conçu une méthode de déduplication en deux phases qui est sécurisée contre des attaques d'utilisateurs malicieux tout en étant efficace en termes d'économie de bande passante réseau et d'espace de stockage. / The quantity of data in the world is steadily increasing bringing challenges to storage system providers to find ways to handle data efficiently in term of dependability and in a cost-effectively manner. We have been interested in cloud storage which is a growing trend in data storage solution. For instance, the International Data Corporation (IDC) predicts that by 2020, nearly 40% of the data in the world will be stored or processed in a cloud. This thesis addressed challenges around data access latency and dependability in cloud storage. We proposed Mistore, a distributed storage system that we designed to ensure data availability, durability, low access latency by leveraging the Digital Subscriber Line (xDSL) infrastructure of an Internet Service Provider (ISP). Mistore uses the available storage resources of a large number of home gateways and Points of Presence for content storage and caching facilities. Mistore also targets data consistency by providing multiple types of consistency criteria on content and a versioning system. We also considered the data security and confidentiality in the context of storage systems applying data deduplication which is becoming one of the most popular data technologies to reduce the storage cost and we design a two-phase data deduplication that is secure against malicious clients while remaining efficient in terms of network bandwidth and storage space savings.
|
27 |
Implementierung eines File Managers für das Hadoop Distributed Filesystem und Realisierung einer MapReduce Workflow Submission-KomponenteFischer, Axel 02 February 2018 (has links)
Die vorliegende Bachelorarbeit erläutert die Entwicklung eines File Managers für das Hadoop Distributed Filesystem (HDFS) im Zusammenhang mit der Entwicklung des Dedoop Prototyps. Der File Manager deckt die Anwendungsfälle refresh, rename, move und delete ab. Darüber hinaus erlaubt er Uploads vom und Downloads zum lokalen Dateisystem des Anwenders. Besonders beachtet werden mussten hierbei die speziellen Anforderungen des Mehrbenutzerbetriebs. Darüber hinaus beschreibt die Bachelorarbeit die Entwicklung einer MapReduce Workflow Submission-Komponente für Dedoop, welche für die Übertragung und Ausführung der vom Anwender erzeugten Worflows verantworklich ist. Auch hierbei mussten die Anforderungen des Mehrbenutzer- und Multi-Cluster-Betriebs beachtet werden.
|
28 |
Přibližná shoda znakových řetězců a její aplikace na ztotožňování metadat vědeckých publikací / Approximate equality of character strings and its application to record linkage in metadata of scientific publicationsDobiášovský, Jan January 2020 (has links)
The thesis explores the application of approximate string matching in scientific publication record linkage process. An introduction to record matching along with five commonly used metrics for string distance (Levenshtein, Jaro, Jaro-Winkler, Cosine distances and Jaccard coefficient) are provided. These metrics are applied on publication metadata from V3S current research information system of the Czech Technical University in Prague. Based on the findings, optimal thresholds in the F1, F2 and F3-measures are determined for each metric.
|
29 |
Free-text Informed Duplicate Detection of COVID-19 Vaccine Adverse Event ReportsTuresson, Erik January 2022 (has links)
To increase medicine safety, researchers use adverse event reports to assess causal relationships between drugs and suspected adverse reactions. VigiBase, the world's largest database of such reports, collects data from numerous sources, introducing the risk of several records referring to the same case. These duplicates negatively affect the quality of data and its analysis. Thus, efforts should be made to detect and clean them automatically. Today, VigiBase holds more than 3.8 million COVID-19 vaccine adverse event reports, making deduplication a challenging problem for existing solutions employed in VigiBase. This thesis project explores methods for this task, explicitly focusing on records with a COVID-19 vaccine. We implement Jaccard similarity, TF-IDF, and BERT to leverage the abundance of information contained in the free-text narratives of the reports. Mean-pooling is applied to create sentence embeddings from word embeddings produced by a pre-trained SapBERT model fine-tuned to maximise the cosine similarity between narratives of duplicate reports. Narrative similarity is quantified by the cosine similarity between sentence embeddings. We apply a Gradient Boosted Decision Tree (GBDT) model for classifying report pairs as duplicates or non-duplicates. For a more calibrated model, logistic regression fine-tunes the leaf values of the GBDT. In addition, the model successfully implements a ruleset to find reports whose narratives mention a unique identifier of its duplicate. The best performing model achieves 73.3% recall and zero false positives on a controlled testing dataset for an F1-score of 84.6%, vastly outperforming VigiBase’s previously implemented model's F1-score of 60.1%. Further, when manually annotated by three reviewers, it reached an average 87% precision when fully deduplicating 11756 reports amongst records relating to hearing disorders.
|
30 |
The Cost of Confidentiality in Cloud StorageHenziger, Eric January 2018 (has links)
Cloud storage services allow users to store and access data in a secure and flexible manner. In recent years, cloud storage services have seen rapid growth in popularity as well as in technological progress and hundreds of millions of users use these services to store thousands of petabytes of data. Additionally, the synchronization of data that is essential for these types of services stands for a significant amount of the total internet traffic. In this thesis, seven cloud storage applications were tested under controlled experiments during the synchronization process to determine feature support and measure performance metrics. Special focus was put on comparing applications that perform client side encryption of user data to applicationsthat do not. The results show a great variation in feature support and performance between the different applications and that client side encryption introduces some limitations to other features but that it does not necessarily impact performance negatively. The results provide insights and enhances the understanding of the advantages and disadvantages that come with certain design choices of cloud storage applications. These insights will help future technological development of cloud storage services.
|
Page generated in 0.1113 seconds