Global ETD Search

1	RDSS: A Reliable and Efficient Distributed Storage System Li, Xiaodong January 2004 (has links) No description available. RDSS Distributed Storage System
2	Distributed large-scale data storage and processing Papailiopoulos, Dimitrios 16 March 2015 (has links) This thesis makes progress towards the fundamental understanding of heterogeneous and dynamic information systems and the way that we store and process massive data-sets. Reliable large-scale data storage: Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. However, traditional erasure codes are associated with high repair cost that is often considered an unavoidable price to pay. In this thesis, we show how to overcome these limitations. We construct novel families of erasure codes that are optimal under various repair cost metrics, while achieving the best possible reliability. We show how these modern storage codes significantly outperform traditional erasure codes. Low-rank approximations for large-scale data processing: A central goal in data analytics is extracting useful and interpretable information from massive data-sets. A challenge that arises from the distributed and large-scale nature of the data at hand, is having algorithms that are good in theory but can also scale up gracefully to large problem sizes. Using ideas from prior work, we develop a scalable lowrank optimization framework with provable guarantees for problems like the densest k-subgraph (DkS) and sparse PCA. Our experimental findings indicate that this low-rank framework can outperform the state-of-the art, by offering higher quality and more interpretable solutions, and by scaling up to problem inputs with billions of entries. / text Codes for distributed storage Big-graph analytics
3	Amélioration de la prédictibilité des performances pour les environnements de stockage de données dans les nuages / Improving Performance Predictability in Cloud Data Stores Jaiman, Vikas 30 April 2019 (has links) De nos jours, les utilisateurs de services interactifs comme le e-commerce, ou les moteurs de recherche, ont de grandes attentes sur la performance et la réactivité de ces services. En effet, les études ont montré que des lenteurs (même pendant une courte durée) impacte directement le chiffre d'affaire. Avoir des performances prédictives est donc devenu une priorité pour ces fournisseurs de services depuis une dizaine d'années.Mais empêcher la variabilité dans les systèmes de stockage distribué est un challenge car les requêtes des utilisateurs finaux transitent par des centaines de servers et les problèmes de performances engendrés par chacun de ces serveurs peuvent influencer sur la latence observée. Même dans les environnements correctement dimensionnés, des problèmes comme de la contention sur les ressources partagés ou un déséquilibre de charge entre les serveurs influent sur les latences des requêtes et en particulier sur la queue de leur distribution (95ème et 99ème centile).L’objectif de cette thèse est de développer des mécanises permettant de réduire les latences et d’obtenir des performances prédictives dans les environnements de stockage de données dans les nuages. Une contre-mesure efficace pour réduire la latence de queue dans les environnements de stockage de données dans les nuages est de fournir des algorithmes efficaces pour la sélection de réplique. Dans la sélection de réplique, une requête tentant d’accéder à une information donnée (aussi appelé valeur) identifiée par une clé unique est dirigée vers la meilleure réplique présumée. Cependant, sous des charges de travail hétérogènes, ces algorithmes entraînent des latences accrues pour les requêtes ayant un court temps d'exécution et qui sont planifiées à la suite de requêtes ayant des long temps d’exécution. Nous proposons Héron, un algorithme de sélection de répliques qui gère des charges de travail avec des requêtes ayant un temps d’exécution hétérogène. Nous évaluons Héron dans un cluster de machines en utilisant un jeu de données synthétique inspiré du jeu de données de Facebook ainsi que deux jeux de données réels provenant de Flickr et WikiMedia. Nos résultats montrent que Héron surpasse les algorithmes de l’état de l’art en réduisant jusqu’à 41% la latence médiane et la latence de queue.Dans la deuxième contribution de cette thèse, nous nous sommes concentrés sur les charges de travail multi-GET afin de réduire la latence dans les environnements de stockage de données dans les nuages Le défi consiste à estimer les opérations limitantes et à les planifier sur des serveurs non-coordonnés avec un minimum de surcoût. Pour atteindre cet objectif, nous présentons TailX, un algorithme d’ordonnancement de tâches multi-GET qui réduit les temps de latence de queue sous des charges de travail hétérogènes. Nous implémentons TailX dans Cassandra, une base de données clé-valeur largement utilisée. Il en résulte une amélioration des performances globales des environnements de stockage de données dans les nuages pour une grande variété de charges de travail hétérogènes. / Today, users of interactive services such as e-commerce, web search have increasingly high expectations on the performance and responsiveness of these services. Indeed, studies have shown that a slow service (even for short periods of time) directly impacts the revenue. Enforcing predictable performance has thus been a priority of major service providers in the last decade. But avoiding latency variability in distributed storage systems is challenging since end user requests go through hundreds of servers and performance hiccups at any of these servers may inflate the observed latency. Even in well-provisioned systems, factors such as the contention on shared resources or the unbalanced load between servers affect the latencies of requests and in particular the tail (95th and 99th percentile) of their distribution.The goal of this thesis to develop mechanisms for reducing latencies and achieve performance predictability in cloud data stores. One effective countermeasure for reducing tail latency in cloud data stores is to provide efficient replica selection algorithms. In replica selection, a request attempting to access a given piece of data (also called value) identified by a unique key is directed to the presumably best replica. However, under heterogeneous workloads, these algorithms lead to increased latencies for requests with a short execution time that get scheduled behind requests with large execution times. We propose Héron, a replica selection algorithm that supports workloads of heterogeneous request execution times. We evaluate Héron in a cluster of machines using a synthetic dataset inspired from the Facebook dataset as well as two real datasets from Flickr and WikiMedia. Our results show that Héron outperforms state-of-the-art algorithms by reducing both median and tail latency by up to 41%.In the second contribution of the thesis, we focus on multiget workloads to reduce the latency in cloud data stores. The challenge is to estimate the bottleneck operations and schedule them on uncoordinated backend servers with minimal overhead. To reach this objective, we present TailX, a task aware multiget scheduling algorithm that reduces tail latencies under heterogeneous workloads. We implement TailX in Cassandra, a widely used key-value store. The result is an improved overall performance of the cloud data stores for a wide variety of heterogeneous workloads. Stockage distribué Performance Planification Distributed storage Performance Scheduling 004
4	Secure Store : A Secure Distributed Storage Service Lakshmanan, Subramanian 12 August 2004 (has links) As computers become pervasive in environments that include the home and community, new applications are emerging that will create and manipulate sensitive and private information. These applications span systems ranging from personal to mobile and hand held devices. They would benefit from a data storage service that protects the integrity and confidentiality of the stored data and is highly available. Such a data repository would have to meet the needs of a variety of applications, handling data with varying security and performance requirements. Providing simultaneously both high levels of security and high levels of performance may not be possible when many nodes in the system are under attack. The agility approach to building secure distributed services advocates the principle that the overhead of providing strong security guarantees should be incurred only by those applications that require such high levels of security and only at times when it is necessary to defend against high threat levels. A storage service that is designed for a variety of applications must follow the principles of agility, offering applications a range of options to choose from for their security and performance requirements. This research presents secure store, a secure and highly available distributed store to meet the performance and security needs of a variety of applications. Secure store is designed to guarantee integrity, confidentiality and availability of stored data even in the face of limited number of compromised servers. Secure store is designed based on the principles of agility. Secure store integrates two well known techniques, namely replication and secret-sharing, and exploits the tradeoffs that exist between security and performance to offer applications a range of options to choose from to suit their needs. This thesis makes several contributions, including (1) illustration of the the principles of agility, (2) a novel gossip-style secure dissemination protocol whose performance is comparable to the best-possible benign-case protocol in the absence of any malicious activity, (3) demonstration of the performance benefits of using weaker consistency models for data access, and (4) a technique called collective endorsement that can be used in other secure distributed applications. Consistency Secure dissemination Byzantine fault tolerance Storage security Distributed storage
5	A Distributed Pool Architecture for Genetic Algorithms Roy, Gautam 2009 December 1900 (has links) The genetic algorithm paradigm is a well-known heuristic for solving many problems in science and engineering in which candidate solutions, or “individuals”, are manipulated in ways analogous to biological evolution, to produce new solutions until one with the desired quality is found. As problem sizes increase, a natural question is how to exploit advances in distributed and parallel computing to speed up the execution of genetic algorithms. This thesis proposes a new distributed architecture for genetic algorithms, based on distributed storage of the individuals in a persistent pool. Processors extract individuals from the pool in order to perform the computations and then insert the resulting individuals back into the pool. Unlike previously proposed approaches, the new approach is tailored for distributed systems in which processors are loosely coupled, failure-prone and can run at different speeds. Proof-of-concept simulation results are presented for four benchmark functions and for a real-world Product Lifecycle Design problem. We have experimented with both the crash failure model and the Byzantine failure model. The results indicate that the approach can deliver improved performance due to the distribution and tolerates a large fraction of processor failures subject to both models. genetic algorithms distributed algorithms distributed storage product lifecycle design
6	ImplementingDistributed Storage System by Network Coding in Presence of Link Failure Chareonvisal, Tanakorn January 2012 (has links) Nowadays increasing multimedia applications e.g., video and voice over IP, social networks and emails poses higher demands for sever storages and bandwidth in the networks. There is a concern that existing resource may not able to support higher demands and reliability. Network coding was introduced to improve distributed storage system. This thesis proposes the way to improve distributed storage system such as increase a chance to recover data in case there is a fail storage node or link fail in a network. In this thesis, we study the concept of network coding in distributed storage systems. We start our description from easy code which is replication coding then follow with higher complex code such as erasure coding. After that we implement these concepts in our test bed and measure performance by the probability of success in download and repair criteria. Moreover we compare success probability for reconstruction of original data between minimum storage regenerating (MSR) and minimum bandwidth regenerating (MBR) method. We also increase field size to increase probability of success. Finally, link failure was added in the test bed for measure reliability in a network. The results are analyzed and it shows that using maximum distance separable and increasing field size can improve the performance of a network. Moreover it also improves reliability of network in case there is a link failure in the repair process. Network coding distributed storage systems Communication Systems Kommunikationssystem
7	Exploitation du contenu pour l'optimisation du stockage distribué / Leveraging content properties to optimize distributed storage systems Kloudas, Konstantinos 06 March 2013 (has links) Les fournisseurs de services de cloud computing, les réseaux sociaux et les entreprises de gestion des données ont assisté à une augmentation considérable du volume de données qu'ils reçoivent chaque jour. Toutes ces données créent des nouvelles opportunités pour étendre la connaissance humaine dans des domaines comme la santé, l'urbanisme et le comportement humain et permettent d'améliorer les services offerts comme la recherche, la recommandation, et bien d'autres. Ce n'est pas par accident que plusieurs universitaires mais aussi les médias publics se référent à notre époque comme l'époque “Big Data”. Mais ces énormes opportunités ne peuvent être exploitées que grâce à de meilleurs systèmes de gestion de données. D'une part, ces derniers doivent accueillir en toute sécurité ce volume énorme de données et, d'autre part, être capable de les restituer rapidement afin que les applications puissent bénéficier de leur traite- ment. Ce document se concentre sur ces deux défis relatifs aux “Big Data”. Dans notre étude, nous nous concentrons sur le stockage de sauvegarde (i) comme un moyen de protéger les données contre un certain nombre de facteurs qui peuvent les rendre indisponibles et (ii) sur le placement des données sur des systèmes de stockage répartis géographiquement, afin que les temps de latence perçue par l'utilisateur soient minimisés tout en utilisant les ressources de stockage et du réseau efficacement. Tout au long de notre étude, les données sont placées au centre de nos choix de conception dont nous essayons de tirer parti des propriétés de contenu à la fois pour le placement et le stockage efficace. / Cloud service providers, social networks and data-management companies are witnessing a tremendous increase in the amount of data they receive every day. All this data creates new opportunities to expand human knowledge in fields like healthcare and human behavior and improve offered services like search, recommendation, and many others. It is not by accident that many academics but also public media refer to our era as the “Big Data” era. But these huge opportunities come with the requirement for better data management systems that, on one hand, can safely accommodate this huge and constantly increasing volume of data and, on the other, serve them in a timely and useful manner so that applications can benefit from processing them. This document focuses on the above two challenges that come with “Big Data”. In more detail, we study (i) backup storage systems as a means to safeguard data against a number of factors that may render them unavailable and (ii) data placement strategies on geographically distributed storage systems, with the goal to reduce the user perceived latencies and the network and storage resources are efficiently utilized. Throughout our study, data are placed in the centre of our design choices as we try to leverage content properties for both placement and efficient storage. Systèmes de stockage distribués Systèmes large échelle Déduplication Distributed storage systems Large scale systems Deduplication
8	Engineering and legal aspects of a distributed storage flood mitigation system in Iowa Baxter, Travis 01 December 2011 (has links) This document presents a sketch of the engineering and legal considerations necessary to implement a distributed storage flood mitigation system in Iowa. This document first presents the results of a simulation done to assess the advantages of active storage reservoirs over passive reservoirs for flood mitigation. Next, this paper considers how forecasts improve the operation of a single reservoir in preventing floods. After demonstrating the effectiveness of accurate forecasts on a single active storage reservoir, this thesis moves on to a discussion of distributed storage with the idea that the advantages of active reservoirs with accurate forecasting could be applied to the distributed storage system. The analysis of distributed storage begins with a determination of suitable locations for reservoirs in the Clear Creek Watershed, near Coralville, Iowa, using two separate algorithms. The first algorithm selected the reservoirs based on the highest average reservoir depth, while the second located reservoirs based on maximizing the storage in two specific travel bands within the watershed. This paper also discusses the results of a land cover analysis on the reservoirs, determining that, based on the land cover inundated, several reservoirs would cause too much damage to be practical. The ultimate goal of a distributed storage system is to use the reservoirs to protect an urban area from significant flood damage. For this thesis, the Clear Creek data were extrapolated to the Cedar River basin with the intention to evaluate the feasibility and gain a rough approximation of the requirements for a distributed storage system to protect Cedar Rapids. Discussion then centered on an approximation of the distributed storage system that could have prevented the catastrophic Flood of 2008 in Cedar Rapids. There is significant potential for a distributed storage system to be a cost effective way of protecting Cedar Rapids from future flooding on the scale of the Flood of 2008. However, more analysis is needed to more accurately determine the costs and benefits of a distributed storage system in the Cedar River basin. This paper also recommends that a large scale distributed storage system should be controlled by an entity be created within the Iowa Department of Natural Resources. A smaller distributed storage system could be managed by a soil and water conservation subdistrict. Iowa allows for condemnation of the land needed for the gate structures and the flowage easements necessary to build and operate a distributed storage system. Finally, this paper discusses the environmental law concerns with a distributed storage system, particularly the Clean Water Act requirement for a National Pollutant Discharge Elimination System permit. Cedar River Clear Creek Distributed storage Flood mitigation Flood protection Civil and Environmental Engineering
9	Managing Applications and Data in Distributed Computing Infrastructures Toor, Salman Zubair January 2012 (has links) During the last decades the demand for large-scale computational and storage resources in science has increased dramatically. New computational infrastructures enable scientists to enter a new mode of science, e-science, which complements traditional theory and experiments. E-science is inherently interdisciplinary, involving researchers from several disciplines, and also opens up for large-scale collaborative efforts where physically distributed groups of scientists share software tools and data to make scientific progress. Within the field of e-science, new challenges are emerging in managing large-scale distributed computing efforts and distributed data sets. Different models, e.g. grids and clouds, have been introduced over the years, but new solutions built on these models are needed to enable easy and flexible use of distributed computing infrastructures by application scientists. In the first part of the thesis, application execution environments are studied. The goal is to hide technical details of the underlying distributed computing infrastructure and expose secure and user-friendly environments to the end users. First, a general-purpose solution using portal technology is described, enabling transparent and easy usage of a variety of grid systems. Then a problem-solving environment for genetic analysis is presented. Here the statistical software R is used as a workflow engine, enhanced with grid-enabled routines for performing the computationally demanding parts of the analysis. Finally, the issue of resource allocation in grid system is briefly studied and certain modifications in the distributed resource-brokering model for the ARC middleware are proposed. The second part of the thesis presents solutions for managing and analyzing scientific data using distributed storage resources. First, a new reliable and secure file-oriented distributed storage system, Chelonia, is presented. The architectural design of the system is described and implementation issues are considered. Also, the stability and scalable performance of Chelonia is verified using several test scenarios. Then, tools for providing an efficient and easy-to-use platform for data analysis built on Chelonia are presented. Here, a database driven approach is explored. An extended architecture where Chelonia is combined with the Web-Service MEDiator (WSMED) system is implemented, providing web service tools to query data without any further programming. This approach is then developed further and Chelonia is combined with SciSPARQL, a query language that extends SPARQL to queries over numeric scientific data. This results in a system that is capable of interactive analysis of distributed data sets. Writing customized modules in Java, Python or C can fulfill advanced application-specific analysis requirements. The viability of the approach is demonstrated by applying the system to data produced by URDME, a computational environment in systems biology and results for sample queries expressed in SciSPARQL are presented. Finally, the use of an open source storage cloud, Openstack – SWIFT, for analysis of data from CERN experiments is considered. Here, a pilot implementation for the ROOT data analysis framework is presented together with a performance evaluation. / eSSENCE Distributed Computing Infrastructures Grids Clouds Application Management Distributed Storage Resource Allocation
10	Répartition des moyens complémentaires de production et de stockage dans les réseaux faiblement interconnectés ou isolés / Distribution of supplementary means of storage and production in isolated or weakly interconnected networks Vu, Thang 14 February 2011 (has links) Cette thèse se situe dans le cadre de l'étude des réseaux faiblement interconnectés (puissance échangée limitée)ou isolés, alimentés principalement par des sources d'origine renouvelable. Afin d'équilibrer à chaque instant la production et la consommation, des groupes électrogènes ou des systèmes de stockage sont insérés. Les travaux portent sur deux grands objectifs. Le premier est de déterminer un mode de fonctionnement des moyens de stockage et de production afin d'exploiter le système à coût minimal en fonction des conditions météorologiques (prévision de la production renouvelable), tarifaires et de la consommation. Une seconde méthode d'optimisation est développée, prenant compte également les contraintes du réseau. Le deuxième objectif est la détermination de la meilleure localisation des moyens de stockage et de production sur le réseau. Une installation optimale permet de réduire les pertes en ligne, d'améliorer la qualité de la tension et ainsi de limiter le renforcement du le réseau aux points critiques. Le concept de stockage réparti (ou décentralisé) est introduit. La répartition de la capacité globale de stockage et le choix des paramètres de fonctionnement des onduleurs (pour répartir les appels de puissance) sont proposés. La simulation d'un cas d'application (réseau de Corse) permet de valider les outils développés. / This thesis concern isolated or weakly interconnected networks (limited power exchanged), powered essentially by renewable sources. To balance at any time between production and consumption, generators and storage systems are inserted.The work will focus on two main objectives. The first is to determine an operation mode of the generators and storage system at minimal cost depending on the weather (forecast of renewable generation), pricing and consumption. Optimization with network constraint is also developed. The second objective is to find the best places to install these resources on the network. A good location helps reduce line losses and improve voltage quality, which helps limit to reinforce the network at critical points. The concept of distributed storage (or decentralized) is introduced. The distribution of the overall storage capacity and choice of operating parameters of the inverters (to share the demanded power) are proposed. The simulations on an application case help to validate the developed tools. Répartition Énergie renouvelable Stockage réparti Isolé interconnecté Distribution Renewable energy Distributed storage Isolated weakly interconnected

Search results