Global ETD Search

21	Cloud-native storage solutions for Kubernetes : A performance comparison Andersson, Filip January 2023 (has links) Kubernetes is a container orchestration system that has been rising in popularity in recent years. The modular nature of Kubernetes allows the usage of different storage solutions, and for cloud environments, cloud-native distributed storage solutions maybe attractive due to their redundant nature. There are many tools for cloud-native distributed storage available on the market today with differing features and performance. Choosing the correct one for an organisation can be difficult. Organisations utilising Kubernetes in cloud environments would like to be as performance efficient as possible to save on costs and resources. This study aims to offer a benchmark and analysis for some of the most popular tools, to help organisations choose the ‘best’ solution for their operational needs, from a performance perspective. The benchmarks compare three cloud-native distributed storage solutions, OpenEBS, Portworx, and Rook-Ceph on both Amazon Elastic Kubernetes Service (EKS) and Azure Kubernetes Service (AKS). For a baseline comparison, the study will also benchmark the cloud providers own solutions; Azure Disk Storage, and Amazon Elastic Block Storage. The study compares these solutions from three key metrics; bandwidth, latency, and IOPS, in both read and write performance. / <p>Det finns övrigt digitalt material (t.ex. film-, bild- eller ljudfiler) eller modeller/artefakter tillhörande examensarbetet som ska skickas till arkivet.</p><p>There are other digital material (eg film, image or audio files) or models/artifacts that belongs to the thesis and need to be archived.</p> Kubernetes distributed storage storage performance benchmark cloud-native bandwidth latency IOPS Information Systems, Social aspects
22	Coding Schemes For Distributed Subspace Computation, Distributed Storage And Local Correctability Vadlamani, Lalitha 02 1900 (has links) (PDF) In this thesis, three problems have been considered and new coding schemes have been devised for each of them. The first is related to distributed function computation, the second to coding for distributed storage and the final problem is based on locally correctable codes. A common theme of the first two problems considered is distributed computation. The first problem is motivated by the problem of distributed function computation considered by Korner and Marton, where the goal is to compute XOR of two binary sources at the receiver. It has been shown that linear encoders give better sum rates for some source distributions as compared to the usual Slepian-Wolf scheme. We generalize this distributed function computation setting to the case of more than two sources and the receiver is interested in computing multiple linear combinations of the sources. Consider `m' random variables each of which takes values from a finite field and are associated with a certain joint probability distribution. The receiver is interested in the lossless computation of `s' linear combinations of the m random variables. By considering the set of all linear combinations of m random variables as a vector space V , this problem can be interpreted as a subspace-computation problem. For this problem, we develop three increasingly refined approaches, all based on linear encoders. The first two approaches which are termed as common code approach and selected subspace approach, use a common matrix to encode all the sources. In the common code approach, the desired subspace W is computed at the receiver, whereas in the selected subspace approach, possibly a larger subspace U which contains the desired subspace is computed. The larger subspace U which gives the minimum sum rate itself is based on a decomposition of vector space V into a chain of subspaces. The chain of subspaces is determined by the joint probability distribution of m random variables and a notion of normalized measure of entropy. The third approach is a nested code approach, where all the encoding matrices are nested and the same subspace U which is identified in the selected subspace approach is computed. We characterize the sum rates under all the three approaches. The sum rate under nested code approach is no larger than both selected subspace approach and Slepian-Wolf approach. For a large class of joint distributions and subspaces W , the nested code scheme is shown to improve upon Slepian-Wolf scheme. Additionally, a class of source distributions and subspaces are identified, for which the nested code approach is sum-rate optimal. In the second problem, we consider a distributed storage network, where data is stored across nodes in a network which are failure-prone. The goal is to store data reliably and efficiently. For a required level of reliability, it is of interest to minimise storage overhead and also of interest to perform node repair efficiently. Conventionally replication and maximum distance separable (MDS) codes are employed in such systems. Though replication is very efficient in terms of node repair, the storage overhead is high. MDS codes have low storage overhead but even the repair of a single failed node requires contacting a large number of nodes and downloading all their data. We consider two coding solutions that have recently been proposed, which enable efficient node repair in case of single node failure. The first solution called regenerating codes seeks to minimize the amount of data downloaded for node repair, while codes with locality attempt to minimize the number of helper nodes accessed. We extend these results in two directions. In the first one, we introduce the notion of codes with locality where the local codes have minimum distance more than 2 and hence can recover a code symbol locally even in the presence of multiple erasures. These codes are termed as codes with local erasure correction. We say that a code has information locality if there exists a set of message symbols, each of which is covered by local codes. A code is said to have all-symbol locality if all the code symbols are covered by local codes. An upper bound on the minimum distance of codes with information locality is presented and codes that are optimal with respect to this bound are constructed. We make a connection between codes with local erasure correction and concatenated codes. The second direction seeks to build codes that combine the advantages of both codes with locality as well as regenerating codes. These codes, termed here as codes with local regeneration, are codes with locality over a vector alphabet, in which the local codes themselves are regenerating codes. There are two well known classes of regenerating codes known as minimum storage regenerating (MSR) codes and minimum bandwidth regenerating (MBR) codes. We derive two upper bounds on the minimum distance of vector-alphabet codes with locality, one for the case when the local codes are MSR codes and the second for the case when the local codes are MBR codes. We also provide several optimal constructions of both classes of codes which achieve their respective minimum distance bounds with equality. The third problem deals with locally correctable codes. A block code of length `n' is said to be locally correctable, if there exists a randomized algorithm such that any one of the coordinates of the codeword can be recovered by querying at most `r' coordinates, even in presence of some fraction of errors. We study the local correctability of linear codes whose duals contain 4-designs. We also derive a bound relating `r' and fraction of errors that can be tolerated, when each instance of the randomized algorithm is `t'-error correcting instead of simple parity computation. Coding Theory Distributed Function Computation Distributed Storage Coding Locally Correctable Codes Linear Codes Encoding Regeneration Codes Error Correcting Codes Information Theory Local Erasure Correction Minimum Storage Regenerating (MSR) Codes Distributed Storage Network Distributed Subspace Computation Computer Science
23	Autonomic management in a distributed storage system Tauber, Markus January 2010 (has links) This thesis investigates the application of autonomic management to a distributed storage system. Effects on performance and resource consumption were measured in experiments, which were carried out in a local area test-bed. The experiments were conducted with components of one specific distributed storage system, but seek to be applicable to a wide range of such systems, in particular those exposed to varying conditions. The perceived characteristics of distributed storage systems depend on their configuration parameters and on various dynamic conditions. For a given set of conditions, one specific configuration may be better than another with respect to measures such as resource consumption and performance. Here, configuration parameter values were set dynamically and the results compared with a static configuration. It was hypothesised that under non-changing conditions this would allow the system to converge on a configuration that was more suitable than any that could be set a priori. Furthermore, the system could react to a change in conditions by adopting a more appropriate configuration. Autonomic management was applied to the peer-to-peer (P2P) and data retrieval components of ASA, a distributed storage system. The effects were measured experimentally for various workload and churn patterns. The management policies and mechanisms were implemented using a generic autonomic management framework developed during this work. The motivation for both groups of experiments was to test management policies with the objective to avoid unsatisfactory situations with respect to resource consumption and performance. Such unsatisfactory situations occur when either the P2P layer or the data retrieval mechanism is configured statically. In a statically configured P2P system two unsatisfactory situations can be identified. The first arises when the frequency with which P2P node states are verified is low and membership churn is high. The P2P node state becomes inaccurate due to a high membership churn, leading to errors during the routing process and a reduction in performance. In this situation it is desirable to increase the frequency to increase P2P state accuracy. The converse situation arises when the frequency is high and churn is low. In this situation network resources are used unnecessarily, which may also reduce performance, making it desirable to decrease the frequency. In ASA’s data retrieval mechanism similar unsatisfactory situations can be identified with respect to the degree of concurrency (DOC). The DOC controls the eagerness with which multiple redundant replicas are retrieved. An unsatisfactory situation arises when the DOC is low and there is a large variation in the times taken to retrieve replicas. In this situation it is desirable to increase the DOC, because by retrieving more replicas in parallel a result can be returned to the user sooner. The converse situation arises when the DOC is high, there is little variation in retrieval time and there is a network bottleneck close to the requesting client. In this situation it is desirable to decrease the DOC, since the low variation removes any benefit in parallel retrieval, and the bottleneck means that decreasing parallelism reduces both bandwidth consumption and elapsed time for the user. The experimental evaluations of autonomic management show promising results, and suggest several future research topics. These include optimisations of the managed mechanisms, alternative management policies, different evaluation methods, and the application of developed management mechanisms to other facets of a distributed storage system. The findings of this thesis could be exploited in building other distributed storage systems that focus on harnessing storage on user workstations, since these are particularly likely to be exposed to varying, unpredictable conditions. 621.382
24	Etude des codes en graphes pour le stockage de données / Study of Sparse-Graph for Distributed Storage Systems Jule, Alan 07 March 2014 (has links) Depuis deux décennies, la révolution technologique est avant tout numérique entrainant une forte croissance de la quantité de données à stocker. Le rythme de cette croissance est trop importante pour les solutions de stockage matérielles, provoquant une augmentation du coût de l'octet. Il est donc nécessaire d'apporter une amélioration des solutions de stockage ce qui passera par une augmentation de la taille des réseaux et par la diminution des copies de sauvegarde dans les centres de stockage de données. L'objet de cette thèse est d'étudier l'utilisation des codes en graphe dans les réseaux de stockage de donnée. Nous proposons un nouvel algorithme combinant construction de codes en graphe et allocation des noeuds de ce code sur le réseau. Cet algorithme permet d'atteindre les hautes performances des codes MDS en termes de rapport entre le nombre de disques de parité et le nombre de défaillances simultanées pouvant être corrigées sans pertes (noté R). Il bénéficie également des propriétés de faible complexité des codes en graphe pour l'encodage et la reconstruction des données. De plus, nous présentons une étude des codes LDPC Spatiallement-Couplés permettant d'anticiper le comportement de leur décodage pour les applications de stockage de données.Il est généralement nécessaire de faire des compromis entre différents paramètres lors du choix du code correcteur d'effacement. Afin que ce choix se fasse avec un maximum de connaissances, nous avons réalisé deux études théoriques comparatives pour compléter l'état de l'art. La première étude s'intéresse à la complexité de la mise à jour des données dans un réseau dynamique établi et déterminons si les codes linéaires utilisés ont une complexité de mise à jour optimale. Dans notre seconde étude, nous nous sommes intéressés à l'impact sur la charge du réseau de la modification des paramètres du code correcteur utilisé. Cette opération peut être réalisée lors d'un changement du statut du fichier (passage d'un caractère hot à cold par exemple) ou lors de la modification de la taille du réseau. L'ensemble de ces études, associé au nouvel algorithme de construction et d'allocation des codes en graphe, pourrait mener à la construction de réseaux de stockage dynamiques, flexibles avec des algorithmes d'encodage et de décodage peu complexes. / For two decades, the numerical revolution has been amplified. The spread of digital solutions associated with the improvement of the quality of these products tends to create a growth of the amount of data stored. The cost per Byte reveals that the evolution of hardware storage solutions cannot follow this expansion. Therefore, data storage solutions need deep improvement. This is feasible by increasing the storage network size and by reducing data duplication in the data center. In this thesis, we introduce a new algorithm that combines sparse graph code construction and node allocation. This algorithm may achieve the highest performance of MDS codes in terms of the ratio R between the number of parity disks and the number of failures that can be simultaneously reconstructed. In addition, encoding and decoding with sparse graph codes helps lower the complexity. By this algorithm, we allow to generalize coding in the data center, in order to reduce the amount of copies of original data. We also study Spatially-Coupled LDPC (SC-LDPC) codes which are known to have optimal asymptotic performance over the binary erasure channel, to anticipate the behavior of these codes decoding for distributed storage applications. It is usually necessary to compromise between different parameters for a distributed storage system. To complete the state of the art, we include two theoretical studies. The first study deals with the computation complexity of data update and we determine whether linear code used for data storage are update efficient or not. In the second study, we examine the impact on the network load when the code parameters are changed. This can be done when the file status changes (from a hot status to a cold status for example) or when the size of the network is modified by adding disks. All these studies, combined with the new algorithm for sparse graph codes, could lead to the construction of new flexible and dynamical networks with low encoding and decoding complexities. Réseau de Stockage de données Codes LDPC Codes en graphe Codes Sptiallement couplés Complexité Mise à Jour Extension du réseau Sparse Graph Codes Distributed Storage systems LDPC codes Spatially Coupled codes Update Complexity Load Rebalancing
25	Efficient Usage Of Flash Memories In High Performance Scenarios Srimugunthan, * 10 1900 (has links) (PDF) New PCI-e flash cards and SSDs supporting over 100,000 IOPs are now available, with several usecases in the design of a high performance storage system. By using an array of flash chips, arranged in multiple banks, large capacities are achieved. Such multi-banked architecture allow parallel read, write and erase operations. In a raw PCI-e flash card, such parallelism is directly available to the software layer. In addition, the devices have restrictions such as, pages within a block can only be written sequentially. The devices also have larger minimum write sizes (>4KB). Current flash translation layers (FTLs) in Linux are not well suited for such devices due to the high device speeds, architectural restrictions as well as other factors such as high lock contention. We present a FTL for Linux that takes into account the hardware restrictions, that also exploits the parallelism to achieve high speeds. We also consider leveraging the parallelism for garbage collection by scheduling the garbage collection activities on idle banks. We propose and evaluate an adaptive method to vary the amount of garbage collection according to the current I/O load on the device. For large scale distributed storage systems, flash memories are an excellent choice because flash memories consume less power, take lesser floor space for a target throughput and provide faster access to data. In a traditional distributed filesystem, even distribution is required to ensure load-balancing, balanced space utilisation and failure tolerance. In the presence of flash memories, in addition, we should also ensure that the numbers of writes to these different flash storage nodes are evenly distributed, to ensure even wear of flash storage nodes, so that unpredictable failures of storage nodes are avoided. This requires that we distribute updates and do garbage collection, across the flash storage nodes. We have motivated the distributed wearlevelling problem considering the replica placement algorithm for HDFS. Viewing the wearlevelling across flash storage nodes as a distributed co-ordination problem, we present an alternate design, to reduce the message communication cost across participating nodes. We demonstrate the effectiveness of our design through simulation. Flash Cards Flash Memory Scaling High Performance Computing Distributed Flash Memories Word Recognition Flash Translation Layer (FTL) Distributed File System Parallelism Flash Cards Flash Cards - Garbage Collection Flash Storage Nodes Cluster Storage Distributed Storage System Computer Science
26	Towards more scalability and flexibility for distributed storage systems / Vers un meilleur passage à l'échelle et une plus grande flexibilité pour les systèmes de stockage distribué Ruty, Guillaume 15 February 2019 (has links) Les besoins en terme de stockage, en augmentation exponentielle, sont difficilement satisfaits par les systèmes de stockage distribué traditionnels. Alors que les performances des disques ont ratrappé celles des cartes réseau en terme d'ordre de grandeur, leur capacité ne croit pas à la même vitesse que l'ensemble des données requérant d'êtres stockées, notamment à cause de l'avènement des applications de big data. Par ailleurs, l'équilibre de performances entre disques, cartes réseau et processeurs a changé et les états de fait sur lesquels se basent la plupart des systèmes de stockage distribué actuels ne sont plus vrais. Cette dissertation explique de quelle manière certains aspects de tels systèmes de stockages peuvent être modifiés et repensés pour faire une utilisation plus efficace des ressources qui les composent. Elle présente une architecture de stockage nouvelle qui se base sur une couche de métadonnées distribuée afin de fournir du stockage d'objet de manière flexible tout en passant à l'échelle. Elle détaille ensuite un algorithme d'ordonnancement des requêtes permettant a un système de stockage générique de traiter les requêtes de clients en parallèle de manière plus équitable. Enfin, elle décrit comment améliorer le cache générique du système de fichier dans le contexte de systèmes de stockage distribué basés sur des codes correcteurs avant de présenter des contributions effectuées dans le cadre de courts projets de recherche. / The exponentially growing demand for storage puts a huge stress on traditionnal distributed storage systems. While storage devices' performance have caught up with network devices in the last decade, their capacity do not grow as fast as the rate of data growth, especially with the rise of cloud big data applications. Furthermore, the performance balance between storage, network and compute devices has shifted and the assumptions that are the foundation for most distributed storage systems are not true anymore. This dissertation explains how several aspects of such storage systems can be modified and rethought to make a more efficient use of the resource at their disposal. It presents an original architecture that uses a distributed layer of metadata to provide flexible and scalable object-level storage, then proposes a scheduling algorithm improving how a generic storage system handles concurrent requests. Finally, it describes how to improve legacy filesystem-level caching for erasure-code-based distributed storage systems, before presenting a few other contributions made in the context of short research projects. Stockage à définition logicielle - SDS Réseau à définition logicielle - SDN Cloud Data Centers Ordonnancement d'E/S Stockage distribué Sofware Defined Storage - SDS Sofware Defined Network - SDN Cloud Data Centers I/O scheduling Distributed storage
27	[en] NEW NETWORK SOLUTIONS AND NEXT GENERATION ENTERTAINMENT SERVICES / [pt] NOVAS SOLUÇÕES DE REDES E SERVIÇOS DE ENTRETENIMENTO DE ÚLTIMA GERAÇÃO CARLOS ALBERTO GAROFALO 28 December 2005 (has links) [pt] O principal objetivo desta dissertação consiste na proposta de implementação de uma rede de telecomunicações utilizando novas tecnologias, enfatizando as aplicações de entretenimento. As soluções adotadas foram orientadas pelas características econômicas verificadas nas áreas nobres das regiões metropolitanas brasileiras e também pelas novas tecnologias de roteamento, chaveamento, armazenamento e distribuição local. A avaliação do custo de investimento e operacional da rede, bem como a formulação de um modelo de negócios associado a uma estrutura de serviços oferecidos foram apresentadas e desenvolvidas. A construção de um plano de negócio hipotético para avaliar a relação custo-benefício resultante da utilização da infra-estrutura da rede proposta associado ao modelo e estrutura dos serviços elaborados foi implementado e executado. Quatro alternativas de implementação de rede foram avaliadas. / [en] The present dissertation is aiming at proposing a telecommunications network implementation using some new technologies where the emphasis is put on entertainment applications. The adopted solutions try to offer a selection grid that qualitatively cope with the economic level of some selected noble metropolitan areas in Brazil and rely in new routing, switching storage and local distribution technologies. The investment evaluation, the operational network costs and the formulation of a business model associated with the respective used service structure is subsequently introduced and described. Next, a hypothetic business plan service model is launched in order to evaluate the cost-benefit ratio between the network infrastructure proposed working together with its new service model and its new structure. Four possible alternatives of network implementation were evaluated and commented. [pt] GIGABIT ETHERNET [pt] REDES MULTIMIDIA [pt] INTELIGENCIA DE REDE [pt] ARMAZENAMENTO DISTRIBUIDO [pt] IP [pt] REDES OPTICAS [pt] ENTRETENIMENTO [en] GIGABIT ETHERNET [en] MULTIMEDIA NETWORKS [en] INTELIGENT NETWORKS [en] DISTRIBUTED STORAGE [en] IP [en] OPTICAL NETWORKS [en] ENTERTAINMENT
28	Efficient techniques for large-scale Web data management / Techniques efficaces de gestion de données Web à grande échelle Camacho Rodriguez, Jesus 25 September 2014 (has links) Le développement récent des offres commerciales autour du cloud computing a fortement influé sur la recherche et le développement des plateformes de distribution numérique. Les fournisseurs du cloud offrent une infrastructure de distribution extensible qui peut être utilisée pour le stockage et le traitement des données.En parallèle avec le développement des plates-formes de cloud computing, les modèles de programmation qui parallélisent de manière transparente l'exécution des tâches gourmandes en données sur des machines standards ont suscité un intérêt considérable, à commencer par le modèle MapReduce très connu aujourd'hui puis par d'autres frameworks plus récents et complets. Puisque ces modèles sont de plus en plus utilisés pour exprimer les tâches de traitement de données analytiques, la nécessité se fait ressentir dans l'utilisation des langages de haut niveau qui facilitent la charge de l'écriture des requêtes complexes pour ces systèmes.Cette thèse porte sur des modèles et techniques d'optimisation pour le traitement efficace de grandes masses de données du Web sur des infrastructures à grande échelle. Plus particulièrement, nous étudions la performance et le coût d'exploitation des services de cloud computing pour construire des entrepôts de données Web ainsi que la parallélisation et l'optimisation des langages de requêtes conçus sur mesure selon les données déclaratives du Web.Tout d'abord, nous présentons AMADA, une architecture d'entreposage de données Web à grande échelle dans les plateformes commerciales de cloud computing. AMADA opère comme logiciel en tant que service, permettant aux utilisateurs de télécharger, stocker et interroger de grands volumes de données Web. Sachant que les utilisateurs du cloud prennent en charge les coûts monétaires directement liés à leur consommation de ressources, notre objectif n'est pas seulement la minimisation du temps d'exécution des requêtes, mais aussi la minimisation des coûts financiers associés aux traitements de données. Plus précisément, nous étudions l'applicabilité de plusieurs stratégies d'indexation de contenus et nous montrons qu'elles permettent non seulement de réduire le temps d'exécution des requêtes mais aussi, et surtout, de diminuer les coûts monétaires liés à l'exploitation de l'entrepôt basé sur le cloud.Ensuite, nous étudions la parallélisation efficace de l'exécution de requêtes complexes sur des documents XML mis en œuvre au sein de notre système PAXQuery. Nous fournissons de nouveaux algorithmes montrant comment traduire ces requêtes dans des plans exprimés par le modèle de programmation PACT (PArallelization ConTracts). Ces plans sont ensuite optimisés et exécutés en parallèle par le système Stratosphere. Nous démontrons l'efficacité et l'extensibilité de notre approche à travers des expérimentations sur des centaines de Go de données XML.Enfin, nous présentons une nouvelle approche pour l'identification et la réutilisation des sous-expressions communes qui surviennent dans les scripts Pig Latin. Notre algorithme, nommé PigReuse, agit sur les représentations algébriques des scripts Pig Latin, identifie les possibilités de fusion des sous-expressions, sélectionne les meilleurs à exécuter en fonction du coût et fusionne d'autres expressions équivalentes pour partager leurs résultats. Nous apportons plusieurs extensions à l'algorithme afin d’améliorer sa performance. Nos résultats expérimentaux démontrent l'efficacité et la rapidité de nos algorithmes basés sur la réutilisation et des stratégies d'optimisation. / The recent development of commercial cloud computing environments has strongly impacted research and development in distributed software platforms. Cloud providers offer a distributed, shared-nothing infrastructure, that may be used for data storage and processing.In parallel with the development of cloud platforms, programming models that seamlessly parallelize the execution of data-intensive tasks over large clusters of commodity machines have received significant attention, starting with the MapReduce model very well known by now, and continuing through other novel and more expressive frameworks. As these models are increasingly used to express analytical-style data processing tasks, the need for higher-level languages that ease the burden of writing complex queries for these systems arises.This thesis investigates the efficient management of Web data on large-scale infrastructures. In particular, we study the performance and cost of exploiting cloud services to build Web data warehouses, and the parallelization and optimization of query languages that are tailored towards querying Web data declaratively.First, we present AMADA, an architecture for warehousing large-scale Web data in commercial cloud platforms. AMADA operates in a Software as a Service (SaaS) approach, allowing users to upload, store, and query large volumes of Web data. Since cloud users support monetary costs directly connected to their consumption of resources, our focus is not only on query performance from an execution time perspective, but also on the monetary costs associated to this processing. In particular, we study the applicability of several content indexing strategies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the monetary costs associated with the exploitation of the cloud-based warehouse.Second, we consider the efficient parallelization of the execution of complex queries over XML documents, implemented within our system PAXQuery. We provide novel algorithms showing how to translate such queries into plans expressed in the PArallelization ConTracts (PACT) programming model. These plans are then optimized and executed in parallel by the Stratosphere system. We demonstrate the efficiency and scalability of our approach through experiments on hundreds of GB of XML data.Finally, we present a novel approach for identifying and reusing common subexpressions occurring in Pig Latin scripts. In particular, we lay the foundation of our reuse-based algorithms by formalizing the semantics of the Pig Latin query language with extended nested relational algebra for bags. Our algorithm, named PigReuse, operates on the algebraic representations of Pig Latin scripts, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and merges other equivalent expressions to share its result. We bring several extensions to the algorithm to improve its performance. Our experiment results demonstrate the efficiency and effectiveness of our reuse-based algorithms and optimization strategies. Données Web XML Stratégies Traitement des requêtes Entreposage distribué XQuery Optimisation multi-requête Pig Latin Web data XML Commercial cloud services Indexing strategies Query processing Distributed storage Query parallelization XQuery Multi-query optimization Pig Latin
29	Scalable algorithms for cloud-based Semantic Web data management / Algorithmes passant à l’échelle pour la gestion de données du Web sémantique sur les platformes cloud Zampetakis, Stamatis 21 September 2015 (has links) Afin de construire des systèmes intelligents, où les machines sont capables de raisonner exactement comme les humains, les données avec sémantique sont une exigence majeure. Ce besoin a conduit à l’apparition du Web sémantique, qui propose des technologies standards pour représenter et interroger les données avec sémantique. RDF est le modèle répandu destiné à décrire de façon formelle les ressources Web, et SPARQL est le langage de requête qui permet de rechercher, d’ajouter, de modifier ou de supprimer des données RDF. Être capable de stocker et de rechercher des données avec sémantique a engendré le développement des nombreux systèmes de gestion des données RDF.L’évolution rapide du Web sémantique a provoqué le passage de systèmes de gestion des données centralisées à ceux distribués. Les premiers systèmes étaient fondés sur les architectures pair-à-pair et client-serveur, alors que récemment l’attention se porte sur le cloud computing.Les environnements de cloud computing ont fortement impacté la recherche et développement dans les systèmes distribués. Les fournisseurs de cloud offrent des infrastructures distribuées autonomes pouvant être utilisées pour le stockage et le traitement des données. Les principales caractéristiques du cloud computing impliquent l’évolutivité́, la tolérance aux pannes et l’allocation élastique des ressources informatiques et de stockage en fonction des besoins des utilisateurs.Cette thèse étudie la conception et la mise en œuvre d’algorithmes et de systèmes passant à l’échelle pour la gestion des données du Web sémantique sur des platformes cloud. Plus particulièrement, nous étudions la performance et le coût d’exploitation des services de cloud computing pour construire des entrepôts de données du Web sémantique, ainsi que l’optimisation de requêtes SPARQL pour les cadres massivement parallèles.Tout d’abord, nous introduisons les concepts de base concernant le Web sémantique et les principaux composants des systèmes fondés sur le cloud. En outre, nous présentons un aperçu des systèmes de gestion des données RDF (centralisés et distribués), en mettant l’accent sur les concepts critiques de stockage, d’indexation, d’optimisation des requêtes et d’infrastructure.Ensuite, nous présentons AMADA, une architecture de gestion de données RDF utilisant les infrastructures de cloud public. Nous adoptons le modèle de logiciel en tant que service (software as a service - SaaS), où la plateforme réside dans le cloud et des APIs appropriées sont mises à disposition des utilisateurs, afin qu’ils soient capables de stocker et de récupérer des données RDF. Nous explorons diverses stratégies de stockage et d’interrogation, et nous étudions leurs avantages et inconvénients au regard de la performance et du coût monétaire, qui est une nouvelle dimension importante à considérer dans les services de cloud public.Enfin, nous présentons CliqueSquare, un système distribué de gestion des données RDF basé sur Hadoop. CliqueSquare intègre un nouvel algorithme d’optimisation qui est capable de produire des plans massivement parallèles pour des requêtes SPARQL. Nous présentons une famille d’algorithmes d’optimisation, s’appuyant sur les équijointures n- aires pour générer des plans plats, et nous comparons leur capacité à trouver les plans les plus plats possibles. Inspirés par des techniques de partitionnement et d’indexation existantes, nous présentons une stratégie de stockage générique appropriée au stockage de données RDF dans HDFS (Hadoop Distributed File System). Nos résultats expérimentaux valident l’effectivité et l’efficacité de l’algorithme d’optimisation démontrant également la performance globale du système. / In order to build smart systems, where machines are able to reason exactly like humans, data with semantics is a major requirement. This need led to the advent of the Semantic Web, proposing standard ways for representing and querying data with semantics. RDF is the prevalent data model used to describe web resources, and SPARQL is the query language that allows expressing queries over RDF data. Being able to store and query data with semantics triggered the development of many RDF data management systems. The rapid evolution of the Semantic Web provoked the shift from centralized data management systems to distributed ones. The first systems to appear relied on P2P and client-server architectures, while recently the focus moved to cloud computing.Cloud computing environments have strongly impacted research and development in distributed software platforms. Cloud providers offer distributed, shared-nothing infrastructures that may be used for data storage and processing. The main features of cloud computing involve scalability, fault-tolerance, and elastic allocation of computing and storage resources following the needs of the users.This thesis investigates the design and implementation of scalable algorithms and systems for cloud-based Semantic Web data management. In particular, we study the performance and cost of exploiting commercial cloud infrastructures to build Semantic Web data repositories, and the optimization of SPARQL queries for massively parallel frameworks.First, we introduce the basic concepts around Semantic Web and the main components and frameworks interacting in massively parallel cloud-based systems. In addition, we provide an extended overview of existing RDF data management systems in the centralized and distributed settings, emphasizing on the critical concepts of storage, indexing, query optimization, and infrastructure. Second, we present AMADA, an architecture for RDF data management using public cloud infrastructures. We follow the Software as a Service (SaaS) model, where the complete platform is running in the cloud and appropriate APIs are provided to the end-users for storing and retrieving RDF data. We explore various storage and querying strategies revealing pros and cons with respect to performance and also to monetary cost, which is a important new dimension to consider in public cloud services. Finally, we present CliqueSquare, a distributed RDF data management system built on top of Hadoop, incorporating a novel optimization algorithm that is able to produce massively parallel plans for SPARQL queries. We present a family of optimization algorithms, relying on n-ary (star) equality joins to build flat plans, and compare their ability to find the flattest possibles. Inspired by existing partitioning and indexing techniques we present a generic storage strategy suitable for storing RDF data in HDFS (Hadoop’s Distributed File System). Our experimental results validate the efficiency and effectiveness of the optimization algorithm demonstrating also the overall performance of the system. Web sémantique RDF Stratégies d’indexation Systèmes distribués Stockage distribué Traitement des requêtes Optimisation des requêtes MapReduce Hadoop HDFS CliqueSquare AMADA Gestion des données RDF Jointures n-aires Plans plats Semantic Web RDF Commercial cloud services Indexing strategies Distributed systems Distributed storage Query processing Query optimization Query parallelization MapReduce Hadoop HDFS CliqueSquare AMADA RDF data management N-ary joins Flat plans
30	Étude des problèmes d’ordonnancement sur des plates-formes hétérogènes en modèle multi-port Rejeb, Hejer 30 August 2011 (has links) Les travaux menés dans cette thèse concernent les problèmes d'ordonnancement sur des plates-formes de calcul dynamiques et hétérogènes et s'appuient sur le modèle de communication "multi-port" pour les communications. Nous avons considéré le problème de l'ordonnancement des tâches indépendantes sur des plates-formes maîtres-esclaves, dans les contextes statique et dynamique. Nous nous sommes également intéressé au problème de la redistribution de fichiers répliqués dans le cadre de l'équilibrage de charge. Enfin, nous avons étudié l'importance des mécanismes de partage de bande passante pour obtenir une meilleure efficacité du système. / The results presented in this document deal with scheduling problems on dynamic and heterogeneous computing platforms under the "multiport" model for the communications. We have considered the problem of scheduling independent tasks on master-slave platforms, in both offline and online contexts. We have also proposed algorithms for replicated files redistribution to achieve load balancing. Finally, we have studied the importance of bandwidth sharing mechanisms to achieve better efficiency. Ordonnancement Modèle multi-port Re-construction des fichiers Systèmes distribués Système de stockage distribué Qualité de service TCP Partage de bande passante Re-configuration de système Tâches indépendantes Diffusion Scheduling Multiport Model Distributed systems Distributed Storage System Quality of Service TCP Bandwidth Sharing System Reconfiguration Independant Tasks Scheduling Broadcasting

Search results