Global ETD Search

81	Towards a Framework for DHT Distributed Computing Rosen, Andrew 12 August 2016 (has links) Distributed Hash Tables (DHTs) are protocols and frameworks used by peer-to-peer (P2P) systems. They are used as the organizational backbone for many P2P file-sharing systems due to their scalability, fault-tolerance, and load-balancing properties. These same properties are highly desirable in a distributed computing environment, especially one that wants to use heterogeneous components. We show that DHTs can be used not only as the framework to build a P2P file-sharing service, but as a P2P distributed computing platform. We propose creating a P2P distributed computing framework using distributed hash tables, based on our prototype system ChordReduce. This framework would make it simple and efficient for developers to create their own distributed computing applications. Unlike Hadoop and similar MapReduce frameworks, our framework can be used both in both the context of a datacenter or as part of a P2P computing platform. This opens up new possibilities for building platforms to distributed computing problems. One advantage our system will have is an autonomous load-balancing mechanism. Nodes will be able to independently acquire work from other nodes in the network, rather than sitting idle. More powerful nodes in the network will be able use the mechanism to acquire more work, exploiting the heterogeneity of the network. By utilizing the load-balancing algorithm, a datacenter could easily leverage additional P2P resources at runtime on an as needed basis. Our framework will allow MapReduce-like or distributed machine learning platforms to be easily deployed in a greater variety of contexts. Distributed Hash Tables MapReduce Distributed Computing Load Balancing Cryptographic Hash Functions
82	Forest aboveground biomass and carbon mapping with computational cloud Guan, Aimin 26 April 2017 (has links) In the last decade, advances in sensor and computing technology are revolutionary. The latest-generation of hyperspectral and synthetic aperture radar ((SAR) instruments have increased their spectral, spatial, and temporal resolution. Consequently, the data sets collected are increasing rapidly in size and frequency of acquisition. Remote sensing applications are requiring more computing resources for data analysis. High performance computing (HPC) infrastructure such as clusters, distributed networks, grids, clouds and specialized hardware components, have been used to disseminate large volumes of remote sensing data and to accelerate the computational speed in processing raw images and extracting information from remote sensing data. In previous research we have shown that we can improve computational efficiency of a hyperspectral image denoising algorithm by parallelizing the algorithm utilizing a distributed computing grid. In recent years, computational cloud technology is emerging, bringing more flexibility and simplicity for data processing. Hadoop MapReduce is a software framework for distributed commodity computing clusters, allowing parallel processing of massive datasets. In this project, we implement a software application to map forest aboveground biomass (AGB) with normalized difference vegetation indices (NDVI) using Landsat Thematic Mapper’s bands 4 and 5 (ND45). We present observations and experimental results on the performance and the algorithmic complexity of the implementation. There are three research questions answered in this thesis, as follows. 1) How do we implement remote sensing algorithms, such as forest AGB mapping, in a computer cloud environment? 2) What are the requirements to implement distributed processing of remote sensing images using the cloud programming model? 3) What is the performance increase for large area remote sensing image processing in a cloud environment? / Graduate / 0799 / 0984 Remote Sensing Computational Cloud Forest Above Ground Biomass Carbon Mapping Hadoop MapReduce Parallel Computing Computational Grid
83	[en] DISTRIBUTED RDF GRAPH KEYWORD SEARCH / [pt] BUSCA DISTRIBUÍDA EM GRAFO RDF POR PALAVRA-CHAVE DANILO MORET RODRIGUES 26 December 2014 (has links) [pt] O objetivo desta dissertação é melhorar a busca por palavra-chave em formato RDF. Propomos uma abordagem escalável, baseada numa representação tensorial, que permite o armazenamento distribuído e, como consequência, o uso de técnicas de paralelismo para agilizar a busca sobre grandes bases de RDF, em particular, as publicadas como Linked Data. Um volume sem precedentes de informação está sendo disponibilizado seguindo os princípios de Linked Data, formando o que chamamos de Web of Data. Esta informação, tipicamente codificada como triplas RDF, costuma ser representada como um grafo, onde sujeitos e objetos são vértices, e predicados são arestas ligando os vértices. Em consequência da ampla adoção de mecanismos de busca na World Wide Web, usuários estão familiarizados com a busca por palavra-chave. No caso de grafos RDF, no entanto, a extração de uma partição coerente de grafos para enriquecer os resultados da busca é uma tarefa cara, demorada, e cuja expectativa do usuário é de que seja executada em tempo real. Este trabalho tem como objetivo o tratamento deste problema. Parte de uma solução proposta recentemente prega a indexação do grafo RDF como uma matriz esparsa, que contém um conjunto de informações pré-computadas para agilizar a extração de seções do grafo, e o uso de consultas baseadas em tensores sobre a matriz esparsa. Esta abordagem baseada em tensores permite que se tome vantagem de técnicas modernas de programação distribuída, e.g., a utilização de bases de dados não-relacionais fracionadas e o modelo de MapReduce. Nesta dissertação, propomos o desenho e exploramos a viabilidade da abordagem baseada em tensores, com o objetivo de construir um depósito de dados distribuído e agilizar a busca por palavras-chave com uma abordagem paralela. / [en] The goal of this dissertation is to improve RDF keyword search. We propose a scalable approach, based on a tensor representation that allows for distributed storage, and thus the use of parallel techniques to speed up the search over large linked data sets, in particular those published as Linked Data. An unprecedented amount of information is becoming available following the principles of Linked Data, forming what is called the Web of Data. This information, typically codified as RDF subject-predicate-object triples, is commonly abstracted as a graph which subjects and objects are nodes, and predicates are edges connecting them. As a consequence of the widespread adoption of search engines on the World Wide Web, users are familiar with keyword search. For RDF graphs, however, extracting a coherent subset of data graphs to enrich search results is a time consuming and expensive task, and it is expected to be executed on-the-fly at user prompt. The dissertation s goal is to handle this problem. A recent proposal has been made to index RDF graphs as a sparse matrix with the pre-computed information necessary for faster retrieval of sub-graphs, and the use of tensor-based queries over the sparse matrix. The tensor approach can leverage modern distributed computing techniques, e.g., nonrelational database sharding and the MapReduce model. In this dissertation, we propose a design and explore the viability of the tensor-based approach to build a distributed datastore and speed up keyword search with a parallel approach. [pt] LINKED DATA [en] LINKED DATA [pt] MAPREDUCE [pt] CLOUD COMPUTING [pt] KEYWORD SEARCH
84	Apprentissage non supervisé de flux de données massives : application aux Big Data d'assurance / Unsupervided learning of massive data streams : application to Big Data in insurance Ghesmoune, Mohammed 25 November 2016 (has links) Le travail de recherche exposé dans cette thèse concerne le développement d'approches à base de growing neural gas (GNG) pour le clustering de flux de données massives. Nous proposons trois extensions de l'approche GNG : séquentielle, distribuée et parallèle, et une méthode hiérarchique; ainsi qu'une nouvelle modélisation pour le passage à l'échelle en utilisant le paradigme MapReduce et l'application de ce modèle pour le clustering au fil de l'eau du jeu de données d'assurance. Nous avons d'abord proposé la méthode G-Stream. G-Stream, en tant que méthode "séquentielle" de clustering, permet de découvrir de manière incrémentale des clusters de formes arbitraires et en ne faisant qu'une seule passe sur les données. G-Stream utilise une fonction d'oubli an de réduire l'impact des anciennes données dont la pertinence diminue au fil du temps. Les liens entre les nœuds (clusters) sont également pondérés par une fonction exponentielle. Un réservoir de données est aussi utilisé an de maintenir, de façon temporaire, les observations très éloignées des prototypes courants. L'algorithme batchStream traite les données en micro-batch (fenêtre de données) pour le clustering de flux. Nous avons défini une nouvelle fonction de coût qui tient compte des sous ensembles de données qui arrivent par paquets. La minimisation de la fonction de coût utilise l'algorithme des nuées dynamiques tout en introduisant une pondération qui permet une pénalisation des données anciennes. Une nouvelle modélisation utilisant le paradigme MapReduce est proposée. Cette modélisation a pour objectif de passer à l'échelle. Elle consiste à décomposer le problème de clustering de flux en fonctions élémentaires (Map et Reduce). Ainsi de traiter chaque sous ensemble de données pour produire soit les clusters intermédiaires ou finaux. Pour l'implémentation de la modélisation proposée, nous avons utilisé la plateforme Spark. Dans le cadre du projet Square Predict, nous avons validé l'algorithme batchStream sur les données d'assurance. Un modèle prédictif combinant le résultat du clustering avec les arbres de décision est aussi présenté. L'algorithme GH-Stream est notre troisième extension de GNG pour la visualisation et le clustering de flux de données massives. L'approche présentée a la particularité d'utiliser une structure hiérarchique et topologique, qui consiste en plusieurs arbres hiérarchiques représentant des clusters, pour les tâches de clustering et de visualisation. / The research outlined in this thesis concerns the development of approaches based on growing neural gas (GNG) for clustering of data streams. We propose three algorithmic extensions of the GNG approaches: sequential, distributed and parallel, and hierarchical; as well as a model for scalability using MapReduce and its application to learn clusters from the real insurance Big Data in the form of a data stream. We firstly propose the G-Stream method. G-Stream, as a “sequential" clustering method, is a one-pass data stream clustering algorithm that allows us to discover clusters of arbitrary shapes without any assumptions on the number of clusters. G-Stream uses an exponential fading function to reduce the impact of old data whose relevance diminishes over time. The links between the nodes are also weighted. A reservoir is used to hold temporarily the distant observations in order to reduce the movements of the nearest nodes to the observations. The batchStream algorithm is a micro-batch based method for clustering data streams which defines a new cost function taking into account that subsets of observations arrive in discrete batches. The minimization of this function, which leads to a topological clustering, is carried out using dynamic clusters in two steps: an assignment step which assigns each observation to a cluster, followed by an optimization step which computes the prototype for each node. A scalable model using MapReduce is then proposed. It consists of decomposing the data stream clustering problem into the elementary functions, Map and Reduce. The observations received in each sub-dataset (within a time interval) are processed through deterministic parallel operations (Map and Reduce) to produce the intermediate states or the final clusters. The batchStream algorithm is validated on the insurance Big Data. A predictive and analysis system is proposed by combining the clustering results of batchStream with decision trees. The architecture and these different modules from the computational core of our Big Data project, called Square Predict. GH-Stream for both visualization and clustering tasks is our third extension. The presented approach uses a hierarchical and topological structure for both of these tasks. Apprentissage no supervisé Clustering de flux de données Clustering topologique MapReduce Unsupervised learning Clustering of data streams Topogical clustering
85	Uma arquitetura para processamento de grande volumes de dados integrando sistemas de workflow científicos e o paradigma mapreduce Zorrilla Coz, Rocío Milagros 13 September 2012 (has links) Submitted by Maria Cristina (library@lncc.br) on 2017-08-10T17:48:51Z No. of bitstreams: 1 RocioZorrilla_Dissertacao.pdf: 3954121 bytes, checksum: f22054a617a91e44c59cba07b1d97fbb (MD5) / Approved for entry into archive by Maria Cristina (library@lncc.br) on 2017-08-10T17:49:05Z (GMT) No. of bitstreams: 1 RocioZorrilla_Dissertacao.pdf: 3954121 bytes, checksum: f22054a617a91e44c59cba07b1d97fbb (MD5) / Made available in DSpace on 2017-08-10T17:49:17Z (GMT). No. of bitstreams: 1 RocioZorrilla_Dissertacao.pdf: 3954121 bytes, checksum: f22054a617a91e44c59cba07b1d97fbb (MD5) Previous issue date: 2012-09-13 / Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) / With the exponential growth of computational power and generated data from scientific experiments and simulations, it is possible to find today simulations that generate terabytes of data and scientific experiments that gather petabytes of data. The type of processing required for this data is currently known as data-intensive computing. The MapReduce paradigm, which is included in the Hadoop framework, is an alternative parallelization technique for the execution of distributed applications that is being increasingly used. This framework is responsible for scheduling the execution of jobs in clusters, provides fault tolerance and manages all necessary communication between machines. For many types of complex applications, the Scientific Workflow Systems offer advanced functionalities that can be leveraged for the development, execution and evaluation of scientific experiments under different computational environments. In the Query Evaluation Framework (QEF), workflow activities are represented as algebrical operators, and specific application data types are encapsulated in a common tuple structure. QEF aims for the automatization of computational processes and data management, supporting scientists so that they can concentrate on the scientific problem. Nowadays, there are several Scientific Workflow Systems that provide components and task parallelization strategies on a distributed environment. However, scientific experiments tend to generate large sizes of information, which may limit the execution scalability in relation to data locality. For instance, there could be delays in data transfer for process execution or a fault at result consolidation. In this work, I present a proposal for the integration of QEF with Hadoop. The main objective is to manage the execution of a workflow with an orientation towards data locality. In this proposal, Hadoop is responsible for the scheduling of tasks in a distributed environment, while the workflow activities and data sources are managed by QEF. The proposed environment is evaluated using a scientific workflow from the astronomy field as a case study. Then, I describe in detail the deployment of the application in a virtualized environment. Finally, experiments that evaluate the impact of the proposed environment on the perceived performance of the application are presented, and future work discussed. / Com o crescimento exponencial do poder computacional e das fontes de geração de dados em experimentos e simulações científicas, é possível encontrar simulações que usualmente geram terabytes de dados e experimentos científicos que coletam petabytes de dados. O processamento requerido nesses casos é atualmente conhecido como computação de dados intensivos. Uma alternativa para a execução de aplicações distribuídas que atualmente é bastante usada é a técnica de paralelismo baseada no paradigma MapReduce, a qual é incluída no framework Hadoop. Esse framework se encarrega do escalonamento da execução em um conjunto de computadores (cluster), do tratamento de falhas, e do gerenciamento da comunicação necessária entre máquinas. Para diversos tipos de aplicações complexas, os Sistemas de Gerência de Workflows Científicos (SGWf) oferecem funcionalidades avançadas que auxiliam no desenvolvimento, execução e avaliação de experimentos científicos sobre diversos tipos de ambientes computacionais. No Query Evaluation Framework (QEF), as atividades de um workflow são representadas como operadores algébricos e os tipos de dados específicos da aplicação são encapsulados em uma tupla com estrutura comum. O QEF aponta para a automatização de processos computacionais e gerenciamento de dados, ajudando os cientistas a se concentrarem no problema científico. Atualmente, existem vários sistemas de gerência de workflows científicos que fornecem componentes e estratégias de paralelização de tarefas em um ambiente distribuído. No entanto, os experimentos científicos apresentam uma tendência a gerar quantidades de informação que podem representar uma limitação na escalabilidade de execução em relação à localidade dos dados. Por exemplo, é possível que exista um atraso na transferência de dados no processo de execução de determinada tarefa ou uma falha no momento de consolidar os resultados. Neste trabalho, é apresentada uma proposta de integração do QEF com o Hadoop. O objetivo dessa proposta é realizar a execução de um workflow científico orientada a localidade dos dados. Na proposta apresentada, o Hadoop é responsável pelo escalonamento de tarefas em um ambiente distribuído, enquanto que o gerenciamento das atividades e fontes de dados do workflow é realizada pelo QEF. O ambiente proposto é avaliado utilizando um workflow científico da astronomia como estudo de caso. Logo, a disponibilização da aplicação no ambiente virtualizado é descrita em detalhe. Por fim, são realizados experimentos para avaliar o impacto do ambiente proposto no desempenho percebido da aplicação, e discutidos trabalhos futuros. Workflows cientificos MapReduce e Hadoop Virtualização
86	Big Data : le nouvel enjeu de l'apprentissage à partir des données massives / Big Data : the new challenge Learning from data Massive Adjout Rehab, Moufida 01 April 2016 (has links) Le croisement du phénomène de mondialisation et du développement continu des technologies de l’information a débouché sur une explosion des volumes de données disponibles. Ainsi, les capacités de production, de stockage et de traitement des donnée sont franchi un tel seuil qu’un nouveau terme a été mis en avant : Big Data.L’augmentation des quantités de données à considérer, nécessite la mise en oeuvre de nouveaux outils de traitement. En effet, les outils classiques d’apprentissage sont peu adaptés à ce changement de volumétrie tant au niveau de la complexité de calcul qu’à la durée nécessaire au traitement. Ce dernier, étant le plus souvent centralisé et séquentiel,ce qui rend les méthodes d’apprentissage dépendantes de la capacité de la machine utilisée. Par conséquent, les difficultés pour analyser un grand jeu de données sont multiples.Dans le cadre de cette thèse, nous nous sommes intéressés aux problèmes rencontrés par l’apprentissage supervisé sur de grands volumes de données. Pour faire face à ces nouveaux enjeux, de nouveaux processus et méthodes doivent être développés afin d’exploiter au mieux l’ensemble des données disponibles. L’objectif de cette thèse est d’explorer la piste qui consiste à concevoir une version scalable de ces méthodes classiques. Cette piste s’appuie sur la distribution des traitements et des données pou raugmenter la capacité des approches sans nuire à leurs précisions.Notre contribution se compose de deux parties proposant chacune une nouvelle approche d’apprentissage pour le traitement massif de données. Ces deux contributions s’inscrivent dans le domaine de l’apprentissage prédictif supervisé à partir des données volumineuses telles que la Régression Linéaire Multiple et les méthodes d’ensemble comme le Bagging.La première contribution nommée MLR-MR, concerne le passage à l’échelle de la Régression Linéaire Multiple à travers une distribution du traitement sur un cluster de machines. Le but est d’optimiser le processus du traitement ainsi que la charge du calcul induite, sans changer évidement le principe de calcul (factorisation QR) qui permet d’obtenir les mêmes coefficients issus de la méthode classique.La deuxième contribution proposée est appelée "Bagging MR_PR_D" (Bagging based Map Reduce with Distributed PRuning), elle implémente une approche scalable du Bagging,permettant un traitement distribué sur deux niveaux : l’apprentissage et l’élagage des modèles. Le but de cette dernière est de concevoir un algorithme performant et scalable sur toutes les phases de traitement (apprentissage et élagage) et garantir ainsi un large spectre d’applications.Ces deux approches ont été testées sur une variété de jeux de données associées àdes problèmes de régression. Le nombre d’observations est de plusieurs millions. Nos résultats expérimentaux démontrent l’efficacité et la rapidité de nos approches basées sur la distribution de traitement dans le Cloud Computing. / In recent years we have witnessed a tremendous growth in the volume of data generatedpartly due to the continuous development of information technologies. Managing theseamounts of data requires fundamental changes in the architecture of data managementsystems in order to adapt to large and complex data. Single-based machines have notthe required capacity to process such massive data which motivates the need for scalablesolutions.This thesis focuses on building scalable data management systems for treating largeamounts of data. Our objective is to study the scalability of supervised machine learningmethods in large-scale scenarios. In fact, in most of existing algorithms and datastructures,there is a trade-off between efficiency, complexity, scalability. To addressthese issues, we explore recent techniques for distributed learning in order to overcomethe limitations of current learning algorithms.Our contribution consists of two new machine learning approaches for large scale data.The first contribution tackles the problem of scalability of Multiple Linear Regressionin distributed environments, which permits to learn quickly from massive volumes ofexisting data using parallel computing and a divide and-conquer approach to providethe same coefficients like the classic approach.The second contribution introduces a new scalable approach for ensembles of modelswhich allows both learning and pruning be deployed in a distributed environment.Both approaches have been evaluated on a variety of datasets for regression rangingfrom some thousands to several millions of examples. The experimental results showthat the proposed approaches are competitive in terms of predictive performance while reducing significantly the time of training and prediction. Données massives Big data Régression linéaire multiple Large scale data Mapreduce Multiple linear regression Bagging
87	M��thode de Partitionnement pour le traitement distribu�� et parall��le de donn��es XML. Malla, Noor 21 September 2012 (has links) (PDF) Durant cette derni��re d��cennie, la diffusion du format XML pour repr��senter les donn��es g��n��r��es par et ��chang��es sur le Web a ��t�� accompagn��e par la mise en ��uvre de nombreux moteurs d'��valuation de requ��tes et de mises �� jour XQuery. Parmi ces moteurs, les syst��mes " m��moire centrale " (Main-memory Systems) jouent un r��le tr��s important dans de nombreuses applications. La gestion et l'int��gration de ces syst��mes dans des environnements de programmation sont tr��s faciles. Cependant, ces syst��mes ont des probl��mes de passage �� l'��chelle puisqu'ils requi��rent le chargement complet des documents en m��moire centrale avant traitement.Cette th��se pr��sente une technique de partitionnement des documents XML qui permet aux moteurs " m��moire principale " d'��valuer des expressions XQuery (requ��tes et mises �� jour) pour des documents de tr��s grandes tailles. Cette m��thode de partitionnement s'applique �� une classe de requ��tes et mises �� jour pertinentes et fr��quentes, dites requ��tes et mises �� jour it��ratives.Cette th��se propose une technique d'analyse statique pour reconna��tre les expressions " it��ratives ". Cette analyse statique est bas��e sur l'extraction de chemins �� partir de l'expression XQuery, sans utilisation d'information suppl��mentaire sur le sch��ma. Des algorithmes sont sp��cifi��s, utilisant les chemins extraits par l'��tape pr��c��dente, pour partitionner les documents en entr��e en plusieurs parties, de sorte que la requ��te ou la mise �� jour peut ��tre ��valu��e sur chaque partie s��par��ment afin de calculer le r��sultat final par simple concat��nation des r��sultats obtenus pour chaque partie. Ces algorithmes sont mis en ��uvre en " streaming " et leur efficacit�� est valid��e exp��rimentalement.En plus, cette m��thode de partitionnement est caract��ris��e ��galement par le fait qu'elle peut ��tre facilement impl��ment��e en utilisant le paradigme MapReduce, permettant ainsi d'��valuer une requ��te ou une mise �� jour en parall��le sur les donn��es partitionn��es. [INFO:INFO_OH] Computer Science/Other XML requ��tes XQuery Mises �� jour XQuery Projection Partitionnement de donn��es MapReduce
88	Méthode de Partitionnement pour le traitement distribué et parallèle de données XML. Malla, Noor 21 September 2012 (has links) (PDF) Durant cette dernière décennie, la diffusion du format XML pour représenter les données générées par et échangées sur le Web a été accompagnée par la mise en œuvre de nombreux moteurs d'évaluation de requêtes et de mises à jour XQuery. Parmi ces moteurs, les systèmes " mémoire centrale " (Main-memory Systems) jouent un rôle très important dans de nombreuses applications. La gestion et l'intégration de ces systèmes dans des environnements de programmation sont très faciles. Cependant, ces systèmes ont des problèmes de passage à l'échelle puisqu'ils requièrent le chargement complet des documents en mémoire centrale avant traitement.Cette thèse présente une technique de partitionnement des documents XML qui permet aux moteurs " mémoire principale " d'évaluer des expressions XQuery (requêtes et mises à jour) pour des documents de très grandes tailles. Cette méthode de partitionnement s'applique à une classe de requêtes et mises à jour pertinentes et fréquentes, dites requêtes et mises à jour itératives.Cette thèse propose une technique d'analyse statique pour reconnaître les expressions " itératives ". Cette analyse statique est basée sur l'extraction de chemins à partir de l'expression XQuery, sans utilisation d'information supplémentaire sur le schéma. Des algorithmes sont spécifiés, utilisant les chemins extraits par l'étape précédente, pour partitionner les documents en entrée en plusieurs parties, de sorte que la requête ou la mise à jour peut être évaluée sur chaque partie séparément afin de calculer le résultat final par simple concaténation des résultats obtenus pour chaque partie. Ces algorithmes sont mis en œuvre en " streaming " et leur efficacité est validée expérimentalement.En plus, cette méthode de partitionnement est caractérisée également par le fait qu'elle peut être facilement implémentée en utilisant le paradigme MapReduce, permettant ainsi d'évaluer une requête ou une mise à jour en parallèle sur les données partitionnées. [INFO:INFO_OH] Computer Science/Other XML requêtes XQuery Mises à jour XQuery Projection Partitionnement de données MapReduce
89	Optimization of vido Delivery in Telco-CDN LI, Zhe 25 January 2013 (has links) (PDF) The exploding HD video streaming traffic calls for deploying content servers deeper inside network operators infrastructures. Telco-CDN are new content distribution services that are managed by Internet Service Providers (ISP). Since the network operator controls both the infrastructure and the content delivery overlay, it is in position to engineer Telco-CDN so that networking resources are optimally utilized. In this thesis, we focus on the optimal resource placement in Telco-CDN. We first investigated the placement of application components in Telco-CDN. Popular services like Facebook or Twitter, with a size in the order of hundreds of Terabytes, cannot be fully replicated on a single data-center. Instead, the idea is to partition the service into smaller components and to locate the components on distinct sites. It is the same and unique method for Telco-CDN operators. We addressed this k-Component Multi-Site Placement Problem from an optimization standpoint. We developed linear programming models, designed approximation and heuristic algorithms to minimize the overall service delivery cost. Thereafter, we extend our works to address the problem of optimal video place- ment for Telco-CDN. We modeled this problem as a k-Product Capacitated Facility Location Problem, which takes into account network conditions and users¿ prefer- ences. We designed a genetic algorithm in order to obtain near-optimal performances of such "push" approach, then we implemented it on the MapReduce framework in order to deal with very large data sets. The evaluation signifies that our optimal placement keeps align with cooperative LRU caching in term of storage efficiency although its impact on network infrastructure is less severe. We then explore the caching decision problem in the context of Information Cen- tric Network (ICN), which could be a revolutionary design of Telco-CDN. In ICN, routers are endowed with caching capabilities. So far, only a basic Least Recently Used (LRU) policy implemented on every router has been proposed. Our first contri- bution is the proposition of a cooperative caching protocol, which has been designed for the treatment of large video streams with on-demand access. We integrated our new protocol into the main router software (CCNx) and developed a platform that automatically deploys our augmented CCNx implementation on real machines. Ex- periments show that our cooperative caching significantly reduces the inter-domain traffic for an ISP with acceptable overhead. Finally, we aim at better understanding the behavior of caching policies other than LRU. We built an analytical model that approximates the performance of a set of policies ranging from LRU to Least Frequently Used (LFU) in any type of network topologies. We also designed a multi-policy in-network caching, where every router implements its own caching policy according to its location in the network. Compared to the single LRU policy, the multi-caching strategy considerably increases the hit- ratio of the in-network caching system in the context of Video-on-Demand application. All in one, this thesis explores different aspects related to the resource placement in Telco-CDN. The aim is to explore optimal and near-optimal performances of various approaches. Facility Location Problem MapReduce Caching Policies CDN ICN
90	Scalable Scientific Computing Algorithms Using MapReduce Xiang, Jingen January 2013 (has links) Cloud computing systems, like MapReduce and Pregel, provide a scalable and fault tolerant environment for running computations at massive scale. However, these systems are designed primarily for data intensive computational tasks, while a large class of problems in scientific computing and business analytics are computationally intensive (i.e., they require a lot of CPU in addition to I/O). In this thesis, we investigate the use of cloud computing systems, in particular MapReduce, for computationally intensive problems, focusing on two classic problems that arise in scienti c computing and also in analytics: maximum clique and matrix inversion. The key contribution that enables us to e ectively use MapReduce to solve the maximum clique problem on dense graphs is a recursive partitioning method that partitions the graph into several subgraphs of similar size and running time complexity. After partitioning, the maximum cliques of the di erent partitions can be computed independently, and the computation is sped up using a branch and bound method. Our experiments show that our approach leads to good scalability, which is unachievable by other partitioning methods since they result in partitions of di erent sizes and hence lead to load imbalance. Our method is more scalable than an MPI algorithm, and is simpler and more fault tolerant. For the matrix inversion problem, we show that a recursive block LU decomposition allows us to e ectively compute in parallel both the lower triangular (L) and upper triangular (U) matrices using MapReduce. After computing the L and U matrices, their inverses are computed using MapReduce. The inverse of the original matrix, which is the product of the inverses of the L and U matrices, is also obtained using MapReduce. Our technique is the rst matrix inversion technique that uses MapReduce. We show experimentally that our technique has good scalability, and it is simpler and more fault tolerant than MPI implementations such as ScaLAPACK. Scientific Computing Cloud Computing MapReduce Hadoop Matrix Inversion Maximum Clique Computer Science

Search results