Spelling suggestions: "subject:"mapreduce"" "subject:"hmapreduce""
111 |
Performance modeling of MapReduce applications for the cloud / Modelagem de desempenho de aplicações mapreduce para a núvemIzurieta, Iván Carrera January 2014 (has links)
Nos últimos anos, Cloud Computing tem se tornado uma tecnologia importante que possibilitou executar aplicações sem a necessidade de implementar uma infraestrutura física com a vantagem de reduzir os custos ao usuário cobrando somente pelos recursos computacionais utilizados pela aplicação. O desafio com a implementação de aplicações distribuídas em ambientes de Cloud Computing é o planejamento da infraestrutura de máquinas virtuais visando otimizar o tempo de execução e o custo da implementação. Assim mesmo, nos últimos anos temos visto como a quantidade de dados produzida pelas aplicações cresceu mais que nunca. Estes dados contêm informação valiosa que deve ser obtida utilizando ferramentas como MapReduce. MapReduce é um importante framework para análise de grandes quantidades de dados desde que foi proposto pela Google, e disponibilizado Open Source pela Apache com a sua implementação Hadoop. O objetivo deste trabalho é apresentar que é possível predizer o tempo de execução de uma aplicação distribuída, a saber, uma aplicação MapReduce, na infraestrutura de Cloud Computing, utilizando um modelo matemático baseado em especificações teóricas. Após medir o tempo levado na execução da aplicação e variando os parámetros indicados no modelo matemático, e, após utilizar uma técnica de regressão linear, o objetivo é atingido encontrando um modelo do tempo de execução que foi posteriormente aplicado para predizer o tempo de execução de aplicações MapReduce com resultados satisfatórios. Os experimentos foram realizados em diferentes configurações: a saber, executando diferentes aplicações MapReduce em clusters privados e públicos, bem como em infraestruturas de Cloud comercial, e variando o número de nós que compõem o cluster, e o tamanho do workload dado à aplicação. Os experimentos mostraram uma clara relação com o modelo teórico, indicando que o modelo é, de fato, capaz de predizer o tempo de execução de aplicações MapReduce. O modelo desenvolvido é genérico, o que quer dizer que utiliza abstrações teóricas para a capacidade computacional do ambiente e o custo computacional da aplicação MapReduce. Motiva-se a desenvolver trabalhos futuros para estender esta abordagem para atingir outro tipo de aplicações distribuídas, e também incluir o modelo matemático deste trabalho dentro de serviços na núvem que ofereçam plataformas MapReduce, a fim de ajudar os usuários a planejar suas implementações. / In the last years, Cloud Computing has become a key technology that made possible running applications without needing to deploy a physical infrastructure with the advantage of lowering costs to the user by charging only for the computational resources used by the application. The challenge with deploying distributed applications in Cloud Computing environments is that the virtual machine infrastructure should be planned in a way that is time and cost-effective. Also, in the last years we have seen how the amount of data produced by applications has grown bigger than ever. This data contains valuable information that has to be extracted using tools like MapReduce. MapReduce is an important framework to analyze large amounts of data since it was proposed by Google, and made open source by Apache with its Hadoop implementation. The goal of this work is to show that the execution time of a distributed application, namely, a MapReduce application, in a Cloud computing environment, can be predicted using a mathematical model based on theoretical specifications. This prediction is made to help the users of the Cloud Computing environment to plan their deployments, i.e., quantify the number of virtual machines and its characteristics in order to have a lesser cost and/or time. After measuring the application execution time and varying parameters stated in the mathematical model, and after that, using a linear regression technique, the goal is achieved finding a model of the execution time which was then applied to predict the execution time of MapReduce applications with satisfying results. The experiments were conducted in several configurations: namely, private and public clusters, as well as commercial cloud infrastructures, running different MapReduce applications, and varying the number of nodes composing the cluster, as well as the amount of workload given to the application. Experiments showed a clear relation with the theoretical model, revealing that the model is in fact able to predict the execution time of MapReduce applications. The developed model is generic, meaning that it uses theoretical abstractions for the computing capacity of the environment and the computing cost of the MapReduce application. Further work in extending this approach to fit other types of distributed applications is encouraged, as well as including this mathematical model into Cloud services offering MapReduce platforms, in order to aid users plan their deployments.
|
112 |
Performance modeling of MapReduce applications for the cloud / Modelagem de desempenho de aplicações mapreduce para a núvemIzurieta, Iván Carrera January 2014 (has links)
Nos últimos anos, Cloud Computing tem se tornado uma tecnologia importante que possibilitou executar aplicações sem a necessidade de implementar uma infraestrutura física com a vantagem de reduzir os custos ao usuário cobrando somente pelos recursos computacionais utilizados pela aplicação. O desafio com a implementação de aplicações distribuídas em ambientes de Cloud Computing é o planejamento da infraestrutura de máquinas virtuais visando otimizar o tempo de execução e o custo da implementação. Assim mesmo, nos últimos anos temos visto como a quantidade de dados produzida pelas aplicações cresceu mais que nunca. Estes dados contêm informação valiosa que deve ser obtida utilizando ferramentas como MapReduce. MapReduce é um importante framework para análise de grandes quantidades de dados desde que foi proposto pela Google, e disponibilizado Open Source pela Apache com a sua implementação Hadoop. O objetivo deste trabalho é apresentar que é possível predizer o tempo de execução de uma aplicação distribuída, a saber, uma aplicação MapReduce, na infraestrutura de Cloud Computing, utilizando um modelo matemático baseado em especificações teóricas. Após medir o tempo levado na execução da aplicação e variando os parámetros indicados no modelo matemático, e, após utilizar uma técnica de regressão linear, o objetivo é atingido encontrando um modelo do tempo de execução que foi posteriormente aplicado para predizer o tempo de execução de aplicações MapReduce com resultados satisfatórios. Os experimentos foram realizados em diferentes configurações: a saber, executando diferentes aplicações MapReduce em clusters privados e públicos, bem como em infraestruturas de Cloud comercial, e variando o número de nós que compõem o cluster, e o tamanho do workload dado à aplicação. Os experimentos mostraram uma clara relação com o modelo teórico, indicando que o modelo é, de fato, capaz de predizer o tempo de execução de aplicações MapReduce. O modelo desenvolvido é genérico, o que quer dizer que utiliza abstrações teóricas para a capacidade computacional do ambiente e o custo computacional da aplicação MapReduce. Motiva-se a desenvolver trabalhos futuros para estender esta abordagem para atingir outro tipo de aplicações distribuídas, e também incluir o modelo matemático deste trabalho dentro de serviços na núvem que ofereçam plataformas MapReduce, a fim de ajudar os usuários a planejar suas implementações. / In the last years, Cloud Computing has become a key technology that made possible running applications without needing to deploy a physical infrastructure with the advantage of lowering costs to the user by charging only for the computational resources used by the application. The challenge with deploying distributed applications in Cloud Computing environments is that the virtual machine infrastructure should be planned in a way that is time and cost-effective. Also, in the last years we have seen how the amount of data produced by applications has grown bigger than ever. This data contains valuable information that has to be extracted using tools like MapReduce. MapReduce is an important framework to analyze large amounts of data since it was proposed by Google, and made open source by Apache with its Hadoop implementation. The goal of this work is to show that the execution time of a distributed application, namely, a MapReduce application, in a Cloud computing environment, can be predicted using a mathematical model based on theoretical specifications. This prediction is made to help the users of the Cloud Computing environment to plan their deployments, i.e., quantify the number of virtual machines and its characteristics in order to have a lesser cost and/or time. After measuring the application execution time and varying parameters stated in the mathematical model, and after that, using a linear regression technique, the goal is achieved finding a model of the execution time which was then applied to predict the execution time of MapReduce applications with satisfying results. The experiments were conducted in several configurations: namely, private and public clusters, as well as commercial cloud infrastructures, running different MapReduce applications, and varying the number of nodes composing the cluster, as well as the amount of workload given to the application. Experiments showed a clear relation with the theoretical model, revealing that the model is in fact able to predict the execution time of MapReduce applications. The developed model is generic, meaning that it uses theoretical abstractions for the computing capacity of the environment and the computing cost of the MapReduce application. Further work in extending this approach to fit other types of distributed applications is encouraged, as well as including this mathematical model into Cloud services offering MapReduce platforms, in order to aid users plan their deployments.
|
113 |
Implementation of the HadoopMapReduce algorithm on virtualizedshared storage systemsNethula, Shravya January 2016 (has links)
Context Hadoop is an open-source software framework developed for distributed storage and distributed processing of large sets of data. The implementation of the Hadoop MapReduce algorithm on virtualized shared storage by eliminating the concept of Hadoop Distributed File System (HDFS) is a challenging task. In this study, the Hadoop MapReduce algorithm is implemented on the Compuverde software that deals with virtualized shared storage of data. Objectives In this study, the effect of using virtualized shared storage with Hadoop framework is identified. The main objective of this study is to design a method to implement the Hadoop MapReduce algorithm on Compuverde software that deals with virtualized shared storage of big data. Finally, the performance of the MapReduce algorithm on Compuverde shared storage (Compuverde File System - CVFS) is evaluated and compared to the performance of the MapReduce algorithm on HDFS. Methods Initially a literature study is conducted to identify the effect of Hadoop implementation on virtualized shared storage. The Compuverde software is analyzed in detail during this literature study. The concepts of the MapReduce algorithms and the functioning of HDFS are scrutinized in detail. The next main research method that is adapted for this study is the implementation of a method where the Hadoop MapReduce algorithm is applied on the Compuverde software that deals with the virtualized shared storage by eliminating the HDFS. The next step is experimentation in which the performance of the implementation of the MapReduce algorithm on Compuverde shared storage (CVFS) in comparison with implementation of the MapReduce algorithm on Hadoop Distributed File System. Results The experiment is conducted in two different scenarios namely the CPU bound scenario and I/O bound scenario. In CPU bound scenario, the average execution time of WordCount program has a linear growth with respect to size of data set. This linear growth is observed for both the file systems, HDFS and CVFS. The same is the case with I/O bound scenario. There is linear growth for both the file systems. When the averages of execution time are plotted on the graph, both the file systems perform similarly in CPU bound scenario(multi-node environment). In the I/O bound scenario (multi-node environment), HDFS slightly out performs CVFS when the size of 1.0GB and both the file systems performs without much difference when the size of data set is 0.5GB and 1.5GB. Conclusions The MapReduce algorithm can be implemented on live data present in the virtualized shared storage systems without copying data into HDFS. In single node environment, distributed storage systems perform better than shared storage systems. In multi-node environment, when the CPU bound scenario is considered, both HDFS and CVFS file systems perform similarly. On the other hand, HDFS performs slightly better than CVFS for 1.0GB of data set in the I/O bound scenario. Hence we can conclude that distributed storage systems perform similar to the shared storage systems in both CPU bound and I/O bound scenarios in multi-node environment.
|
114 |
MsSpark: Implementation of Molecular Simulation Queries Using Apache SparkKaur, Parneet 24 June 2016 (has links)
Huge amount of data is being generated in almost every field and it cannot be avoided, rather is essential for the advancement of the field. Analysis of this data requires intensive computing power. Molecular Simulation is a powerful tool for understanding the behavior of natural systems. The simulation generates large amount data while observing the spatial and temporal relationships. The challenge is to handle the analytical queries that are often compute intensive.
Although various tools exist to tackle this problem, but in this paper we have tried an alternate approach that uses Apache Spark- a modern big data platform – to parallelize the computation of analytical queries. MsSpark consists of three layers: Apache Spark layer, MS RDD layer and MS Query Processing layer. MS RDD layers supports data that is specific to Molecular Simulation. MS Query Processing layer provides functionality of executing analytical queries. Caching is used to improve the performance. The system can be further extended to cover more analytical queries.
|
115 |
Cloud Computing Frameworks for Food Recognition from ImagesPeddi, Sri Vijay Bharat January 2015 (has links)
Distributed cloud computing, when integrated with smartphone capabilities, contribute to building an efficient multimedia e-health application for mobile devices. Unfortunately, mobile devices alone do not possess the ability to run complex machine learning algorithms, which require large amounts of graphic processing and computational power. Therefore, offloading the computationally intensive part to the cloud, reduces the overhead on the mobile device. In this thesis, we introduce two such distributed cloud computing models, which implement machine learning algorithms in the cloud in parallel, thereby achieving higher accuracy. The first model is based on MapReduce SVM, wherein, through the use of Hadoop, the system distributes the data and processes it across resizable Amazon EC2 instances. Hadoop uses a distributed processing architecture called MapReduce, in which a task is mapped to a set of servers for processing and the results are then reduced back to a single set. In the second method, we implement cloud virtualization, wherein we are able to run our mobile application in the cloud using an Android x86 image. We describe a cloud-based virtualization mechanism for multimedia-assisted mobile food recognition, which allow users to control their virtual smartphone operations through a dedicated client application installed on their smartphone. The application continues to be processed on the virtual mobile image even if the user is disconnected for some reason. Using these two distributed cloud computing models, we were able to achieve higher accuracy and reduced timings for the overall execution of machine learning algorithms and calorie measurement methodologies, when implemented on the mobile device.
|
116 |
An Ensemble Method for Large Scale Machine Learning with Hadoop MapReduceLiu, Xuan January 2014 (has links)
We propose a new ensemble algorithm: the meta-boosting algorithm. This algorithm enables the original Adaboost algorithm to improve the decisions made by different WeakLearners utilizing the meta-learning approach. Better accuracy results are achieved since this algorithm reduces both bias and variance. However, higher accuracy also brings higher computational complexity, especially on big data. We then propose the parallelized meta-boosting algorithm: Parallelized-Meta-Learning (PML) using the MapReduce programming paradigm on Hadoop. The experimental results on the Amazon EC2 cloud computing infrastructure show that PML reduces the computation complexity enormously while retaining lower error rates than the results on a single computer. As we know MapReduce has its inherent weakness that it cannot directly support iterations in an algorithm, our approach is a win-win method, since it not only overcomes this weakness, but also secures good accuracy performance. The comparison between this approach and a contemporary algorithm AdaBoost.PL is also performed.
|
117 |
Análisis de archivos Logs semi-estructurados de ambientes Web usando tecnologías Big-DataVillalobos Luengo, César Alexis January 2016 (has links)
Magíster en Tecnologías de la Información / Actualmente el volumen de datos que las empresas generan es mucho más grande del
que realmente pueden procesar, por ende existe un gran universo de información que se
pierde implícito en estos datos. Este proyecto de tesis logró implementar tecnologías Big
Data capaces de extraer información de estos grandes volúmenes de datos existentes en
la organización y que no eran utilizados, de tal forma de transformarlos en valor para el
negocio.
La empresa elegida para este proyecto se dedicada al pago de cotizaciones previsionales
de forma electrónica por internet. Su función es ser el medio por el cual se recaudan las
cotizaciones de los trabajadores del país. Cada una de estas cotizaciones es informada,
rendida y publicada a las instituciones previsionales correspondientes (Mutuales, Cajas de
Compensación, AFPs, etc.). Para realizar su función, la organización ha implementado a
lo largo de sus 15 años una gran infraestructura de alto rendimiento orientada a servicios
web. Actualmente esta arquitectura de servicios genera una gran cantidad de archivos
logs que registran los sucesos de las distintas aplicaciones y portales web. Los archivos
logs tienen la característica de poseer un gran tamaño y a la vez no tener una estructura
rigurosamente definida. Esto ha causado que la organización no realice un eficiente
procesamiento de estos datos, ya que las actuales tecnologías de bases de datos
relaciones que posee no lo permiten. Por consiguiente, en este proyecto de tesis se buscó
diseñar, desarrollar, implementar y validar métodos que sean capaces de procesar
eficientemente estos archivos de logs con el objetivo de responder preguntas de negocio
que entreguen valor a la compañía.
La tecnología Big Data utilizada fue Cloudera, la que se encuentra en el marco que la
organización exige, como por ejemplo: Que tenga soporte en el país, que esté dentro de
presupuesto del año, etc. De igual forma, Cloudera es líder en el mercado de soluciones
Big Data de código abierto, lo cual entrega seguridad y confianza de estar trabajando
sobre una herramienta de calidad. Los métodos desarrollados dentro de esta tecnología
se basan en el framework de procesamiento MapReduce sobre un sistema de archivos
distribuido HDFS.
Este proyecto de tesis probó que los métodos implementados tienen la capacidad de
escalar horizontalmente a medida que se le agregan nodos de procesamiento a la
arquitectura, de forma que la organización tenga la seguridad que en el futuro, cuando los
archivos de logs tengan un mayor volumen o una mayor velocidad de generación, la
arquitectura seguirá entregando el mismo o mejor rendimiento de procesamiento, todo
dependerá del número de nodos que se decidan incorporar.
|
118 |
Distribuované zpracování dat o IP tocích / Distributed Processing of IP flow DataKrobot, Pavel January 2015 (has links)
This thesis deals with the subject of distributed processing of IP flow. Main goal is to provide an implementation of a software collector which allows storing and processing huge amount of a network data in particular. There was studied an open-source implementation of a framework for the distributed processing of large data sets called Hadoop, which is based on MapReduce paradigm. There were made some experiments with this system which provided the comparison with the current systems and shown weaknesses of this framework. Based on this knowledge there was created a specification and scheme for an extension of current software collector within this work. In terms of the created scheme there was created an implementation of query framework for formed collector, which is considered as most critical in the field of distributed processing of IP flow data. Results of experiments with created implementation show significant performance growth and ability of linear scalability with some types of queries.
|
119 |
Evaluierung und Erweiterung von MapReduce-Algorithmen zur Berechnung der transitiven Hülle ungerichteter Graphen für Entity ResolutionWorkflowsZiad, Sehili 16 April 2018 (has links)
Im Bereich von Entity-Resolution oder deduplication werden aufgrund fehlender global eindeutiger Identifikatoren Match-Techniken verwendet, um zu bestimmen, ob verschiedene Datensätze dasselbe Realweltobjekt darstellen. Die inhärente quadratische Komplexität führt zu sehr langen Laufzeiten für große Datenmengen, was eine Parallelisierung dieses Prozesses erfordert.
MapReduce ist wegen seiner Skalierbarkeit und Einsetzbarkeit in Cloud- Infrastrukturen eine gute Lösung zur Verbesserung der Laufzeit. Außerdem kann unter bestimmten Voraussetzungen die Qualität des Match-Ergebnisses durch die Berechnung der transitiven Hülle verbessert werden.
|
120 |
ZipThru: A software architecture that exploits Zipfian skew in datasets for accelerating Big Data analysisEjebagom J Ojogbo (9529172) 16 December 2020 (has links)
<div>In the past decade, Big Data analysis has become a central part of many industries including entertainment, social networking, and online commerce. MapReduce, pioneered by Google, is a popular programming model for Big Data analysis, famous for its easy programmability due to automatic data partitioning, fault tolerance, and high performance. Majority of MapReduce workloads are summarizations, where the final output is a per-key ``reduced" version of the input, highlighting a shared property of each key in the input dataset.</div><div><br></div><div>While MapReduce was originally proposed for massive data analyses on networked clusters, the model is also applicable to datasets small enough to be analyzed on a single server. In this single-server context the intermediate tuple state generated by mappers is saved to memory, and only after all Map tasks have finished are reducers allowed to process it. This Map-then-Reduce sequential mode of execution leads to distant reuse of the intermediate state, resulting in poor locality for memory accesses. In addition the size of the intermediate state is often too large to fit in the on-chip caches, leading to numerous cache misses as the state grows during execution, further degrading performance. It is well known, however, that many large datasets used in these workloads possess a Zipfian/Power Law skew, where a minority of keys (e.g., 10\%) appear in a majority of tuples/records (e.g., 70\%). </div><div><br></div><div>I propose ZipThru, a novel MapReduce software architecture that exploits this skew to keep the tuples for the popular keys on-chip, processing them on the fly and thus improving reuse of their intermediate state and curtailing off-chip misses. ZipThru achieves this using four key mechanisms: 1) Concurrent execution of both Map and Reduce phases; 2) Holding only the small, reduced state of the minority of popular keys on-chip during execution; 3) Using a lookup table built from pre-processing a subset of the input to distinguish between popular and unpopular keys; and 4) Load balancing the concurrently executing Map and Reduce phases to efficiently share on-chip resources. </div><div><br></div><div>Evaluations using Phoenix, a shared-memory MapReduce implementation, on 16- and 32-core servers reveal that ZipThru incurs 72\% fewer cache misses on average over traditional MapReduce while achieving average speedups of 2.75x and 1.73x on both machines respectively.</div>
|
Page generated in 0.047 seconds