Application-aware software-defined networking to accelerate mapreduce applications

Made available in DSpace on 2015-04-14T14:50:19Z (GMT). No. of bitstreams: 1
466322.pdf: 4102408 bytes, checksum: d0728ba001c22ab7a016962b0a3e122f (MD5)
Previous issue date: 2015-01-22 / O modelo de programa??o MapReduce (MR), tal como implementado por Hadoop, tornou-se o padr?o de facto para an?lise de dados de larga escala em data centers, sendo tamb?m a base para uma grande variedade de tecnologias de Big Data que s?o utilizadas atualmente. Neste contexto, Hadoop ? um framework escal?vel que permite a utiliza??o de um grande n?mero de servidores para manipular os crescentes conjutos de dados da ?rea de Big Data. Enquanto capacidade de processamento e E/S podem ser escalados atrav?s da adi??o de mais servidores, isto gera um tr?fego acentuado na rede. No caso de MR, a fase que realiza comunica??es via rede representa uma significante parcela do tempo total de execu??o. Esse problema ? agravado ainda mais quando os padr?es de comunica??o s?o desbalanceados, o que n?o ? incomum para muitas aplica??es MR. MR normalmente executa em grandes data centers (DC) de commodity hardware. A rede de tais DCs normalmente utiliza topologias densas que oferecem m?ltiplos caminhos alternativos (multipath) entre cada par de hosts. Este tipo de topologia, combinado com a emergente tecnologia de redes definidas por software (SDN), possibilita a cria??o de protocolos inteligentes para distribuir o tr?fego entre os diferentes caminhos dispon?veis e reduzir o tempo de execu??o das aplica??es. Assim, esse trabalho prop?e a cria??o de um controle de rede ciente de aplica??o (isto ?, que conhece as sem?nticas e demandas de tr?fego do n?vel de aplica??o) para melhorar o desempenho de aplica??es MR quando comparado com um controle de rede tradicional. Para isso, primeiramente estudou-se MR em detalhes e identificou-se os padr?es t?picos de comunica??o e causas frequentes de gargalos de desempenho relativos ? utiliza??o de rede nesse tipo de aplica??o. Em seguida, estudou-se o estado da arte em redes de data centers e sua habilidade de lidar com os padr?es de comunica??o encontrados em aplica??es MR. Baseado nos resultados obtidos, foi proposta uma arquitetura para controle de rede ciente de aplica??o. Um prot?tipo foi desenvolvido utilizando um controlador SDN, o qual foi utilizado com sucesso para acelerar aplica??es MR. Experimentos utilizando benchmarks populares e diferentes caracter?sticas de rede demonstraram uma redu??o de 2% a 58% no tempo total de execu??o de aplica??es MR. Al?m do ganho de desempenho em aplica??es MR, outras contribui??es desse trabalho incluem um m?todo para predizer demandas de tr?fego de aplica??es MR, heur?sticas para otimiza??o de rede e um ambiente de testes para redes de data centers baseado em emula??o. / The rise of Internet of Things sensors, social networking and mobile devices has led to an explosion of available data. Gaining insights into this data has led to the area of Big Data analytics. The MapReduce (MR) framework, as implemented in Hadoop, has become the de facto standard for Big Data analytics. It also forms a base platform for a plurality of Big Data technologies that are used today. To handle the ever-increasing data size, Hadoop is a scalable framework that allows dedicated, seemingly unbound numbers of servers to participate in the analytics process. Response time of an analytics request is an important factor for time to value/insights. While the compute and disk I/O requirements can be scaled with the number of servers, scaling the system leads to increased network traffic. Arguably, the communication-heavy phase of MR contributes significantly to the overall response time. This problem is further aggravated, if communication patterns are heavily skewed, as is not uncommon in many MR workloads. MR applications normally run in large data centers (DCs) employing dense network topologies (e.g. multi-rooted trees) with multiple paths available between any pair of hosts. These DC network designs, combined with recent software-defined network (SDN) programmability, offer a new opportunity to dynamically and intelligently configure the network to achieve shorter application runtime. The initial intuition motivating our work is that the well-defined structure of MR and the rich traffic demand information available in Hadoop s log and meta-data files could be used to guide the network control. We therefore conjecture that an application-aware network control (i.e., one that knows the applicationlevel semantics and traffic demands) can improve MR applications performance when compared to state-of-the-art application-agnostic network control. To confirm our thesis, we first studied MR systems in detail and identified typical communication patterns and common causes of network-related performance bottlenecks in MR applications. Then, we studied the state of the art in DC networks and evaluated its ability to handle MapReduce-like communication patterns. Our results confirmed the assumption that existing techniques are not able to deal with MR communication patterns mainly because of the lack of visibility of application-level information. Based on these findings, we proposed an architecture for an application-aware network control for DCs running MR applications. We implemented a prototype within a SDN controller and used it to successfully accelerate MR applications. Depending on the network oversubscription ratio, we demonstrated a 2% to 58% reduction in the job completion time for popular MR benchmarks, when compared to ECMP (the de facto flow allocation algorithm in multipath DC networks), thus, confirming the thesis. Other contributions include a method to predict network demands in MR applications, algorithms to identify the critical communication path in MR shuffle and dynamically alocate paths to flows in a multipath network, and an emulation-based testbed for realistic MR workloads.

Identiferoai:union.ndltd.org:IBICT/oai:tede2.pucrs.br:tede/5276
Date22 January 2015
CreatorsNeves, Marcelo Veiga
ContributorsRose, C?sar Augusto Fonticielha de
PublisherPontif?cia Universidade Cat?lica do Rio Grande do Sul, Programa de P?s-Gradua??o em Ci?ncia da Computa??o, PUCRS, BR, Faculdade de Inform?ca
Source SetsIBICT Brazilian ETDs
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/publishedVersion, info:eu-repo/semantics/doctoralThesis
Formatapplication/pdf
Sourcereponame:Biblioteca Digital de Teses e Dissertações da PUC_RS, instname:Pontifícia Universidade Católica do Rio Grande do Sul, instacron:PUC_RS
Rightsinfo:eu-repo/semantics/openAccess
Relation1974996533081274470, 500, 600, 1946639708616176246

Page generated in 0.0027 seconds