Análise de benefícios do paralelismo por comunicação unilateral em aplicações com grades não estruturadas / Improvement analysis of parallelism by one-sided communication on unstructured grids applications

Lopes, Pedro Pais 03 September 2010 (has links)
A computacao paralela, empregada no meio cientifico para resolucao de problemas que de- mandam grande poder computacional, teve nos ultimos anos o surgimento de um novo tipo de comunicacao entre instancias do paralelismo. Trata-se da Comunicacao Unilateral (CUL), onde somente uma instancia realiza a operacao de transferencia de informacoes, e esta ocorre em segundo plano, ao contrario da Comunicacao Bilateral (CBL), onde uma instancia envia a informacao e a outra recebe. Neste contexto se buscou analisar os beneficios que a CUL agrega ao paralelismo de um programa que se utiliza de uma grade nao estruturada em me- moria. Duas formas de apoio ao paralelismo foram utilizadas: uma biblioteca, a \"Message Passing Interface\" (MPI) (especificamente a sua parte que descreve a CUL), e uma extensao a linguagem Fortran, o Coarray Fortran (CAF). A semantica do MPI CUL e mais complexa que a do CAF, mas a do CAF e mais restritiva. Para analisar a semantica e desempenho da CUL foi realizada uma ambientacao utilizando MPI CUL e CAF no paralelismo de um programa simples, denominado jogo da Vida (Game of Life), com grade estruturada e com otimo desempenho paralelo atraves do MPI CBL. Na programacao o MPI CUL se mostrou verborragico (aumento do numero de linhas de codigo) e complexo, principalmente quando se utiliza um controle refinado de sincronismo entre as imagens. Ja o CAF reduziu o nu- mero de linhas de codigo (entre 20% e 40%), e o sincronismo e muito mais simples. Os resultados mostraram uma piora no desempenho no caso do MPI CUL, mas para o CAF o desempenho absoluto foi melhor que a implementacao original ate o numero de nucleos de processamento que compartilham a mesma memoria. Para grades nao estruturadas se utilizou o Ocean Land Atmospheric Model (OLAM), um modelo de simulacao do sistema terrestre com grade baseada em prismas triangulares, paralelizado atraves de MPI CBL. A implementacao da comunicacao por MPI CUL na estrutura do paralelismo existente mos- trou que esta semantica possui alguns pontos que podem prejudicar a programacao, como o tratamento da exposicao de memoria (cada instancia tem uma memoria exposta de tamanho diferente) e como e realizado o sincronismo entre as instancias. Em termos de desempenho as curvas de speed-ups mostraram que o MPI CUL prejudicou o OLAM independentemente da implementacao das bibliotecas ou do equipamento utilizado, com reducao de pelo menos 20% no speed-up para sete ou mais processadores. Assim como no jogo da Vida o MPI com comunicacao unilateral penalizou o desempenho. / Parallel computing is used to solve many scientific problems that demand intensive compu- ting power. Recently a new paradigm of communication between instances of the parallelism has appeared, called the one-sided communication (OSC), where only one instance performs the operation of information transfer, occurring in the background, as opposed to the two- sided communication (TSC), where one instance sends the information and the other receives it. In this context we analyze the benefits that OSC aggregates to the parallelism of a pro- gram that uses an unstructured grid in memory. Two OSC implementations were used: the \"Message Passing Interface\" (MPI) library (specifically the part that describes OSC), and Coarray Fortran (CAF), an extension of the Fortran language. The semantics of MPI OSC is more complex than that of CAF, but the semantics of CAF is more restrictive. To analyze the semantics and performance of OSC a simple program called Game of Life is used in a structured grid, giving very good parallel performance through MPI TSC. The MPI OSC program was verbose (increase in the number of lines of code) and complex, especially when using a more refined control to synchronize the parallel instances. On the other hand, CAF has reduced the number of lines of code (between 20% to 40%), and the synchronization is very simple. The results showed a worse performance in the case of MPI OSC, but for the CAF the absolute performance was better than the original implementation up to the number of processor cores that share the same memory. For unstructured grids we used the Ocean Land Atmospheric Model (OLAM), an earth simulation model on a grid based on triangular prisms, and parallelized with MPI TSC. The implementation with MPI OSC showed that this semantics has some points that may affect the coding of the communication structure, as in the treatment of memory exposure (each instance has an exposed memory of different size) and the way to treat the synchronization among instances. In terms of performance, the speedup curves showed that MPI OSC penalized OLAM, independently of the MPI implementation or the equipment used, with a reduction of at least 20% in speedup for seven or more processors. As in the Game of Life, MPI OSC degrades the performance.

Hierarchical message passing through a ProActive/GCM based runtime / Passagem de mensagem hierárquica através de um runtime baseado em ProActive/GCM

Mathias, Elton Nicoletti January 2010 (has links)
Nos últimos anos, computação em grade tem emergido como uma forma de utilização de recursos geograficamente distribuídos em múltiplas organizações. Devido ao fato de grids serem altamente distribuídos e compostos por recursos heterogêneos, a computação em grade tem dado importância a requisitos específicos, como escalabilidade, desempenho e a necessidade de um modelo de programação adequado. Vários modelos de programação já foram propostos para a computação em grade. Entretanto, ate agora, nenhum deles supriu todos os requisitos. Diferentemente, na área de alto desempenho em clusters, o modelo de passagem de mensagens se tornou um verdadeiro padrão com um grande número de bibliotecas e aplicações legadas. Este trabalho propõe um framework híbrido que combina os altos desempenho e aceitação do padrão MPI, melhorado com extensões intuitivas para permitir aos desenvolvedores o projeto e desenvolvimento de aplicações em grade ou a gridi-ficação de aplicações já existentes, com a flexibilidade de um runtime baseado em componentes, modelando uma hierarquia de recursos e suportando a comunicação entre clusters. A solução proposta se baseia na adição de comunicadores MPI e uma API relacionada, a qual oferece um suporte ao desenvolvimento de aplicações que levam em conta a topologia hierárquica de grades computacionais, adequado a desenvolvedores habituados a MPI. características (Simula_c~ao Baseada no Algoritmo de Monte Carlo, Mergesort e um solver Poisson3D) mostraram que a gridificação pode melhorar consideravelmente o desempenho dessas aplicações em ambientes de grade. Ainda que o objetivo deste trabalho não seja competir com distribuições MPI existentes, o desempenho da solução proposta _e comparável ao desempenho de MPI, sendo melhor em alguns casos. A partir dos resultados obtidos com o protótipo apresentado, é possível concluir que o custo adicionado pela utilização de componentes não é desprezível, mas dentro do esperado. Entretanto, espera-se que os benefícios para aplicações de grade devem superar os custos adicionais. Além disso, as extensões a interface MPI oferecem a usuários as abstrações necessárias ao projeto de algoritmos paralelos de forma hierárquica, visando ambientes de grade. / In the past several years, grid computing has emerged as a way to harness computing resources geographically distributed across multiple organizations. Due to its inherently largely distributed and heterogeneous nature, grid computing has enlarged the importance of specific requirements, such as scalability, performance and the need of an adequate programming model. Several programming models have been proposed for grid programming. Nonetheless, so far, none of them met all the requirements. Differently, in the field of high performance cluster computing, the message passing model became a true standard with a large number of libraries and legacy applications. This work proposes a hybrid framework that combines the high performance and high acceptability of the MPI standard boosted with intuitive extensions to enable developers to design grid applications or "gridify" existing ones with the flexibility of a component-based runtime modeling resources hierarchy and offering support to inter-cluster communication. The proposed solution relies on the addition of new MPI communicators and a related API, which may offer a support well-suited to programmers used to MPI in order to reflect a hierarchical topology within the deployed application. Carlo Simulation, a Mergesort and a Poissond3D solver) have shown that the "gridification" of applications improve application performance on grid environments. Even if the goal is not to compete against existing MPI distributions, the performance of the solution is comparable with MPI performance, even better in some cases. From the results obtained in the evaluation of this prototype, we conclude that the overhead introduced by the components is not negligible, but inside of the expected. However, we can expect the benefits to grid applications to bypass the generated overhead. Besides, the extended interface may offer users the adequate abstractions to design parallel algorithms in a hierarchical way addressing grid environments.

A Java Founded LOIS-framework and the Message Passing Interface? : An Exploratory Case Study

Strand, Christian January 2006 (has links)
<p>In this thesis project we have successfully added an MPI extension layer to the LOIS framework. The framework defines an infrastructure for executing and connecting continuous stream processing applications. The MPI extension provides the same amount of stream based data as the framework’s original transport. We assert that an MPI-2 compatible implementation can be a candidate to extend the given framework with an adaptive and flexible communication sub-system. Adaptability is required since the communication subsystem has to be resilient to changes, either due to optimizations or system requirements.</p>

Simulation des Workflows in einer Kooperation

Telzer, Martin 23 January 2006 (has links) (PDF)
Je weiter die Zivilisation vorranschreitet, um so komplexer werden deren Errungenschaften. Die Herstellungsprozesse ziehen auch ein komplexes Management während der Produktion nach sich, da viele Menschen und Maschinen am Produktionsprozess beteiligt sind. Der Manager stellt hier einen "Single Point of Failure" dar. Das bedeutet, dass die erfolgreiche Produktion nun abhängig von der Qualität und der Fehlerfreiheit des Managers bzw. des leitetenden Personals ist. Um diesen Mangel zu beseitigen, lohnt es sich auch an dieser Stelle gewisse Prozesse zu automatisieren. Man erreicht dadurch einen höheren Grad an Fehlerfreiheit und Zuverlässigkeit. Um dies zu realisieren, werden unter anderem die Prinzipien des Workflow-Managements benutzt. Je komplexer ein Workflow wird, um so mehr Rechenleistung wird benötigt, um diesen in einem Workflow-Management-System auszuführen. Eine technische Möglichkeit dieses Problem zu lösen, stellt die Verteilung der Workflow-Management-Software dar. Verteilung bedeutet im gleichen Atemzug eine Verkomplizierung der Softwarearchitektur, wodurch sie wiederum komplizierter zu entwickeln ist. Komplexe Softwaresysteme ziehen komplexe Testprogramme und Simulationsumgebungen nach sich. Um die Entwicklung eines verteilten Workflow-Management-Systems zu unterstützen, wird in dieser Arbeit ein Simulationssystem für Workflow-Management-Systeme entworfen und implementiert. Es wird den Entwicklern eines verteilten Workflow-Management- Systems ein wertvolles Tool während der Implementierung der Software sein.

Improving the Performance of Selected MPI Collective Communication Operations on InfiniBand Networks

Viertel, Carsten 23 September 2007 (has links) (PDF)
The performance of collective communication operations is one of the deciding factors in the overall performance of a MPI application. Open MPI's component architecture offers an easy way to implement new algorithms for collective operations, but current implementations use the point-to-point components to access the InfiniBand network. Therefore it is tried to improve the performance of a collective component by accessing the InfiniBand network directly. This should avoid overhead and make it possible to tune the algorithms to this specific network. The first part of this work gives a short overview of the InfiniBand Architecture and Open MPI. In the next part several models for parallel computation are analyzed. Afterwards various algorithms for the MPI_Scatter, MPI_Gather and MPI_Allgather operations are presented. The theoretical performance of the algorithms is analyzed with the LogfP and LogGP models. Selected algorithms are implemented as part of an Open MPI collective component. Finally the performance of different algorithms and different MPI implementations is compared. The test results show, that the performance of the operations could be improved for several message and communicator size ranges.

Development, analysis and applications of the technology for parallelization of numerical algorithms for solution of PDE and systems of PDEs / Diferencialinių lygčių ir jų sistemų skaitinio sprendimo algoritmų lygiagretinimo technologijos kūrimas, analizė ir taikymai

Jakušev, Aleksandr 20 June 2008 (has links)
The new parallelization technology is presented in this work. The technology is suitable for parallelization of linear algebra problems that arise during solution of PDE and PDE systems. The new technology combines the strong points of "data parallel" and "global memory" parallel programming models. Using the pecularities of the problems of a given class, the technology allows to write effective code easily, with the addition of the possibility for semi-automatic parallelization. The work consists of 3 parts: the review of existing technologies, the description of the new one, various applications. / Šiame darbe pateikiama nauja tiesinės algebros algoritmų, atsirandančių sprendžiant dif. lygtis ir jų sistemas, lygiagretinimo technologija. Ši technologija apjungia "lygiagrečiųjų duomenų" ir "globalios atminties" lygiagretinimo modelių privalumus, ir, naudojant apibrėžtos klasės uždavinių yptaumus, leidžia lengvai gauti efektyvų programos kodą, kuris pusiau automatiškai lygiagretinamas. Darbas susideda iš 3 dalių: egzistuojančių priemonių apžvalga, naujos technologijos aprašymas, įvairūs taikymai.

Lygiagretieji skaičiavimai naudojant vaizdo plokštes / Parallel computing using graphics cards

Juodaitis, Robertas 01 August 2013 (has links)
Šiame darbe lyginami vaizdo plokštės ir MPI lygiagrečiųjų skaičiavimų pajėgumai klasikiniais lygiagretinimo algoritmais: apytikslės π reikšmės skaičiavimo, matricų daugybos. Daug dėmesio skiriama uždavinių lygiagretinimo strategijos parinkimui, efektyviai išnaudoti tiek MPI klasterį, tiek vaizdo plokštę. Nustatytas tinkamas šių įrenginių palyginimui kriterijus – santykinis pagreitėjimas, objektyviai nusakantis, kokį skaičiavimo pajėgumą pasiekia vaizdo plokštė prieš centrinį procesorių. Išanalizavus eksperimentų rezultatus nustatyta, kad programuotojas turi siekti mažesnio duomenų apsikeitimo tarp procesų, nes komunikavimas mažina lygiagrečiųjų algoritmų efektyvumą. Taip pat nustatyta, kad programavimas Cuda reikalauja griežto prisitaikymo prie vaizdo plokštės parametrų ir yra sudėtingesnis. Kaip rezultatas - pilnai apkrauta vaizdo plokštė su Cuda yra spartesnė ne tik už kompiuterius su 4 branduolių procesoriumi, bet ir nedidelį klasterį. / This work compares two different kinds of computing devices – video card and central processor unit for general purpose computing in parallel. MPI library used for central processor unit, Cuda used for video card, compute classic parallel algorithm approximate π value and matrix multiplication. Our main attention - better strategies working with MPI cluster and Cuda to completely utilize these two kind computing resources. There are found objective method to compare video card and central processor unit computing advantages – relative speedup. After analyze experiment result there are found some advice for programmer. Programmers must find the ways to communicate between processes more rarely, because communication lowers efficiency of parallel algorithm. Programming with Cuda requires much more skills and flexibility to work efficiency with video card device. As a result fully utilized video card with Cuda is faster than computer with 4 cores CPU and little cluster.

