Global ETD Search

11	Aspects of the efficient implementation of the message passing interface (MPI) Träff, Jesper January 2009 (has links) Zugl.: Copenhagen, Univ., Habil.-Schr., 2009
12	Uso de auto-tuning para otimização de decomposição de domínios paralela / Optimizing parallel domain decomposition using auto-tuning Almeida, Alexandre Vinicius January 2011 (has links) O desenvolvimento de aplicações de forma a atingir níveis de desempenho próximos aos níveis teóricos de uma determinada plataforma é uma tarefa que exige conhecimento técnico do ambiente de hardware, uma vez que o software deve explorar detalhes específicos da plataforma em questão. Pelo fato do software ser específico à plataforma, caso ela evolua ou se altere, as otimizações realizadas podem não explorar a nova arquitetura de forma eficiente. Auto-tuners são sistemas que surgiram como um meio automatizado de adaptar um determinado software a uma arquitetura alvo. Essa adaptação ocorre através de uma busca empírica de valores ótimos para parâmetros específicos de uma aplicação, a fim de ajustá-los às características do hardware, ou ainda através da geração de códigofonte otimizado para a plataforma. Este trabalho propõe um módulo auto-tuner orientado à adaptação parametrizada de uma aplicação paralela, que trabalha variando os fatores da dimensão do domínio bidimensional, o número de processos e a extensão das regiões de sobreposição. Para cada variação dos fatores, o auto-tuner testa a aplicação na arquitetura paralela de forma a buscar a combinação de parâmetros com melhor desempenho. Para possibilitar o auto-tuning, foi desenvolvida uma classe em linguagem C++ denominada Mesh, baseada no padrão MPI. A classe busca abstrair a decomposição de domínios de uma aplicação paralela por meio do uso de Orientação a Objetos, e facilita a variação da extensão das regiões de sobreposição entre os subdomínios. Os resultados experimentais demonstraram que o auto-tuner explora o ganho de desempenho pela variação do número de processos da aplicação, que também é tratado pelo módulo auto-tuner. A arquitetura paralela utilizada na validação não se mostrou ideal para uma otimização através do aumento da extensão das regiões sobrepostas entre subdomínios. / Achieving the peak performance level of a particular platform requires technical knowledge of the hardware environment involved, since the software must explore specific details inherent to the hardware. Once the software is optimized for a target platform, if the hardware evolves or is changed, the software probably would not be as efficient in the new environment. This performance portability problem is addressed by software auto-tuning, which emerged in the past decade as an automated technique to adapt a particular software to an underlying hardware. The software adaptation is performed by an auto-tuner. The auto-tuner is an entity that empirically adjusts specific application parameters in order to improve the overall application performance, or even generates source-code optimized for the target platform. This dissertation proposes an auto-tuner to optimize the domain decomposition of a parallel application that performs stencil computations. The proposed auto-tuner works in a parameterized adaptation fashion, and varies the dimensions of a 2D domain, the number of parallel processes and the extension of the overlapping zones between subdomains. For each combination of parameter values, the auto-tuner probes the application in the parallel architecture in order to seek the best combination of values. In order to make auto-tuning possible, it is proposed a C++ class called Mesh, based on the Message Passing Interface (MPI) standard. The role of this class is to abstract the domain decomposition from the application using the Object Orientation facilities provided by C++, and also to enable the extension of the overlapping zones between subdomain. The experimental results showed that the performance gains were mainly due to the variation of the number of processes, which was one of the application factors dealt by the auto-tuner. The parallel architecture used in the experiments showed itself as not adequate for optimizing the domain decomposition by increasing the overlapping zones extension. Mpi Processamento paralelo Auto-tuning Domain decomposition MPI Paralelism
13	Uso de auto-tuning para otimização de decomposição de domínios paralela / Optimizing parallel domain decomposition using auto-tuning Almeida, Alexandre Vinicius January 2011 (has links) O desenvolvimento de aplicações de forma a atingir níveis de desempenho próximos aos níveis teóricos de uma determinada plataforma é uma tarefa que exige conhecimento técnico do ambiente de hardware, uma vez que o software deve explorar detalhes específicos da plataforma em questão. Pelo fato do software ser específico à plataforma, caso ela evolua ou se altere, as otimizações realizadas podem não explorar a nova arquitetura de forma eficiente. Auto-tuners são sistemas que surgiram como um meio automatizado de adaptar um determinado software a uma arquitetura alvo. Essa adaptação ocorre através de uma busca empírica de valores ótimos para parâmetros específicos de uma aplicação, a fim de ajustá-los às características do hardware, ou ainda através da geração de códigofonte otimizado para a plataforma. Este trabalho propõe um módulo auto-tuner orientado à adaptação parametrizada de uma aplicação paralela, que trabalha variando os fatores da dimensão do domínio bidimensional, o número de processos e a extensão das regiões de sobreposição. Para cada variação dos fatores, o auto-tuner testa a aplicação na arquitetura paralela de forma a buscar a combinação de parâmetros com melhor desempenho. Para possibilitar o auto-tuning, foi desenvolvida uma classe em linguagem C++ denominada Mesh, baseada no padrão MPI. A classe busca abstrair a decomposição de domínios de uma aplicação paralela por meio do uso de Orientação a Objetos, e facilita a variação da extensão das regiões de sobreposição entre os subdomínios. Os resultados experimentais demonstraram que o auto-tuner explora o ganho de desempenho pela variação do número de processos da aplicação, que também é tratado pelo módulo auto-tuner. A arquitetura paralela utilizada na validação não se mostrou ideal para uma otimização através do aumento da extensão das regiões sobrepostas entre subdomínios. / Achieving the peak performance level of a particular platform requires technical knowledge of the hardware environment involved, since the software must explore specific details inherent to the hardware. Once the software is optimized for a target platform, if the hardware evolves or is changed, the software probably would not be as efficient in the new environment. This performance portability problem is addressed by software auto-tuning, which emerged in the past decade as an automated technique to adapt a particular software to an underlying hardware. The software adaptation is performed by an auto-tuner. The auto-tuner is an entity that empirically adjusts specific application parameters in order to improve the overall application performance, or even generates source-code optimized for the target platform. This dissertation proposes an auto-tuner to optimize the domain decomposition of a parallel application that performs stencil computations. The proposed auto-tuner works in a parameterized adaptation fashion, and varies the dimensions of a 2D domain, the number of parallel processes and the extension of the overlapping zones between subdomains. For each combination of parameter values, the auto-tuner probes the application in the parallel architecture in order to seek the best combination of values. In order to make auto-tuning possible, it is proposed a C++ class called Mesh, based on the Message Passing Interface (MPI) standard. The role of this class is to abstract the domain decomposition from the application using the Object Orientation facilities provided by C++, and also to enable the extension of the overlapping zones between subdomain. The experimental results showed that the performance gains were mainly due to the variation of the number of processes, which was one of the application factors dealt by the auto-tuner. The parallel architecture used in the experiments showed itself as not adequate for optimizing the domain decomposition by increasing the overlapping zones extension. Mpi Processamento paralelo Auto-tuning Domain decomposition MPI Paralelism
14	Paralelização de programas sisal para sistemas MPI / Parallelization of sisal programs for MPI systems Raul Junji Nakashima 15 March 1996 (has links) Este trabalho teve como finalidade a implementação de um método para a paralelização parcial de programas, escritos na linguagem funcional, SISAL utilizando as bibliotecas do padrão MPI (Message Passing Interface). Para tal, propusemos a transformação dos programas SISAL através do particionamento do loop paralelo forall, através do método de particionamento slice e a utilização do modelo de implementação do paralelismo SPMD (Single Program Multiple Data) no estilo de programas mestre/escravo. A validação de nossa proposta foi obtida através da realização de testes onde foram comparados os resultados obtidos com os programas originais e os programas com as alterações propostas / This work describes a method for the partial parallelization of SISAL programs into programs with calls to MPI routines. We focused on the parallelization of the forall loop (through slicing of the index range). The generated code is a master/slave SPMD program. The work was validated through the compilation of some simple SISAL programs and comparison of the results with an unmodified version Compiladores Linguagens funcionais MPI Paralelismo Compilers Functional languages MPI Parallelism
15	Principal Design Criteria Influencing the Performance of a Portable, High Performance Parallel I/O Implementation Rajaram, Kumaran 11 May 2002 (has links) MPI-IO, the parallel I/O functionality of MPI-2, is a portable interface designed specifically to achieve high-performance. This thesis proposes fundamental design criteria influencing the performance of a portable high performance I/O middleware. This thesis hypothesizes that overlap of I/O and computation and agglomeration of I/O requests based on an application's access pattern improve the performance of a portable parallel I/O implementation. The work included the development of MercutIO, a complete, portable, high performance MPI-IO implementation. MercutIO achieves portability through the Bulldog Abstract File System, a portable, efficient non-collective I/O interface, also developed in this thesis work. A new data access model based on non-blocking semantics is presented here. Two new I/O metrics (degree of overlapping and degree of non-contiguity) as well as parallel I/O benchmarks essential in the performance appraisal of a parallel I/O implementation are introduced in this thesis. MPI-IO MPI-2 Parallel I/O MercutIO BAFS
16	Adapting Remote Direct Memory Access Based File System to Parallel Input-/Output Velusamy, Vijay 13 December 2003 (has links) Traditional file access interfaces rely on ubiquitous transports that impose severe restrictions on performance and prove insufficient for adaptation to parallel Input/Output (I/O). Remote Direct Memory Access based (RDMA-based) approaches are aimed at moving data between different process address spaces with streamlined mediation and reduced involvement of the operating system using synchronization semantics that are different from ubiquitous transports. This thesis studies the adaptability of RDMA-based transports to parallel I/O. Combining RDMA semantics with parallel I/O leads to overhead reduction by overlapping communication and computation and by bandwidth enhancement. Although parallel I/O tends to increase latency in certain cases, use of RDMA techniques mitigate on this effect. DAFS RDMA MercutIO Parallel I/O MPI-IO MPI-2
17	A versatile parallel Lanczos eigensolver solution for MPI compatible AMLS Haben, Joshua D. 03 September 2009 (has links) PARPACK is an open-source Arnoldi/Lanczos eigensolver package which is compatible with a number of distributed parallel computing schemes. This thesis concentrates on a set of driver routines for PARPACK that were developed for use in the Automated Multilevel Substructuring (AMLS) vibration analysis software package developed at The University of Texas at Austin. AMLS requires many truncated eigensolutions to symmetric generalized algebraic eigenvalue problems. There is a need in AMLS to solve these problems in several different computing regimes, from serial execution on a single processor, to parallel execution on multiple nodes of a distributed computing cluster. This work is designed to enable evaluation, selection, and development of PARPACK capabilities for the variety of eigensolutions required by AMLS. / text Lanczos AMLS PARPACK ARPACK MPI
18	Performance Oriented Partial Checkpoint and Migration of LAM/MPI Applications Singh, Rajendra 21 January 2011 (has links) In parallel computing, MPI is heavily used due to its support of popular cluster based parallel machines and the Single Program Multiple Data (SPMD) model. Normally cluster nodes are dedicated to a single parallel job/application but MPI could also be used with nodes that are concurrently shared by multiple users. In this case, nodes could become overloaded with work from other users. Even a few overloaded nodes can result in application slowdown. Thus, it is desirable to relocate affected processes in a running application to lightly loaded nodes by partial checkpointing and migrating of those processes. In some MPI applications, groups of processes communicate frequently with one another. Such groups must be near one another to ensure communication efficiency. Thus, if any member of a group is to be checkpointed and migrated, all should be. It must therefore be possible to identify such groups. I have built a prototype, using LAM/MPI, that supports partial checkpoint, migration and restart of MPI processes. To identify process groups for checkpoint and migration, I adapted TEIRESIAS (an algorithm for pattern discovery from bioinformatics) to identify frequent, recurring patterns of communication using data gathered by LAM/MPI. I then created predictors that use the discovered patterns to predict groups of communicating processes that should be checkpointed and migrated together. I have assessed the effectiveness of my technique using synthetic and real communication data (for a small set of representative applications) to show that my predictors can accurately predict process groups for those applications. Additionally, I have created a simple simulation system to allow me to explore scenarios related to network characteristics and overload conditions under which my system might provide useful speedup. Not all MPI applications will benefit from my approach (e.g. those with unpredictable communication patterns or large groups of frequently communicating processes). However, my experimental and simulation results suggest that my technique should be effective for a number of common application types, network characteristics and overload conditions. Using partial checkpoint and migration should therefore allow many long running applications to finish faster than if a subset of their processes was left running on overloaded nodes. Checkpoint Migration Partial LAM/MPI
19	Performance Oriented Partial Checkpoint and Migration of LAM/MPI Applications Singh, Rajendra 21 January 2011 (has links) In parallel computing, MPI is heavily used due to its support of popular cluster based parallel machines and the Single Program Multiple Data (SPMD) model. Normally cluster nodes are dedicated to a single parallel job/application but MPI could also be used with nodes that are concurrently shared by multiple users. In this case, nodes could become overloaded with work from other users. Even a few overloaded nodes can result in application slowdown. Thus, it is desirable to relocate affected processes in a running application to lightly loaded nodes by partial checkpointing and migrating of those processes. In some MPI applications, groups of processes communicate frequently with one another. Such groups must be near one another to ensure communication efficiency. Thus, if any member of a group is to be checkpointed and migrated, all should be. It must therefore be possible to identify such groups. I have built a prototype, using LAM/MPI, that supports partial checkpoint, migration and restart of MPI processes. To identify process groups for checkpoint and migration, I adapted TEIRESIAS (an algorithm for pattern discovery from bioinformatics) to identify frequent, recurring patterns of communication using data gathered by LAM/MPI. I then created predictors that use the discovered patterns to predict groups of communicating processes that should be checkpointed and migrated together. I have assessed the effectiveness of my technique using synthetic and real communication data (for a small set of representative applications) to show that my predictors can accurately predict process groups for those applications. Additionally, I have created a simple simulation system to allow me to explore scenarios related to network characteristics and overload conditions under which my system might provide useful speedup. Not all MPI applications will benefit from my approach (e.g. those with unpredictable communication patterns or large groups of frequently communicating processes). However, my experimental and simulation results suggest that my technique should be effective for a number of common application types, network characteristics and overload conditions. Using partial checkpoint and migration should therefore allow many long running applications to finish faster than if a subset of their processes was left running on overloaded nodes. Checkpoint Migration Partial LAM/MPI
20	Paralelizações de métodos numéricos em clusters empregando as bibliotecas MPICH, DECK e Pthread Picinin Júnior, Delcino January 2003 (has links) Este trabalho tem como objetivo desenvolver e empregar técnicas e estruturas de dados agrupadas visando paralelizar os métodos do subespaço de Krylov, fazendo-se uso de diversas ferramentas e abordagens. A partir dos resultados é feita uma análise comparativa de desemvpenho destas ferramentas e abordagens. As paralelizações aqui desenvolvidas foram projetadas para serem executadas em um arquitetura formada por um agregado de máquinas indepentes e multiprocessadas (Cluster), ou seja , são considerados o paralelismo e intra-nodos. Para auxiliar a programação paralela em clusters foram, e estão sendo, desenvolvidas diferentes ferramentas (bibliotecas) que visam a exploração dos dois níveis de paralelismo existentes neste tipo de arquitetura. Neste trabalho emprega-se diferentes bibliotecas de troca de mensagens e de criação de threads para a exploração do paralelismo inter-nodos e intra-nodos. As bibliotecas adotadas são o DECK e o MPICH e a Pthread. Um dos itens a serem analisados nestes trabalho é acomparação do desempenho obtido com essas bibliotecas.O outro item é a análise da influência no desemepnho quando quando tulizadas múltiplas threads no paralelismo em clusters multiprocessados. Os métodos paralelizados nesse trabalho são o Gradiente Conjugação (GC) e o Resíduo Mínmo Generalizado (GMRES), quepodem ser adotados, respectivamente, para solução de sistemas de equações lineares sintéticos positivos e definidos e não simétricas. Tais sistemas surgem da discretização, por exemplo, dos modelos da hidrodinâmica e do transporte de massa que estão sendo desenvolvidos no GMCPAD. A utilização desses métodos é justificada pelo fato de serem métodos iterativos, o que os torna adequados à solução de sistemas de equações esparsas e de grande porte. Na solução desses sistemas através desses métodos iterativos paralelizados faz-se necessário o particionamento do domínio do problema, o qual deve ser feito visando um bom balanceamento de carga e minimização das fronteiras entre os sub-domínios. A estrutura de dados desenvolvida para os métodos paralelizados nesse trabalho permite que eles sejam adotados para solução de sistemas de equações gerados a partir de qualquer tipo de particionamento, pois o formato de armazenamento de dados adotado supre qualquer tipo de dependência de dados. Além disso, nesse trabalho são adotadas duas estratégias de ordenação para as comunicações, estratégias essas que podem ser importantes quando se considera a portabilidade das paralelizações para máquinas interligadas por redes de interconexão com buffer de tamanho insuficiente para evitar a ocorrência de dealock. Os resultados obtidos nessa dissertação contribuem nos trabalhos do GMCPAD, pois as paralelizações são adotadas em aplicações que estão sendo desenvolvidas no grupo. Análise numérica Cluster Mpi Paralelismo

Search results