Global ETD Search

131	Schémas d'adaptations algorithmiques sur les nouveaux supports d'éxécution parallèles / Algorithmic adaptations schemas on the new parallel platforms Achour, Sami 06 July 2013 (has links) Avec la multitude des plates-formes parallèles émergentes caractérisées par une hétérogénéité sur le plan matériel (processeurs, réseaux, …), le développement d'applications et de bibliothèques parallèles performantes est devenu un défi. Une méthode qui se révèle appropriée pour relever ce défi est l'approche adaptative consistant à utiliser plusieurs paramètres (architecturaux, algorithmiques,…) dans l'objectif d'optimiser l'exécution de l'application sur la plate-forme considérée. Les applications adoptant cette approche doivent tirer avantage des méthodes de modélisation de performance pour effectuer leurs choix entre les différentes alternatives dont elles disposent (algorithmes, implémentations ou ordonnancement). L'usage de ces méthodes de modélisation dans les applications adaptatives doit obéir aux contraintes imposées par ce contexte, à savoir la rapidité et la précision des prédictions. Nous proposons dans ce travail, en premier lieu, un framework de développement d'applications parallèles adaptatives basé sur la modélisation théorique de performances. Ensuite, nous nous concentrons sur la tâche de prédiction de performance pour le cas des milieux parallèles et hiérarchiques. En effet, nous proposons un framework combinant les différentes méthodes de modélisation de performance (analytique, expérimentale et simulation) afin de garantir un compromis entre les contraintes suscités. Ce framework profite du moment d'installation de l'application parallèle pour découvrir la plate-forme d'exécution et les traces de l'application afin de modéliser le comportement des parties de calcul et de communication. Pour la modélisation de ces deux composantes, nous avons développé plusieurs méthodes s'articulant sur des expérimentations et sur la régression polynômiale pour fournir des modèles précis. Les modèles résultats de la phase d'installation seront utilisés (au moment de l'exécution) par notre outil de prédiction de performance de programmes MPI (MPI-PERF-SIM) pour prédire le comportement de ces derniers. La validation de ce dernier framework est effectuée séparément pour les différents modules, puis globalement pour le noyau du produit de matrices. / With the multitude of emerging parallel platforms characterized by their heterogeneity in terms of hardware components (processors, networks, ...), the development of performant applications and parallel libraries have become a challenge. A method proved suitable to face this challenge is the adaptive approach which uses several parameters (architectural, algorithmic, ...) in order to optimize the execution of the application on the target platform. Applications adopting this approach must take advantage of performance modeling methods to make their choice between the alternatives they have (algorithms, implementations or scheduling). The use of these modeling approaches in adaptive applications must obey the constraints imposed by the context, namely predictions speed and accuracy. We propose in this work, first, a framework for developing adaptive parallel applications based on theoretical modeling performance. Then, we focuse on the task of performance prediction for the case of parallel and hierarchical environments. Indeed, we propose a framework combining different methods of performance modeling (analytical, experimental and simulation) to ensure a balance between the constraints raised. This framework makes use of the installing phase of the application to discover the parallel platform and the execution traces of this application in order to model the behavior of two components namely computing kernels and pt/pt communications. For the modeling of these components, we have developed several methods based on experiments and polynomial regression to provide accurate models. The resulted models will be used at runtime by our tool for performance prediction of MPI programs (MPI-PERF-SIM) to predict the behavior of the latter. The validation of the latter framework is done separately for the different modules, then globally on the matrix product kernel. Adaptativité Mpi Plates-formes hiérarchiques Régression Simulation Adaptive Hierarchical clusters Mpi Performance modeling and prediction Regression Simulation 004
132	Performance Evaluation and Prediction of Parallel Applications / Évaluation et prédiction de performance d'applications parallèles Markomanolis, Georgios 20 January 2014 (has links) L'analyse et la compréhension du comportement d'applications parallèles sur des plates-formes de calcul variées est un problème récurent de la communauté du calcul scientifique. Lorsque les environnements d'exécution ne sont pas disponibles, la simulation devient une approche raisonnable pour obtenir des indicateurs de performance objectifs et pour explorer plusieurs scénarios ``what-if?''. Dans cette thèse, nous présentons un environnement pour la simulation off-line d'applications écrites avec MPI. La principale originalité de notre travail par rapport aux travaux précédents réside dans la définition de traces indépendantes du temps. Elles permettent d'obtenir une extensibilité maximale puisque des ressources hétérogènes et distribuées peuvent être utilisées pour obtenir une trace. Nous proposons un format dans lequel pour chaque événement qui apparaît durant l'exécution d'une application, nous récupérons les informations sur le volume d'instructions pour une phase de calcul ou le nombre d'octets et le type d'une communication. Pour obtenir des traces indépendantes du temps lors de l'exécution d'applications MPI, nous devons les instrumenter pour récupérer les données requises. Il existe plusieurs outils d'instrumentation qui peuvent instrumenter une application. Nous proposons un système de notation qui correspond aux besoins de notre environnement et nous évaluons les outils d'instrumentation selon lui. De plus, nous introduisons un outil original appelé Minimal Instrumentation qui a été conçu pour répondre au besoins de notre environnement. Nous étudions plusieurs méthodes d'instrumentation et plusieurs stratégies d'acquisition. Nous détaillons les outils qui extraient les traces indépendantes du temps à partir des traces d'instrumentations de quelques outils de profiling connus. Enfin nous évaluons la procédure d'acquisition complète et présentons l'acquisition d'instances à grande échelle. Nous décrivons en détail la procédure pour fournir un fichier de plateforme simulée réaliste à notre outil d'exécution de traces qui prend en compte la topologie de la plateforme cible ainsi que la procédure de calibrage par rapport à l'application qui va être simulée. De plus, nous montrons que notre simulateur peut prédire les performances de certains benchmarks MPI avec moins de 11% d'erreur relative entre l'exécution réelle et la simulation pour les cas où il n'y a pas de problème de performance. Enfin, nous identifions les causes de problèmes de performances et nous proposons des solutions pour y remédier. / Analyzing and understanding the performance behavior of parallel applicationson various compute infrastructures is a long-standing concern in the HighPerformance Computing community. When the targeted execution environments arenot available, simulation is a reasonable approach to obtain objectiveperformance indicators and explore various ``what-if?'' scenarios. In thiswork we present a framework for the off-line simulation of MPIapplications. The main originality of our work with regard to the literature is to rely on\tit execution traces. This allows for an extreme scalability as heterogeneousand distributed resources can be used to acquire a trace. We propose a formatwhere for each event that occurs during the execution of an application we logthe volume of instructions for a computation phase or the bytes and the type ofa communication. To acquire time-independent traces of the execution of MPI applications, wehave to instrument them to log the required data. There exist many profilingtools which can instrument an application. We propose a scoring system thatcorresponds to our framework specific requirements and evaluate the mostwell-known and open source profiling tools according to it. Furthermore weintroduce an original tool called Minimal Instrumentation that was designed tofulfill the requirements of our framework. We study different instrumentationmethods and we also investigate several acquisition strategies. We detail thetools that extract the \tit traces from the instrumentation traces of somewell-known profiling tools. Finally we evaluate the whole acquisition procedureand we present the acquisition of large scale instances. We describe in detail the procedure to provide a realistic simulated platformfile to our trace replay tool taking under consideration the topology of thereal platform and the calibration procedure with regard to the application thatis going to be simulated. Moreover we present the implemented trace replaytools that we used during this work. We show that our simulator can predictthe performance of some MPI benchmarks with less than 11\% relativeerror between the real execution and simulation for the cases that there is noperformance issue. Finally, we identify the reasons of the performance issuesand we propose solutions. Acquisition de traces Simulation off-line Évaluation de performances Prédiction de performances MPI Trace acquisition Off-line simulation Performance evaluation Performance prediction MPI
133	Análise de benefícios do paralelismo por comunicação unilateral em aplicações com grades não estruturadas / Improvement analysis of parallelism by one-sided communication on unstructured grids applications Pedro Pais Lopes 03 September 2010 (has links) A computacao paralela, empregada no meio cientifico para resolucao de problemas que de- mandam grande poder computacional, teve nos ultimos anos o surgimento de um novo tipo de comunicacao entre instancias do paralelismo. Trata-se da Comunicacao Unilateral (CUL), onde somente uma instancia realiza a operacao de transferencia de informacoes, e esta ocorre em segundo plano, ao contrario da Comunicacao Bilateral (CBL), onde uma instancia envia a informacao e a outra recebe. Neste contexto se buscou analisar os beneficios que a CUL agrega ao paralelismo de um programa que se utiliza de uma grade nao estruturada em me- moria. Duas formas de apoio ao paralelismo foram utilizadas: uma biblioteca, a \"Message Passing Interface\" (MPI) (especificamente a sua parte que descreve a CUL), e uma extensao a linguagem Fortran, o Coarray Fortran (CAF). A semantica do MPI CUL e mais complexa que a do CAF, mas a do CAF e mais restritiva. Para analisar a semantica e desempenho da CUL foi realizada uma ambientacao utilizando MPI CUL e CAF no paralelismo de um programa simples, denominado jogo da Vida (Game of Life), com grade estruturada e com otimo desempenho paralelo atraves do MPI CBL. Na programacao o MPI CUL se mostrou verborragico (aumento do numero de linhas de codigo) e complexo, principalmente quando se utiliza um controle refinado de sincronismo entre as imagens. Ja o CAF reduziu o nu- mero de linhas de codigo (entre 20% e 40%), e o sincronismo e muito mais simples. Os resultados mostraram uma piora no desempenho no caso do MPI CUL, mas para o CAF o desempenho absoluto foi melhor que a implementacao original ate o numero de nucleos de processamento que compartilham a mesma memoria. Para grades nao estruturadas se utilizou o Ocean Land Atmospheric Model (OLAM), um modelo de simulacao do sistema terrestre com grade baseada em prismas triangulares, paralelizado atraves de MPI CBL. A implementacao da comunicacao por MPI CUL na estrutura do paralelismo existente mos- trou que esta semantica possui alguns pontos que podem prejudicar a programacao, como o tratamento da exposicao de memoria (cada instancia tem uma memoria exposta de tamanho diferente) e como e realizado o sincronismo entre as instancias. Em termos de desempenho as curvas de speed-ups mostraram que o MPI CUL prejudicou o OLAM independentemente da implementacao das bibliotecas ou do equipamento utilizado, com reducao de pelo menos 20% no speed-up para sete ou mais processadores. Assim como no jogo da Vida o MPI com comunicacao unilateral penalizou o desempenho. / Parallel computing is used to solve many scientific problems that demand intensive compu- ting power. Recently a new paradigm of communication between instances of the parallelism has appeared, called the one-sided communication (OSC), where only one instance performs the operation of information transfer, occurring in the background, as opposed to the two- sided communication (TSC), where one instance sends the information and the other receives it. In this context we analyze the benefits that OSC aggregates to the parallelism of a pro- gram that uses an unstructured grid in memory. Two OSC implementations were used: the \"Message Passing Interface\" (MPI) library (specifically the part that describes OSC), and Coarray Fortran (CAF), an extension of the Fortran language. The semantics of MPI OSC is more complex than that of CAF, but the semantics of CAF is more restrictive. To analyze the semantics and performance of OSC a simple program called Game of Life is used in a structured grid, giving very good parallel performance through MPI TSC. The MPI OSC program was verbose (increase in the number of lines of code) and complex, especially when using a more refined control to synchronize the parallel instances. On the other hand, CAF has reduced the number of lines of code (between 20% to 40%), and the synchronization is very simple. The results showed a worse performance in the case of MPI OSC, but for the CAF the absolute performance was better than the original implementation up to the number of processor cores that share the same memory. For unstructured grids we used the Ocean Land Atmospheric Model (OLAM), an earth simulation model on a grid based on triangular prisms, and parallelized with MPI TSC. The implementation with MPI OSC showed that this semantics has some points that may affect the coding of the communication structure, as in the treatment of memory exposure (each instance has an exposed memory of different size) and the way to treat the synchronization among instances. In terms of performance, the speedup curves showed that MPI OSC penalized OLAM, independently of the MPI implementation or the equipment used, with a reduction of at least 20% in speedup for seven or more processors. As in the Game of Life, MPI OSC degrades the performance. CAF Grades não estruturadas MPI OLAM Paralelismo Sincronismo Speed-up CAF MPI OLAM Parallelism Speed-up Syncronism Unstructured grids
134	A Java Founded LOIS-framework and the Message Passing Interface? : An Exploratory Case Study Strand, Christian January 2006 (has links) In this thesis project we have successfully added an MPI extension layer to the LOIS framework. The framework defines an infrastructure for executing and connecting continuous stream processing applications. The MPI extension provides the same amount of stream based data as the framework’s original transport. We assert that an MPI-2 compatible implementation can be a candidate to extend the given framework with an adaptive and flexible communication sub-system. Adaptability is required since the communication subsystem has to be resilient to changes, either due to optimizations or system requirements. LOIS MPI Message Passing Interface and Java MPI and Java JNI Java Native Interface Component Systems Computer Sciences Datavetenskap (datalogi)
135	Programmation haute performance pour architectures hybrides / High Performance Programming for Hybrid Architectures Habel, Rachid 19 November 2014 (has links) Les architectures parallèles hybrides constituées d'un grand nombre de noeuds de calcul multi-coeurs/GPU connectés en réseau offrent des performances théoriques très élevées, de l'ordre de quelque dizaines de TeraFlops. Mais la programmation efficace de ces machines reste un défi à cause de la complexité de l'architecture et de la multiplication des modèles de programmation utilisés. L'objectif de cette thèse est d'améliorer la programmation des applications scientifiques denses sur les architectures parallèles hybrides selon trois axes: réduction des temps d'exécution, traitement de données de très grande taille et facilité de programmation. Nous avons pour cela proposé un modèle de programmation à base de directives appelé DSTEP pour exprimer à la fois la distribution des données et des calculs. Dans ce modèle, plusieurs types de distribution de données sont exprimables de façon unifiée à l'aide d'une directive "dstep distribute" et une réplication de certains éléments distribués peut être exprimée par un "halo". La directive "dstep gridify" exprime à la fois la distribution des calculs ainsi que leurs contraintes d'ordonnancement. Nous avons ensuite défini un modèle de distribution et montré la correction de la transformation de code du domaine séquentiel au domaine distribué. À partir du modèle de distribution, nous avons dérivé un schéma de compilation pour la transformation de programmes annotés de directives DSTEP en des programmes parallèles hybrides. Nous avons implémenté notre solution sous la forme d'un compilateur intégré à la plateforme de compilation PIPS ainsi qu'une bibliothèque fournissant les fonctionnalités du support d'exécution, notamment les communications. Notre solution a été validée sur des programmes de calcul scientifiques standards tirés des NAS Parallel Benchmarks et des Polybenchs ainsi que sur une application industrielle. / Clusters of multicore/GPU nodes connected with a fast network offer very high therotical peak performances, reaching tens of TeraFlops. Unfortunately, the efficient programing of such architectures remains challenging because of their complexity and the diversity of the existing programming models. The purpose of this thesis is to improve the programmability of dense scientific applications on hybrid architectures in three ways: reducing the execution times, processing larger data sets and reducing the programming effort. We propose DSTEP, a directive-based programming model expressing both data and computation distribution. A large set of distribution types are unified in a "dstep distribute" directive and the replication of some distributed elements can be expressed using a "halo". The "dstep gridify" directive expresses both the computation distribution and the schedule constraints of loop iterations. We define a distribution model and demonstrate the correctness of the code transformation from the sequential domain to the parallel domain. From the distribution model, we derive a generic compilation scheme transforming DSTEP annotated input programs into parallel hybrid ones. We have implemented such a tool as a compiler integrated to the PIPS compilation workbench together with a library offering the runtime functionality, especially the communication. Our solution is validated on scientific programs from the NAS Parallel Benchmarks and the PolyBenchs as well as on an industrial signal procesing application. Compilation Mémoire distribuée Mémoire partagée Gpu Mpi OpenMP Compilation Distributed-Memory Shared-Memory Gpu Mpi OpenMP 004
136	A Survey of Barrier Algorithms for Coarse Grained Supercomputers Hoefler, Torsten, Mehlan, Torsten, Mietke, Frank, Rehm, Wolfgang 28 June 2005 (has links) (PDF) There are several different algorithms available to perform a synchronization of multiple processors. Some of them support only shared memory architectures or very fine grained supercomputers. This work gives an overview about all currently known algorithms which are suitable for distributed shared memory architectures and message passing based computer systems (loosely coupled or coarse grained supercomputers). No absolute decision can be made for choosing a barrier algorithm for a machine. Several architectural aspects have to be taken into account. The overview about known barrier algorithms given in this work is mostly targeted to implementors of libraries supporting collective communication (such as MPI). Barrier Collective Communication Kollektive Operationen MPI_Barrier ddc:004 MPI <Schnittstelle> Mpi-Sprache Netzwerk <Graphentheorie> Supercomputer
137	Improving memory consumption and performance scalability of HPC applications with multi-threaded network communications / Amélioration de la consommation mémoire et de l'extensibilité des performances des applications HPC par le multi-threading des communications réseaux Didelot, Sylvain 12 June 2014 (has links) La tendance en HPC est à l'accroissement du nombre de coeurs par noeud de calcul pour une quantité totale de mémoire par noeud constante. A large échelle, l'un des principaux défis pour les applications parallèles est de garder une faible consommation mémoire. Cette thèse présente une couche de communication multi-threadée sur Infiniband, laquelle fournie de bonnes performances et une faible consommation mémoire. Nous ciblons les applications scientifiques parallélisées grâce à la bibliothèque MPI ou bien combinées avec un modèle de programmation en mémoire partagée. En partant du constat que le nombre de connexions réseau et de buffers de communication est critique pour la mise à l'échelle des bibliothèques MPI, la première contribution propose trois approches afin de contrôler leur utilisation. Nous présentons une topologie virtuelle extensible et entièrement connectée pour réseaux rapides orientés connexion. Dans un contexte agrégeant plusieurs cartes permettant d'ajuster dynamiquement la configuration des buffers réseau utilisant la technologie RDMA. La seconde contribution propose une optimisation qui renforce le potentiel d'asynchronisme des applications MPI, laquelle montre une accélération de deux des communications. La troisième contribution évalue les performances de plusieurs bibliothèques MPI exécutant une application de modélisation sismique en contexte hybride. Les expériences sur des noeuds de calcul jusqu'à 128 coeurs montrent une économie de 17 % sur la mémoire. De plus, notre couche de communication multi-threadée réduit le temps d'exécution dans le cas où plusieurs threads OpenMP participent simultanément aux communications MPI. / A recent trend in high performance computing shows a rising number of cores per compute node, while the total amount of memory per compute node remains constant. To scale parallel applications on such large machines, one of the major challenges is to keep a low memory consumption. This thesis develops a multi-threaded communication layer over Infiniband which provides both good performance of communications and a low memory consumption. We target scientific applications parallelized using the MPI standard in pure mode or combined with a shared memory programming model. Starting with the observation that network endpoints and communication buffers are critical for the scalability of MPI runtimes, the first contribution proposes three approaches to control their usage. We introduce a scalable and fully-connected virtual topology for connection-oriented high-speed networks. In the context of multirail configurations, we then detail a runtime technique which reduces the number of network connections. We finally present a protocol for dynamically resizing network buffers over the RDMA technology. The second contribution proposes a runtime optimization to enforce the overlap potential of MPI communications, showing a 2x improvement factor on communications. The third contribution evaluates the performance of several MPI runtimes running a seismic modeling application in a hybrid context. On large compute nodes up to 128 cores, the introduction of OpenMP in the MPI application saves up to 17 % of memory. Moreover, we show a performance improvement with our multi-threaded communication layer where the OpenMP threads concurrently participate to the MPI communications Calcul haute performance Multi-threading Réseaux haut débit MPI NUMA High performance computing Multi-threading High-speed networks MPI NUMA
138	Avaliação da efetividade dos critérios de teste estruturais no contexto de programas concorrentes / Investigating of the structural testing effectiveness in context of concurrent programs Maria Adelina Silva Brito 24 November 2011 (has links) A Engenharia de Software tem desenvolvido t[écnicas e métodos para apoiar o desenvolvimento de software confiável, flexível, com baixo custo de desenvolvimento e fácil manutenção. Técnicas e critérios de teste contribuem nessa direção, fornecendo mecanismos para produzir software com alta qualidade. Este trabalho apresenta um estudo experimental para avaliar o custo, eficácia e a dificuldade de satisfação (strength) dos critérios estruturais propostos para programas concorrentes. Esta avaliação foi conduzida usando oito programas implementados em MPI e utilizando a ferramenta de teste ValiMPI. Com base em taxonomias, defeitos foram injetados nos programas de modo a avaliar a eficácia dos critérios de teste em revelar os defeitos inseridos. Os resultados obtidos demonstraram o aspecto complementar dos critérios e informações sobre o custo e eficácia, que contribuíram para o estabelecimento de uma estratégia de teste incremental para aplicar os critérios de teste em uma boa relação custo-eficácia. Para concluir, os resultados indicam que os critérios de teste estrutural propostos para programas concorrentes em MPI são promissores e podem auxiliar a detectar defeitos nessas aplicações, melhorando a qualidade das mesmas / The software engineering has developed techniques and methods to help the software development reliable, exible and with lower development costs and easy maintenance. Testing techniques and criteria contribute in this sense, providing mechanisms to produce high quality software. This work presents an experimental study to evaluate the cost, effectiveness and strength of the structural testing criteria for concurrent programs. This evaluation has been conducted using eight programs implemented in MPI and the ValiMPI testing tool. Based on a fault taxonomy for concurrent programs, defects have been injected in these programs in order to evaluate the effectiveness of the testing criteria. The results indicate the complementary aspect of the criteria and the information about cost and effectiveness have contributed to the establishment of an incremental testing strategy to apply the testing criteria in good relation cost-effectiveness. To conclude, the results indicate that the structural testing criteria, proposed for MPI concurrent programs, are promising and they can help to find defects in these applications, improving their quality Engenharia de software experimental Programas concorrentes Programas MPI Teste de software Concurrent programs Experimental software engineering MPI programs Software testing
139	Efektivní komunikace v multi-GPU systémech / Efficient Communication in Multi-GPU Systems Špeťko, Matej January 2018 (has links) After the introduction of CUDA by Nvidia, the GPUs became devices capable of accelerating any general purpose computation. GPUs are designed as parallel processors which posses huge computation power. Modern supercomputers are often equipped with GPU accelerators. Sometimes single GPU performance is not enough for a scientific application and it needs to scale over multiple GPUs. During the computation, there is a need for the GPUs to exchange partial results. This communication represents computation overhead and it is important to research methods of the effective communication between GPUs. This means less CPU involvement, lower latency and shared system buffers. This thesis is focused on inter-node and intra-node GPU-to-GPU communication using GPUDirect technologies from Nvidia and CUDA-Aware MPI. Subsequently, k-Wave toolbox for simulating the propagation of acoustic waves is introduced. This application is accelerated by using CUDA-Aware MPI. Peer-to-peer transfer support is also integrated to k-Wave using CUDA Inter-process Communication.
140	Fast Barrier Synchronization for InfiniBand Hoefler, Torsten 04 January 2006 (has links) Barrier Synchronization is crucial for many parallel systems. This talk introduces different synchronization mechanisms and demonstrates new approaches to leverage special hardware properties of InfiniBand to lower the Barrier latency. info:eu-repo/classification/ddc/004 ddc:004 MPI <Schnittstelle> Parallelrechner Barrier InfiniBand MPI_Barrier Open MPI

Search results