Global ETD Search

161	Résolution de systèmes linéaires et non linéaires creux sur grappes de GPUs / Solving sparse linear and nonlinear systems on GPU clusters Ziane Khodja, Lilia 07 June 2013 (has links) Depuis quelques années, les grappes équipées de processeurs graphiques GPUs sont devenues des outils très attrayants pour le calcul parallèle haute performance. Dans cette thèse, nous avons conçu des algorithmes itératifs parallèles pour la résolution de systèmes linéaires et non linéaires creux de très grandes tailles sur grappes de GPUs. Dans un premier temps, nous nous sommes focalisés sur la résolution de systèmes linéaires creux à l'aide des méthodes itératives CG et GMRES. Les expérimentations ont montré qu'une grappe de GPUs est plus performante que son homologue grappe de CPUs pour la résolution de systèmes linéaires de très grandes tailles. Ensuite, nous avons mis en oeuvre des algorithmes parallèles synchrones et asynchrones des méthodes itératives Richardson et de relaxation par blocs pour la résolution de systèmes non linéaires creux. Nous avons constaté que les meilleurs solutions développées pour les CPUs ne sont pas nécessairement bien adaptées aux GPUs. En effet, les simulations effectuées sur une grappe de GPUs ont montré que les algorithmes Richardson sont largement plus efficaces que ceux de relaxation par blocs. De plus, elles ont aussi montré que la puissance de calcul des GPUs permet de réduire le rapport entre le temps d'exécution et celui de communication, ce qui favorise l'utilisation des algorithmes asynchrones sur des grappes de GPUs. Enfin, nous nous sommes intéressés aux grappes géographiquement distantes pour la résolution de systèmes linéaires creux. Dans ce contexte, nous avons utilisé la méthode de multi-décomposition à deux niveaux avec GMRES parallèle adaptée aux grappes de GPUs. Celle-ci utilise des itérations synchrones pour résoudre localement les sous-systèmes linéaires et des itérations asynchrones pour résoudre la globalité du système linéaire. / Or the past few years, the clusters equipped with GPUs have become attractive tools for high performance computing. In this thesis, we have designed parallel iterative algorithms for solving large sparse linear and nonlinear systems on GPU clusters. First, we have focused on solving sparse linear systems using CG and GMRES iterative methods. The experiments have shown that a GPU cluster is more efficient that its pure CPU counterpart for solving large sparse systems of linear equations. Then, we have implemented the synchronous and asynchronous algorithms of the Richardson and the block relaxation iterative methods for solving sparse nonlinear systems. We have noticed that the best solutions developed for the CPUs are not necessarily well suited to GPUs. Indeed, the experiments performed on a GPU cluster have shown that the parallel algorithms of the Richardson method are far more efficient than those of the block relaxation method. In addition, they have shown that the computing power of GPUs allows to reduce the ratio between the time of the computation over that of the communication, which favors the use of the asynchronous iteration on GPU clusters. Finally, we are interested in geographically distant clusters for solving large sparse linear systems. In this context, we have used a multisplitting two-stage method using parallel GMRES method adapted to GPU clusters. It uses the synchronous iteration to solve locally the sub-linear systems and the asynchronous one to solve the global sparse linear system. Méthodes itératives Parallélisme MPI/CUDA Grappes de GPUs Sparse linear and nonlinear systems Iterative methods MPI/CUDA parallelism GPU clusters 005.1
162	Mouvement de données et placement des tâches pour les communications haute performance sur machines hiérarchiques Moreaud, Stéphanie 12 October 2011 (has links) Les architectures des machines de calcul sont de plus en plus complexes et hiérarchiques, avec des processeurs multicœurs, des bancs mémoire distribués, et de multiples bus d'entrées-sorties. Dans le cadre du calcul haute performance, l'efficacité de l'exécution des applications parallèles dépend du coût de communication entre les tâches participantes qui est impacté par l'organisation des ressources, en particulier par les effets NUMA ou de cache.Les travaux de cette thèse visent à l'étude et à l'optimisation des communications haute performance sur les architectures hiérarchiques modernes. Ils consistent tout d'abord en l'évaluation de l'impact de la topologie matérielle sur les performances des mouvements de données, internes aux calculateurs ou au travers de réseaux rapides, et pour différentes stratégies de transfert, types de matériel et plateformes. Dans une optique d'amélioration et de portabilité des performances, nous proposons ensuite de prendre en compte les affinités entre les communications et le matériel au sein des bibliothèques de communication. Ces recherches s'articulent autour de l'adaptation du placement des tâches en fonction des schémas de transfert et de la topologie des calculateurs, ou au contraire autour de l'adaptation des stratégies de mouvement de données à une répartition définie des tâches. Ce travail, intégré aux principales bibliothèques MPI, permet de réduire de façon significative le coût des communications et d'améliorer ainsi les performances applicatives. Les résultats obtenus témoignent de la nécessité de prendre en compte les caractéristiques matérielles des machines modernes pour en exploiter la quintessence. / The emergence of multicore processors led to an increasing complexity inside the modern servers, with many cores, distributed memory banks and multiple Input/Output buses. The execution time of parallel applications depends on the efficiency of the communications between computing tasks. On recent architectures, the communication cost is largely impacted by hardware characteristics such as NUMA or cache effects. In this thesis, we propose to study and optimize high performance communication on hierarchical architectures. We first evaluate the impact of the hardware affinities on data movement, inside servers or across high-speed networks, and for multiple transfer strategies, technologies and platforms. We then propose to consider affinities between hardware and communicating tasks inside the communication libraries to improve performance and ensure their portability. To do so,we suggest to adapt the tasks binding according to the transfer method and thetopology, or to adjust the data transfer strategies to a defined task distribution. Our approaches have been integrated in some main MPI implementations. They significantly reduce the communication costs and improve the overall application performance. These results highlight the importance of considering hardware topology for nowadays servers. Calcul intensif Communication réseau Mémoire partagée Mpi Multiprocesseur Numa Mulicœur Affinité matérielle Topologie High Performance Computing Network communication Shared memory Mpi Multiprocessor Numa Multicore Hardware affinity Topology
163	Scheduling and memory optimizations for sparse direct solver on multi-core/multi-gpu duster systems / Ordonnancement et optimisations mémoire pour un solveur creux par méthodes directes sur des machines hétérogènes Lacoste, Xavier 18 February 2015 (has links) L’évolution courante des machines montre une croissance importante dans le nombre et l’hétérogénéité des unités de calcul. Les développeurs doivent alors trouver des alternatives aux modèles de programmation habituels permettant de produire des codes de calcul à la fois performants et portables. PaStiX est un solveur parallèle de système linéaire creux par méthodes directe. Il utilise un ordonnanceur de tâche dynamique pour être efficaces sur les machines modernes multi-coeurs à mémoires hiérarchiques. Dans cette thèse, nous étudions les bénéfices et les limites que peut nous apporter le remplacement de l’ordonnanceur interne, très spécialisé, du solveur PaStiX par deux systèmes d’exécution génériques : PaRSEC et StarPU. Pour cela l’algorithme doit être décrit sous la forme d’un graphe de tâches qui est fournit aux systèmes d’exécution qui peuvent alors calculer une exécution optimisée de celui-ci pour maximiser l’efficacité de l’algorithme sur la machine de calcul visée. Une étude comparativedes performances de PaStiX utilisant ordonnanceur interne, PaRSEC, et StarPU a été menée sur différentes machines et est présentée ici. L’analyse met en évidence les performances comparables des versions utilisant les systèmes d’exécution par rapport à l’ordonnanceur embarqué optimisé pour PaStiX. De plus ces implémentations permettent d’obtenir une accélération notable sur les machines hétérogènes en utilisant lesaccélérateurs tout en masquant la complexité de leur utilisation au développeur. Dans cette thèse nous étudions également la possibilité d’obtenir un solveur distribué de système linéaire creux par méthodes directes efficace sur les machines parallèles hétérogènes en utilisant les systèmes d’exécution à base de tâche. Afin de pouvoir utiliser ces travaux de manière efficace dans des codes parallèles de simulations, nous présentons également une interface distribuée, orientée éléments finis, permettant d’obtenir un assemblage optimisé de la matrice distribuée tout en masquant la complexité liée à la distribution des données à l’utilisateur. / The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this thesis, we study the benefits and the limits of replacing the highly specialized internal scheduler of the PaStiX solver by two generic runtime systems: PaRSEC and StarPU. Thus, we have to describe the factorization algorithm as a tasks graph that we provide to the runtime system. Then it can decide how to process and optimize the graph traversal in order to maximize the algorithm efficiency for thetargeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its original internal scheduler, PaRSEC, and StarPU frameworks is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer. In this thesis, we also study the possibilities to build a distributed sparse linear solver on top of task-based runtime systems to target heterogeneous clusters. To permit an efficient and easy usage of these developments in parallel simulations, we also present an optimized distributed interfaceaiming at hiding the complexity of the construction of a distributed matrix to the user. GPU Multi-coeur MPI, Ordonnanceur à base de tâches Sparse direct solver GPU Multi-core MPI Tasks based runtime systems
164	Konstrukční návrh sekvenčního řazení vozidla / Design of Vehicle Sequential Gearbox Šardický, Jakub January 2018 (has links) The thesis deals with the search of the current sequential gearboxes and the structural shift of the Škoda Felicia 1.6 MPI synchronous transmission to the manual sequential gearbox. Based on the assignment, the schematic reconstruction principle is included. This principle i s then elaborated into a detailed design and the thesis focuses on strength analysis of newly created parts of reconstruction, optimization of these parts and control of their fatigue life. The thesis ends with the theoretical continuation of mechanical conversion to a fully electronic sequential gearbox.
165	Traitement de données multi-spectrales par calcul intensif et applications chez l'homme en imagerie par résonnance magnétique nucléaire / Processing of multi-spectral data by high performance computing and its applications on human nuclear magnetic resonance imaging Angeletti, Mélodie 21 February 2019 (has links) L'imagerie par résonance magnétique fonctionnelle (IRMf) étant une technique non invasive pour l'étude de cerveau, elle a été employée pour comprendre les mécanismes cérébraux sous-jacents à la prise alimentaire. Cependant, l'utilisation de stimuli liquides pour simuler la prise alimentaire engendre des difficultés supplémentaires par rapport aux stimulations visuellement habituellement mises en œuvre en IRMf. L'objectif de cette thèse a donc été de proposer une méthode robuste d'analyse des données tenant compte de la spécificité d'une stimulation alimentaire. Pour prendre en compte le mouvement dû à la déglutition, nous proposons une méthode de censure fondée uniquement sur le signal mesuré. Nous avons de plus perfectionné l'étape de normalisation des données afin de réduire la perte de signal. La principale contribution de cette thèse est d'implémenter l'algorithme de Ward de sorte que parcelliser l'ensemble du cerveau soit réalisable en quelques heures et sans avoir à réduire les données au préalable. Comme le calcul de la distance euclidienne entre toutes les paires de signaux des voxels représente une part importante de l'algorithme de Ward, nous proposons un algorithme cache-aware du calcul de la distance ainsi que trois parallélisations sur les architectures suivantes : architecture à mémoire partagée, architecture à mémoire distribuée et GPU NVIDIA. Une fois l'algorithme de Ward exécuté, il est possible d'explorer toutes les échelles de parcellisation. Nous considérons plusieurs critères pour évaluer la qualité de la parcellisation à une échelle donnée. À une échelle donnée, nous proposons soit de calculer des cartes de connectivités entre les parcelles, soit d'identifier les parcelles répondant à la stimulation à l'aide du coefficient de corrélation de Pearson. / As a non-invasive technology for studying brain imaging, functional magnetic resonance imaging (fMRI) has been employed to understand the brain underlying mechanisms of food intake. Using liquid stimuli to fake food intake adds difficulties which are not present in fMRI studies with visual stimuli. This PhD thesis aims to propose a robust method to analyse food stimulated fMRI data. To correct the data from swallowing movements, we have proposed to censure the data uniquely from the measured signal. We have also improved the normalization step of data between subjects to reduce signal loss.The main contribution of this thesis is the implementation of Ward's algorithm without data reduction. Thus, clustering the whole brain in several hours is now feasible. Because Euclidean distance computation is the main part of Ward algorithm, we have developed a cache-aware algorithm to compute the distance between each pair of voxels. Then, we have parallelized this algorithm for three architectures: shared-memory architecture, distributed memory architecture and NVIDIA GPGPU. Once Ward's algorithm has been applied, it is possible to explore multi-scale clustering of data. Several criteria are considered in order to evaluate the quality of clusters. For a given number of clusters, we have proposed to compute connectivity maps between clusters or to compute Pearson correlation coefficient to identify brain regions activated by the stimulation. IRMf alimentaire Parcellisation multi-échelle Algorithme de Ward Distance euclidienne Parallélisation OpenMP MPI CUDA Food fMRI Multi-scale clustering Ward's algorithm Euclidean distance Parallelisation OpenMP MPI CUDA
166	Développements du modèle adjoint de la différentiation algorithmique destinés aux applications intensives en calcul / Extensions of algorithmic differentiation by source transformation inspired by modern scientific computing Taftaf, Ala 17 January 2017 (has links) Le mode adjoint de la Différentiation Algorithmique (DA) est particulièrement intéressant pour le calcul des gradients. Cependant, ce mode utilise les valeurs intermédiaires de la simulation d'origine dans l'ordre inverse à un coût qui augmente avec la longueur de la simulation. La DA cherche des stratégies pour réduire ce coût, par exemple en profitant de la structure du programme donné. Dans ce travail, nous considérons d'une part le cas des boucles à point-fixe pour lesquels plusieurs auteurs ont proposé des stratégies adjointes adaptées. Parmi ces stratégies, nous choisissons celle de B. Christianson. Nous spécifions la méthode choisie et nous décrivons la manière dont nous l'avons implémentée dans l'outil de DA Tapenade. Les expériences sur une application de taille moyenne montrent une réduction importante de la consommation de mémoire. D'autre part, nous étudions le checkpointing dans le cas de programmes parallèles MPI avec des communications point-à-point. Nous proposons des techniques pour appliquer le checkpointing à ces programmes. Nous fournissons des éléments de preuve de correction de nos techniques et nous les expérimentons sur des codes représentatifs. Ce travail a été effectué dans le cadre du projet européen ``AboutFlow'' / The adjoint mode of Algorithmic Differentiation (AD) is particularly attractive for computing gradients. However, this mode needs to use the intermediate values of the original simulation in reverse order at a cost that increases with the length of the simulation. AD research looks for strategies to reduce this cost, for instance by taking advantage of the structure of the given program. In this work, we consider on one hand the frequent case of Fixed-Point loops for which several authors have proposed adapted adjoint strategies. Among these strategies, we select the one introduced by B. Christianson. We specify further the selected method and we describe the way we implemented it inside the AD tool Tapenade. Experiments on a medium-size application shows a major reduction of the memory needed to store trajectories. On the other hand, we study checkpointing in the case of MPI parallel programs with point-to-point communications. We propose techniques to apply checkpointing to these programs. We provide proof of correctness of our techniques and we experiment them on representative CFD codes Différenciation algorithmique Méthode adjointe Algorithmes point-fixe Communication par passage de message MPI Algorithmic differentiation Adjoint methods Fixed-point algorithms Checkpointing Message passing MPI
167	A Survey of Barrier Algorithms for Coarse Grained Supercomputers Hoefler, Torsten, Mehlan, Torsten, Mietke, Frank, Rehm, Wolfgang 28 June 2005 (has links) There are several different algorithms available to perform a synchronization of multiple processors. Some of them support only shared memory architectures or very fine grained supercomputers. This work gives an overview about all currently known algorithms which are suitable for distributed shared memory architectures and message passing based computer systems (loosely coupled or coarse grained supercomputers). No absolute decision can be made for choosing a barrier algorithm for a machine. Several architectural aspects have to be taken into account. The overview about known barrier algorithms given in this work is mostly targeted to implementors of libraries supporting collective communication (such as MPI). info:eu-repo/classification/ddc/004 ddc:004 MPI <Schnittstelle> Mpi-Sprache Netzwerk <Graphentheorie> Supercomputer Barrier Collective Communication Kollektive Operationen MPI_Barrier
168	Simulation des Workflows in einer Kooperation Telzer, Martin 19 December 2005 (has links) Je weiter die Zivilisation vorranschreitet, um so komplexer werden deren Errungenschaften. Die Herstellungsprozesse ziehen auch ein komplexes Management während der Produktion nach sich, da viele Menschen und Maschinen am Produktionsprozess beteiligt sind. Der Manager stellt hier einen "Single Point of Failure" dar. Das bedeutet, dass die erfolgreiche Produktion nun abhängig von der Qualität und der Fehlerfreiheit des Managers bzw. des leitetenden Personals ist. Um diesen Mangel zu beseitigen, lohnt es sich auch an dieser Stelle gewisse Prozesse zu automatisieren. Man erreicht dadurch einen höheren Grad an Fehlerfreiheit und Zuverlässigkeit. Um dies zu realisieren, werden unter anderem die Prinzipien des Workflow-Managements benutzt. Je komplexer ein Workflow wird, um so mehr Rechenleistung wird benötigt, um diesen in einem Workflow-Management-System auszuführen. Eine technische Möglichkeit dieses Problem zu lösen, stellt die Verteilung der Workflow-Management-Software dar. Verteilung bedeutet im gleichen Atemzug eine Verkomplizierung der Softwarearchitektur, wodurch sie wiederum komplizierter zu entwickeln ist. Komplexe Softwaresysteme ziehen komplexe Testprogramme und Simulationsumgebungen nach sich. Um die Entwicklung eines verteilten Workflow-Management-Systems zu unterstützen, wird in dieser Arbeit ein Simulationssystem für Workflow-Management-Systeme entworfen und implementiert. Es wird den Entwicklern eines verteilten Workflow-Management- Systems ein wertvolles Tool während der Implementierung der Software sein. info:eu-repo/classification/ddc/004 ddc:004 MPI Simulation Arbeitssimulation MPI MPICH Simulator Testumgebung Verteilung Workflow Workflow-Management Workflow-Management-System verteilt
169	Improving the Performance of Selected MPI Collective Communication Operations on InfiniBand Networks Viertel, Carsten 30 April 2007 (has links) The performance of collective communication operations is one of the deciding factors in the overall performance of a MPI application. Open MPI's component architecture offers an easy way to implement new algorithms for collective operations, but current implementations use the point-to-point components to access the InfiniBand network. Therefore it is tried to improve the performance of a collective component by accessing the InfiniBand network directly. This should avoid overhead and make it possible to tune the algorithms to this specific network. The first part of this work gives a short overview of the InfiniBand Architecture and Open MPI. In the next part several models for parallel computation are analyzed. Afterwards various algorithms for the MPI_Scatter, MPI_Gather and MPI_Allgather operations are presented. The theoretical performance of the algorithms is analyzed with the LogfP and LogGP models. Selected algorithms are implemented as part of an Open MPI collective component. Finally the performance of different algorithms and different MPI implementations is compared. The test results show, that the performance of the operations could be improved for several message and communicator size ranges. info:eu-repo/classification/ddc/004 ddc:004 Hochleistungsrechnen MPI <Schnittstelle> Netzwerk InfiniBand Kollektive Operationen LogP Modell MPI_Allgather MPI_Gather MPI_Scatter Open MPI
170	Integrating SkePU's algorithmic skeletons with GPI on a cluster / Integrering av SkePUs algoritmiska skelett med GPI på ett cluster Almqvist, Joel January 2022 (has links) As processors' clock-speed flattened out in the early 2000s, multi-core processors became more prevalent and so did parallel programming. However this programming paradigm introduces additional complexities, and to combat this, the SkePU framework was created. SkePU does this by offering a single-threaded interface which executes the user's code in parallel in accordance to a chosen computational pattern. Furthermore it allows the user themselves to decide which parallel backend should perform the execution, be it OpenMP, CUDA or OpenCL. This modular approach of SkePU thus allows for different hardware to be used without changing the code, and it currently supports CPUs, GPUs and clusters. This thesis presents a new so-called SkePU-backend made for clusters, using the communication library GPI. It demonstrates that the new backend is able to scale better and handle workload imbalances better than the existing SkePU-cluster-backend. This is achieved despite it performing worse at low node amounts, indicating that it requires less scaling overhead. Its weaknesses are also analyzed, partially from a design point of view, and clear solutions are presented, combined with a discussion as to why they arose in the first place. SlePU cluster GPI MPI OpenMP parallel programming HPC algorithmic skeletons distributed computing SkePU kluster GPI MPI OpenMP parallellprogrammering HPC algoritmiska skelett Computer Sciences Datavetenskap (datalogi)

Search results