Global ETD Search

261	Implementation of a Hardware-Optimized MPI Library for the SCMP Multiprocessor Poole, Jeffrey Hyatt 16 August 2004 (has links) As time progresses, computer architects continue to create faster and more complex microprocessors using techniques such as out-of-order execution, branch prediction, dynamic scheduling, and predication. While these techniques enable greater performance, they also increase the complexity and silicon area of the design. This creates larger development and testing times. The shrinking feature sizes associated with newer technology increase wire resistance and signal propagation delays, further complicating large designs. One potential solution is the Single-Chip Message-Passing (SCMP) Parallel Computer, developed at Virginia Tech. SCMP makes use of an architecture where a number of simple processors are tiled across a single chip and connected by a fast interconnection network. The system is designed to take advantage of thread-level parallelism and to keep wire traces short in preparation for even smaller integrated circuit feature sizes. This thesis presents the implementation of the MPI (Message-Passing Interface) communications library on top of SCMP's hardware communication support. Emphasis is placed on the specific needs of this system with regards to MPI. For example, MPI is designed to operate between heterogeneous systems; however, in the SCMP environment such support is unnecessary and wastes resources. The SCMP network is also designed such that messages can be sent with very low latency, but with cooperative multitasking it is difficult to assure a timely response to messages. Finally, the low-level network primitives have no support for send operations that occur before the receiver is prepared and that functionality is necessary for MPI support. / Master of Science Message-Passing Systems Single-Chip Systems Parallel Architecture Chip Multiprocessors Message Passing Interface MPI
262	Dynamic fractional flow reserve measurement: potential implications for dynamic first-pass myocardial perfusion imaging Barmby, D., Davies, A., Gislason-Lee, Amber J., Sivananthan, M. January 2015 (has links) No Magnetic Resonance (MR) Myocardial Perfusion Imaging (MPI) Fractional Flow Reserve (FFR) Dynamic fractional flow reserve (dFFR)
263	Using Task Parallelism for Distributed Parallel Skeleton Programming : Implementing a StarPU Back-End to SkePU 2 / Distribuerade parallellprogrammeringsskelett genom uppgiftsparallellism : Implementation av en StarPU-baserad SkePU 2 backend Henrik, Henriksson January 2024 (has links) We extended the parallel skeleton programming framework SkePU 2 with a new back-end utilizing StarPU, a task programming framework for hybrid and distributed architectures. The aim was to allow SkePU to run on distributed clusters, using MPI through StarPU. The implemented back-end distributes data and work across participating ranks. While we did not implement the full SkePU API, the Map and Reduce1D skeletons were successfully implemented. During the implementation, we discovered some differences in API design between SkePU and StarPU. We combine the type-safe templates used in the SkePU API with the C-style void*-heavy API of StarPU. This requires the implementation to use more complex templates than normally desired. While we could preserve most of the SkePU 2 API when moving to a distributed memory situation, some parts had to change. In particular, we needed to change the semantics of SkePU 2 containers with regards to iterators and random access. We benchmarked the performance of the implemented back-end against an MPI+OpenMP reference implementation on two problems, n-body and a simple reduction. While the n-body problem demonstrates promising scaling properties, reductions do not scale well to larger number of ranks. A performance comparison against the MPI+OpenMP reference implementation reveals that, aside from the higher communication overhead, there may also be some overhead in the work performed between communications, potentially performing at below 60-70% of the reference. In most cases, the new back-end to SkePU exhibits significantly lower performance than the reference. Extending the implemented solution to cover the full API and improving performance could provide a high level interface to distributed programming for application programmers. Indeed, subsequent developments of SkePU 3 extend and improve our StarPU back-end. HPC StarPU SkePU parallel porgramming skeleton programming distributed systems MPI Computer Engineering Datorteknik
264	Optimizing Applications and Message-Passing Libraries for the QPACE Architecture Wunderlich, Simon 18 July 2012 (has links) (PDF) The goal of the QPACE project is to build a novel cost-efficient massive parallel supercomputer optimized for LQCD (Lattice Quantum Chromodynamics) applications. Unlike previous projects which use custom ASICs, this is accomplished by using the general purpose multi-core CPU PowerXCell 8i processor tightly coupled with a custom network processor implemented on a modern FPGA. The heterogeneous architecture of the PowerXCell 8i processor and its core-independent OS-bypassing access to the custom network hardware and application-oriented 3D torus topology pose interesting challenges for the implementation of the applications. This work will describe and evaluate the implementation possibilities of message passing APIs: the more general MPI, and the more QCD-oriented QMP, and their performance in PPE centric or SPE centric scenarios. These results will then be employed to optimize HPL for the QPACE architecture. Finally, the developed approaches and concepts will be briefly discussed regarding their applicability to heterogeneous node/network architectures as is the case in the "High-speed Network Interface with Collective Operation Support for Cell BE (NICOLL)" project. PowerXCell PowerXCell 8i QPACE Cell PPE SPE MPI QCD QMP NICOLL Torus parallel supercomputer PowerXCell PowerXCell 8i QPACE Cell PPE SPE MPI QCD QMP NICOLL Torus parallel supercomputer ddc:000 Quantenchromodynamik HPL Field programmable gate array
265	ESPGOAL Schneider, Timo, Eckelmann, Sven 18 May 2011 (has links) (PDF) Optimized implementations of blocking and nonblocking collective operations are most important for scalable high-performance applications. Offloading such collective operations into the communication layer can improve performance and asynchronous progression of the operations. However, it is most important that such offloading schemes remain flexible in order to support user-defined (sparse neighbor) collective communications. In this work, we describe an operating system kernel-based architecture for implementing an interpreter for the flexible Group Operation Assembly Language (GOAL) framework to offload collective communications. We describe an optimized scheme to store the schedules that define the collective operations and show an extension to profile the performance of the kernel layer. Our microbenchmarks demonstrate the effectiveness of the approach and we show performance improvements over traditional progression in user-space. We also discuss complications with the design and offloading strategies in general. LINUX Ethernet Kernel Gruppenkommunikation Ethernet Raw Linux Kernel MPI EDP ESP Open MPI GOAL Open-MX Dependency Offload ddc:000 LINUX Ethernet Kernel <Informatik>
266	Implementação de um algoritmo de mecânica dos fluidos computacional projetado para plataformas de processamento paralelo com memória distribuída Angeli, João Paulo de 30 June 2005 (has links) Made available in DSpace on 2016-12-23T14:36:45Z (GMT). No. of bitstreams: 1 dissertacao.pdf: 1896132 bytes, checksum: dc313d94261c073031be0aad2e3bffbf (MD5) Previous issue date: 2005-06-30 / Discute a implementação do algoritmo numérico para simulação de escoamento de fluidos incompressíveis, baseado no método de diferenças finitas, projetado para plataformas de processamento paralelo com memória distribuída, particularmente para clusters de estações de trabalho. O algoritmo de solução para as equações de Navier-Stokes utiliza um esquema explicito para pressão e um esquema implícito para as velocidades. A implementação paralela é baseada na decomposição do domínio, onde o domínio computacional do problema é decomposto em vários blocos, sendo um ou mais destinados a nós de processamento distintos. Todos os nós então processam em paralelo as tarefas de computação sobre os blocos a eles designados. O processamento paralelo inclui inicialização, cálculo de coeficientes, solução linear nos subdomínios, e comunicação entre os nós. A troca de informação entre os processos referentes a cada subdomínio é realizada utilizando a biblioteca message passing interface (MPI), o que assegura portabilidade entre diferentes plataformas computacionais, abrangendo desde máquinas maciçamente paralelas (MPP) até clusters de estações de trabalho. Para melhorar os níveis de desempenho obtidos pelo algoritmo, foram investigadas técnicas para a redução do volume de comunicação entre processadores e utilização mais eficiente da memória cache dos microprocessadores. Para avaliar o desempenho do algoritmo desenvolvido e analisar as diferentes estratégias de paralelização foram executadas simulações com cluster de 2 a 56 processadores, nas quais foram avaliados o tempo de execução, speedup e eficiência paralela. Os resultados experimentais mostram que as otimizações relacionadas aos fatores de comunicação melhoram o speedup em até 165%, e a técnica de utilização mais eficiente da memória cache pode melhorar o speedup em mais 40% acima da otimização da comunicação. / This work discusses the implementation of a numerical algorithm for simulating incompressible fluid flows, based on the finite difference method, and designed for parallel computing platforms with distributed-memory, particularly for clusters of workstations. The solution algorithm for the Navier-Stokes equations utilizes an explicit scheme for pressure and an implicit scheme for velocities. The parallel implementation is based on domain decomposition, where the original calculation domain is decomposed into several blocks, each of which given to a separate processing node. All nodes then execute computations in parallel, each node on its associated sub-domain. The parallel computations include initialization, coefficient generation, linear solution on the sub-domain, and inter-node communication. The exchange of information across the sub-domains, or processors, is achieved using the message passing interface standard, MPI. The use of MPI ensures portability across different computing platforms ranging from massively parallel machines to clusters of workstations. Three different optimization strategies were evaluated in order to improve the computational performance of the algorithm, which include techniques exploring a reduction in the communication volume between processors and a more efficient utilization of the microprocessor s cache memory. In order to evaluate the performance levels obtained, and to analyze the effectiveness of the optimization strategies adopted, simulations using a 64 nodes cluster were executed. The simulations were performed using 2 to 56 processors, where execution time and speed-up were measured. The results indicate that the optimizations related to communication factors can improve the speed-up obtained up to 165%, while the cache memory optimization technique used can improve the speed-up obtained in further 40%. processamento paralelo diferenças finitas Navier-Stokes MPI memória cache parallel processing finite difference method Navier-Stokes MPI cache memory
267	Méthode de décomposition de domaine avec parallélisme hybride et accélération non linéaire pour la résolution de l'équation du transport Sn en géométrie non-structurée / Domain decomposition method using a hybrid parallelism and a low-order acceleration for solving the Sn transport equation on unstructured geometry Odry, Nans 07 October 2016 (has links) Les schémas de calcul déterministes permettent une modélisation à moindre coût du comportement de la population de neutrons en réacteur, mais sont traditionnellement construits sur des approximations (décomposition réseau/cœur, homogénéisation spatiale et énergétique…). La thèse revient sur une partie de ces sources d’erreur, de façon à rapprocher la méthode déterministe d’un schéma de référence. L’objectif est de profiter des architectures informatiques modernes (HPC) pour résoudre le problème neutronique à l’échelle du cœur 3D, tout en préservant l’opérateur de transport et une partie des hétérogénéités de la géométrie. Ce travail est réalisé au sein du solveur cœur Sn Minaret de la plateforme de calcul Apollo3® pour des réacteurs à neutrons rapides.Une méthode de décomposition de domaine en espace, est retenue. L'idée consiste à décomposer un problème de grande dimension en sous-problèmes "indépendants" de taille réduite. La convergence vers la solution globale est assurée par échange de flux angulaires entre sous-domaines au cours d'un processus itératif. En favorisant un recours massif au parallélisme, les méthodes de décomposition de domaine contribuent à lever les contraintes en mémoire et temps de calcul. La mise en place d'un parallélisme hybride, couplant les technologies MPI et OpenMP, est en particulier propice au passage sur supercalculateur. Une méthode d'accélération de type Coarse Mesh Rebalance est ajoutée pour pallier à la pénalité de convergence constatée sur la méthode de décomposition de domaine. Le potentiel du nouveau schéma est finalement mis en évidence sur un coeur CFV 3D, construit en préservant l'hétérogénéité des assemblages absorbants. / Deterministic calculation schemes are devised to numerically solve the neutron transport equation in nuclear reactors. Dealing with core-sized problems is very challenging for computers, so much that the dedicated core codes have no choice but to allow simplifying assumptions (assembly- then core-scale steps…). The PhD work aims to correct some of these ‘standard’ approximations, in order to get closer of reference calculations: thanks to important increases in calculation capacities (HPC), nowadays one can solve 3D core-sized problems, using both high mesh refinement and the transport operator. Developments were performed inside the Sn core solver Minaret, from the new CEA neutronics platform Apollo3® for fast neutrons reactors of the CFV-kind.This work focuses on a Domain Decomposition Method in space. The fundamental idea involves splitting a core-sized problem into smaller and 'independent' subproblems. Angular flux is exchanged between adjacent subdomains. In doing so, all combined subproblems converge to the global solution at the outcome of an iterative process. Domain decomposition is well-suited to massive parallelism, allowing much more ambitious computations in terms of both memory requirements and calculation time. An hybrid MPI/OpenMP parallelism is chosen to match the supercomputers architecture. A Coarse Mesh Rebalance accelration technique is added to balance the convergence penalty observed using Domain Decomposition. The potential of the new calculation scheme is demonstrated on a 3D core of the CFV-kind, using an heterogeneous description of the absorbent rods. Equation du transport des neutrons Schémas déterministes Apollo3 Méthode de Décomposition de Domaine Parallélisme hybride MPI/OpenMP Méthode d'accélération Coarse Mesh Rebalance Neutron transport equation Deterministic schemes Apollo3 Domain Decomposition Method Hybrid parallelism MPI/OpenMP Acceleration technique Coarse Mesh Rebalance 530
268	Schémas numériques adaptés aux accélérateurs multicoeurs pour les écoulements bifluides / Numerical simulations of two-fluid flow on multicores accelerator Jung, Jonathan 28 October 2013 (has links) Cette thèse traite de la modélisation et de l'approximation numérique des écoulements liquide-gaz compressibles. La difficulté centrale est la modélisation et l'approximation de l'interface liquide-gaz. Le modèle bifluide est constitué d'un système de lois de conservation fermé par une loi d'état du mélange. La loi d'état conditionne les bonnes propriétés (hyperbolicité, existence d'une entropie de Lax) du système. Les schémas classiques de type Godunov conduisent à des imprécisions les rendant inutilisables en pratique. L'existence de solutions discontinues rend difficile la construction de schémas d'ordre élevé et nécessite des maillages très fins pour une précision acceptable. Il est indispensable de proposer des algorithmes performants pour les calculateurs parallèles les plus récents. Nous aborderons chacune de ces problématiques: construction d'une "bonne" loi de pression, construction de schémas numériques adaptés, programmation sur calculateur massivement multicoeur. / This thesis deals with the modeling and numerical approximation of compressible gas-liquid flows. The main difficulty lies in modeling and approximation of the liquid-gas interface. The two-fluid model is a system of conservation laws closed with a mixture pressure law. The law has to be chosen carefully, it conditions good properties of the system as hyperbolicity or existence of a Lax entropy. Classic conservative Godunov-type schemes lead to inaccuracies that make them unusable inpractice. The existence of discontinuous solutions makes it difficult to build high order schemes and requires very fine meshes to an acceptable accuracy. It is therefore essential to provide efficient algorithms for the High Performance Computing. In this thesis, we will partially treat each of these issues : construction of a "good" pressure law, building adapted numerical schemes, programming on GPU or GPU cluster. Écoulements bifluides Schéma Lagrange-projection Schéma ALE-projection Projection aléatoire Ensemble d'hyperbolicité non convexe Entropie de mélange OpenCL GPU MPI Two-fluid flow Lagrange-projection scheme ALE-projection scheme Random sampling Non convex hyperbolic set Mixture entropy OpenCL GPU MPI 532.5 530.15 620
269	Calcul parallèle et méthodes numériques pour la simulation de plasmas de bords / Parallel computing and numerical methods for boundary plasma simulations Kuhn, Matthieu 29 September 2014 (has links) L'amélioration du code Emedge3D (code de bord électromagnétique) est abordée sous plusieurs axes. Premier axe, des innovations sur les méthodes numériques ont été mises en oeuvre. L'avantage des méthodes de type semi-implicite est décrit, leur stabilité inconditionnelle permet l'augmentation du pas de temps, et donc la diminution du nombre d'itérations temporelles requises pour une simulation. Les avantages de la montée en ordre en espace et en temps sont détaillés. Deuxième axe, des réponses sont proposées pour la parallélisation du code. Le cadre de cette étude est proche du problème général d'advection-diffusion non linéaire. Les parties coûteuses ont tout d'abord été optimisées séquentiellement puis fait l'objet d'une parallélisation OpenMP. Pour la partie du code la plus sensible aux contraintes de bande passante mémoire, une solution parallèle MPI sur machine à mémoire distribuée est décrite et analysée. Une bonne extensibilité est observée jusque 384 cœurs. Cette thèse s'inscrit dans le projet interdisciplinaire ANR E2T2 (CEA/IRFM, Université Aix-Marseille/PIIM, Université Strasbourg/Icube). / The main goal of this work is to significantly reduce the computational cost of the scientific application Emedge3D, simulating the edge of tokamaks. Improvements to this code are made on two axes. First, innovations on numerical methods have been implemented. The advantage of semi-implicit time schemes are described. Their inconditional stability allows to consider larger timestep values, and hence to lower the number of temporal iteration required for a simulation. The benefits of a high order (time and space) are also presented. Second, solutions to the parallelization of the code are proposed. This study addresses the more general non linear advection-diffusion problem. The hot spots of the application have been sequentially optimized and parallelized with OpenMP. Then, a hybrid MPI OpenMP parallel algorithm for the memory bound part of the code is described and analyzed. Good scalings are observed up to 384 cores. This Ph. D. thesis is part of the interdisciplinary project ANR E2T2 (CEA/IRFM, University of Aix-Marseille/PIIM, University of Strasbourg/ICube). Fusion nucléaire Simulation numérique Méthodes semi-implicites Discrétisation semi-spectrale Diffusion anisotrope Bande passante mémoire limitée Parallélisation hybride MPI/OpenMP Calcul haute performance Boundary plasma simulations Numerical methods Semi-implicit methods Hybrid MPI OpenMP parallel algorithm 004.6 539.75 518
270	Programmation des architectures hiérarchiques et hétérogènes / Programming hierarxchical and heterogenous machines Hamidouche, Khaled 10 November 2011 (has links) Les architectures de calcul haute performance de nos jours sont des architectures hiérarchiques et hétérogènes: hiérarchiques car elles sont composées d’une hiérarchie de mémoire, une mémoire distribuée entre les noeuds et une mémoire partagée entre les coeurs d’un même noeud. Hétérogènes due à l’utilisation des processeurs spécifiques appelés Accélérateurs tel que le processeur CellBE d’IBM et les CPUs de NVIDIA. La complexité de maîtrise de ces architectures est double. D’une part, le problème de programmabilité: la programmation doit rester simple, la plus proche possible de la programmation séquentielle classique et indépendante de l’architecture cible. D’autre part, le problème d’efficacité: les performances doivent êtres proches de celles qu’obtiendrait un expert en écrivant le code à la main en utilisant des outils de bas niveau. Dans cette thèse, nous avons proposé une plateforme de développement pour répondre à ces problèmes. Pour cela, nous proposons deux outils : BSP++ est une bibliothèque générique utilisant des templates C++ et BSPGen est un framework permettant la génération automatique de code hybride à plusieurs niveaux de la hiérarchie (MPI+OpenMP ou MPI + Cell BE). Basée sur un modèle hiérarchique, la bibliothèque BSP++ prend les architectures hybrides comme cibles natives. Utilisant un ensemble réduit de primitives et de concepts intuitifs, BSP++ offre une simplicité d'utilisation et un haut niveau d' abstraction de la machine cible. Utilisant le modèle de coût de BSP++, BSPGen estime et génère le code hybride hiérarchique adéquat pour une application donnée sur une architecture cible. BSPGen génère un code hybride à partir d'une liste de fonctions séquentielles et d'une description de l'algorithme parallèle. Nos outils ont été validés sur différentes applications de différents domaines allant de la vérification et du calcul scientifique au traitement d'images en passant par la bioinformatique. En utilisant une large sélection d’architecture cible allant de simple machines à mémoire partagée au machines Petascale en passant par les architectures hétérogènes équipées d’accélérateurs de type Cell BE. / Today’s high-performance computing architectures are hierarchical and heterogeneous. With a hierarchy of memory, they are composed of distributed memory between nodes and shared memory between cores of the same node. heterogeneous due to the use of specific processors called accelerators such as the CellBE IBM processor and/or NVIDIA GPUs. The programming complexity of these architectures is twofold. On the one hand, the problem of programmability: the programming should be simple, as close as possible to the conventional sequential programming and independent of the target architecture. On the other hand, the problem of efficiency: performance should be similar to those obtained by a expert in writing code by hand using low-level tools. In this thesis, we proposed a development platform to address these problems. For this, we propose two tools: BSP++ is a generic library using C++ templates and BSPGen is a framework for the automatic hybrid multi-level hierarchy (MPI + OpenMP or MPI + Cell BE) code generation.Based on a hierarchical model, the BSP++ library takes the hybrid architectures as native targets. Using a small set of primitives and intuitive concepts, BSP++ provides a simple way to use and a high level of abstraction of the target machine. Using the cost model of BSP++, BSPGen predicts and generates the appropriate hierarchical hybrid code for a given application on target architecture. BSPGen generates hybrid code from a sequential list of functions and a description of the parallel algorithm.Our tools have been validated with various applications in different fields ranging from verification to scientific computing and image processing through bioinformatics. Using a wide selection of target architecture ranging from simple shared memory machines to Petascale machines through the heterogeneous architectures equipped with Cell BE accelerators. BSP Génération automatique Programmation parallèle MPI OpenMP Cell BE BSP Automatic code generation Parallel computing MPI OpenMP Cell BE

Search results