91 |
Parallelization of a thermal elastohydrodynamic lubricated contacts simulation using OpenMPAlrheis, Ghassan January 2020 (has links)
Datorer med flera kärnor som delar på ett gemensamt minne (SMP) har blivit normen sedan Moore's lag har slutat gälla. För att utnyttja den prestanda flera kärnor erbjuder så behöver mjukvaruingenjören skriva programmen så att de explicit utnyttjar flera kärnor. För mindre projekt är det lätt att detta bortses från vilket skapar program som endast utnyttjar en kärna. Detta gör att det i sådana fall finns stora vinningar genom att parallellisera koden. Det här examensarbetet har förbättrat prestandan på ett beräkningstungt simuleringsprogram, skrivit att utnyttja endast en kärna, genom att hitta områden i koden som är lämpliga att parallellisera. Dessa områden har identifierats med Intel's Vtune Amplifier och utförts med OpenMP. Arbetet har också bytt ut en speciell beräkningsrutin som var särskilt krävande, speciellt för större problem. Slutresultatet är ett beräkningsprogram som ger samma resultat som det ursprungliga programmet men betydligt snabbare och med mindre datorresurser. Programmet kommer att användas i framtida forskningsprojekt. / Multi-core Shared Memory Parallel (SMP) systems became the norm ever since the performance trend prophesied by Moore’s law ended. Correctly utilizing the performance benefits these systems offer usually requires a conscious effort from the software developer’s side to enforce concurrency in the program. This is easy to disregard in small software projects and can lead to great amounts of unused potential parallelism in the produced code. This thesis attempted to improve the perfor- mance of a computationally demanding Thermal Elastohydrodynamic Lubrication (TEHL) simula- tion written in Fortran by finding such parallelism. The parallelization effort focused on the most demanding parts of the program identified using Intel’s VTune Amplifier and was implemented using OpenMP. The thesis also documents an algorithm change that led to further improvements in terms of execution time and scalability with respect to problem size. The end result is a faster, lighter and more efficient TEHL simulator that can further support the research in its domain.
|
92 |
A PAIRWISE COMPARISON OF DNA SEQUENCE ALIGNMENT USING AN OPENMP IMPLEMENTATION OF THE SWAMP PARALLEL SMITH-WATERMAN ALGORITHMCuevas, Tristan Lee 22 April 2015 (has links)
No description available.
|
93 |
Directive-Based Data Partitioning and Pipelining and Auto-Tuning for High-Performance GPU ComputingCui, Xuewen 15 December 2020 (has links)
The computer science community needs simpler mechanisms to achieve the performance potential of accelerators, such as graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and co-processors (e.g., Intel Xeon Phi), due to their increasing use in state-of-the-art supercomputers. Over the past 10 years, we have seen a significant improvement in both computing power and memory connection bandwidth for accelerators. However, we also observe that the computation power has grown significantly faster than the interconnection bandwidth between the central processing unit (CPU) and the accelerator.
Given that accelerators generally have their own discrete memory space, data needs to be copied from the CPU host memory to the accelerator (device) memory before computation starts on the accelerator. Moreover, programming models like CUDA, OpenMP, OpenACC, and OpenCL can efficiently offload compute-intensive workloads to these accelerators. However, achieving the overlapping of data transfers with computation in a kernel with these models is neither simple nor straightforward. Instead, codes copy data to or from the device without overlapping or requiring explicit user design and refactoring.
Achieving performance can require extensive refactoring and hand-tuning to apply data transfer optimizations, and users must manually partition their dataset whenever its size is larger than device memory, which can be highly difficult when the device memory size is not exposed to the user. As the systems are becoming more and more complex in terms of heterogeneity, CPUs are responsible for handling many tasks related to other accelerators, computation and data movement tasks, task dependency checking, and task callbacks. Leaving all logic controls to the CPU not only costs extra communication delay over the PCI-e bus but also consumes the CPU resources, which may affect the performance of other CPU tasks. This thesis work aims to provide efficient directive-based data pipelining approaches for GPUs that tackle these issues and improve performance, programmability, and memory management. / Doctor of Philosophy / Over the past decade, parallel accelerators have become increasingly prominent in this emerging era of "big data, big compute, and artificial intelligence.'' In more recent supercomputers and datacenter clusters, we find multi-core central processing units (CPUs), many-core graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and co-processors (e.g., Intel Xeon Phi) being used to accelerate many kinds of computation tasks.
While many new programming models have been proposed to support these accelerators, scientists or developers without domain knowledge usually find existing programming models not efficient enough to port their code to accelerators. Due to the limited accelerator on-chip memory size, the data array size is often too large to fit in the on-chip memory, especially while dealing with deep learning tasks. The data need to be partitioned and managed properly, which requires more hand-tuning effort. Moreover, performance tuning is difficult for developers to achieve high performance for specific applications due to a lack of domain knowledge. To handle these problems, this dissertation aims to propose a general approach to provide better programmability, performance, and data management for the accelerators. Accelerator users often prefer to keep their existing verified C, C++, or Fortran code rather than grapple with the unfamiliar code.
Since 2013, OpenMP has provided a straightforward way to adapt existing programs to accelerated systems. We propose multiple associated clauses to help developers easily partition and pipeline the accelerated code. Specifically, the proposed extension can overlap kernel computation and data transfer between host and device efficiently. The extension supports memory over-subscription, meaning the memory size required by the tasks could be larger than the GPU size. The internal scheduler guarantees that the data is swapped out correctly and efficiently. Machine learning methods are also leveraged to help with auto-tuning accelerator performance.
|
94 |
Parallel implementation and application of particle scale heat transfer in the Discrete Element MethodAmritkar, Amit Ravindra 25 July 2013 (has links)
Dense fluid-particulate systems are widely encountered in the pharmaceutical, energy, environmental and chemical processing industries. Prediction of the heat transfer characteristics of these systems is challenging. Use of a high fidelity Discrete Element Method (DEM) for particle scale simulations coupled to Computational Fluid Dynamics (CFD) requires large simulation times and limits application to small particulate systems. The overall goal of this research is to develop and implement parallelization techniques which can be applied to large systems with O(105- 106) particles to investigate particle scale heat transfer in rotary kiln and fluidized bed environments.
The strongly coupled CFD and DEM calculations are parallelized using the OpenMP paradigm which provides the flexibility needed for the multimodal parallelism encountered in fluid-particulate systems. The fluid calculation is parallelized using domain decomposition, whereas N-body decomposition is used for DEM. It is shown that OpenMP-CFD with the first touch policy, appropriate thread affinity and careful tuning scales as well as MPI up to 256 processors on a shared memory SGI Altix. To implement DEM in the OpenMP framework, ghost particle transfers between grid blocks, which consume a substantial amount of time in DEM, are eliminated by a suitable global mapping of the multi-block data structure. The global mapping together with enforcing perfect particle load balance across OpenMP threads results in computational times between 2-5 times faster than an equivalent MPI implementation.
Heat transfer studies are conducted in a rotary kiln as well as in a fluidized bed equipped with a single horizontal tube heat exchanger. Two cases, one with mono-disperse 2 mm particles rotating at 20 RPM and another with a poly-disperse distribution ranging from 1-2.8 mm and rotating at 1 RPM are investigated. It is shown that heat transfer to the mono-disperse 2 mm particles is dominated by convective heat transfer from the thermal boundary layer that forms on the heated surface of the kiln. In the second case, during the first 24 seconds, the heat transfer to the particles is dominated by conduction to the larger particles that settle at the bottom of the kiln. The results compare reasonably well with experiments. In the fluidized bed, the highly energetic transitional flow and thermal field in the vicinity of the tube surface and the limits placed on the grid size by the volume-averaged nature of the governing equations result in gross under prediction of the heat transfer coefficient at the tube surface. It is shown that the inclusion of a subgrid stress model and the application of a LES wall function (WMLES) at the tube surface improves the prediction to within ± 20% of the experimental measurements. / Ph. D.
|
95 |
Escalabilidade Paralela de um Algoritmo de Migra??o Reversa no Tempo (RTM) Pr?-empilhamento / PARALLEL SCALABILITY OF A PRESTACK REVERSE TIME MIGRATION (RTM) ALGORITHMRos?rio, Desnes Augusto Nunes do 21 December 2012 (has links)
Made available in DSpace on 2014-12-17T14:56:09Z (GMT). No. of bitstreams: 1
DesnesANR_DISSERT.pdf: 3501359 bytes, checksum: 5155a508018af1e52dae20205b8f726b (MD5)
Previous issue date: 2012-12-21 / The seismic method is of extreme importance in geophysics. Mainly associated with oil exploration, this line of research focuses most of all investment in this area. The acquisition, processing and interpretation of seismic data are the parts that instantiate a seismic study. Seismic processing in particular is focused on the imaging that represents the geological structures in subsurface.
Seismic processing has evolved significantly in recent decades due to the demands of the oil industry, and also due to the technological advances of hardware that achieved higher storage and digital information processing capabilities, which enabled the development of more sophisticated processing algorithms such as the ones that use of parallel architectures.
One of the most important steps in seismic processing is imaging. Migration of seismic data is one of the techniques used for imaging, with the goal of obtaining a seismic section image that represents the geological structures the most accurately and faithfully as possible. The result of migration is a 2D or 3D image which it is possible to identify faults and salt domes among other structures of interest, such as potential hydrocarbon reservoirs.
However, a migration fulfilled with quality and accuracy may be a long time consuming process, due to the mathematical algorithm heuristics and the extensive amount of data inputs and outputs involved in this process, which may take days, weeks and even months of uninterrupted execution on the supercomputers, representing large computational and financial costs, that could derail the implementation of these methods.
Aiming at performance improvement, this work conducted the core parallelization of a Reverse Time Migration (RTM) algorithm, using the parallel programming model Open Multi-Processing (OpenMP), due to the large computational effort required by this migration technique. Furthermore, analyzes such as speedup, efficiency were performed, and ultimately, the identification of the algorithmic scalability degree with respect to the technological advancement expected by future processors / A s?smica ? uma ?rea de extrema import?ncia na geof?sica. Associada principalmente ? explora??o de petr?leo, essa linha de pesquisa concentra boa parte de todo o investimento realizado nesta grande ?rea. A aquisi??o, o processamento e a interpreta??o dos dados s?smicos s?o as partes que comp?em um estudo s?smico. O processamento s?smico em especial tem como objetivo ? obten??o de uma imagem que represente as estruturas geol?gicas em subsuperf?cie.
O processamento s?smico evoluiu significativamente nas ?ltimas d?cadas devido ?s demandas da ind?stria petrol?fera, e aos avan?os tecnol?gicos de hardware que proporcionaram maiores capacidades de armazenamento e processamento de informa??es digitais, que por sua vez possibilitaram o desenvolvimento de algoritmos de processamento mais sofisticados, tais como os que utilizam arquiteturas paralelas de processamento.
Uma das etapas importantes contidas no processamento s?smico ? o imageamento. A migra??o ? uma das t?cnicas usadas para no imageamento com o objetivo de obter uma se??o s?smica que represente de forma mais precisa e fiel as estruturas geol?gicas. O resultado da migra??o ? uma imagem 2D ou 3D na qual ? poss?vel a identifica??o de falhas e domos salinos dentre outras estruturas de interesse, poss?veis reservat?rios de hidrocarbonetos.
Entretanto, uma migra??o rica em qualidade e precis?o pode ser um processo demasiadamente longo, devido ?s heur?sticas matem?ticas do algoritmo e ? quantidade extensa de entradas e sa?das de dados envolvida neste processo, podendo levar dias, semanas e at? meses de execu??o ininterrupta em supercomputadores, o que representa grande custo computacional e financeiro, o que pode inviabilizar a aplica??o desses m?todos.
Tendo como objetivo a melhoria de desempenho, este trabalho realizou a paraleliza??o do n?cleo de um algoritmo de Migra??o Reversa no Tempo (RTM - do ingl?s: Reverse Time Migration), utilizando o modelo de programa??o paralela OpenMP (do ingl?s: Open Multi-Processing), devido ao alto esfor?o computacional demandado por essa t?cnica de migra??o. Al?m disso, foram realizadas an?lises de desempenho tais como de speedup, efici?ncia, e, por fim, a identifica??o do grau de escalabilidade algor?tmica com rela??o ao avan?o tecnol?gico esperado para futuros processadores
|
96 |
Méthode de décomposition de domaine avec parallélisme hybride et accélération non linéaire pour la résolution de l'équation du transport Sn en géométrie non-structurée / Domain decomposition method using a hybrid parallelism and a low-order acceleration for solving the Sn transport equation on unstructured geometryOdry, Nans 07 October 2016 (has links)
Les schémas de calcul déterministes permettent une modélisation à moindre coût du comportement de la population de neutrons en réacteur, mais sont traditionnellement construits sur des approximations (décomposition réseau/cœur, homogénéisation spatiale et énergétique…). La thèse revient sur une partie de ces sources d’erreur, de façon à rapprocher la méthode déterministe d’un schéma de référence. L’objectif est de profiter des architectures informatiques modernes (HPC) pour résoudre le problème neutronique à l’échelle du cœur 3D, tout en préservant l’opérateur de transport et une partie des hétérogénéités de la géométrie. Ce travail est réalisé au sein du solveur cœur Sn Minaret de la plateforme de calcul Apollo3® pour des réacteurs à neutrons rapides.Une méthode de décomposition de domaine en espace, est retenue. L'idée consiste à décomposer un problème de grande dimension en sous-problèmes "indépendants" de taille réduite. La convergence vers la solution globale est assurée par échange de flux angulaires entre sous-domaines au cours d'un processus itératif. En favorisant un recours massif au parallélisme, les méthodes de décomposition de domaine contribuent à lever les contraintes en mémoire et temps de calcul. La mise en place d'un parallélisme hybride, couplant les technologies MPI et OpenMP, est en particulier propice au passage sur supercalculateur. Une méthode d'accélération de type Coarse Mesh Rebalance est ajoutée pour pallier à la pénalité de convergence constatée sur la méthode de décomposition de domaine. Le potentiel du nouveau schéma est finalement mis en évidence sur un coeur CFV 3D, construit en préservant l'hétérogénéité des assemblages absorbants. / Deterministic calculation schemes are devised to numerically solve the neutron transport equation in nuclear reactors. Dealing with core-sized problems is very challenging for computers, so much that the dedicated core codes have no choice but to allow simplifying assumptions (assembly- then core-scale steps…). The PhD work aims to correct some of these ‘standard’ approximations, in order to get closer of reference calculations: thanks to important increases in calculation capacities (HPC), nowadays one can solve 3D core-sized problems, using both high mesh refinement and the transport operator. Developments were performed inside the Sn core solver Minaret, from the new CEA neutronics platform Apollo3® for fast neutrons reactors of the CFV-kind.This work focuses on a Domain Decomposition Method in space. The fundamental idea involves splitting a core-sized problem into smaller and 'independent' subproblems. Angular flux is exchanged between adjacent subdomains. In doing so, all combined subproblems converge to the global solution at the outcome of an iterative process. Domain decomposition is well-suited to massive parallelism, allowing much more ambitious computations in terms of both memory requirements and calculation time. An hybrid MPI/OpenMP parallelism is chosen to match the supercomputers architecture. A Coarse Mesh Rebalance accelration technique is added to balance the convergence penalty observed using Domain Decomposition. The potential of the new calculation scheme is demonstrated on a 3D core of the CFV-kind, using an heterogeneous description of the absorbent rods.
|
97 |
Calcul parallèle et méthodes numériques pour la simulation de plasmas de bords / Parallel computing and numerical methods for boundary plasma simulationsKuhn, Matthieu 29 September 2014 (has links)
L'amélioration du code Emedge3D (code de bord électromagnétique) est abordée sous plusieurs axes. Premier axe, des innovations sur les méthodes numériques ont été mises en oeuvre. L'avantage des méthodes de type semi-implicite est décrit, leur stabilité inconditionnelle permet l'augmentation du pas de temps, et donc la diminution du nombre d'itérations temporelles requises pour une simulation. Les avantages de la montée en ordre en espace et en temps sont détaillés. Deuxième axe, des réponses sont proposées pour la parallélisation du code. Le cadre de cette étude est proche du problème général d'advection-diffusion non linéaire. Les parties coûteuses ont tout d'abord été optimisées séquentiellement puis fait l'objet d'une parallélisation OpenMP. Pour la partie du code la plus sensible aux contraintes de bande passante mémoire, une solution parallèle MPI sur machine à mémoire distribuée est décrite et analysée. Une bonne extensibilité est observée jusque 384 cœurs. Cette thèse s'inscrit dans le projet interdisciplinaire ANR E2T2 (CEA/IRFM, Université Aix-Marseille/PIIM, Université Strasbourg/Icube). / The main goal of this work is to significantly reduce the computational cost of the scientific application Emedge3D, simulating the edge of tokamaks. Improvements to this code are made on two axes. First, innovations on numerical methods have been implemented. The advantage of semi-implicit time schemes are described. Their inconditional stability allows to consider larger timestep values, and hence to lower the number of temporal iteration required for a simulation. The benefits of a high order (time and space) are also presented. Second, solutions to the parallelization of the code are proposed. This study addresses the more general non linear advection-diffusion problem. The hot spots of the application have been sequentially optimized and parallelized with OpenMP. Then, a hybrid MPI OpenMP parallel algorithm for the memory bound part of the code is described and analyzed. Good scalings are observed up to 384 cores. This Ph. D. thesis is part of the interdisciplinary project ANR E2T2 (CEA/IRFM, University of Aix-Marseille/PIIM, University of Strasbourg/ICube).
|
98 |
Programmation des architectures hiérarchiques et hétérogènes / Programming hierarxchical and heterogenous machinesHamidouche, Khaled 10 November 2011 (has links)
Les architectures de calcul haute performance de nos jours sont des architectures hiérarchiques et hétérogènes: hiérarchiques car elles sont composées d’une hiérarchie de mémoire, une mémoire distribuée entre les noeuds et une mémoire partagée entre les coeurs d’un même noeud. Hétérogènes due à l’utilisation des processeurs spécifiques appelés Accélérateurs tel que le processeur CellBE d’IBM et les CPUs de NVIDIA. La complexité de maîtrise de ces architectures est double. D’une part, le problème de programmabilité: la programmation doit rester simple, la plus proche possible de la programmation séquentielle classique et indépendante de l’architecture cible. D’autre part, le problème d’efficacité: les performances doivent êtres proches de celles qu’obtiendrait un expert en écrivant le code à la main en utilisant des outils de bas niveau. Dans cette thèse, nous avons proposé une plateforme de développement pour répondre à ces problèmes. Pour cela, nous proposons deux outils : BSP++ est une bibliothèque générique utilisant des templates C++ et BSPGen est un framework permettant la génération automatique de code hybride à plusieurs niveaux de la hiérarchie (MPI+OpenMP ou MPI + Cell BE). Basée sur un modèle hiérarchique, la bibliothèque BSP++ prend les architectures hybrides comme cibles natives. Utilisant un ensemble réduit de primitives et de concepts intuitifs, BSP++ offre une simplicité d'utilisation et un haut niveau d' abstraction de la machine cible. Utilisant le modèle de coût de BSP++, BSPGen estime et génère le code hybride hiérarchique adéquat pour une application donnée sur une architecture cible. BSPGen génère un code hybride à partir d'une liste de fonctions séquentielles et d'une description de l'algorithme parallèle. Nos outils ont été validés sur différentes applications de différents domaines allant de la vérification et du calcul scientifique au traitement d'images en passant par la bioinformatique. En utilisant une large sélection d’architecture cible allant de simple machines à mémoire partagée au machines Petascale en passant par les architectures hétérogènes équipées d’accélérateurs de type Cell BE. / Today’s high-performance computing architectures are hierarchical and heterogeneous. With a hierarchy of memory, they are composed of distributed memory between nodes and shared memory between cores of the same node. heterogeneous due to the use of specific processors called accelerators such as the CellBE IBM processor and/or NVIDIA GPUs. The programming complexity of these architectures is twofold. On the one hand, the problem of programmability: the programming should be simple, as close as possible to the conventional sequential programming and independent of the target architecture. On the other hand, the problem of efficiency: performance should be similar to those obtained by a expert in writing code by hand using low-level tools. In this thesis, we proposed a development platform to address these problems. For this, we propose two tools: BSP++ is a generic library using C++ templates and BSPGen is a framework for the automatic hybrid multi-level hierarchy (MPI + OpenMP or MPI + Cell BE) code generation.Based on a hierarchical model, the BSP++ library takes the hybrid architectures as native targets. Using a small set of primitives and intuitive concepts, BSP++ provides a simple way to use and a high level of abstraction of the target machine. Using the cost model of BSP++, BSPGen predicts and generates the appropriate hierarchical hybrid code for a given application on target architecture. BSPGen generates hybrid code from a sequential list of functions and a description of the parallel algorithm.Our tools have been validated with various applications in different fields ranging from verification to scientific computing and image processing through bioinformatics. Using a wide selection of target architecture ranging from simple shared memory machines to Petascale machines through the heterogeneous architectures equipped with Cell BE accelerators.
|
99 |
High performance computing for the discontinuous Galerkin methodsMukhamedov, Farukh January 2018 (has links)
Discontinuous Galerkin methods form a class of numerical methods to find a solution of partial differential equations by combining features of finite element and finite volume methods. Methods are defined using a weak form of a particular model problem, allowing for discontinuities in the discrete trial and test spaces. Using a discontinuous discrete space mesh provides proper flexibility and a compact discretisation pattern, allowing a multidomain and multiphysics simulation. Discontinuous Galerkin methods with a higher approximation polynomial order, the socalled p-version, performs better in terms of convergence rate, compared with the low order h-version with smaller element sizes and bigger mesh. However, the condition number of the Galerkin system grows subsequently. This causes surge in the amount of required storage, computational complexity and in the time required for computation. We use the following three approaches to keep the advantages and eliminate the disadvantages. The first approach will be a specific choice of basis functions which we call C1 polynomials. These ensure that the majority of integrals over the edge of the mesh elements disappears. This reduces the total number of non-zero elements in the resulting system. This decreases the computational complexity without loss in precision. This approach does not affect the number of iterations required by chosen Conjugate Gradients method when compared to the other choice of basis functions. It actually decreases the total number of algebraic operations performed. The second approach is the introduction of suitable preconditioners. In our case, the Additive two-layer Schwarz method, developed in [4], for the iterative Conjugate Gradients method is considered. This directly affects the spectral condition number of the system matrix and decreases the number of iterations required for the computation. This approach, however, increases the total number of algebraic operations and might require more operational time. To tackle the rise in the number of algebraic operations, we introduced a modified Additive two-layer non-overlapping Schwarz method with a Multigrid process. This using a fixed low-order approximation polynomial degree on a coarse grid. We show that this approach is spectrally equivalent to the first preconditioner, and requires less time for computation. The third approach is a development of an efficient mathematical framework for distributed data structure. This allows a high performance, massively parallel, implementation of the discontinuous Galerkin method. We demonstrate that it is possible to exploit properties of the system matrix and C1 polynomials as basis functions to optimize the parallel structures. The previously mentioned parallel data structure allows us to parallelize at the same time both the matrix-vector multiplication routines for the Conjugate Gradients method, as well as the preconditioner routines on the solver level. This minimizes the transfer ratio amongst the distributed system. Finally, we combined all three approaches and created a framework, which allowed us to successfully implement all of the above.
|
100 |
Scalable Community Detection using Distributed Louvain AlgorithmSattar, Naw Safrin 23 May 2019 (has links)
Community detection (or clustering) in large-scale graph is an important problem in graph mining. Communities reveal interesting characteristics of a network. Louvain is an efficient sequential algorithm but fails to scale emerging large-scale data. Developing distributed-memory parallel algorithms is challenging because of inter-process communication and load-balancing issues. In this work, we design a shared memory-based algorithm using OpenMP, which shows a 4-fold speedup but is limited to available physical cores. Our second algorithm is an MPI-based parallel algorithm that scales to a moderate number of processors. We also implement a hybrid algorithm combining both. Finally, we incorporate dynamic load-balancing in our final algorithm DPLAL (Distributed Parallel Louvain Algorithm with Load-balancing). DPLAL overcomes the performance bottleneck of the previous algorithms, shows around 12-fold speedup scaling to a larger number of processors. Overall, we present the challenges, our solutions, and the empirical performance of our algorithms for several large real-world networks.
|
Page generated in 0.0299 seconds