Spelling suggestions: "subject:"arallel programming"" "subject:"arallel erogramming""
401 |
Graph-Based Whole Genome PhylogenomicsFujimoto, Masaki Stanley 01 June 2020 (has links)
Understanding others is a deeply human urge basic in our existential quest. It requires knowing where someone has come from and where they sit amongst peers. Phylogenetic analysis and genome wide association studies seek to tell us where we’ve come from and where we are relative to one another through evolutionary history and genetic makeup. Current methods do not address the computational complexity caused by new forms of genomic data, namely long-read DNA sequencing and increased abundances of assembled genomes, that are becoming evermore abundant. To address this, we explore specialized data structures for storing and comparing genomic information. This work resulted in the creation of novel data structures for storing multiple genomes that can be used for identifying structural variations and other types of polymorphisms. Using these methods we illuminate the genetic history of organisms in our efforts to understand the world around us.
|
402 |
Paralelní syntaktická analýza / Parallel Syntax AnalysisOtáhal, Jiří January 2012 (has links)
This thesis focuses on modern methods of language description. It introduces several controlled grammars, describing in detail the tree controlled grammar. The thesis is based on relatively new technique of syntax analysis using tree controlled grammars. The process of this analysis is described in detail, followed by a design of parallel-processing of this analysis. We managed to succesfully implement this design, speed up the syntax analysis and therefore achieve the main goal of the thesis.
|
403 |
Evoluční návrh kombinačních obvodů na počítačovém clusteru / Evolutionary Design of Combinational Circuits on Computer ClusterPánek, Richard January 2015 (has links)
This master's thesis deals with evolutionary algorithms and how them to use to design of combinational circuits. Genetic programming especially CGP is the most applicable to use for this type of task. Furthermore, it deals with computation on computer cluster and the use of evolutionary algorithms on them. For this computation is the most suited island models with CGP. Then a new way of recombination in CGP is designed to improve them. This design is implemented and tested on the computer cluster.
|
404 |
Optimizing MPI Collective Communication by Orthogonal StructuresKühnemann, Matthias, Rauber, Thomas, Rünger, Gudula 28 June 2007 (has links)
Many parallel applications from scientific computing use MPI collective communication operations to collect or distribute data. Since the execution times of these communication operations increase with the number of participating processors, scalability problems might occur. In this article, we show for different MPI implementations how the execution time of collective communication operations can be significantly improved by a restructuring based on orthogonal processor structures with two or more levels. As platform, we consider a dual Xeon cluster, a Beowulf cluster and a Cray T3E with different MPI implementations. We show that the execution time of operations like MPI Bcast or MPI Allgather can be reduced by 40% and 70% on the dual Xeon cluster and the Beowulf cluster. But also on a Cray T3E a significant improvement can be obtained by a careful selection of the processor groups. We demonstrate that the optimized communication operations can be used to reduce the execution time of data parallel implementations of complex application programs without any other change of the computation and communication structure. Furthermore, we investigate how the execution time of orthogonal realization can be modeled using runtime functions. In particular, we consider the modeling of two-phase realizations of communication operations. We present runtime functions for the modeling and verify that these runtime functions can predict the execution time both for communication operations in isolation and in the context of application programs.
|
405 |
Efficient Implementation of 3D Finite Difference Schemes on Recent Processor Architectures / Effektiv implementering av finita differensmetoder i 3D på senaste processorarkitekturerCeder, Frederick January 2015 (has links)
Efficient Implementation of 3D Finite Difference Schemes on Recent Processors Abstract In this paper a solver is introduced that solves a problem set modelled by the Burgers equation using the finite difference method: forward in time and central in space. The solver is parallelized and optimized for Intel Xeon Phi 7120P as well as Intel Xeon E5-2699v3 processors to investigate differences in terms of performance between the two architectures. Optimized data access and layout have been implemented to ensure good cache utilization. Loop tiling strategies are used to adjust data access with respect to the L2 cache size. Compiler hints describing aligned memory access are used to support vectorization on both processors. Additionally, prefetching strategies and streaming stores have been evaluated for the Intel Xeon Phi. Parallelization was done using OpenMP and MPI. The parallelisation for native execution on Xeon Phi is based on OpenMP and yielded a raw performance of nearly 100 GFLOP/s, reaching a speedup of almost 50 at a 83\% parallel efficiency. An OpenMP implementation on the E5-2699v3 (Haswell) processors produced up to 292 GFLOP/s, reaching a speedup of almost 31 at a 85\% parallel efficiency. For comparison a mixed implementation using interleaved communications with computations reached 267 GFLOP/s at a speedup of 28 with a 87\% parallel efficiency. Running a pure MPI implementation on the PDC's Beskow supercomputer with 16 nodes yielded a total performance of 1450 GFLOP/s and for a larger problem set it yielded a total of 2325 GFLOP/s, reaching a speedup and parallel efficiency at resp. 170 and 33,3\% and 290 and 56\%. An analysis based on the roofline performance model shows that the computations were memory bound to the L2 cache bandwidth, suggesting good L2 cache utilization for both the Haswell and the Xeon Phi's architectures. Xeon Phi performance can probably be improved by also using MPI. Keeping technological progress for computational cores in the Haswell processor in mind for the comparison, both processors perform well. Improving the stencil computations to a more compiler friendly form might improve performance more, as the compiler can possibly optimize more for the target platform. The experiments on the Cray system Beskow showed an increased efficiency from 33,3\% to 56\% for the larger problem, illustrating good weak scaling. This suggests that problem sizes should increase accordingly for larger number of nodes in order to achieve high efficiency. Frederick Ceder / Effektiv implementering av finita differensmetoder i 3D på moderna processorarkitekturer Sammanfattning Denna uppsats diskuterar implementationen av ett program som kan lösa problem modellerade efter Burgers ekvation numeriskt. Programmet är byggt ifrån grunden och använder sig av finita differensmetoder och applicerar FTCS metoden (Forward in Time Central in Space). Implementationen paralleliseras och optimeras på Intel Xeon Phi 7120P Coprocessor och Intel Xeon E5-2699v3 processorn för att undersöka skillnader i prestanda mellan de två modellerna. Vi optimerade programmet med omtanke på dataåtkomst och minneslayout för att få bra cacheutnyttjande. Loopblockningsstrategier används också för att dela upp arbetsminnet i mindre delar för att begränsa delarna i L2 cacheminnet. För att utnyttja vektorisering till fullo så används kompilatordirektiv som beskriver minnesåtkomsten, vilket ska hjälpa kompilatorn att förstå vilka dataaccesser som är alignade. Vi implementerade också prefetching strategier och streaming stores på Xeon Phi och disskuterar deras värde. Paralleliseringen gjordes med OpenMP och MPI. Parallelliseringen för Xeon Phi:en är baserad på bara OpenMP och exekverades direkt på chipet. Detta gav en rå prestanda på nästan 100 GFLOP/s och nådde en speedup på 50 med en 83% effektivitet. En OpenMP implementation på E5-2699v3 (Haswell) processorn fick upp till 292 GFLOP/s och nådde en speedup på 31 med en effektivitet på 85%. I jämnförelse fick en hybrid implementation 267 GFLOP/s och nådde en speedup på 28 med en effektivitet på 87%. En ren MPI implementation på PDC's Beskow superdator med 16 noder gav en total prestanda på 1450 GFLOP/s och för en större problemställning gav det totalt 2325 GFLOP/s, med speedup och effektivitet på respektive 170 och 33% och 290 och 56%. En analys baserad på roofline modellen visade att beräkningarna var minnesbudna till L2 cache bandbredden, vilket tyder på bra L2-cache användning för både Haswell och Xeon Phi:s arkitekturer. Xeon Phis prestanda kan förmodligen förbättras genom att även använda MPI. Håller man i åtanke de tekniska framstegen när det gäller beräkningskärnor på de senaste åren, så preseterar både arkitekturer bra. Beräkningskärnan av implementationen kan förmodligen anpassas till en mer kompilatorvänlig variant, vilket eventuellt kan leda till mer optimeringar av kompilatorn för respektive plattform. Experimenten på Cray-systemet Beskow visade en ökad effektivitet från 33,3% till 56% för större problemställningar, vilket visar tecken på bra weak scaling. Detta tyder på att effektivitet kan uppehållas om problemställningen växer med fler antal beräkningsnoder. Frederick Ceder
|
406 |
Dynamic Task Prediction for an SpMT Architecture Based on Control IndependenceJothi, Komal 01 January 2009 (has links)
Exploiting better performance from computer programs translates to finding more instructions to execute in parallel. Since most general purpose programs are written in an imperatively sequential manner, closely lying instructions are always data dependent, making the designer look far ahead into the program for parallelism. This necessitates wider superscalar processors with larger instruction windows. But superscalars suffer from three key limitations, their inability to scale, sequential fetch bottleneck and high branch misprediction penalty. Recent studies indicate that current superscalars have reached the end of the road and designers will have to look for newer ideas to build computer processors.
Speculative Multithreading (SpMT) is one of the most recent techniques to exploit parallelism from applications. Most SpMT architectures partition a sequential program into multiple threads (or tasks) that can be concurrently executed on multiple processing units. It is desirable that these tasks are sufficiently distant from each other so as to facilitate parallelism. It is also desirable that these tasks are control independent of each other so that execution of a future task is guaranteed in case of local control flow misspeculations. Some task prediction mechanisms rely on the compiler requiring recompilation of programs. Current dynamic mechanisms either rely on program constructs like loop iterations and function and loop boundaries, resulting in unbalanced loads, or predict tasks which are too short to be of use in an SpMT architecture. This thesis is the first proposal of a predictor that dynamically predicts control independent tasks that are consistently wide apart, and executes them on a novel SpMT architecture.
|
407 |
Integrating SkePU's algorithmic skeletons with GPI on a cluster / Integrering av SkePUs algoritmiska skelett med GPI på ett clusterAlmqvist, Joel January 2022 (has links)
As processors' clock-speed flattened out in the early 2000s, multi-core processors became more prevalent and so did parallel programming. However this programming paradigm introduces additional complexities, and to combat this, the SkePU framework was created. SkePU does this by offering a single-threaded interface which executes the user's code in parallel in accordance to a chosen computational pattern. Furthermore it allows the user themselves to decide which parallel backend should perform the execution, be it OpenMP, CUDA or OpenCL. This modular approach of SkePU thus allows for different hardware to be used without changing the code, and it currently supports CPUs, GPUs and clusters. This thesis presents a new so-called SkePU-backend made for clusters, using the communication library GPI. It demonstrates that the new backend is able to scale better and handle workload imbalances better than the existing SkePU-cluster-backend. This is achieved despite it performing worse at low node amounts, indicating that it requires less scaling overhead. Its weaknesses are also analyzed, partially from a design point of view, and clear solutions are presented, combined with a discussion as to why they arose in the first place.
|
408 |
COMPARISON OF THE PERFORMANCE OF NVIDIA ACCELERATORS WITH SIMD AND ASSOCIATIVE PROCESSORS ON REAL-TIME APPLICATIONSShaker, Alfred M. 27 July 2017 (has links)
No description available.
|
409 |
HDArray: PARALLEL ARRAY INTERFACE FOR DISTRIBUTED HETEROGENEOUS DEVICESHyun Dok Cho (18620491) 30 May 2024 (has links)
<p dir="ltr">Heterogeneous clusters with nodes containing one or more accelerators, such as GPUs, have become common. While MPI provides inter-address space communication, and OpenCL provides a process with access to heterogeneous computational resources, programmers are forced to write hybrid programs that manage the interaction of both of these systems. This paper describes an array programming interface that provides users with automatic and manual distributions of data and work. Using work distribution and kernel def and use information, communication among processes and devices in a process is performed automatically. By providing a unified programming model to the user, program development is simplified.</p>
|
410 |
Hybrid Parallel Computing Strategies for Scientific Computing ApplicationsLee, Joo Hong 10 October 2012 (has links)
Multi-core, multi-processor, and Graphics Processing Unit (GPU) computer architectures pose significant challenges with respect to the efficient exploitation of parallelism for large-scale, scientific computing simulations. For example, a simulation of the human tonsil at the cellular level involves the computation of the motion and interaction of millions of cells over extended periods of time. Also, the simulation of Radiative Heat Transfer (RHT) effects by the Photon Monte Carlo (PMC) method is an extremely computationally demanding problem. The PMC method is example of the Monte Carlo simulation method—an approach extensively used in wide of application areas. Although the basic algorithmic framework of these Monte Carlo methods is simple, they can be extremely computationally intensive. Therefore, an efficient parallel realization of these simulations depends on a careful analysis of the nature these problems and the development of an appropriate software framework. The overarching goal of this dissertation is develop and understand what the appropriate parallel programming model should be to exploit these disparate architectures, both from the metric of efficiency, as well as from a software engineering perspective.
In this dissertation we examine these issues through a performance study of PathSim2, a software framework for the simulation of large-scale biological systems, using two different parallel architectures’ distributed and shared memory. First, a message-passing implementation of a multiple germinal center simulation by PathSim2 is developed and analyzed for distributed memory architectures. Second, a germinal center simulation is implemented on shared memory architecture with two parallelization strategies based on Pthreads and OpenMP.
Finally, we present work targeting a complete hybrid, parallel computing architecture. With this work we develop and analyze a software framework for generic Monte Carlo simulations implemented on multiple, distributed memory nodes consisting of a multi-core architecture with attached GPUs. This simulation framework is divided into two asynchronous parts: (a) a threaded, GPU-accelerated pseudo-random number generator (or producer), and (b) a multi-threaded Monte Carlo application (or consumer). The advantage of this approach is that this software framework can be directly used within any Monte Carlo application code, without requiring application-specific programming of the GPU. We examine this approach through a performance study of the simulation of RHT effects by the PMC method on a hybrid computing architecture. We present a theoretical analysis of our proposed approach, discuss methods to optimize performance based on this analysis, and compare this analysis to experimental results obtained from simulations run on two different hybrid, parallel computing architectures. / Ph. D.
|
Page generated in 0.0691 seconds