Global ETD Search

1	Efficient Parallel Text Compression on GPUs Zhang, Xiaoxi 2011 December 1900 (has links) This paper demonstrates an efficient text compressor with parallel Lempel-Ziv-Markov chain algorithm (LZMA) on graphics processing units (GPUs). We divide LZMA into two parts, match finder and range encoder. We parallel both parts and achieve competitive performance with freeArc on AMD 6-core 2.81 GHz CPU. We measure match finder time, range encoder compression time and demonstrate realtime performance on a large dataset: 10 GB web pages crawled by IRLbot. Our parallel range encoder is 15 times faster than sequential algorithm (FastAC) with static model. GPUs parallel computing compression
2	Compile-time and Run-time Optimizations for Enhancing Locality and Parallelism on Multi-core and Many-core Systems Baskaran, Muthu Manikandan 05 November 2009 (has links) No description available. Computer Science Compilers. Multi-cores GPUs
3	An enhanced GPU architecture for not-so-regular parallelism with special implications for database search Narasiman, Veynu Tupil 27 June 2014 (has links) Graphics Processing Units (GPUs) have become a popular platform for executing general purpose (i.e., non-graphics) applications. To run efficiently on a GPU, applications must be parallelized into many threads, each of which performs the same task but operates on different data (i.e., data parallelism). Previous work has shown that some applications experience significant speedup when executed on a GPU instead of a CPU. The applications that benefit most tend to have certain characteristics such as high computational intensity, regular control-flow and memory access patterns, and little to no communication among threads. However, not all parallel applications have these characteristics. Applications with a more balanced compute to memory ratio, divergent control flow, irregular memory accesses, and/or frequent communication (i.e., not-so-regular applications) will not take full advantage of the GPU's resources, resulting in performance far short of what could be delivered. The goal of this dissertation is to enhance the GPU architecture to better handle not-so-regular parallelism. This is accomplished in two parts. First, I analyze a diverse set of data parallel applications that suffer from divergent control-flow and/or significant stall time due to memory. I propose two microarchitectural enhancements to the GPU called the Large Warp Microarchitecture and Two-Level Warp Scheduling to address these problems respectively. When combined, these mechanisms increase performance by 19% on average. Second, I examine one of the most important and fundamental applications in computing: database search. Database search is an excellent example of an application that is rich in parallelism, but rife with not-so-regular characteristics. I propose enhancements to the GPU architecture including new instructions that improve intra-warp thread communication and decision making, and also a row-buffer locality hint bit to better handle the irregular memory access patterns of index-based tree search. These proposals improve performance by 21% for full table scans, and 39% for index-based search. The result of this dissertation is an enhanced GPU architecture that better handles not-so-regular parallelism. This increases the scope of applications that run efficiently on the GPU, making it a more viable platform not only for current parallel workloads such as databases, but also for future and emerging parallel applications. / text Graphics processing units GPU GPUs GPGPU Branch divergence Warp scheduling Database search on GPUs
4	Ordonnancement pour les nouvelles plateformes de calcul avec GPUs / Scheduling for new computing platforms with GPUs Monna, Florence 25 November 2014 (has links) De plus en plus d'ordinateurs utilisent des architectures hybrides combinant des processeurs multi-cœurs (CPUs) et des accélérateurs matériels comme les GPUs (Graphics Processing Units). Ces plates-formes parallèles hybrides exigent de nouvelles stratégies d'ordonnancement adaptées. Cette thèse est consacrée à une caractérisation de ce nouveau type de problèmes d'ordonnancement. L'objectif le plus étudié dans ce travail est la minimisation du makespan, qui est un problème crucial pour atteindre le potentiel des nouvelles plates-formes en Calcul Haute Performance.Le problème central étudié dans ce travail est le problème d'ordonnancement efficace de n tâches séquentielles indépendantes sur une plateforme de m CPUs et k GPUs, où chaque tâche peut être exécutée soit sur un CPU ou sur un GPU, avec un makespan minimal. Ce problème est NP-difficiles, nous proposons donc des algorithmes d'approximation avec des garanties de performance allant de 2 à (2q + 1)/(2q) +1/(2qk), q> 0, et des complexités polynomiales. Il s'agit des premiers algorithmes génériques pour la planification sur des machines hybrides avec une garantie de performance et une fin pratique. Des variantes du problème central ont été étudiées : un cas particulier où toutes les tâches sont accélérées quand elles sont affectées à un GPU, avec un algorithme avec un ratio de 3/2, un cas où les préemptions sont autorisées sur CPU, mais pas sur GPU, le modèle des tâches malléables, avec un algorithme avec un ratio de 3/2. Enfin, le problème avec des tâches dépendantes a été étudié, avec un algorithme avec un ratio de 6. Certains des algorithmes ont été intégré dans l'ordonnanceur du système xKaapi. / More and more computers use hybrid architectures combining multi-core processors (CPUs) and hardware accelerators like GPUs (Graphics Processing Units). These hybrid parallel platforms require new scheduling strategies. This work is devoted to a characterization of this new type of scheduling problems. The most studied objective in this work is the minimization of the makespan, which is a crucial problem for reaching the potential of new platforms in High Performance Computing. The core problem studied in this work is scheduling efficiently n independent sequential tasks with m CPUs and k GPUs, where each task of the application can be processed either on a CPU or on a GPU, with minimum makespan. This problem is NP-hard, therefore we propose approximation algorithms with performance ratios ranging from 2 to (2q+1)/(2q)+1/(2qk), q>0, and corresponding polynomial time complexities. The proposed solving method is the first general purpose algorithm for scheduling on hybrid machines with a theoretical performance guarantee that can be used for practical purposes. Some variants of the core problem are studied: a special case where all the tasks are accelerated when assigned to a GPU, with a 3/2-approximation algorithm, a case where preemptions are allowed on CPUs, the same problem with malleable tasks, with an algorithm with a ratio of 3/2. Finally, we studied the problem with dependent tasks, providing a 6-approximation algorithm. Experiments based on realistic benchmarks have been conducted. Some algorithms have been integrated into the scheduler of the xKaapi runtime system for linear algebra kernels, and compared to the state-of-the-art algorithm HEFT. Ordonnancement GPUs Algorithmes d'approximation Plateformes hétérogènes Calcul haute performance Approximation duale Job scheduling GPUs 004.3
5	Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach Nath, Rajib Kumar 01 August 2010 (has links) Dense linear algebra(DLA) is one of the most seven important kernels in high performance computing. The introduction of new machines from vendors provides us opportunities to optimize DLA libraries for the new machines and thus exploit their power. Unfortunately the optimization phase is not straightforward. The optimum code of a certain Basic Linear Algebra Subprogram (BLAS) kernel, which is the core of DLA algorithms, in two different machines with different semiconductor process can be different even if they share the same features in terms of instruction set architecture, memory hierarchy and clock speed. It has become a tradition to optimize BLAS for new machines. Vendors maintain highly optimized BLAS libraries targeting their CPUs. Unfortunately the existing BLAS for GPUs is not highly optimized for DLA algorithms. In my research, I have provided new algorithms for several important BLAS kernels for different generation of GPUs and introduced a pointer redirecting approach to make BLAS run faster in generic problem size. I have also presented an auto-tuning approach to parameterize the developed BLAS algorithms and select the best set of parameters for a given card. The hardware trends have also brought up the need for updates on existing legacy DLA software packages, such as the sequential LAPACK. To take advantage of the new computational environment, successors of LAPACK must incorporate algorithms of three main characteristics: high parallelism, reduced communication, and heterogeneity-awareness. On multicore architectures, Parallel Linear Algebra Software for Multicore Architectures (PLASMA) has been developed to meet the challenges in multicore. On the other extreme, Matrix Algebra on GPU and Multicore Architectures (MAGMA) library demonstrated a hybridization approach that indeed streamlined the development of high performance DLA for multicores with GPU accelerators. The performance of these two libraries depend upon right choice of parameters for a given problem size and given number of cores and/or GPUs. In this work, the issue of automatically tuning these two libraries is presented. A prune based empirical auto-tuning method has been proposed for tuning PLASMA. Part of the tuning method for PLASMA was considered to tune hybrid MAGMA library. Optimization GPUs Multicore Dense Linear Algebra BLAS Hybrid Architecture
6	Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach Nath, Rajib Kumar 01 August 2010 (has links) Dense linear algebra(DLA) is one of the most seven important kernels inhigh performance computing. The introduction of new machines from vendorsprovides us opportunities to optimize DLA libraries for the new machinesand thus exploit their power. Unfortunately the optimization phase is notstraightforward. The optimum code of a certain Basic Linear AlgebraSubprogram (BLAS) kernel, which is the core of DLA algorithms, in twodifferent machines with different semiconductor process can be differenteven if they share the same features in terms of instruction setarchitecture, memory hierarchy and clock speed. It has become a traditionto optimize BLAS for new machines. Vendors maintain highly optimized BLASlibraries targeting their CPUs. Unfortunately the existing BLAS for GPUsis not highly optimized for DLA algorithms. In my research, I haveprovided new algorithms for several important BLAS kernels for differentgeneration of GPUs and introduced a pointer redirecting approach to makeBLAS run faster in generic problem size. I have also presented anauto-tuning approach to parameterize the developed BLAS algorithms andselect the best set of parameters for a given card.The hardware trends have also brought up the need for updates on existinglegacy DLA software packages, such as the sequential LAPACK. To takeadvantage of the new computational environment, successors of LAPACK mustincorporate algorithms of three main characteristics: high parallelism,reduced communication, and heterogeneity-awareness. On multicorearchitectures, Parallel Linear Algebra Software for MulticoreArchitectures (PLASMA) has been developed to meet the challenges inmulticore. On the other extreme, Matrix Algebra on GPU and MulticoreArchitectures (MAGMA) library demonstrated a hybridization approach thatindeed streamlined the development of high performance DLA for multicoreswith GPU accelerators. The performance of these two libraries depend uponright choice of parameters for a given problem size and given number ofcores and/or GPUs. In this work, the issue of automatically tuning thesetwo libraries is presented. A prune based empirical auto-tuning method hasbeen proposed for tuning PLASMA. Part of the tuning method for PLASMA wasconsidered to tune hybrid MAGMA library. Optimization GPUs Multicore Dense Linear Algebra BLAS Hybrid Architecture
7	Accelerating Mixed-Abstraction SystemC Models on Multi-Core CPUs and GPUs Kaushik, Anirudh Mohan January 2014 (has links) Functional verification is a critical part in the hardware design process cycle, and it contributes for nearly two-thirds of the overall development time. With increasing complexity of hardware designs and shrinking time-to-market constraints, the time and resources spent on functional verification has increased considerably. To mitigate the increasing cost of functional verification, research and academia have been engaged in proposing techniques for improving the simulation of hardware designs, which is a key technique used in the functional verification process. However, the proposed techniques for accelerating the simulation of hardware designs do not leverage the performance benefits offered by multiprocessors/multi-core and heterogeneous processors available today. With the growing ubiquity of powerful heterogeneous computing systems, which integrate multi-processor/multi-core systems with heterogeneous processors such as GPUs, it is important to utilize these computing systems to address the functional verification bottleneck. In this thesis, I propose a technique for accelerating SystemC simulations across multi-core CPUs and GPUs. In particular, I focus on accelerating simulation of SystemC models that are described at both the Register-Transfer Level (RTL) and Transaction Level (TL) abstractions. The main contributions of this thesis are: 1.) a methodology for accelerating the simulation of mixed abstraction SystemC models defined at the RTL and TL abstractions on multi-core CPUs and GPUs and 2.) An open-source static framework for parsing, analyzing, and performing source-to-source translation of identified portions of a SystemC model for execution on multi-core CPUs and GPUs.
8	Algoritmos paralelos exatos e otimizações para alinhamento de sequências biológicas longas em plataformas de alto desempenho Sandes, Edans Flávius de Oliveira 09 September 2015 (has links) Tese (doutorado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2015. / Submitted by Albânia Cézar de Melo (albania@bce.unb.br) on 2016-01-21T13:09:08Z No. of bitstreams: 1 2015_EdansFlaviusOliveiraSandes.pdf: 8651626 bytes, checksum: eb6970a8085ba3a4dc141481620451c6 (MD5) / Approved for entry into archive by Patrícia Nunes da Silva(patricia@bce.unb.br) on 2016-05-15T13:57:40Z (GMT) No. of bitstreams: 1 2015_EdansFlaviusOliveiraSandes.pdf: 8651626 bytes, checksum: eb6970a8085ba3a4dc141481620451c6 (MD5) / Made available in DSpace on 2016-05-15T13:57:40Z (GMT). No. of bitstreams: 1 2015_EdansFlaviusOliveiraSandes.pdf: 8651626 bytes, checksum: eb6970a8085ba3a4dc141481620451c6 (MD5) / O alinhamento de sequências biológicas é uma das operações mais importantes em Bioinformática, sendo executado milhares de vezes a cada dia ao redor do mundo. Os algoritmos exatos existentes para este fim possuem complexidade quadrática de tempo. Logo, quando a comparação é realizada com sequências muito longas, tais como no escopo do genoma humano, matrizes na ordem de petabytes devem ser calculadas, algo considerado inviável pela maioria dos pesquisadores. O principal objetivo desta tese de Doutorado é propor e avaliar algoritmos e otimizações que permitam que o alinhamento ótimo de sequências muito longas de DNA seja obtido em tempo reduzido em plataformas de alto desempenho. Os algoritmos propostos utilizam técnicas paralelas de dividir e conquistar com complexidade de memória reduzida mantendo a complexidade quadrática do tempo de execução. O CUDAlign, em suas versões 2.0, 2.1, 3.0 e 4.0, é a principal contribuição desta tese, onde os algoritmos propostos estão integrados na mesma ferramenta, permitindo a recuperação eficiente do alinhamento ótimo entre duas sequências longas de DNA em múltiplas GPUs (Graphics Processing Unit) da NVIDIA. As otimizações propostas neste trabalho permitem que o nível máximo de paralelismo seja mantido durante quase todo o processamento. No cálculo do alinhamento em uma GPU, as otimizações Orthogonal Execution, Balanced Partition e Block Pruning foram propostas, aumentando o desempenho no cálculo da matriz e descartando áreas que não contribuem para o alinhamento ótimo. A análise formal do Block Pruning mostra que sua eficácia depende de vários fatores, tais como a similaridade entre as sequências e a forma de processamento da matriz. No cálculo do alinhamento com várias GPUs, a otimização Incremental Speculative Traceback é proposta para acelerar a obtenção do alinhamento utilizando valores especulados com alta taxa de acerto. Também são propostos métodos de balanceamento dinâmico de carga que se mostraram eficientes em ambientes simulados. A arquitetura de software chamada de Multi-Platform Architecture for Sequence Aligners (MASA) foi proposta para facilitar a portabilidade do CUDAlign para diferentes plataformas de hardware ou software. Com esta arquitetura, foi possível portar o CUDAlign para plataformas de hardware como CPUs e Intel Phi e utilizando plataformas de software como OpenMP e OmpSs. Nesta tese, sequências reais são utilizadas para validar a eficácia dos algoritmos e otimizações nas várias arquiteturas suportadas. Por meio do desempenho das ferramentas implementadas, avançou-se o estado da arte para permitir o alinhamento, em tempo viável, de todos os cromossomos homólogos do homem e do chimpanzé, utilizando algoritmos exatos de comparação de sequências com um desempenho de até 10,35 TCUPS (Trilhões de Células Atualizadas por Segundo). Até onde sabemos, esta foi a primeira vez que tal tipo de comparação foi realizada com métodos exatos. / Biological sequence alignment is one of the most important operations in Bioinformatics, executing thousands of times every day around the world. The exact algorithms for this purpose have quadratic time complexity. So when the comparison involves very long sequences, such as in the human genome, matrices with petabytes must be calculated, and this is still considered unfeasible by most researchers. The main objective of this Thesis is to propose and evaluate algorithms and optimizations that produce the optimal alignment of very long DNA sequences in a short time using high-performance computing platforms. The proposed algorithms use parallel divide-and-conquer techniques with reduced memory complexity, whilst with quadratic time complexity. CUDAlign, in its versions 2.0, 2.1, 3.0 and 4.0, is the main contribution of this Thesis. The proposed algorithms are integrated into the same tool, allowing efficient retrieval of the optimal alignment between two long DNA sequences using multiple GPUs (Graphics Processing Unit) from NVIDIA. The proposed optimizations maintain the maximum parallelism during most of the processing time. To accelerate the matrix calculation in a single GPU, the Orthogonal Execution, Balanced Partition and Block Pruning optimizations were proposed, increasing the performance of the matrix computation and discarding areas that do not contribute to the optimal alignment. The formal analysis of Block Pruning shows that its effectiveness depends on factors such as the sequences similarity and the matrix processing order. During the alignment computation with multiple GPUs, the Incremental Speculative Traceback optimization is proposed to accelerate the alignment retrieval, using speculated values with high accuracy rate. A dynamic load balancing method has also been proposed and its effectiveness has been shown in simulated environments. Finally, the software architecture called Multi-Platform Architecture for Sequence aligners (MASA) was proposed to simplify the portability of CUDAlign to different hardware and software platforms. With this architecture, it was possible to port CUDAlign to hardware platforms such as CPU and Intel Phi, and using software platforms such as OpenMP and OmpSs. In this Thesis, real sequences are used to validate the effectiveness of the proposed algorithms and optimizations in several supported architectures. Our proposed tools were able to advance the state-of-the-art of sequence alignment algorithms, allowing a fast retrieval of all human and chimpanzee homologous chromosomes, using exact algorithms at an unprecedented rate of up to 10.35 TCUPS (Trillions of Cells Updated Per Second). As far as we know, this was the first time that this type of comparison was carried out with exact sequence comparison algorithms. Bioinformática Comparação de sequências Biologia computacional
9	MASA-OpenCL : comparação paralela de sequências biológicas longas em GPU Figueirêdo Júnior, Marco Antônio Caldas de 05 August 2015 (has links) Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2015. / Submitted by Raquel Viana (raquelviana@bce.unb.br) on 2016-02-04T15:52:54Z No. of bitstreams: 1 2015_MarcoAntônioCaldasdeFigueirêdoJúnior.pdf: 2211162 bytes, checksum: 999b7a9af378fd239a06877f9dbd003b (MD5) / Approved for entry into archive by Raquel Viana(raquelviana@bce.unb.br) on 2016-02-04T15:56:38Z (GMT) No. of bitstreams: 1 2015_MarcoAntônioCaldasdeFigueirêdoJúnior.pdf: 2211162 bytes, checksum: 999b7a9af378fd239a06877f9dbd003b (MD5) / Made available in DSpace on 2016-02-04T15:56:38Z (GMT). No. of bitstreams: 1 2015_MarcoAntônioCaldasdeFigueirêdoJúnior.pdf: 2211162 bytes, checksum: 999b7a9af378fd239a06877f9dbd003b (MD5) / A comparação de sequências biológicas é uma tarefa importante executada com frequência na análise genética de organismos. Algoritmos que realizam este procedimento utilizando um método exato possuem complexidade quadrática de tempo, demandando alto poder computacional e uso de técnicas de paralelização. Muitas soluções têm sido propostas para tratar este problema em GPUs, mas a maioria delas são implementadas em CUDA, restringindo sua execução a GPUs NVidia. Neste trabalho, propomos e avaliamos o MASA-OpenCL, solução desenvolvida em OpenCL capaz de executar a comparação paralela de sequências biológicas em plataformas heterogêneas de computação. O MASA-OpenCL foi testado em diferentes modelos de CPUs e GPUs, avaliando pares de sequências de DNA cujos tamanhos variam entre 10 KBP (milhares de pares de bases) e 47 MBP (milhões de pares de bases), com desempenho superior a outras soluções existentes baseadas em CUDA. A solução obteve um máximo de 179,2 GCUPS (bilhões de células atualizadas por segundo) em uma GPU AMD R9 280X. Até onde temos conhecimento, esta é única solução implementada em OpenCL que realiza a comparação de sequências longas de DNA, e o desempenho alcançado é, até o momento, o melhor já obtido com uma única GPU. ______________________________________________________________________________________________ ABSTRACT / The comparison of biological sequences is an important task performed frequently in the genetic analysis of organisms. Algorithms that perform biological comparison using an exact method require quadratic time complexity, demanding high computational power and use of parallelization techniques. Many solutions have been proposed to address this problem on GPUs, but most of them are implemented in CUDA, restricting its execution to NVidia GPUs. In this work, we propose and evaluate MASA-OpenCL, which is developed in OpenCL and capable of performing parallel comparison of biological sequences in heterogeneous computing platforms. The application was tested in different families of CPUs and GPUs, evaluating pairs of DNA sequences whose sizes range between 10 KBP (thousands of base pairs) and 47 MBP (millions of base pairs) with superior performance to other existing solutions based on CUDA. Our solution achieved a maximum of 179.2 GCUPS (billions of cells updated per second) on an AMD R9 280X GPU. As far as we know, this is the only solution implemented in OpenCL that performs long DNA sequence comparison, and the achieved performance is, so far, the best ever obtained on a single GPU. Programação paralela (Computação) Sequenciamento genômico
10	Parallel computation techniques for virtual acoustics and physical modelling synthesis Webb, Craig Jonathan January 2014 (has links) The numerical simulation of large-scale virtual acoustics and physical modelling synthesis is a computationally expensive process. Time stepping methods, such as finite difference time domain, can be used to simulate wave behaviour in models of three-dimensional room acoustics and virtual instruments. In the absence of any form of simplifying assumptions, and at high audio sample rates, this can lead to simulations that require many hours of computation on a standard Central Processing Unit (CPU). In recent years the video game industry has driven the development of Graphics Processing Units (GPUs) that are now capable of multi-teraflop performance using highly parallel architectures. Whilst these devices are primarily designed for graphics calculations, they can also be used for general purpose computing. This thesis explores the use of such hardware to accelerate simulations of three-dimensional acoustic wave propagation, and embedded systems that create physical models for the synthesis of sound. Test case simulations of virtual acoustics are used to compare the performance of workstation CPUs to that of Nvidia’s Tesla GPU hardware. Using representative multicore CPU benchmarks, such simulations can be accelerated in the order of 5X for single precision and 3X for double precision floating-point arithmetic. Optimisation strategies are examined for maximising GPU performance when using single devices, as well as for multiple device codes that can compute simulations using billions of grid points. This allows the simulation of room models of several thousand cubic metres at audio rates such as 44.1kHz, all within a useable time scale. The performance of alternative finite difference schemes is explored, as well as strategies for the efficient implementation of boundary conditions. Creating physical models of acoustic instruments requires embedded systems that often rely on sparse linear algebra operations. The performance efficiency of various sparse matrix storage formats is detailed in terms of the fundamental operations that are required to compute complex models, with an optimised storage system achieving substantial performance gains over more generalised formats. An integrated instrument model of the timpani drum is used to demonstrate the performance gains that are possible using the optimisation strategies developed through this thesis. 005.2

Search results