Global ETD Search

1	Runtime Adaptation for Autonomic Heterogeneous Computing Scogland, Thomas R. 12 December 2014 (has links) Heterogeneity is increasing across all levels of computing, with the rise of accelerators such as GPUs, FPGAs, and other coprocessors into everything from cell phones to supercomputers. More quietly it is increasing with the rise of NUMA systems, hierarchical caching, OS noise, and a myriad of other factors. As heterogeneity becomes a fact of life, efficiently managing heterogeneous compute resources is becoming a critical, and ever more complex, task. The focus of this dissertation is to lay the foundation for an autonomic system for heterogeneous computing, employing runtime adaptation to improve performance portability and performance consistency while maintaining or increasing programmability. We investigate heterogeneity arising from a myriad of factors, grouped into the dimensions of locality and capability. This work has resulted in runtime schedulers capable of automatically detecting and mitigating heterogeneity in physically homogeneous systems through MPI and adaptive coscheduling for physically heterogeneous accelerator based systems as well as a synthesis of the two to address multiple levels of heterogeneity as a coherent whole. We also discuss our current work towards the next generation of fine-grained scheduling and synchronization across heterogeneous platforms in the design of a highly-scalable and portable concurrent queue for many-core systems. Each component addresses aspects of the urgent need for automated management of the extreme and ever expanding complexity introduced by heterogeneity. / Ph. D. Scheduling Graphics Processing Unit (GPU) OpenMP
2	Development of High-Performance GPUs in Accelerated Computing Liu, Zhuren 12 1900 (has links) This dissertation conducted an in-depth analysis of graphics processing unit (GPU) performance models and configurations. It profiled GPU systems across a range of hardware configurations and workloads, employing machine learning techniques such as support vector machines (SVM) and Random Forest algorithms to develop predictive models and identify key system parameters impacting performance. This work also presents the Genomics-GPU benchmark suite, comprising ten critical algorithms representative of genomics analysis tasks. It performed extensive evaluations using both hardware and software simulators, accompanied by microarchitectural characterizations examining memory access patterns, thread scaling behavior, pipeline stalls, and interconnect network impacts. It proposes a novel optimization: a lightweight process-in-memory (PIM) data pre-processing unit (DPU) integrated within the GPU’s memory hierarchy. This architectural enhancement offloads preliminary filtering operations to memory units, thereby reducing data movement between memory and processing units—a significant bottleneck in handling large genomic datasets. The specialized data pre-processing unit (DPU) performs sequence alignment pre-processing directly within DRAM, improving data throughput and overall performance by 1.23X-8.37X. High-Performance GPUs Accelerated Computing graphics processing unit (GPU)
3	High performance bioinformatics and computational biology on general-purpose graphics processing units Ling, Cheng January 2012 (has links) Bioinformatics and Computational Biology (BCB) is a relatively new multidisciplinary field which brings together many aspects of the fields of biology, computer science, statistics, and engineering. Bioinformatics extracts useful information from biological data and makes these more intuitive and understandable by applying principles of information sciences, while computational biology harnesses computational approaches and technologies to answer biological questions conveniently. Recent years have seen an explosion of the size of biological data at a rate which outpaces the rate of increases in the computational power of mainstream computer technologies, namely general purpose processors (GPPs). The aim of this thesis is to explore the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high performance and efficient implementation of BCB applications in order to meet the demands of biological data increases at affordable cost. The thesis presents detailed design and implementations of GPU solutions for a number of BCB algorithms in two widely used BCB applications, namely biological sequence alignment and phylogenetic analysis. Biological sequence alignment can be used to determine the potential information about a newly discovered biological sequence from other well-known sequences through similarity comparison. On the other hand, phylogenetic analysis is concerned with the investigation of the evolution and relationships among organisms, and has many uses in the fields of system biology and comparative genomics. In molecular-based phylogenetic analysis, the relationship between species is estimated by inferring the common history of their genes and then phylogenetic trees are constructed to illustrate evolutionary relationships among genes and organisms. However, both biological sequence alignment and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with the size of sequence databases. The thesis firstly presents a multi-threaded parallel design of the Smith- Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A novel technique is put forward to solve the restriction on the length of the query sequence in previous GPU-based implementations of the SW algorithm. Based on this implementation, the difference between two main task parallelization approaches (Inter-task and Intra-task parallelization) is presented. The resulting GPU implementation matches the speed of existing GPU implementations while providing more flexibility, i.e. flexible length of sequences in real world applications. It also outperforms an equivalent GPPbased implementation by 15x-20x. After this, the thesis presents the first reported multi-threaded design and GPU implementation of the Gapped BLAST with Two-Hit method algorithm, which is widely used for aligning biological sequences heuristically. This achieved up to 3x speed-up improvements compared to the most optimised GPP implementations. The thesis then presents a multi-threaded design and GPU implementation of a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and multiple sequence alignment (MSA). This achieves 8x-20x speed up compared to an equivalent GPP implementation based on the widely used ClustalW software. The NJ method however only gives one possible tree which strongly depends on the evolutionary model used. A more advanced method uses maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte Carlo (MCMC)-based Bayesian inference. The latter was the subject of another multi-threaded design and GPU implementation presented in this thesis, which achieved 4x-8x speed up compared to an equivalent GPP implementation based on the widely used MrBayes software. Finally, the thesis presents a general evaluation of the designs and implementations achieved in this work as a step towards the evaluation of GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA) technology. 572.8
4	Cellular matrix for parallel k-means and local search to Euclidean grid matching / Matrice cellulaire pour des algorithmes parallèles de k-means et de recherche locale appliqués à des problèmes euclidiens d’appariement de graphes Wang, Hongjian 03 December 2015 (has links) Dans cette thèse, nous proposons un modèle de calcul parallèle, appelé « matrice cellulaire », pour apporter des réponses aux problématiques de calcul parallèle appliqué à la résolution de problèmes d’appariement de graphes euclidiens. Ces problèmes d’optimisation NP-difficiles font intervenir des données réparties dans le plan et des structures élastiques représentées par des graphes qui doivent s’apparier aux données. Ils recouvrent des problèmes connus sous des appellations diverses telles que geometric k-means, elastic net, topographic mapping, elastic image matching. Ils permettent de modéliser par exemple le problème du voyageur de commerce euclidien, le problème du cycle médian, ainsi que des problèmes de mise en correspondance d’images. La contribution présentée est divisée en trois parties. Dans la première partie, nous présentons le modèle de matrice cellulaire qui partitionne les données et définit le niveau de granularité du calcul parallèle. Nous présentons une boucle générique de calcul parallèle qui modélise le principe des projections de graphes et de leur appariement. Dans la deuxième partie, nous appliquons le modèle de calcul parallèle aux algorithmes de k-means avec topologie dans le plan. Les algorithmes proposés sont appliqués au voyageur de commerce, à la génération de maillage structuré et à la segmentation d'image suivant le concept de superpixel. L’approche est nommée superpixel adaptive segmentation map (SPASM). Dans la troisième partie, nous proposons un algorithme de recherche locale parallèle, appelé distributed local search (DLS). La solution du problème résulte des opérations locales sur les structures et les données réparties dans le plan, incluant des évaluations, des recherches de voisinage, et des mouvements structurés. L’algorithme est appliqué à des problèmes d’appariement de graphe tels que le stéréo-matching et le problème de flot optique. / In this thesis, we propose a parallel computing model, called cellular matrix, to provide answers to problematic issues of parallel computation when applied to Euclidean graph matching problems. These NP-hard optimization problems involve data distributed in the plane and elastic structures represented by graphs that must match the data. They include problems known under various names, such as geometric k-means, elastic net, topographic mapping, and elastic image matching. The Euclidean traveling salesman problem (TSP), the median cycle problem, and the image matching problem are also examples that can be modeled by graph matching. The contribution presented is divided into three parts. In the first part, we present the cellular matrix model that partitions data and defines the level of granularity of parallel computation. We present a generic loop for parallel computations, and this loop models the projection between graphs and their matching. In the second part, we apply the parallel computing model to k-means algorithms in the plane extended with topology. The proposed algorithms are applied to the TSP, structured mesh generation, and image segmentation following the concept of superpixel. The approach is called superpixel adaptive segmentation map (SPASM). In the third part, we propose a parallel local search algorithm, called distributed local search (DLS). The solution results from the many local operations, including local evaluation, neighborhood search, and structured move, performed on the distributed data in the plane. The algorithm is applied to Euclidean graph matching problems including stereo matching and optical flow. Matrice cellulaire L'appariement de graphes K-means Recherche locale Algorithmiques parallèles Graphics processing unit (GPU) 620
5	Design of a Multi-Core Multi-thread Floating-Point Processor and Its Application in Computer Graphics Yeh, Chia-Yu 06 September 2011 (has links) Graphics processing unit (GPU) designs usually adopts various computer architecture techniques to boost the computation speed, including single-instruction multiple data (SIMD), very-long-instruction word (VLIW), multi-threading, and/or multi-core. In OpenGL ES 2.0, user programmable vertex shader (VS) hardware unit can be designed using vectored SIMD computation unit so that it can efficiently compute the matrix-vector multiplication, one of the key operations in vertex transformation. Recently, high-performance GPU, such as Telsa series from nVidia, is designed with many-core architectures with each core responsible for scalar operations. The intention is to allow for efficient execution of general-purpose computations in addition to the specialized graphics computations. In this thesis, we design a scalar-based multi-threaded GPU design that is composed of four scalar processors, one special-function unit, and can execute multi-threaded instructions. We use the example of vertex transformation to demonstrate execution efficiency of the scalar-based multi-threaded GPU. We also make comparison with the vector-based SIMD GPU. multi-threading graphics processing unit (GPU) vertex shader SIMD matrix-vector multiplication OpenGL ES 2.0
6	Τριγωνοποίηση Delaunay : μία υλοποίηση βασισμένη στη GPU και η χρήση της σε προβλήματα πραγματικού χρόνου της υπολογιστικής όρασης και της γραφικής Βασιλείου, Πέτρος 01 February 2013 (has links) Μια γρήγορη επίλυση του Delaunay Τριγωνισμός (DT) πρόβληματος αποτελεί ένα από τα βασικά συστατικά σε πολλές θεωριτικές και πρακτικές εφαρμογές. Οι υπάρχουσες μονάδες επεξεργασίας γραφικών (GPU), με βάση τις εφαρμογές των αλγορίθμων DT πάσχουν από δύο σοβαρά μειονεκτήματα. Το πρώτο σχετίζεται με την εξάρτηση του αλγορίθμου καθοδήγηση της GPU από την CPU για τους υπολογισμούς. Το δεύτερο πιο σοβαρό μειονέκτημα είναι η εξάρτησή τους από τη διανομή του σημειοσύνολου εισόδου. Οι περισσότεροι αλγορίθμοι για GPU έχουν καλή απόδοση μόνο με ομοιόμορφες κατανομές σημειοσύνολον. Προτείνουμε ένα καινούριο αλγόριθμο που δεν πάσχουν από τα παραπάνω προβλήματα. / A Fast solver of Delaunay Triangulation (DT) problem constitutes one of the basic ingredients in many practical and sientific applications. Existing Graphics Processing Units (GPU) based implementations of DT algorithms suffer from two serious drawbacks. The first is related to the dependency of the CPU guidance algorithm on GPU calculations. Albeit the modern GPUs have high computational throughput, if the feedback from CPU is necessary for the algorithmic evolution, the overhead caused by CPU-GPU communication can seriously degrade the performance. The second most serious drawback is their dependency on the distribution of the given point-set. Most of the GPU-based implementations can optimally run only on uniformly distributed point-sets, however, in many practical applications this is not the case. Τριγωνοποίηση Κάρτες γραφικών 006.6 Delaunay Graphics processing unit (GPU) Computational geometry Triangulation
7	Pencil beam dose calculation for proton therapy on graphics processing units da Silva, Joakim January 2016 (has links) Radiotherapy delivered using scanned beams of protons enables greater conformity between the dose distribution and the tumour than conventional radiotherapy using X rays. However, the dose distributions are more sensitive to changes in patient anatomy, and tend to deteriorate in the presence of motion. Online dose calculation during treatment delivery offers a way of monitoring the delivered dose in real time, and could be used as a basis for mitigating the effects of motion. The aim of this work has therefore been to investigate how the computational power offered by graphics processing units can be harnessed to enable fast analytical dose calculation for online monitoring in proton therapy. The first part of the work consisted of a systematic investigation of various approaches to implementing the most computationally expensive step of the pencil beam algorithm to run on graphics processing units. As a result, it was demonstrated how the kernel superposition operation, or convolution with a spatially varying kernel, can be efficiently implemented using a novel scatter-based approach. For the intended application, this outperformed the conventional gather-based approach suggested in the literature, permitting faster pencil beam dose calculation and potential speedups of related algorithms in other fields. In the second part, a parallelised proton therapy dose calculation engine employing the scatter-based kernel superposition implementation was developed. Such a dose calculation engine, running all of the principal steps of the pencil beam algorithm on a graphics processing unit, had not previously been presented in the literature. The accuracy of the calculation in the high- and medium-dose regions matched that of a clinical treatment planning system whilst the calculation was an order of magnitude faster than previously reported. Importantly, the calculation times were short, both compared to the dead time available during treatment delivery and to the typical motion period, making the implementation suitable for online calculation. In the final part, the beam model of the dose calculation engine was extended to account for the low-dose halo caused by particles travelling at large angles with the beam, making the algorithm comparable to those in current clinical use. By reusing the workflow of the initial calculation but employing a lower resolution for the halo calculation, it was demonstrated how the improved beam model could be included without prohibitively prolonging the calculation time. Since the implementation was based on a widely used algorithm, it was further predicted that by careful tuning, the dose calculation engine would be able to reproduce the dose from a general beamline with sufficient accuracy. Based on the presented results, it was concluded that, by using a single graphics processing unit, dose calculation using the pencil beam algorithm could be made sufficiently fast for online dose monitoring, whilst maintaining the accuracy of current clinical systems. 616.99
8	Efficient Dynamic Automatic Memory Management And Concurrent Kernel Execution For General-Purpose Programs On Graphics Processing Units Pai, Sreepathi 11 1900 (has links) (PDF) Modern supercomputers now use accelerators to achieve their performance with the most widely used accelerator being the Graphics Processing Unit (GPU). However, achieving the performance potential of systems that combine a GPU and CPU is an arduous task which could be made easier with the assistance of the compiler or runtime. In particular, exploiting two features of GPU architectures -- distributed memory and concurrent kernel execution -- is critical to achieve good performance, but in current GPU programming systems, programmers must exploit them manually. This can lead to poor performance. In this thesis, we propose automatic techniques that: i) perform data transfers between the CPU and GPU, ii) allocate resources for concurrent kernels, and iii) schedule concurrent kernels efficiently without programmer intervention. <p>Most GPU programs access data in GPU memory for performance. Manually inserting data transfers that move data to and from this GPU memory is an error-prone and tedious task. In this work, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to identify potential stale data accesses and uses a runtime to initiate transfers as necessary. This avoids redundant transfers that are exhibited by all other existing automatic memory management proposals for general purpose programs. We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Tested on eight programs ported from the Rodinia benchmark suite it achieves (i) a 1.06x speedup over hand-tuned manual memory management, and (ii) a 1.29x speedup over another recently proposed compiler--runtime automatic memory management system. Compared to other existing runtime-only (ADSM) and compiler-only (OpenMPC) proposals, it also transfers 2.2x to 13.3x less data on average. <p>Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite. Current GPUs therefore allow concurrent execution of kernels to improve utilization. We study concurrent execution of GPU kernels using multiprogrammed workloads on current NVIDIA Fermi GPUs. On two-program workloads from Parboil2 we find concurrent execution is often no better than serialized execution. We identify lack of control over resource allocation to kernels as a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware runtime concurrency policies that offer significantly better performance and concurrency than the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads over the current CUDA concurrency implementation. <p>Recent NVIDIA GPUs use a FIFO policy in their thread block scheduler (TBS) to schedule thread blocks of concurrent kernels. We show that FIFO leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the Shortest Remaining Time First (SRTF) policy instead. Since SRTF requires an estimate of runtime (i.e. execution time), we introduce Structural Runtime Prediction that uses the grid structure of GPU programs for predicting runtimes. Using a novel Staircase model of GPU kernel execution, we show that kernel runtime can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench and find that predictions made after the execution of single thread block are between 0.48x to 1.08x of actual runtime. %Next, we design a thread block scheduler that is both concurrent kernel-aware and incorporates this predictor. We implement the SRTF policy for concurrent kernels that uses this predictor and evaluate it on two-program workloads from ERCBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. Compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by 2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of SRTF achieves STP to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF. GPGPU Automatic Memory Management Concurrent Kernel Graphics Processing Unit (GPU) Elastic Kernels GPGPU Computer and Information Science
9	Generalizing the Utility of Graphics Processing Units in Large-Scale Heterogeneous Computing Systems Xiao, Shucai 03 July 2013 (has links) Today, heterogeneous computing systems are widely used to meet the increasing demand for high-performance computing. These systems commonly use powerful and energy-efficient accelerators to augment general-purpose processors (i.e., CPUs). The graphic processing unit (GPU) is one such accelerator. Originally designed solely for graphics processing, GPUs have evolved into programmable processors that can deliver massive parallel processing power for general-purpose applications. Using SIMD (Single Instruction Multiple Data) based components as building units; the current GPU architecture is well suited for data-parallel applications where the execution of each task is independent. With the delivery of programming models such as Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), programming GPUs has become much easier than before. However, developing and optimizing an application on a GPU is still a challenging task, even for well-trained computing experts. Such programming tasks will be even more challenging in large-scale heterogeneous systems, particularly in the context of utility computing, where GPU resources are used as a service. These challenges are largely due to the limitations in the current programming models: (1) there are no intra-and inter-GPU cooperative mechanisms that are natively supported; (2) current programming models only support the utilization of GPUs installed locally; and (3) to use GPUs on another node, application programs need to explicitly call application programming interface (API) functions for data communication. To reduce the mapping efforts and to better utilize the GPU resources, we investigate generalizing the utility of GPUs in large-scale heterogeneous systems with GPUs as accelerators. We generalize the utility of GPUs through the transparent virtualization of GPUs, which can enable applications to view all GPUs in the system as if they were installed locally. As a result, all GPUs in the system can be used as local GPUs. Moreover, GPU virtualization is a key capability to support the notion of "GPU as a service." Specifically, we propose the virtual OpenCL (or VOCL) framework for the transparent virtualization of GPUs. To achieve good performance, we optimize and extend the framework in three aspects: (1) optimize VOCL by reducing the data transfer overhead between the local node and remote node; (2) propose GPU synchronization to reduce the overhead of switching back and forth if multiple kernel launches are needed for data communication across different compute units on a GPU; and (3) extend VOCL to support live virtual GPU migration for quick system maintenance and load rebalancing across GPUs. With the above optimizations and extensions, we thoroughly evaluate VOCL along three dimensions: (1) show the performance improvement for each of our optimization strategies; (2) evaluate the overhead of using remote GPUs via several microbenchmark suites as well as a few real-world applications; and (3) demonstrate the overhead as well as the benefit of live virtual GPU migration. Our experimental results indicate that VOCL can generalize the utility of GPUs in large-scale systems at a reasonable virtualization and migration cost. / Ph. D. Graphics Processing Unit (GPU) CUDA OpenCL BLAST Smith-Waterman Fine-Grained Parallelization GPU Virtualization
10	Performance Modeling, Optimization, and Characterization on Heterogeneous Architectures Panwar, Lokendra Singh 21 October 2014 (has links) Today, heterogeneous computing has truly reshaped the way scientists think and approach high-performance computing (HPC). Hardware accelerators such as general-purpose graphics processing units (GPUs) and Intel Many Integrated Core (MIC) architecture continue to make in-roads in accelerating large-scale scientific applications. These advancements, however, introduce new sets of challenges to the scientific community such as: selection of best processor for an application, effective performance optimization strategies, maintaining performance portability across architectures etc. In this thesis, we present our techniques and approach to address some of these significant issues. Firstly, we present a fully automated approach to project the relative performance of an OpenCL program over different GPUs. Performance projections can be made within a small amount of time, and the projection overhead stays relatively constant with the input data size. As a result, the technique can help runtime tools make dynamic decisions about which GPU would run faster for a given kernel. Usage cases of this technique include scheduling or migrating GPU workloads over a heterogeneous cluster with different types of GPUs. We then present our approach to accelerate a seismology modeling application that is based on the finite difference method (FDM), using MPI and CUDA over a hybrid CPU+GPU cluster. We describe the generic computational complexities involved in porting such applications to the GPUs and present our strategy of efficient performance optimization and characterization. We also show how performance modeling can be used to reason and drive the hardware-specific optimizations on the GPU. The performance evaluation of our approach delivers a maximum speedup of 23-fold with a single GPU and 33-fold with dual GPUs per node over the serial version of the application, which in turn results in a many-fold speedup when coupled with the MPI distribution of the computation across the cluster. We also study the efficacy of GPU-integrated MPI, with MPI-ACC as an example implementation, in a seismology modeling application and discuss the lessons learned. / Master of Science Heterogeneous Computing Graphics Processing Unit (GPU) GPU Emulation Performance Modeling Finite Difference Method Seismology Modeling

Search results