Global ETD Search

1	Dynamic warp formation : exploiting thread scheduling for efficient MIMD control flow on SIMD graphics hardware Fung, Wilson Wai Lun 11 1900 (has links) Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported by hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this thesis, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by an average of 47% for an estimated area increase of 8%. GPU SIMD Control flow Graphics processing unit
2	Dynamic warp formation : exploiting thread scheduling for efficient MIMD control flow on SIMD graphics hardware Fung, Wilson Wai Lun 11 1900 (has links) Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported by hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this thesis, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by an average of 47% for an estimated area increase of 8%. GPU SIMD Control flow Graphics processing unit
3	Dynamic warp formation : exploiting thread scheduling for efficient MIMD control flow on SIMD graphics hardware Fung, Wilson Wai Lun 11 1900 (has links) Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported by hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this thesis, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by an average of 47% for an estimated area increase of 8%. / Applied Science, Faculty of / Electrical and Computer Engineering, Department of / Graduate GPU SIMD Control flow Graphics processing unit
4	Runtime Adaptation for Autonomic Heterogeneous Computing Scogland, Thomas R. 12 December 2014 (has links) Heterogeneity is increasing across all levels of computing, with the rise of accelerators such as GPUs, FPGAs, and other coprocessors into everything from cell phones to supercomputers. More quietly it is increasing with the rise of NUMA systems, hierarchical caching, OS noise, and a myriad of other factors. As heterogeneity becomes a fact of life, efficiently managing heterogeneous compute resources is becoming a critical, and ever more complex, task. The focus of this dissertation is to lay the foundation for an autonomic system for heterogeneous computing, employing runtime adaptation to improve performance portability and performance consistency while maintaining or increasing programmability. We investigate heterogeneity arising from a myriad of factors, grouped into the dimensions of locality and capability. This work has resulted in runtime schedulers capable of automatically detecting and mitigating heterogeneity in physically homogeneous systems through MPI and adaptive coscheduling for physically heterogeneous accelerator based systems as well as a synthesis of the two to address multiple levels of heterogeneity as a coherent whole. We also discuss our current work towards the next generation of fine-grained scheduling and synchronization across heterogeneous platforms in the design of a highly-scalable and portable concurrent queue for many-core systems. Each component addresses aspects of the urgent need for automated management of the extreme and ever expanding complexity introduced by heterogeneity. / Ph. D. Scheduling Graphics Processing Unit (GPU) OpenMP
5	Enabling rapid iterative model design within the laboratory environment Clayton, Thomas F. January 2009 (has links) This thesis presents a proof of concept study for the better integration of the electrophysiological and modelling aspects of neuroscience. Members of these two sub-disciplines collaborate regularly, but due to differing resource requirements, and largely incompatible spheres of knowledge, cooperation is often impeded by miscommunication and delays. To reduce the model design time, and provide a platform for more efficient experimental analysis, a rapid iterative model design method is proposed. The main achievement of this work is the development of a rapid model evaluation method based on parameter estimation, utilising a combination of evolutionary algorithms (EAs) and graphics processing unit (GPU) hardware acceleration. This method is the primary force behind the better integration of modelling and laboratorybased electrophysiology, as it provides a generic model evaluation method that does not require prior knowledge of model structure, or expertise in modelling, mathematics, or computer science. If combined with a suitable intuitive and user targeted graphical user interface, the ideas presented in this thesis could be developed into a suite of tools that would enable new forms of experimentation to be performed. The latter part of this thesis investigates the use of excitability-based models as the basis of an iterative design method. They were found to be computationally and structurally simple, easily extensible, and able to reproduce a wide range of neural behaviours whilst still faithfully representing underlying cellular mechanisms. A case study was performed to assess the iterative design process, through the implementation of an excitability-based model. The model was extended iteratively, using the rapid model evaluation method, to represent a vasopressin releasing neuron. Not only was the model implemented successfully, but it was able to suggest the existence of other more subtle cell mechanisms, in addition to highlighting potential failings in previous implementations of the class of neuron. 612.8
6	Genetic programming and cellular automata for fast flood modelling on multi-core CPU and many-core GPU computers Gibson, Michael John January 2015 (has links) Many complex systems in nature are governed by simple local interactions, although a number are also described by global interactions. For example, within the field of hydraulics the Navier-Stokes equations describe free-surface water flow, through means of the global preservation of water volume, momentum and energy. However, solving such partial differential equations (PDEs) is computationally expensive when applied to large 2D flow problems. An alternative which reduces the computational complexity, is to use a local derivative to approximate the PDEs, such as finite difference methods, or Cellular Automata (CA). The high speed processing of such simulations is important to modern scientific investigation especially within urban flood modelling, as urban expansion continues to increase the number of impervious areas that need to be modelled. Large numbers of model runs or large spatial or temporal resolution simulations are required in order to investigate, for example, climate change, early warning systems, and sewer design optimisation. The recent introduction of the Graphics Processor Unit (GPU) as a general purpose computing device (General Purpose Graphical Processor Unit, GPGPU) allows this hardware to be used for the accelerated processing of such locally driven simulations. A novel CA transformation for use with GPUs is proposed here to make maximum use of the GPU hardware. CA models are defined by the local state transition rules, which are used in every cell in parallel, and provide an excellent platform for a comparative study of possible alternative state transition rules. Writing local state transition rules for CA systems is a difficult task for humans due to the number and complexity of possible interactions, and is known as the ‘inverse problem’ for CA. Therefore, the use of Genetic Programming (GP) algorithms for the automatic development of state transition rules from example data is also investigated in this thesis. GP is investigated as it is capable of searching the intractably large areas of possible state transition rules, and producing near optimal solutions. However, such population-based optimisation algorithms are limited by the cost of many repeated evaluations of the fitness function, which in this case requires the comparison of a CA simulation to given target data. Therefore, the use of GPGPU hardware for the accelerated learning of local rules is also developed. Speed-up factors of up to 50 times over serial Central Processing Unit (CPU) processing are achieved on simple CA, up to 5-10 times speedup over the fully parallel CPU for the learning of urban flood modelling rules. Furthermore, it is shown GP can generate rules which perform competitively when compared with human formulated rules. This is achieved with generalisation to unseen terrains using similar input conditions and different spatial/temporal resolutions in this important application domain. 004
7	Development of High-Performance GPUs in Accelerated Computing Liu, Zhuren 12 1900 (has links) This dissertation conducted an in-depth analysis of graphics processing unit (GPU) performance models and configurations. It profiled GPU systems across a range of hardware configurations and workloads, employing machine learning techniques such as support vector machines (SVM) and Random Forest algorithms to develop predictive models and identify key system parameters impacting performance. This work also presents the Genomics-GPU benchmark suite, comprising ten critical algorithms representative of genomics analysis tasks. It performed extensive evaluations using both hardware and software simulators, accompanied by microarchitectural characterizations examining memory access patterns, thread scaling behavior, pipeline stalls, and interconnect network impacts. It proposes a novel optimization: a lightweight process-in-memory (PIM) data pre-processing unit (DPU) integrated within the GPU’s memory hierarchy. This architectural enhancement offloads preliminary filtering operations to memory units, thereby reducing data movement between memory and processing units—a significant bottleneck in handling large genomic datasets. The specialized data pre-processing unit (DPU) performs sequence alignment pre-processing directly within DRAM, improving data throughput and overall performance by 1.23X-8.37X. High-Performance GPUs Accelerated Computing graphics processing unit (GPU)
8	A General-Purpose GPU Reservoir Computer Keith, Tūreiti January 2013 (has links) The reservoir computer comprises a reservoir of possibly non-linear, possibly chaotic dynamics. By perturbing and taking outputs from this reservoir, its dynamics may be harnessed to compute complex problems at “the edge of chaos”. One of the first forms of reservoir computer, the Echo State Network (ESN), is a form of artificial neural network that builds its reservoir from a large and sparsely connected recurrent neural network (RNN). The ESN was initially introduced as an innovative solution to train RNNs which, up until that point, was a notoriously difficult task. The innovation of the ESN is that, rather than train the RNN weights, only the output is trained. If this output is assumed to be linear, then linear regression may be used. This work presents an effort to implement the Echo State Network, and an offline linear regression training method based on Tikhonov regularisation. This implementation targeted the general purpose graphics processing unit (GPU or GPGPU). The behaviour of the implementation was examined by comparing it with a central processing unit (CPU) implementation, and by assessing its performance against several studied learning problems. These assessments were performed using all 4 cores of the Intel i7-980 CPU and an Nvidia GTX480. When compared with a CPU implementation, the GPU ESN implementation demonstrated a speed-up starting from a reservoir size of between 512 and 1,024. A maximum speed-up of approximately 6 was observed at the largest reservoir size tested (2,048). The Tikhonov regularisation (TR) implementation was also compared with a CPU implementation. Unlike the ESN execution, the GPU TR implementation was largely slower than the CPU implementation. Speed-ups were observed at the largest reservoir and state history sizes, the largest of which was 2.6813. The learning behaviour of the GPU ESN was tested on three problems, a sinusoid, a Mackey-Glass time-series, and a multiple superimposed oscillator (MSO). The normalised root-mean squared errors of the predictors were compared. The best observed sinusoid predictor outperformed the best MSO predictor by 4 orders of magnitude. In turn, the best observed MSO predictor outperformed the best Mackey-Glass predictor by 2 orders of magnitude. graphics processing unit echo state network reservoir computer tikhonov regularisation linear regression
9	High performance bioinformatics and computational biology on general-purpose graphics processing units Ling, Cheng January 2012 (has links) Bioinformatics and Computational Biology (BCB) is a relatively new multidisciplinary field which brings together many aspects of the fields of biology, computer science, statistics, and engineering. Bioinformatics extracts useful information from biological data and makes these more intuitive and understandable by applying principles of information sciences, while computational biology harnesses computational approaches and technologies to answer biological questions conveniently. Recent years have seen an explosion of the size of biological data at a rate which outpaces the rate of increases in the computational power of mainstream computer technologies, namely general purpose processors (GPPs). The aim of this thesis is to explore the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high performance and efficient implementation of BCB applications in order to meet the demands of biological data increases at affordable cost. The thesis presents detailed design and implementations of GPU solutions for a number of BCB algorithms in two widely used BCB applications, namely biological sequence alignment and phylogenetic analysis. Biological sequence alignment can be used to determine the potential information about a newly discovered biological sequence from other well-known sequences through similarity comparison. On the other hand, phylogenetic analysis is concerned with the investigation of the evolution and relationships among organisms, and has many uses in the fields of system biology and comparative genomics. In molecular-based phylogenetic analysis, the relationship between species is estimated by inferring the common history of their genes and then phylogenetic trees are constructed to illustrate evolutionary relationships among genes and organisms. However, both biological sequence alignment and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with the size of sequence databases. The thesis firstly presents a multi-threaded parallel design of the Smith- Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A novel technique is put forward to solve the restriction on the length of the query sequence in previous GPU-based implementations of the SW algorithm. Based on this implementation, the difference between two main task parallelization approaches (Inter-task and Intra-task parallelization) is presented. The resulting GPU implementation matches the speed of existing GPU implementations while providing more flexibility, i.e. flexible length of sequences in real world applications. It also outperforms an equivalent GPPbased implementation by 15x-20x. After this, the thesis presents the first reported multi-threaded design and GPU implementation of the Gapped BLAST with Two-Hit method algorithm, which is widely used for aligning biological sequences heuristically. This achieved up to 3x speed-up improvements compared to the most optimised GPP implementations. The thesis then presents a multi-threaded design and GPU implementation of a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and multiple sequence alignment (MSA). This achieves 8x-20x speed up compared to an equivalent GPP implementation based on the widely used ClustalW software. The NJ method however only gives one possible tree which strongly depends on the evolutionary model used. A more advanced method uses maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte Carlo (MCMC)-based Bayesian inference. The latter was the subject of another multi-threaded design and GPU implementation presented in this thesis, which achieved 4x-8x speed up compared to an equivalent GPP implementation based on the widely used MrBayes software. Finally, the thesis presents a general evaluation of the designs and implementations achieved in this work as a step towards the evaluation of GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA) technology. 572.8
10	Parallélisation de simulations physiques utilisant un modéle de Boltzmann mullti-phases et multi-composants en vue d'un épandage de GNL sur sol / Parallelisation of physical simulations using Boltzmann method multiphase and multicomponent with the aim of manuring GNL on ground Duchateau, Julien 09 December 2015 (has links) Cette thèse a pour but de définir et de développer des solutions informatiques de manière à permettre la mise en place de simulations physiques sur des domaines de simulation très grands tels qu'un site industriel comme le terminal méthanier de Dunkerque. Le modèle d'écoulement mis en place est basé sur la méthode de Boltzmann sur réseau et permet de gérer de nombreux cas de simulation. Différentes architectures de calculs sont étudiées dans ce travail de thèse. L'utilisation du processeur central ainsi que de processeurs graphiques pour la parallélisation des calculs est abordée. Des solutions sont mises en place de manière à obtenir une parallélisation efficace du modèle de calcul sur plusieurs GPUS pouvant calculer en parallèle. Une approche de maillage progressif du maillage de simulation est également abordée pour gérer dynamiquement la quantité de mémoire nécessaire pour simuler en fonction des besoins de la simulation et de sa progression. Son intégration sur une architecture de calcul composée de plusieurs processeurs graphiques est également mise en avant. Finalement, une solution de type "Out-of-core" a été mise en place pour traiter des cas où la mémoire liée aux processeurs graphiques est insuffisante pour simuler. En effet, les processeurs graphiques disposent généralement d'une quantité de mémoire nettement inférieure à celle de la RAM du processeur central. La mise en place d'un système d'échange efficace entre les processeurs graphiques et la RAM est donc essentielle. / This thesis has for goal to define and develop solutions in order to achieve physical simulations on large simulation domains such as industrial sites (Dunkerque LNG Terminal). The simulation model is based on the lattice Boltzmann method (LBM) and allows to treat several simulation cases. The use of several computing architectures are studied in this work. The use of a multicore central processing unit (CPU) and also several graphics processing units (GPUS) is considered. An efficient parllelization of the simulation model is obtained by the use of several GPUS able to calculate in parallel. A progressive mesh algorithm is also defined in order to automatically mesh the simulation domain according to fluids propagation. Its integration on a multi-GPU architecture is studied. Finally, an "out-of-core" method is introduced in order to handle cases that require more memory than all GPUS have. Indeed, GPU memory is generally significantly inferior to the CPU memory. The definition of an exchange system between GPUS and the CPU is therefore essential. Parallélisme GPU Simulation Méthode de Boltzmann Parallelism Graphics Processing Unit Simulation Boltzmann method

Search results