Global ETD Search

31	Random Forests for CUDA GPUs Lapajne, Mikael Hellborg, Slat, Daniel January 2010 (has links) Context. Machine Learning is a complex and resource consuming process that requires a lot of computing power. With the constant growth of information, the need for efficient algorithms with high performance is increasing. Today's commodity graphics cards are parallel multi processors with high computing capacity at an attractive price and are usually pre-installed in new PCs. The graphics cards provide an additional resource to be used in machine learning applications. The Random Forest learning algorithm which has been showed competitive within machine learning has a good potential for performance increase through parallelization of the algorithm. Objectives. In this study we implement and review a revised Random Forest algorithm for GPU execution using CUDA. Methods. A review of previous work in the area has been done by studying articles from several sources, including Compendex, Inspec, IEEE Xplore, ACM Digital Library and Springer Link. Additional information regarding GPU architecture and implementation specific details have been obtained mainly from documentation available from Nvidia and the Nvidia developer forums. The implemented algorithm has been benchmarked and compared with two state-of-the-art CPU implementations of the Random Forest algorithm, both regarding consumed time for training and classification and for classification accuracy. Results. Measurements from benchmarks made on the three different algorithms are gathered showing the performance results of the algorithms for two publicly available data sets. Conclusion. We conclude that our implementation under the right conditions is able to outperform its competitors. We also conclude that this is only true for certain data sets depending on the size of the data sets. Moreover we conclude that there is potential for further improvements of the algorithm both regarding performance as well as adaption towards a wider range of real world applications. / Mikael: +46768539263, Daniel: +46703040693 CUDA Random forests Parallel computing Graphics processing units Software Engineering Programvaruteknik
32	Agent-based crowd simulation using GPU computing O’Reilly, Sean Patrick January 2014 (has links) M.Sc. (Information Technology) / The purpose of the research is to investigate agent-based approaches to virtual crowd simulation. Crowds are ubiquitous and are becoming an increasingly common phenomena in modern society, particularly in urban settings. As such, crowd simulation systems are becoming increasingly popular in training simulations, pedestrian modelling, emergency simulations, and multimedia. One of the primary challenges in crowd simulation is the ability to model realistic, large-scale crowd behaviours in real time. This is a challenging problem, as the size, visual fidelity, and complex behaviour models of the crowd all have an impact on the available computational resources. In the last few years, the graphics processing unit (GPU) has presented itself as a viable computational resource for general purpose computation. Traditionally, GPUs were used solely for their ability to efficiently compute operations related to graphics applications. However, the modern GPU is a highly parallel programmable processor, with substantially higher peak arithmetic and memory bandwidth than its central processing unit (CPU) counterpart. The GPU’s architecture makes it a suitable processing resource for computations that are parallel or distributed in nature. One attribute of multi-agent systems (MASs) is that they are inherently decentralised. As such, a MAS that leverages advancements in GPU computing may provide a solution for crowd simulation. The research investigates techniques and methods for general purpose crowd simulation, including topics in agent behavioural modes, pathplanning, collision avoidance and agent steering. The research also investigates how GPU computing has been utilised to address these computationally intensive problem domains. Based on the outcomes of the research, an agent-based model, Massively Parallel Crowds (MPCrowds), is proposed to address virtual crowd simulation, using the GPU as an additional resource for agent computation. Crowds - Computer simulation Virtual computer systems Graphics processing units Multiagent systems
33	GPF : a framework for general packet classification on GPU co-processors / GPU Packet Filter : framework for general packet classification on Graphics Processing Unit co-processors Nottingham, Alastair January 2012 (has links) This thesis explores the design and experimental implementation of GPF, a novel protocol-independent, multi-match packet classification framework. This framework is targeted and optimised for flexible, efficient execution on NVIDIA GPU platforms through the CUDA API, but should not be difficult to port to other platforms, such as OpenCL, in the future. GPF was conceived and developed in order to accelerate classification of large packet capture files, such as those collected by Network Telescopes. It uses a multiphase SIMD classification process which exploits both the parallelism of packet sets and the redundancy in filter programs, in order to classify packet captures against multiple filters at extremely high rates. The resultant framework - comprised of classification, compilation and buffering components - efficiently leverages GPU resources to classify arbitrary protocols, and return multiple filter results for each packet. The classification functions described were verified and evaluated by testing an experimental prototype implementation against several filter programs, of varying complexity, on devices from three GPU platform generations. In addition to the significant speedup achieved in processing results, analysis indicates that the prototype classification functions perform predictably, and scale linearly with respect to both packet count and filter complexity. Furthermore, classification throughput (packets/s) remained essentially constant regardless of the underlying packet data, and thus the effective data rate when classifying a particular filter was heavily influenced by the average size of packets in the processed capture. For example: in the trivial case of classifying all IPv4 packets ranging in size from 70 bytes to 1KB, the observed data rate achieved by the GPU classification kernels ranged from 60Gbps to 900Gbps on a GTX 275, and from 220Gbps to 3.3Tbps on a GTX 480. In the less trivial case of identifying all ARP, TCP, UDP and ICMP packets for both IPv4 and IPv6 protocols, the effective data rates ranged from 15Gbps to 220Gbps (GTX 275), and from 50Gbps to 740Gbps (GTX 480), for 70B and 1KB packets respectively. / LaTeX with hyperref package Graphics processing units Coprocessors Computer network protocols Computer networks -- Security measures NVIDIA Corporation
34	Classification of the difficulty in accelerating problems using GPUs Tristram, Uvedale Roy January 2014 (has links) Scientists continually require additional processing power, as this enables them to compute larger problem sizes, use more complex models and algorithms, and solve problems previously thought computationally impractical. General-purpose computation on graphics processing units (GPGPU) can help in this regard, as there is great potential in using graphics processors to accelerate many scientific models and algorithms. However, some problems are considerably harder to accelerate than others, and it may be challenging for those new to GPGPU to ascertain the difficulty of accelerating a particular problem or seek appropriate optimisation guidance. Through what was learned in the acceleration of a hydrological uncertainty ensemble model, large numbers of k-difference string comparisons, and a radix sort, problem attributes have been identified that can assist in the evaluation of the difficulty in accelerating a problem using GPUs. The identified attributes are inherent parallelism, branch divergence, problem size, required computational parallelism, memory access pattern regularity, data transfer overhead, and thread cooperation. Using these attributes as difficulty indicators, an initial problem difficulty classification framework has been created that aids in GPU acceleration difficulty evaluation. This framework further facilitates directed guidance on suggested optimisations and required knowledge based on problem classification, which has been demonstrated for the aforementioned accelerated problems. It is anticipated that this framework, or a derivative thereof, will prove to be a useful resource for new or novice GPGPU developers in the evaluation of potential problems for GPU acceleration. Graphics processing units Computer algorithms Computer programming Problem solving -- Data processing
35	Monte Carlo simulations on a graphics processor unit with applications in inertial navigation Roets, Sarel Frederik 12 March 2012 (has links) M.Ing. / The Graphics Processor Unit (GPU) has been in the gaming industry for several years now. Of late though programmers and scientists have started to use the parallel processing or stream processing capabilities of the GPU in general numerical applications. The Monte Carlo method is a processing intensive methods, as it evaluates systems with stochastic components. The stochastic components require several iterations of the systems to develop an idea of how the systems reacts to the stochastic inputs. The stream processing capabilities of GPUs are used for the analysis of such systems. Evaluating low-cost Inertial Measurement Units (IMU) for utilisation in Inertial Navigation Systems (INS) is a processing intensive process. The non-deterministic or stochastic error components of the IMUs output signal requires multiple simulation runs to properly evaluate the IMUs performance when applied as input to an INS. The GPU makes use of stream processing, which allows simultaneous execution of the same algorithm on multiple data sets. Accordingly Monte Carlo techniques are applied to create trajectories for multiple possible outputs of the INS based on stochastically varying inputs from the IMU. The processing power of the GPU allows simultaneous Monte Carlo analysis of several IMUs. Each IMU requires a sensor error model, which entails calibration of each IMU to obtain numerical values for the main error sources of lowcost IMUs namely scale factor, non-orthogonality, bias, random walk and white noise. Three low-cost MEMS IMUs was calibrated to obtain numerical values for their sensor error models. Simultaneous Monte Carlo analysis of each of the IMUs is then done on the GPU with a resulting circular error probability plot. The circular error probability indicates the accuracy and precision of each IMU relative to a reference trajectory and the other IMUs trajectories. Results obtained indicate the GPU to be an alternative processing platform, for large amounts of data, to that of the CPU. Monte Carlo simulations on the GPU was performed 200 % faster than Monte Carlo simulations on the CPU. Results obtained from the Monte Carlo simulations, indicated the Random Walk error to be the main source of error in low-cost IMUs. The CEP results was used to determine the e ect of the various error sources on the INS output. Monte Carlo method Graphics processing units Inertial navigation systems Inertial measurement units
36	Advancements in Computational Small Molecule Binding Affinity Prediction Methods Devlaminck, Pierre January 2023 (has links) Computational methods for predicting the binding affinity of small organic molecules tobiological macromolecules cover a vast range of theoretical and physical complexity. Generally, as the required accuracy increases so does the computational cost, thereby making the user choose a method that suits their needs within the parameters of the project. We present how WScore, a rigid-receptor docking program normally consigned to structure-based hit discovery in drug design projects, is systematically ameliorated to perform accurately enough for lead optimization with a set of ROCK1 complexes and congeneric ligands from a structure-activity relationship study. Initial WScore results from the Schrödinger 2019-3 release show poor correlation (R² ∼0.0), large errors in predicted binding affinity (RMSE = 2.30 kcal/mol), and bad native pose prediction (two RMSD > 4Å) for the six ROCK1 crystal structures and associated active congeneric ligands. Improvements to WScore’s treatment of desolvation, myriad code fixes, and a simple ensemble consensus scoring protocol improved the correlation (R² = 0.613), the predicted affinity accuracy (RMSE = 1.34 kcal/mol), and native pose prediction (one RMSD > 1.5Å). Then we evaluate a physically and thermodynamically rigorous free energy perturbation (FEP) method, FEP+, against CryoEM structures of the Machilis hrabei olfactory receptor, MhOR5, and associated dose-response assays of a panel of small molecules with the wild-type and mutants. Augmented with an induced-fit docking method, IFD-MD, FEP+ performs well for ligand mutating relative binding FEP (RBFEP) calculations which correlate with experimental log(EC50)with an R² = 0.551. Ligand absolute binding FEP (ABFEP) on a set of disparate ligands from the MhOR5 panel has poor correlation (R² = 0.106) for ligands with log(EC50) within the assay range. But qualitative predictions correctly identify the ligands with the lowest potency. Protein mutation calculations have no log(EC50) correlation and consistently fail to predict the loss of potency for a majority of MhOR5 single point mutations. Prediction of ligand efficacy (the magnitude of receptor response) is also an unsolved problem as the canonical active and inactive conformations of the receptor are absent in the FEP simulations. We believe that structural insights of the mutants for both bound and unbound (apo) states are required to better understand the shortcomings of the current FEP+ methods for protein mutation RBFEP. Finally, improvements to GPU-accelerated linear algebra functions in an Auxiliary-Field Quantum Monte Carlo (AFQMC) program effect an average 50-fold reduction in GPU kernel compute time using optimized GPU library routines instead of custom made GPU kernels. Also MPI parallelization of the population control algorithm that destroys low-weight walkers has a bottleneck removed in large, multi-node AFQMC calculations.Computational methods for predicting the binding affinity of small organic molecules tobiological macromolecules cover a vast range of theoretical and physical complexity. Generally, as the required accuracy increases so does the computational cost, thereby making the user choose a method that suits their needs within the parameters of the project. We present how WScore, a rigid-receptor docking program normally consigned to structure-based hit discovery in drug design projects, is systematically ameliorated to perform accurately enough for lead optimization with a set of ROCK1 complexes and congeneric ligands from a structure-activity relationship study. Initial WScore results from the Schrödinger 2019-3 release show poor correlation (R² ∼0.0), large errors in predicted binding affinity (RMSE = 2.30 kcal/mol), and bad native pose prediction (two RMSD > 4Å) for the six ROCK1 crystal structures and associated active congeneric ligands. Improvements to WScore’s treatment of desolvation, myriad code fixes, and a simple ensemble consensus scoring protocol improved the correlation (R² = 0.613), the predicted affinity accuracy (RMSE = 1.34 kcal/mol), and native pose prediction (one RMSD > 1.5Å). Then we evaluate a physically and thermodynamically rigorous free energy perturbation (FEP) method, FEP+, against CryoEM structures of the Machilis hrabei olfactory receptor, MhOR5, and associated dose-response assays of a panel of small molecules with the wild-type and mutants. Augmented with an induced-fit docking method, IFD-MD, FEP+ performs well for ligand mutating relative binding FEP (RBFEP) calculations which correlate with experimental log(EC50)with an R² = 0.551. Ligand absolute binding FEP (ABFEP) on a set of disparate ligands from the MhOR5 panel has poor correlation (R² = 0.106) for ligands with log(EC50) within the assay range. But qualitative predictions correctly identify the ligands with the lowest potency. Protein mutation calculations have no log(EC50) correlation and consistently fail to predict the loss of potency for a majority of MhOR5 single point mutations. Prediction of ligand efficacy (the magnitude of receptor response) is also an unsolved problem as the canonical active and inactive conformations of the receptor are absent in the FEP simulations. We believe that structural insights of the mutants for both bound and unbound (apo) states are required to better understand the shortcomings of the current FEP+ methods for protein mutation RBFEP. Finally, improvements to GPU-accelerated linear algebra functions in an Auxiliary-Field Quantum Monte Carlo (AFQMC) program effect an average 50-fold reduction in GPU kernel compute time using optimized GPU library routines instead of custom made GPU kernels. Also MPI parallelization of the population control algorithm that destroys low-weight walkers has a bottleneck removed in large, multi-node AFQMC calculations. Computational chemistry Molecules Ligands Graphics processing units Quantum Monte Carlo methods
37	Programming High-Performance Clusters with Heterogeneous Computing Devices Aji, Ashwin M. 19 May 2015 (has links) Today's high-performance computing (HPC) clusters are seeing an increase in the adoption of accelerators like GPUs, FPGAs and co-processors, leading to heterogeneity in the computation and memory subsystems. To program such systems, application developers typically employ a hybrid programming model of MPI across the compute nodes in the cluster and an accelerator-specific library (e.g.; CUDA, OpenCL, OpenMP, OpenACC) across the accelerator devices within each compute node. Such explicit management of disjointed computation and memory resources leads to reduced productivity and performance. This dissertation focuses on designing, implementing and evaluating a runtime system for HPC clusters with heterogeneous computing devices. This work also explores extending existing programming models to make use of our runtime system for easier code modernization of existing applications. Specifically, we present MPI-ACC, an extension to the popular MPI programming model and runtime system for efficient data movement and automatic task mapping across the CPUs and accelerators within a cluster, and discuss the lessons learned. MPI-ACC's task-mapping runtime subsystem performs fast and automatic device selection for a given task. MPI-ACC's data-movement subsystem includes careful optimizations for end-to-end communication among CPUs and accelerators, which are seamlessly leveraged by the application developers. MPI-ACC provides a familiar, flexible and natural interface for programmers to choose the right computation or communication targets, while its runtime system achieves efficient cluster utilization. / Ph. D. Runtime Systems Programming Models Message Passing Interface (MPI) CUDA OpenCL
38	Accelerating Quantum Monte Carlo via Graphics Processing Units Himberg, Benjamin Evert 01 January 2017 (has links) An exact quantum Monte Carlo algorithm for interacting particles in the spatial continuum is extended to exploit the massive parallelism offered by graphics processing units. Its efficacy is tested on the Calogero-Sutherland model describing a system of bosons interacting in one spatial dimension via an inverse square law. Due to the long range nature of the interactions, this model has proved difficult to simulate via conventional path integral Monte Carlo methods running on conventional processors. Using Graphics Processing Units, optimal speedup factors of up to 640 times are obtained for N = 126 particles. The known results for the ground state energy are confirmed and, for the first time, the effects of thermal fluctuations at finite temperature are explored. Calogero-Sutherland model Exactly solvable Graphics Processing Units One-dimensional models Path Integral Monte Carlo Condensed Matter Physics
39	General Purpose Computing in Gpu - a Watermarking Case Study Hanson, Anthony 08 1900 (has links) The purpose of this project is to explore the GPU for general purpose computing. The GPU is a massively parallel computing device that has a high-throughput, exhibits high arithmetic intensity, has a large market presence, and with the increasing computation power being added to it each year through innovations, the GPU is a perfect candidate to complement the CPU in performing computations. The GPU follows the single instruction multiple data (SIMD) model for applying operations on its data. This model allows the GPU to be very useful for assisting the CPU in performing computations on data that is highly parallel in nature. The compute unified device architecture (CUDA) is a parallel computing and programming platform for NVIDIA GPUs. The main focus of this project is to show the power, speed, and performance of a CUDA-enabled GPU for digital video watermark insertion in the H.264 video compression domain. Digital video watermarking in general is a highly computationally intensive process that is strongly dependent on the video compression format in place. The H.264/MPEG-4 AVC video compression format has high compression efficiency at the expense of having high computational complexity and leaving little room for an imperceptible watermark to be inserted. Employing a human visual model to limit distortion and degradation of visual quality introduced by the watermark is a good choice for designing a video watermarking algorithm though this does introduce more computational complexity to the algorithm. Research is being conducted into how the CPU-GPU execution of the digital watermark application can boost the speed of the applications several times compared to running the application on a standalone CPU using NVIDIA visual profiler to optimize the application. H.264 video compression domain digital video watermark CUDA Graphics processing units. CUDA (Computer architecture) Data protection.
40	Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors January 2018 (has links) abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2018 Computer science Computer engineering Cache memories Chip multiprocessors Computer architecture Graphics Processing Units Memory subsystem Moore's law

Search results