Global ETD Search

31	Power-constrained performance optimization of GPU graph traversal McLaughlin, Adam Thomas 13 January 2014 (has links) Graph traversal represents an important class of graph algorithms that is the nucleus of many large scale graph analytics applications. While improving the performance of such algorithms using GPUs has received attention, understanding and managing performance under power constraints has not yet received similar attention. This thesis first explores the power and performance characteristics of breadth first search (BFS) via measurements on a commodity GPU. We utilize this analysis to address the problem of minimizing execution time below a predefined power limit or power cap exposing key relationships between graph properties and power consumption. We modify the firmware on a commodity GPU to measure power usage and use the GPU as an experimental system to evaluate future architectural enhancements for the optimization of graph algorithms. Specifically, we propose and evaluate power management algorithms that scale i) the GPU frequency or ii) the number of active GPU compute units for a diverse set of real-world and synthetic graphs. Compared to scaling either frequency or compute units individually, our proposed schemes reduce execution time by an average of 18.64% by adjusting the configuration based on the inter- and intra-graph characteristics. GPU architecture Graph algorithms Power-constrained environments Graph algorithms Graphics processing units
32	Many-core architecture for programmable hardware accelerator Lee, Junghee 13 January 2014 (has links) As the further development of single-core architectures faces seemingly insurmountable physical and technological limitations, computer designers have turned their attention to alternative approaches. One such promising alternative is the use of several smaller cores working in unison as a programmable hardware accelerator. It is clear that the vast – and, as yet, largely untapped – potential of hardware accelerators is coming to the forefront of computer architecture. There are many challenges that must be addressed for the programmable hardware accelerator to be realized in practice. In this thesis, load-balancing, on-chip communication, and an execution model are studied. Imbalanced distribution of workloads across the processing elements constitutes wasteful use of resources, which results in degrading the performance of the system. In this thesis, a hardware-based load-balancing technique is proposed, which is demonstrated to be more scalable than state-of-the-art loadbalancing techniques. To facilitate efficient communication among ever increasing number of cores, a scalable communication network is imperative. Packet switching networks-on-chip (NoC) is considered as a viable candidate for scalable communication fabric. The size of flit, which is a unit of flow control in NoC, is one of important design parameters that determine latency, throughput and cost of NoC routers. How to determine an optimal flit size is studied in this thesis and a novel router architecture is proposed, which overcomes a problem related with the flit size. This thesis also includes a new execution model and its supporting architecture. An event-driven model that is an extension of hardware description language is employed as an execution model. The dynamic scheduling and module-level prefetching for supporting the event-driven execution model are evaluated. Hardware accelerator Load-balancing Networks-on-chip Execution model Graphics processing units Computers
33	Shared resource management for efficient heterogeneous computing Lee, Jaekyu 13 January 2014 (has links) The demand for heterogeneous computing, because of its performance and energy efficiency, has made on-chip heterogeneous chip multi-processors (HCMP) become the mainstream computing platform, as the recent trend shows in a wide spectrum of platforms from smartphone application processors to desktop and low-end server processors. The performance of on-chip GPUs is not yet comparable to that of discrete GPU cards, but vendors have integrated more powerful GPUs and this trend will continue in upcoming processors. In this architecture, several system resources are shared between CPUs and GPUs. The sharing of system resources enables easier and cheaper data transfer between CPUs and GPUs, but it also causes resource contention problems between cores. The resource sharing problem has existed since the homogeneous (CPU-only) chip-multi processor (CMP) was introduced. However, resource sharing in HCMPs shows different aspects because of the different nature of CPU and GPU cores. In order to solve the resource sharing problem in HCMPs, we consider efficient shared resource management schemes, in particular tackling the problem in shared last-level cache and interconnection network. In the thesis, we propose four resource sharing mechanisms: First, we propose an efficient cache sharing mechanism that exploits the different characteristics of CPU and GPU cores to effectively share cache space between them. Second, adaptive virtual channel partitioning for on-chip interconnection network is proposed to isolate inter-application interference. By partitioning virtual channels to CPUs and GPUs, we can prevent the interference problem while guaranteeing quality-of-service (QoS) for both cores. Third, we propose a dynamic frequency controlling mechanism to efficiently share system resources. When both cores are active, the degree of resource contention as well as the system throughput will be affected by the operating frequency of CPUs and GPUs. The proposed mechanism tries to find optimal operating frequencies for both cores, which reduces the resource contention while improving system throughput. Finally, we propose a second cache sharing mechanism that exploits GPU-semantic information. The programming and execution models of GPUs are more strict and easier than those of CPUs. Also, programmers are asked to provide more information to the hardware. By exploiting these characteristics, GPUs can energy-efficiently exercise the cache and simpler, but more efficient cache partitioning can be enabled for HCMPs. Resource management Heterogeneous architecture Shared cache On-chip network Graphics processing units Heterogeneous computing Cache memory
34	Energy conservation techniques for GPU computing Mei, Xinxin 29 August 2016 (has links) The emerging general purpose graphics processing units (GPGPU) computing has tremendously speeded up a great variety of commercial and scientific applications. The GPUs have become prevalent accelerators in current high performance clusters. Though the computational capacity per Watt of the GPUs is much higher than that of the CPUs, the hybrid GPU clusters still consume enormous power. To conserve energy on this kind of clusters is of critical significance. In this thesis, we seek energy conservative computing on the GPU accelerated servers. We introduce our studies as follows. First, we dissect the GPU memory hierarchy due to the fact that most of the GPU applications are suffering from the GPU memory bottleneck. We find that the conventional CPU cache models cannot be applied on the modern GPU caches, and the microbenchmarks to study the conventional CPU cache become invalid for the GPU. We propose the GPU-specified microbenchmarks to examine the GPU memory structures and properties. Our benchmark results verify that the design goal of the GPU has transformed from pure computation performance to better energy efficiency. Second, we investigate the impact of dynamic voltage and frequency scaling (DVFS), a successful energy management technique for CPUs, on the GPU platforms. Our experimental results suggest that GPU DVFS is still promising in conserving energy, but the patterns to save energy strongly differ from those of the CPU. Besides, the effect of GPU DVFS depends on the individual application characteristics. Third, we derive the GPU DVFS power and performance models from our experimental results, based on which we find the optimal GPU voltage and frequency setting to minimize the energy consumption of a single GPU task. We then study the problem of scheduling multiple tasks on a hybrid CPU-GPU cluster to minimize the total energy consumption by GPU DVFS. We design an effective offline scheduling algorithm which can reduce the energy consumption significantly. At last, we combine the GPU DVFS and dynamic resource sleep (DRS), another energy management technique, to further conserve the energy, for the online task scheduling on hybrid clusters. Though the idle energy consumption increases significantly compared to the offline problem, our online scheduling algorithm still achieves more than 30% of energy conservation with appropriate runtime GPU DVFS readjustments.
35	Random Forests for CUDA GPUs Lapajne, Mikael Hellborg, Slat, Daniel January 2010 (has links) Context. Machine Learning is a complex and resource consuming process that requires a lot of computing power. With the constant growth of information, the need for efficient algorithms with high performance is increasing. Today's commodity graphics cards are parallel multi processors with high computing capacity at an attractive price and are usually pre-installed in new PCs. The graphics cards provide an additional resource to be used in machine learning applications. The Random Forest learning algorithm which has been showed competitive within machine learning has a good potential for performance increase through parallelization of the algorithm. Objectives. In this study we implement and review a revised Random Forest algorithm for GPU execution using CUDA. Methods. A review of previous work in the area has been done by studying articles from several sources, including Compendex, Inspec, IEEE Xplore, ACM Digital Library and Springer Link. Additional information regarding GPU architecture and implementation specific details have been obtained mainly from documentation available from Nvidia and the Nvidia developer forums. The implemented algorithm has been benchmarked and compared with two state-of-the-art CPU implementations of the Random Forest algorithm, both regarding consumed time for training and classification and for classification accuracy. Results. Measurements from benchmarks made on the three different algorithms are gathered showing the performance results of the algorithms for two publicly available data sets. Conclusion. We conclude that our implementation under the right conditions is able to outperform its competitors. We also conclude that this is only true for certain data sets depending on the size of the data sets. Moreover we conclude that there is potential for further improvements of the algorithm both regarding performance as well as adaption towards a wider range of real world applications. / Mikael: +46768539263, Daniel: +46703040693 CUDA Random forests Parallel computing Graphics processing units Software Engineering Programvaruteknik
36	Agent-based crowd simulation using GPU computing O’Reilly, Sean Patrick January 2014 (has links) M.Sc. (Information Technology) / The purpose of the research is to investigate agent-based approaches to virtual crowd simulation. Crowds are ubiquitous and are becoming an increasingly common phenomena in modern society, particularly in urban settings. As such, crowd simulation systems are becoming increasingly popular in training simulations, pedestrian modelling, emergency simulations, and multimedia. One of the primary challenges in crowd simulation is the ability to model realistic, large-scale crowd behaviours in real time. This is a challenging problem, as the size, visual fidelity, and complex behaviour models of the crowd all have an impact on the available computational resources. In the last few years, the graphics processing unit (GPU) has presented itself as a viable computational resource for general purpose computation. Traditionally, GPUs were used solely for their ability to efficiently compute operations related to graphics applications. However, the modern GPU is a highly parallel programmable processor, with substantially higher peak arithmetic and memory bandwidth than its central processing unit (CPU) counterpart. The GPU’s architecture makes it a suitable processing resource for computations that are parallel or distributed in nature. One attribute of multi-agent systems (MASs) is that they are inherently decentralised. As such, a MAS that leverages advancements in GPU computing may provide a solution for crowd simulation. The research investigates techniques and methods for general purpose crowd simulation, including topics in agent behavioural modes, pathplanning, collision avoidance and agent steering. The research also investigates how GPU computing has been utilised to address these computationally intensive problem domains. Based on the outcomes of the research, an agent-based model, Massively Parallel Crowds (MPCrowds), is proposed to address virtual crowd simulation, using the GPU as an additional resource for agent computation. Crowds - Computer simulation Virtual computer systems Graphics processing units Multiagent systems
37	GPF : a framework for general packet classification on GPU co-processors / GPU Packet Filter : framework for general packet classification on Graphics Processing Unit co-processors Nottingham, Alastair January 2012 (has links) This thesis explores the design and experimental implementation of GPF, a novel protocol-independent, multi-match packet classification framework. This framework is targeted and optimised for flexible, efficient execution on NVIDIA GPU platforms through the CUDA API, but should not be difficult to port to other platforms, such as OpenCL, in the future. GPF was conceived and developed in order to accelerate classification of large packet capture files, such as those collected by Network Telescopes. It uses a multiphase SIMD classification process which exploits both the parallelism of packet sets and the redundancy in filter programs, in order to classify packet captures against multiple filters at extremely high rates. The resultant framework - comprised of classification, compilation and buffering components - efficiently leverages GPU resources to classify arbitrary protocols, and return multiple filter results for each packet. The classification functions described were verified and evaluated by testing an experimental prototype implementation against several filter programs, of varying complexity, on devices from three GPU platform generations. In addition to the significant speedup achieved in processing results, analysis indicates that the prototype classification functions perform predictably, and scale linearly with respect to both packet count and filter complexity. Furthermore, classification throughput (packets/s) remained essentially constant regardless of the underlying packet data, and thus the effective data rate when classifying a particular filter was heavily influenced by the average size of packets in the processed capture. For example: in the trivial case of classifying all IPv4 packets ranging in size from 70 bytes to 1KB, the observed data rate achieved by the GPU classification kernels ranged from 60Gbps to 900Gbps on a GTX 275, and from 220Gbps to 3.3Tbps on a GTX 480. In the less trivial case of identifying all ARP, TCP, UDP and ICMP packets for both IPv4 and IPv6 protocols, the effective data rates ranged from 15Gbps to 220Gbps (GTX 275), and from 50Gbps to 740Gbps (GTX 480), for 70B and 1KB packets respectively. / LaTeX with hyperref package Graphics processing units Coprocessors Computer network protocols Computer networks -- Security measures NVIDIA Corporation
38	Classification of the difficulty in accelerating problems using GPUs Tristram, Uvedale Roy January 2014 (has links) Scientists continually require additional processing power, as this enables them to compute larger problem sizes, use more complex models and algorithms, and solve problems previously thought computationally impractical. General-purpose computation on graphics processing units (GPGPU) can help in this regard, as there is great potential in using graphics processors to accelerate many scientific models and algorithms. However, some problems are considerably harder to accelerate than others, and it may be challenging for those new to GPGPU to ascertain the difficulty of accelerating a particular problem or seek appropriate optimisation guidance. Through what was learned in the acceleration of a hydrological uncertainty ensemble model, large numbers of k-difference string comparisons, and a radix sort, problem attributes have been identified that can assist in the evaluation of the difficulty in accelerating a problem using GPUs. The identified attributes are inherent parallelism, branch divergence, problem size, required computational parallelism, memory access pattern regularity, data transfer overhead, and thread cooperation. Using these attributes as difficulty indicators, an initial problem difficulty classification framework has been created that aids in GPU acceleration difficulty evaluation. This framework further facilitates directed guidance on suggested optimisations and required knowledge based on problem classification, which has been demonstrated for the aforementioned accelerated problems. It is anticipated that this framework, or a derivative thereof, will prove to be a useful resource for new or novice GPGPU developers in the evaluation of potential problems for GPU acceleration. Graphics processing units Computer algorithms Computer programming Problem solving -- Data processing
39	Monte Carlo simulations on a graphics processor unit with applications in inertial navigation Roets, Sarel Frederik 12 March 2012 (has links) M.Ing. / The Graphics Processor Unit (GPU) has been in the gaming industry for several years now. Of late though programmers and scientists have started to use the parallel processing or stream processing capabilities of the GPU in general numerical applications. The Monte Carlo method is a processing intensive methods, as it evaluates systems with stochastic components. The stochastic components require several iterations of the systems to develop an idea of how the systems reacts to the stochastic inputs. The stream processing capabilities of GPUs are used for the analysis of such systems. Evaluating low-cost Inertial Measurement Units (IMU) for utilisation in Inertial Navigation Systems (INS) is a processing intensive process. The non-deterministic or stochastic error components of the IMUs output signal requires multiple simulation runs to properly evaluate the IMUs performance when applied as input to an INS. The GPU makes use of stream processing, which allows simultaneous execution of the same algorithm on multiple data sets. Accordingly Monte Carlo techniques are applied to create trajectories for multiple possible outputs of the INS based on stochastically varying inputs from the IMU. The processing power of the GPU allows simultaneous Monte Carlo analysis of several IMUs. Each IMU requires a sensor error model, which entails calibration of each IMU to obtain numerical values for the main error sources of lowcost IMUs namely scale factor, non-orthogonality, bias, random walk and white noise. Three low-cost MEMS IMUs was calibrated to obtain numerical values for their sensor error models. Simultaneous Monte Carlo analysis of each of the IMUs is then done on the GPU with a resulting circular error probability plot. The circular error probability indicates the accuracy and precision of each IMU relative to a reference trajectory and the other IMUs trajectories. Results obtained indicate the GPU to be an alternative processing platform, for large amounts of data, to that of the CPU. Monte Carlo simulations on the GPU was performed 200 % faster than Monte Carlo simulations on the CPU. Results obtained from the Monte Carlo simulations, indicated the Random Walk error to be the main source of error in low-cost IMUs. The CEP results was used to determine the e ect of the various error sources on the INS output. Monte Carlo method Graphics processing units Inertial navigation systems Inertial measurement units
40	Advancements in Computational Small Molecule Binding Affinity Prediction Methods Devlaminck, Pierre January 2023 (has links) Computational methods for predicting the binding affinity of small organic molecules tobiological macromolecules cover a vast range of theoretical and physical complexity. Generally, as the required accuracy increases so does the computational cost, thereby making the user choose a method that suits their needs within the parameters of the project. We present how WScore, a rigid-receptor docking program normally consigned to structure-based hit discovery in drug design projects, is systematically ameliorated to perform accurately enough for lead optimization with a set of ROCK1 complexes and congeneric ligands from a structure-activity relationship study. Initial WScore results from the Schrödinger 2019-3 release show poor correlation (R² ∼0.0), large errors in predicted binding affinity (RMSE = 2.30 kcal/mol), and bad native pose prediction (two RMSD > 4Å) for the six ROCK1 crystal structures and associated active congeneric ligands. Improvements to WScore’s treatment of desolvation, myriad code fixes, and a simple ensemble consensus scoring protocol improved the correlation (R² = 0.613), the predicted affinity accuracy (RMSE = 1.34 kcal/mol), and native pose prediction (one RMSD > 1.5Å). Then we evaluate a physically and thermodynamically rigorous free energy perturbation (FEP) method, FEP+, against CryoEM structures of the Machilis hrabei olfactory receptor, MhOR5, and associated dose-response assays of a panel of small molecules with the wild-type and mutants. Augmented with an induced-fit docking method, IFD-MD, FEP+ performs well for ligand mutating relative binding FEP (RBFEP) calculations which correlate with experimental log(EC50)with an R² = 0.551. Ligand absolute binding FEP (ABFEP) on a set of disparate ligands from the MhOR5 panel has poor correlation (R² = 0.106) for ligands with log(EC50) within the assay range. But qualitative predictions correctly identify the ligands with the lowest potency. Protein mutation calculations have no log(EC50) correlation and consistently fail to predict the loss of potency for a majority of MhOR5 single point mutations. Prediction of ligand efficacy (the magnitude of receptor response) is also an unsolved problem as the canonical active and inactive conformations of the receptor are absent in the FEP simulations. We believe that structural insights of the mutants for both bound and unbound (apo) states are required to better understand the shortcomings of the current FEP+ methods for protein mutation RBFEP. Finally, improvements to GPU-accelerated linear algebra functions in an Auxiliary-Field Quantum Monte Carlo (AFQMC) program effect an average 50-fold reduction in GPU kernel compute time using optimized GPU library routines instead of custom made GPU kernels. Also MPI parallelization of the population control algorithm that destroys low-weight walkers has a bottleneck removed in large, multi-node AFQMC calculations.Computational methods for predicting the binding affinity of small organic molecules tobiological macromolecules cover a vast range of theoretical and physical complexity. Generally, as the required accuracy increases so does the computational cost, thereby making the user choose a method that suits their needs within the parameters of the project. We present how WScore, a rigid-receptor docking program normally consigned to structure-based hit discovery in drug design projects, is systematically ameliorated to perform accurately enough for lead optimization with a set of ROCK1 complexes and congeneric ligands from a structure-activity relationship study. Initial WScore results from the Schrödinger 2019-3 release show poor correlation (R² ∼0.0), large errors in predicted binding affinity (RMSE = 2.30 kcal/mol), and bad native pose prediction (two RMSD > 4Å) for the six ROCK1 crystal structures and associated active congeneric ligands. Improvements to WScore’s treatment of desolvation, myriad code fixes, and a simple ensemble consensus scoring protocol improved the correlation (R² = 0.613), the predicted affinity accuracy (RMSE = 1.34 kcal/mol), and native pose prediction (one RMSD > 1.5Å). Then we evaluate a physically and thermodynamically rigorous free energy perturbation (FEP) method, FEP+, against CryoEM structures of the Machilis hrabei olfactory receptor, MhOR5, and associated dose-response assays of a panel of small molecules with the wild-type and mutants. Augmented with an induced-fit docking method, IFD-MD, FEP+ performs well for ligand mutating relative binding FEP (RBFEP) calculations which correlate with experimental log(EC50)with an R² = 0.551. Ligand absolute binding FEP (ABFEP) on a set of disparate ligands from the MhOR5 panel has poor correlation (R² = 0.106) for ligands with log(EC50) within the assay range. But qualitative predictions correctly identify the ligands with the lowest potency. Protein mutation calculations have no log(EC50) correlation and consistently fail to predict the loss of potency for a majority of MhOR5 single point mutations. Prediction of ligand efficacy (the magnitude of receptor response) is also an unsolved problem as the canonical active and inactive conformations of the receptor are absent in the FEP simulations. We believe that structural insights of the mutants for both bound and unbound (apo) states are required to better understand the shortcomings of the current FEP+ methods for protein mutation RBFEP. Finally, improvements to GPU-accelerated linear algebra functions in an Auxiliary-Field Quantum Monte Carlo (AFQMC) program effect an average 50-fold reduction in GPU kernel compute time using optimized GPU library routines instead of custom made GPU kernels. Also MPI parallelization of the population control algorithm that destroys low-weight walkers has a bottleneck removed in large, multi-node AFQMC calculations. Computational chemistry Molecules Ligands Graphics processing units Quantum Monte Carlo methods

Search results