Global ETD Search

281	Application of machine learning potential to predict grain boundary properties and development of its performant implementation / 機械学習原子間ポテンシャルの結晶粒界構造探索への応用と高速化手法開発 Nishiyama, Takayuki 23 March 2022 (has links) 京都大学 / 新制・課程博士 / 博士(工学) / 甲第23899号 / 工博第4986号 / 新制\|\|工\|\|1778(附属図書館) / 京都大学大学院工学研究科材料工学専攻 / (主査)教授田中功, 教授中村裕之, 教授奥田浩司 / 学位規則第4条第1項該当 / Doctor of Philosophy (Engineering) / Kyoto University / DFAM machine learning interatomic potential grain boundary GPU 500
282	Fast 3D Deformable Image Registration on a GPU Computing Platform Mousazadeh, Mohammad Hamed 10 1900 (has links) <p>Image registration has become an indispensable tool in medical diagnosis and intervention. The increasing need for speed and accuracy in clinical applications have motivated researchers to focus on developing fast and reliable registration algorithms. In particular, advanced deformable registration routines are emerging for medical applications involving soft-tissue organs such as brain, breast, kidney, liver, prostate, etc. Computational complexity of such algorithms are significantly higher than those of conventional rigid and affine methods, leading to substantial increases in execution time. In this thesis, we present a parallel implementation of a newly developed deformable image registration algorithm by Marami et al. [1] using the Computer Unified Device Architecture (CUDA). The focus of this study is on acceleration of the computations on a Graphics Processing Unit (GPU) to reduce the execution time to nearly real-time for diagnostic and interventional applications. The algorithm co-registers preoperative and intraoperative 3-dimensional magnetic resonance (MR) images of a deforming organ. It employs a linear elastic dynamic finite-element model of the deformation and distance measures such as mutual information and sum of squared difference to align volumetric image data sets. In this study, we report a parallel implementation of the algorithm for 3D-3D MR registration based on SSD on a CUDA capable NVIDIA GTX 480 GPU. Computationally expensive tasks such as interpolation, displacement and force calculation are significantly accelerated using the GPU. The result of the experiments carried out with a realistic breast phantom tissue shows a 37-fold speedup for the GPUbased implementation compared with an optimized CPU-based implementation in high resolution MR image registration. The CPU is a 3.20 GHz Intel core i5 650 processor with 4GB RAM that also hosts the GTX 480 GPU. This GPU has 15 streaming multiprocessors, each with 32 streaming processors, i.e. a total of 480 cores. The GPU implementation registers 3D-3D high resolution (512×512×136) image sets in just over 2 seconds, compared to 1.38 and 23.25 minutes for CPU and MATLAB-based implementations, respectively. Most GPU kernels which are employed in 3D-3D registration algorithm also can be employed to accelerate the 2D-3D registration algorithm in [1].</p> / Master of Applied Science (MASc) GPU Parallel Implementation Image Registration CUDA Computer Engineering Computer Engineering
283	An evaluation of GPU virtualization Vilestad, Josef January 2024 (has links) There has been extensive research and progress on virtualization on CPUs for a while. More recently the focus on GPU virtualization has increased as processing power doubles roughly every 2.5 years. Coupled with advances in memory management and the PCIe standard the first hardware assisted virtual solutions became available in the 2010s. Very recently, a new virtualization mode called Multi-Instance GPU (MIG) makes it possible to isolate partitions with memory in hardware rather than just software. This thesis is focused on virtual GPU performance and capabilities for AI training in a multi tenant situation. It explores the technologies currently used for GPU virtualization,including Single Root IO Virtualization (SR-IOV) and mediated devices. It also covers a proposed new standard for IO virtualization called SIOV that addresses some of the limitations in the SR-IOV standard. The limitations of time sliced virtualization are mainly the lack of customization for a partition compared to CPU virtualization and the problem of overhead. MIG virtualization is more customisable in how compute power and memory can be allocated, the biggest limitation is that fast intercommunication is not currently possible between partitions, making MIG more suited for applications that can run on just one partition. It is also not suited for graphical applications as it currently does not support any graphical APIs. The experimental results showed that in compute situations the overhead of time sliced virtualization is around 5% while the maximum intercommunication bandwidth is lowered by 11% and latency increased by 25%. Time slice windows of 4ms compared to 2ms can decrease scheduling overhead to nearly 0.5% at the cost of increased latency for the end user, this can be beneficial for applications where user interactivity is not of importance. gpu virtualization mig siov Computer Engineering Datorteknik Computer Systems Datorsystem
284	Analysis and Implementation Considerations of Krylov Subspace Methods on Modern Heterogeneous Computing Architectures Higgins, Andrew, 0009-0007-5527-9263 05 1900 (has links) Krylov subspace methods are the state-of-the-art iterative algorithms for solving large, sparse systems of equations, which are ubiquitous throughout scientific computing. Even with Krylov methods, these problems are often infeasible to solve on standard workstation computers and must be solved instead on supercomputers. Most modern supercomputers fall into the category of “heterogeneous architectures”, typically meaning a combination of CPU and GPU processors. Thus, development and analysis of Krylov subspace methods on these heterogeneous architectures is of fundamental importance to modern scientific computing. This dissertation focuses on how this relates to several specific problems. Thefirst analyzes the performance of block GMRES (BGMRES) compared to GMRES for linear systems with multiple right hand sides (RHS) on both CPUs and GPUs, and modelling when BGMRES is most advantageous over GMRES on the GPU. On CPUs, the current paradigm is that if one wishes to solve a system of equations with multiple RHS, BGMRES can indeed outperform GMRES, but not always. Our original goal was to see if there are some cases for which BGMRES is slower in execution time on the CPU than GMRES on the CPU, while on the GPU, the reverse holds. This is true, and we generally observe much faster execution times and larger improvements in the case of BGMRES on the GPU. We also observe that for any fixed matrix, when the number of RHS increase, there is a point in which the improvements start to decrease and eventually any advantage of the (unrestarted) block method is lost. We present a new computational model which helps us explain why this is so. The significance of this analysis is that it first demonstrates increased potential of block Krylov methods on heterogeneous architectures than on previously studied CPU-only machines. Moreover, the theoretical runtime model can be used to identify an optimal partitioning strategy of the RHS for solving systems with many RHS. The second problem studies the s-step GMRES method, which is an implementation of GMRES that attains high performance on modern heterogeneous machines by generating s Krylov basis vectors per iteration, and then orthogonalizing the vectors in a block-wise fashion. The use of s-step GMRES is currently limited because the algorithm is prone to numerical instabilities, partially due to breakdowns in a tall-and-skinny QR subroutine. Further, a conservatively small step size must be used in practice, limiting the algorithm’s performance. To address these issues, first a novel randomized tall-and-skinny QR factorization is presented that is significantly more stable than the current practical algorithms without sacrificing performance on GPUs. Then, a novel two-stage block orthogonalization scheme is introduced that significantly improves the performance of the s-step GMRES algorithm when small step sizes are used. These contributions help make s-step GMRES a more practical method in heterogeneous, and therefore exascale, environments. / Mathematics Mathematics Computer science GPU Heterogeneous computer Krylov subspace methods
285	Massively Parallel Hidden Markov Models for Wireless Applications Hymel, Shawn 03 January 2012 (has links) Cognitive radio is a growing field in communications which allows a radio to automatically configure its transmission or reception properties in order to reduce interference, provide better quality of service, or allow for more users in a given spectrum. Such processes require several complex features that are currently being utilized in cognitive radio. Two such features, spectrum sensing and identification, have been implemented in numerous ways, however, they generally suffer from high computational complexity. Additionally, Hidden Markov Models (HMMs) are a widely used mathematical modeling tool used in various fields of engineering and sciences. In electrical and computer engineering, it is used in several areas, including speech recognition, handwriting recognition, artificial intelligence, queuing theory, and are used to model fading in communication channels. The research presented in this thesis proposes a new approach to spectrum identification using a parallel implementation of Hidden Markov Models. Algorithms involving HMMs are usually implemented in the traditional serial manner, which have prohibitively long runtimes. In this work, we study their use in parallel implementations and compare our approach to traditional serial implementations. Timing and power measurements are taken and used to show that the parallel implementation can achieve well over 100Ã speedup in certain situations. To demonstrate the utility of this new parallel algorithm using graphics processing units (GPUs), a new method for signal identification is proposed for both serial and parallel implementations using HMMs. The method achieved high recognition at -10 dB Eb/N0. HMMs can benefit from parallel implementation in certain circumstances, specifically, in models that have many states or when multiple models are used in conjunction. / Master of Science Hidden Markov Models CUDA GPU GPGPU Parallel Processing Signal Recognition
286	Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems Daga, Mayank 06 June 2011 (has links) The emergence of scientific applications embedded with multiple modes of parallelism has made heterogeneous computing systems indispensable in high performance computing. The popularity of such systems is evident from the fact that three out of the top five fastest supercomputers in the world employ heterogeneous computing, i.e., they use dissimilar computational units. A closer look at the performance of these supercomputers reveals that they achieve only around 50% of their theoretical peak performance. This suggests that applications that were tuned for erstwhile homogeneous computing may not be efficient for today's heterogeneous computing and hence, novel optimization strategies are required to be exercised. However, optimizing an application for heterogeneous computing systems is extremely challenging, primarily due to the architectural differences in computational units in such systems. This thesis intends to act as a cookbook for optimizing applications on heterogeneous computing systems that employ graphics processing units (GPUs) as the preferred mode of accelerators. We discuss optimization strategies for multicore CPUs as well as for the two popular GPU platforms, i.e., GPUs from AMD and NVIDIA. Optimization strategies for NVIDIA GPUs have been well studied but when applied on AMD GPUs, they fail to measurably improve performance because of the differences in underlying architecture. To the best of our knowledge, this research is the first to propose optimization strategies for AMD GPUs. Even on NVIDIA GPUs, there exists a lesser known but an extremely severe performance pitfall called partition camping, which can affect application performance by up to seven-fold. To facilitate the detection of this phenomenon, we have developed a performance prediction model that analyzes and characterizes the effect of partition camping in GPU applications. We have used a large-scale, molecular modeling application to validate and verify all the optimization strategies. Our results illustrate that if appropriately optimized, AMD and NVIDIA GPUs can provide 371-fold and 328-fold improvement, respectively, over a hand-tuned, SSE-optimized serial implementation. / Master of Science molecular modeling performance evaluation Optimization OpenCL CUDA GPU multicore CPU
287	On the Complexity of Robust Source-to-Source Translation from CUDA to OpenCL Sathre, Paul Daniel 12 June 2013 (has links) The use of hardware accelerators in high-performance computing has grown increasingly prevalent, particularly due to the growth of graphics processing units (GPUs) as general-purpose (GPGPU) accelerators. Much of this growth has been driven by NVIDIA's CUDA ecosystem for developing GPGPU applications on NVIDIA hardware. However, with the increasing diversity of GPUs (including those from AMD, ARM, and Qualcomm), OpenCL has emerged as an open and vendor-agnostic environment for programming GPUs as well as other parallel computing devices such as the CPU (central processing unit), APU (accelerated processing unit), FPGA (field programmable gate array), and DSP (digital signal processor). The above, coupled with the broader array of devices supporting OpenCL and the significant conceptual and syntactic overlap between CUDA and OpenCL, motivated the creation of a CUDA-to-OpenCL source-to-source translator. However, there exist sufficient differences that make the translation non-trivial, providing practical limitations to both manual and automatic translation efforts. In this thesis, the performance, coverage, and reliability of a prototype CUDA-to-OpenCL source translator are addressed via extensive profiling of a large body of sample CUDA applications. An analysis of the sample body of applications is provided, which identifies and characterizes general CUDA source constructs and programming practices that obstruct our translation efforts. This characterization then led to more robust support for the translator, followed by an evaluation that demonstrated the performance of our automatically-translated OpenCL is on par with the original CUDA for a subset of sample applications when executed on the same NVIDIA device. / Master of Science Source Translation Clang CUDA OpenCL GPU GPGPU Compilers
288	Characterization and Exploitation of GPU Memory Systems Lee, Kenneth Sydney 25 October 2012 (has links) Graphics Processing Units (GPUs) are workhorses of modern performance due to their ability to achieve massive speedups on parallel applications. The massive number of threads that can be run concurrently on these systems allow applications which have data-parallel computations to achieve better performance when compared to traditional CPU systems. However, the GPU is not perfect for all types of computation. The massively parallel SIMT architecture of the GPU can still be constraining in terms of achievable performance. GPU-based systems will typically only be able to achieve between 40%-60% of their peak performance. One of the major problems affecting this effeciency is the GPU memory system, which is tailored to the needs of graphics workloads instead of general-purpose computation. This thesis intends to show the importance of memory optimizations for GPU systems. In particular, this work addresses problems of data transfer and global atomic memory contention. Using the novel AMD Fusion architecture, we gain overall performance improvements over discrete GPU systems for data-intensive applications. The fused architecture systems offer an interesting trade off by increasing data transfer rates at the cost of some raw computational power. We characterize the performance of different memory paths that are possible because of the shared memory space present on the fused architecture. In addition, we provide a theoretical model which can be used to correctly predict the comparative performance of memory movement techniques for a given data-intensive application and system. In terms of global atomic memory contention, we show improvements in scalability and performance for global synchronization primitives by avoiding contentious global atomic memory accesses. In general, this work shows the importance of understanding the memory system of the GPU architecture to achieve better application performance. / Master of Science Data Transfer Performance Modeling GPGPU APU GPU Memory Systems
289	Accelerating a Coupled SPH-FEM Solver through Heterogeneous Computing for use in Fluid-Structure Interaction Problems Gilbert, John Nicholas 08 June 2015 (has links) This work presents a partitioned approach to simulating free-surface flow interaction with hyper-elastic structures in which a smoothed particle hydrodynamics (SPH) solver is coupled with a finite-element (FEM) solver. SPH is a mesh-free, Lagrangian numerical technique frequently employed to study physical phenomena involving large deformations, such as fragmentation or breaking waves. As a mesh-free Lagrangian method, SPH makes an attractive alternative to traditional grid-based methods for modeling free-surface flows and/or problems with rapid deformations where frequent re-meshing and additional free-surface tracking algorithms are non-trivial. This work continues and extends the earlier coupled 2D SPH-FEM approach of Yang et al. [1,2] by linking a double-precision GPU implementation of a 3D weakly compressible SPH formulation [3] with the open source finite element software Code_Aster [4]. Using this approach, the fluid domain is evolved on the GPU, while the CPU updates the structural domain. Finally, the partitioned solutions are coupled using a traditional staggered algorithm. / Ph. D. smoothed particle hydrodynamics meshfree CFD fluid-structure interaction GPU computing
290	Mining Rare Features in Fingerprints using Core points and Triplet-based Features Munagani, Indira Priya Darshini 04 January 2014 (has links) A fingerprint matching algorithm with a novel set of matching parameters based on core points and triangular descriptors is proposed to discover rarity in fingerprints. The algorithm uses a mathematical and statistical approach to discover rare features in fingerprints which provides scientific validation for both ten-print and latent fingerprint evidence. A feature is considered rare if it is statistically uncommon; that is, the rare feature should be unique among N (N>100) randomly sampled prints. A rare feature in a fingerprint has higher discriminatory power when it is identified in a print (latent or otherwise). In the case of latent fingerprint matching, the enhanced discriminatory power from the rare features can help in delivering a confident court judgment. In addition to mining the rare features, a parallel algorithm for fingerprint matching on GPUs is also proposed to reduce the run-time of fingerprint matching on larger databases. Results show that 1) matching algorithm is useful in eliminating false matches. 2) each of the 30 fingerprints randomly selected to mine rare features have a small set of highly distinctive statistically rare features some of whose occurrence is one in 1000 fingerprints. 3) the parallel algorithm implemented on GPUs for larger databases is around 40 times faster than the sequential algorithm. / Master of Science Fingerprints Rare Features Rarity Latent Core Points Triplets GPU

Search results