Global ETD Search

91	Deep Learning with Go Stinson, Derek L. 05 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Current research in deep learning is primarily focused on using Python as a support language. Go, an emerging language, that has many benefits including native support for concurrency has seen a rise in adoption over the past few years. However, this language is not widely used to develop learning models due to the lack of supporting libraries and frameworks for model development. In this thesis, the use of Go for the development of neural network models in general and convolution neural networks is explored. The proposed study is based on a Go-CUDA implementation of neural network models called GoCuNets. This implementation is then compared to a Go-CPU deep learning implementation that takes advantage of Go's built in concurrency called ConvNetGo. A comparison of these two implementations shows a significant performance gain when using GoCuNets compared to ConvNetGo. Go Golang CUDA Deep Learning Framework Deep Learning GPU
92	Optimizing Lempel-Ziv Factorization for the GPU Architecture Ching, Bryan 01 June 2014 (has links) (PDF) Lossless data compression is used to reduce storage requirements, allowing for the relief of I/O channels and better utilization of bandwidth. The Lempel-Ziv lossless compression algorithms form the basis for many of the most commonly used compression schemes. General purpose computing on graphic processing units (GPGPUs) allows us to take advantage of the massively parallel nature of GPUs for computations other that their original purpose of rendering graphics. Our work targets the use of GPUs for general lossless data compression. Specifically, we developed and ported an algorithm that constructs the Lempel-Ziv factorization directly on the GPU. Our implementation bypasses the sequential nature of the LZ factorization and attempts to compute the factorization in parallel. By breaking down the LZ factorization into what we call the PLZ, we are able to outperform the fastest serial CPU implementations by up to 24x and perform comparatively to a parallel multicore CPU implementation. To achieve these speeds, our implementation outputted LZ factorizations that were on average only 0.01 percent greater than the optimal solution that what could be computed sequentially. We are also able to reevaluate the fastest GPU suffix array construction algorithm, which is needed to compute the LZ factorization. We are able to find speedups of up to 5x over the fastest CPU implementations. GPU CUDA Parallelism Compression LZ77 Lempel-Ziv Computer and Systems Architecture
93	Astro – A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing Using the Jetson TK1 Sheen, Sean Kai 01 June 2016 (has links) (PDF) With the rising costs of large scale distributed systems many researchers have began looking at utilizing low power architectures for clusters. In this paper, we describe our Astro cluster, which consists of 46 NVIDIA Jetson TK1 nodes each equipped with an ARM Cortex A15 CPU, 192 core Kepler GPU, 2 GB of RAM, and 16 GB of flash storage. The cluster has a number of advantages when compared to conventional clusters including lower power usage, ambient cooling, shared memory between the CPU and GPU, and affordability. The cluster is built using commodity hardware and can be setup for relatively low costs while providing up to 190 single precision GFLOPS of computing power per node due to its combined GPU/CPU architecture. The cluster currently uses one 48-port Gigabit Ethernet switch and runs Linux for Tegra, a modified version of Ubuntu provided by NVIDIA as its operating system. Common file systems such as PVFS, Ceph, and NFS are supported by the cluster and benchmarks such as HPL, LAPACK, and LAMMPS are used to evaluate the system. At peak performance, the cluster is able to produce 328 GFLOPS of double precision and a peak of 810W using the LINPACK benchmark placing the cluster at 324th place on the Green500. Single precision benchmarks result in a peak performance of 6800 GFLOPs. The Astro cluster aims to be a proof-of-concept for future low power clusters that utilize a similar architecture. The cluster is installed with many of the same applications used by top supercomputers and is validated using the several standard supercomputing benchmarks. We show that with the rise of low-power CPUs and GPUs, and the need for lower server costs, this cluster provides insight into how ARM and CPU-GPU hybrid chips will perform in high-performance computing. distributed cluster CUDA Jetson ARM Computer and Systems Architecture
94	Real-time Rendering of Burning Objects in Video Games Amarasinghe, Dhanyu Eshaka 08 1900 (has links) In recent years there has been growing interest in limitless realism in computer graphics applications. Among those, my foremost concentration falls into the complex physical simulations and modeling with diverse applications for the gaming industry. Different simulations have been virtually successful by replicating the details of physical process. As a result, some were strong enough to lure the user into believable virtual worlds that could destroy any sense of attendance. In this research, I focus on fire simulations and its deformation process towards various virtual objects. In most game engines model loading takes place at the beginning of the game or when the game is transitioning between levels. Game models are stored in large data structures. Since changing or adjusting a large data structure while the game is proceeding may adversely affect the performance of the game. Therefore, developers may choose to avoid procedural simulations to save resources and avoid interruptions on performance. I introduce a process to implement a real-time model deformation while maintaining performance. It is a challenging task to achieve high quality simulation while utilizing minimum resources to represent multiple events in timely manner. Especially in video games, this overwhelming criterion would be robust enough to sustain the engaging player's willing suspension of disbelief. I have implemented and tested my method on a relatively modest GPU using CUDA. My experiments conclude this method gives a believable visual effect while using small fraction of CPU and GPU resources. Hardware acceleration volume rendering CUDA free form deformation polygonal modeling
95	A Fast Poisson Solver with Periodic Boundary Conditions for GPU Clusters in Various Configurations Rattermann, Dale N. 27 October 2014 (has links) No description available. Aerospace Materials GPU Poisson Incompressible CFD FFT CUDA
96	Sparse Matrix-Vector Multiplication on GPU Ashari, Arash January 2014 (has links) No description available. Computer Engineering Computer Science GPU CUDA Sparse SpMV BRC ACSR
97	On Improving Sparse Matrix-Matrix Multiplication on GPUs Kunchum, Rakshith 15 August 2017 (has links) No description available. Computer Science SPGEMM CUDA sparse matrices load balancing
98	Accelerating and Predicting Map Projections with CUDA and MLP Zhang, Jiaqi 15 August 2018 (has links) No description available. Computer Science Geographic Information Science CUDA MLP Projection Neural Network
99	A Highly Parallelized Approach to Silhouette Edge Detection for Shadow Volumes in Three Dimensional Triangular Meshes Mourning, Chad L. 29 December 2008 (has links) No description available. Computer Science Computer Graphics Shadows Shadow Volumes CUDA
100	Fast 3D Deformable Image Registration on a GPU Computing Platform Mousazadeh, Mohammad Hamed 10 1900 (has links) <p>Image registration has become an indispensable tool in medical diagnosis and intervention. The increasing need for speed and accuracy in clinical applications have motivated researchers to focus on developing fast and reliable registration algorithms. In particular, advanced deformable registration routines are emerging for medical applications involving soft-tissue organs such as brain, breast, kidney, liver, prostate, etc. Computational complexity of such algorithms are significantly higher than those of conventional rigid and affine methods, leading to substantial increases in execution time. In this thesis, we present a parallel implementation of a newly developed deformable image registration algorithm by Marami et al. [1] using the Computer Unified Device Architecture (CUDA). The focus of this study is on acceleration of the computations on a Graphics Processing Unit (GPU) to reduce the execution time to nearly real-time for diagnostic and interventional applications. The algorithm co-registers preoperative and intraoperative 3-dimensional magnetic resonance (MR) images of a deforming organ. It employs a linear elastic dynamic finite-element model of the deformation and distance measures such as mutual information and sum of squared difference to align volumetric image data sets. In this study, we report a parallel implementation of the algorithm for 3D-3D MR registration based on SSD on a CUDA capable NVIDIA GTX 480 GPU. Computationally expensive tasks such as interpolation, displacement and force calculation are significantly accelerated using the GPU. The result of the experiments carried out with a realistic breast phantom tissue shows a 37-fold speedup for the GPUbased implementation compared with an optimized CPU-based implementation in high resolution MR image registration. The CPU is a 3.20 GHz Intel core i5 650 processor with 4GB RAM that also hosts the GTX 480 GPU. This GPU has 15 streaming multiprocessors, each with 32 streaming processors, i.e. a total of 480 cores. The GPU implementation registers 3D-3D high resolution (512×512×136) image sets in just over 2 seconds, compared to 1.38 and 23.25 minutes for CPU and MATLAB-based implementations, respectively. Most GPU kernels which are employed in 3D-3D registration algorithm also can be employed to accelerate the 2D-3D registration algorithm in [1].</p> / Master of Applied Science (MASc) GPU Parallel Implementation Image Registration CUDA Computer Engineering Computer Engineering

Search results