Global ETD Search

201	GPU Implementation of a Novel Approach to Cramer’s Algorithm for Solving Large Scale Linear Systems West, Rosanne Lane 01 May 2010 (has links) Scientific computing often requires solving systems of linear equations. Most software pack- ages for solving large-scale linear systems use Gaussian elimination methods such as LU- decomposition. An alternative method, recently introduced by K. Habgood and I. Arel, involves an application of Cramer’s Rule and Chio’s condensation to achieve a better per- forming system for solving linear systems on parallel computing platforms. This thesis describes an implementation of this algorithm on an nVidia graphics processor card us- ing the CUDA language. Increased performance, relative to the serial implementation, is demonstrated, paving the way for future parallel realizations of the scheme. high performance computing GPU CUDA Other Computer Engineering
202	Computational kinetics of a large scale biological process on GPU workstations : DNA bending Ruymgaart, Arnold Peter 30 October 2013 (has links) It has only recently become possible to study the dynamics of large time scale biological processes computationally in explicit solvent and atomic detail. This required a combination of advances in computer hardware, utilization of parallel and special purpose hardware as well as numerical and theoretical approaches. In this work we report advances in these areas contributing to the feasibility of a work of this scope in a reasonable time. We then make use of them to study an interesting model system, the action of the DNA bending protein 1IHF and demonstrate such an effort can now be performed on GPU equipped PC workstations. Many cellular processes require DNA bending. In the crowded compartment of the cell, DNA must be efficiently stored but this is just one example where bending is observed. Other examples include the effects of DNA structural features involved in transcription, gene regulation and recombination. 1IHF is a bacterial protein that binds and kinks DNA at sequence specific sites. The 1IHF binding to DNA is the cause or effect of bending of the double helix by almost 180 degrees. Most sequence specific DNA binding proteins bind in the major groove of the DNA and sequence specificity results from direct readout. 1IHF is an exception; it binds in the minor groove. The final structure of the binding/bending reaction was crystallized and shows the protein arm like features "latched" in place wrapping the DNA in the minor grooves and intercalating the tips between base pairs at the kink sites. This sequence specific, mostly indirect readout protein-DNA binding/bending interaction is therefore an interesting test case to study the mechanism of protein DNA binding and bending in general. Kinetic schemes have been proposed and numerous experimental studies have been carried out to validate these schemes. Experiments have included rapid kinetics laser T jump studies providing unprecedented temporal resolution and time resolved (quench flow) DNA foot-printing. Here we complement and add to those studies by investigating the mechanism and dynamics of the final latching/initial unlatching at an atomic level. This is accomplished with the computational tools of molecular dynamics and the theory of Milestoning. Our investigation begins by generating a reaction coordinate from the crystal structure of the DNA-protein complex and other images generated through modelling based on biochemical intuition. The initial path is generated by steepest descent minimization providing us with over 100 anchor images along the Steepest Descent Path (SDP) reaction coordinate. We then use the tools of Milestoning to sample hypersurfaces (milestones) between reaction coordinate anchors. Launching multiple trajectories from each milestone allowed us to accumulate average passage times to adjacent milestones and obtain transition probabilities. A complete set of rates was obtained this way allowing us to draw important conclusions about the mechanism of DNA bending. We uncover two possible metastable intermediates in the dissociation unkinking process. The first is an unexpected stable intermediate formed by initial unlatching of the IHF arms accompanied by a complete "psi-0" to "psi+140" conformational change of the IHF arm tip prolines. This unlatching (de-intercalation of the IHF tips from the kink sites) is required for any unkinking to occur. The second intermediate is formed by the IHF protein arms sliding over the DNA phosphate backbone and refolding in the next groove. The formation of this intermediate occurs on the millisecond timescale which is within experimental unkinking rate results. We show that our code optimization and parallelization enhancements allow the entire computational process of these millisecond timescale events in about one month on 10 or less GPU equipped workstations/cluster nodes bringing these studies within reach of researchers that do not have access to supercomputer clusters. / text GPU Molecular dynamics OMP MPI SHAKE CUDA Computational kinetics Milestoning
203	Performance-efficient mechanisms for managing irregularity in throughput processors Rhu, Minsoo 01 July 2014 (has links) Recent graphics processing units (GPUs) have emerged as a promising platform for general purpose computing and have been shown to be very efficient in executing parallel applications with regular control and memory access behavior. Current GPU architectures primarily adopt the single-instruction multiple-thread (SIMT) programming model that balances programmability and hardware efficiency. With SIMT, the programmer writes application code to be executed by scalar threads and each thread is supported with conditional branch and fine-grained load/store instruction for ease of programming. At the same time, the hardware and software collaboratively enable the grouping of scalar threads to be executed in a vectorized single-instruction multiple-data (SIMD) in-order pipeline, simplifying hardware design. As GPUs gain momentum in being utilized in various application domains, these throughput processors will increasingly demand more efficient execution of irregular applications. Current GPUs, however, suffer from reduced thread-level parallelism, underutilization of compute resources, inefficient on-chip caching, and waste in off-chip memory bandwidth utilization for highly irregular programs with divergent control and memory accesses. In this dissertation, I develop techniques that enable simple, robust, and highly effective performance optimizations for SIMT-based throughput processor architectures such that they can better manage irregularity. I first identify that previously suggested optimizations to the divergent control flow problem suffers from the following limitations: 1) serialized execution of diverging paths, 2) lack of robustness across regular/irregular codes, and 3) limited applicability. Based on such observations, I propose and evaluate three novel mechanisms that resolve the aforementioned issues, providing significant performance improvements while minimizing implementation overhead. In the second half of the dissertation, I observe that conventional coarse-grained memory hierarchy designs do not take into account the massively multi-threaded nature of GPUs, which leads to substantial waste in off-chip memory bandwidth utilization. I design and evaluate a locality-aware memory hierarchy for throughput processors, which retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy consumption are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality. / text Computer architecture GPU Graphics Throughput processors Memory systems
204	A Case Study of Parallel Bilateral Filtering on the GPU Larsson, Jonas January 2015 (has links) Smoothing and noise reduction of images is often an important ﬁrst step in image processing applications. Simple image smoothing algorithms like the Gaussian ﬁlter have the unfortunate side eﬀect of blurring the image which could obfuscate important information and have a negative impact on the following applications. The bilateral ﬁlter is a well-used non-linear smoothing algorithm that seeks to preserve edges and contours while removing noise. The bilateral ﬁlter comes at a heavy cost in computational speed, especially when used on larger images, since the algorithm does a greater amount of work for each pixel in the image than some simpler smoothing algorithms. In applications where timing is important, this may be enough to encourage certain developers to choose a simpler ﬁlter, at the cost of quality. However, the time cost of the bilateral ﬁlter can be greatly reduced through parallelization, as the work for each pixel can theoretically be done simultaneously. This work uses Nvidia’s Compute Uniﬁed Device Architecture (CUDA) to implement and evaluate some of the most common and eﬀective methods for parallelizing the bilateral ﬁlter on a Graphics processing unit (GPU). This includes use of the constant and shared memories, and a technique called 1 x N tiling. These techniques are evaluated on newer hardware and the results are compared to a sequential version, and a naive parallel version not using advanced techniques. This report also intends to give a detailed and comprehensible explanation to these techniques in the hopes that the reader may be able to use the information put forth to implement them on their own. The greatest speedup is achieved in the initial parallelizing step, where the algorithm is simply converted to run in parallel on a GPU. Storing some data in the constant memory provides a slight but reliable speedup for a small amount of work. Additional time can be gained by using shared memory. However, memory transactions did not account for as much of the execution time as was expected, and therefore the memory optimizations only yielded small improvements. Test results showed 1 x N tiling to be mostly non-beneﬁcial for the hardware that was used in this work, but there might have been problems with the implementation. Bilateral Filter Image filtering Image processing CUDA GPU GPGPU
205	Enhanced Ultrasound Visualization for Procedure Guidance Brattain, Laura 04 December 2014 (has links) Intra-cardiac procedures often involve fast-moving anatomic structures with large spatial extent and high geometrical complexity. Real-time visualization of the moving structures and instrument-tissue contact is crucial to the success of these procedures. Real-time 3D ultrasound is a promising modality for procedure guidance as it offers improved spatial orientation information relative to 2D ultrasound. Imaging rates at 30 fps enable good visualization of instrument-tissue interactions, far faster than the volumetric imaging alternatives (MR/CT). Unlike fluoroscopy, 3D ultrasound also allows better contrast of soft tissues, and avoids the use of ionizing radiation. / Engineering and Applied Sciences Biomedical engineering GPU image processing procedure guidance robotics ultrasound visualization
206	Parallel Discrete Event Simulation on Many Core Platforms Using Parallel Heap Event Queues Tanniru, Govardhan 10 May 2014 (has links) Discrete Event Simulation on GPUs employing parallel heap data structure is the focus of this thesis. Two traditional algorithms, one being conservative and other being optimistic, for parallel discrete event simulation have been implemented on GPUs using CUDA. The first algorithm is the safe-window algorithm (conservative). It has produced expected performance when compared to sequential simulation. The second algorithm, known as SyncSim, is an optimistic simulation algorithm previously designed to be space efficient and reduce rollbacks. This algorithm is re-implemented on GPU platform with necessary changes on the logic simulator and the parallel heap implementation. The performance of the parallel heap when working with a logic simulator has also been validated against the results indicated in previous research paper on parallel heap without the logic simulator. Discrete Event Simulation GPU Parallel Heap Logic Simulation
207	Performance Analysis of kNN on large datasets using CUDA & Pthreads : Comparing between CPU & GPU Kankatala, Sriram January 2015 (has links) Several organizations have large databases which are growing at a rapid rate day by day, which need to be regularly maintained. Content based searches are similar searched based on certain features that are obtained from various multi media data. For various applications like multimedia content retrieval, data mining, pattern recognition, etc., performing the nearest neighbor search is a challenging task in multidimensional data. The important factors in nearest neighbor search kNN are searching speed and accuracy. Implementation of kNN on GPU is an ongoing research from last few years, focusing on improving the performance of kNN. By considering these aspects, our research has been started and found a gap in this research area. This master thesis shows effective and efficient parallelism on multi-core of CPU and GPU to compare the performance with single core CPU. This paper shows an experimental implementation of kNN on single core CPU, Mutli-core CPU and GPU using C, Pthreads and CUDA respectively. We considered different levels of inputs (size, dimensions) to evaluate the performance. The experiment shows the GPU outperforms for kNN when compared to CPU single core with a factor of approximately 5.8 to 16 and CPU multi-core with a factor of approximately 1.2 to 3 for different levels of inputs. GPU Multicore CPU Parallel computing Performance Single core CPU
208	Performance Benchmarking of Fast Multipole Methods Al-Harthi, Noha A. 06 1900 (has links) The current trends in computer architecture are shifting towards smaller byte/flop ratios, while available parallelism is increasing at all levels of granularity – vector length, core count, and MPI process. Intel’s Xeon Phi coprocessor, NVIDIA’s Kepler GPU, and IBM’s BlueGene/Q all have a Byte/flop ratio close to 0.2, which makes it very difficult for most algorithms to extract a high percentage of the theoretical peak flop/s from these architectures. Popular algorithms in scientific computing such as FFT are continuously evolving to keep up with this trend in hardware. In the meantime it is also necessary to invest in novel algorithms that are more suitable for computer architectures of the future. The fast multipole method (FMM) was originally developed as a fast algorithm for ap- proximating the N-body interactions that appear in astrophysics, molecular dynamics, and vortex based fluid dynamics simulations. The FMM possesses have a unique combination of being an efficient O(N) algorithm, while having an operational intensity that is higher than a matrix-matrix multiplication. In fact, the FMM can reduce the requirement of Byte/flop to around 0.01, which means that it will remain compute bound until 2020 even if the cur- rent trend in microprocessors continues. Despite these advantages, there have not been any benchmarks of FMM codes on modern architectures such as Xeon Phi, Kepler, and Blue- Gene/Q. This study aims to provide a comprehensive benchmark of a state of the art FMM code “exaFMM” on the latest architectures, in hopes of providing a useful reference for deciding when the FMM will become useful as the computational engine in a given application code. It may also serve as a warning to certain problem size domains areas where the FMM will exhibit insignificant performance improvements. Such issues depend strongly on the asymptotic constants rather than the asymptotics themselves, and therefore are strongly implementation and hardware dependent. The primary objective of this study is to provide these constants on various computer architectures. Fast Multiple Method FMM Benchmark Scalability load balancing GPU
209	IMPLEMENTATION OF FILTERING BEAMFORMING ALGORITHMS FOR SONAR DEVICES USING GPU Kamali, Shahrokh 27 June 2013 (has links) Beamforming is a signal processing technique used in sensor arrays to direct signal transmission or reception. Beamformer combines input signals in the array to achieve constructive interference at particular angles (beams) and destructive interference for other angles. According to the following facts: 1- Beamforming can be computationally intensive, so real-time sonar beamforming algorithms in sonar devices is important. 2- Parallel computing has become a critical component of computing technology of the 1990s, and it is likely to have as much impact over the next 20 years as microprocessors have had over the past 20 [5]. 3- The high-performance computing community has been developing parallel programs for decades. These programs run on large scale, expensive computers. Only a few elite applications can justify the use of these expensive computers [2]. 4- GPU computing has the ability of parallel computing and it could be available on the personal computers. The objective of this thesis is to use Graphics Processing Unit (GPU) as real-time digital beamformer to accelerate the intensive signal processing.
210	A system for real-time rendering of compressed time-varying volume data She, Biao Unknown Date No description available. GPU decompression Vector quantization Time-varying volume data

Search results