Global ETD Search

61	GPU Acceleration of 3D MRSI using CUDA Chen, Chun-Cheng 04 August 2010 (has links) Using Graphic Processor Unit (GPU) to process the parallel operation via Compute Unified Device Architecture (CUDA) is a new technology in recent years. In the past, the GPU has been used in parallel operation but it was not easy for programming so that it couldn¡¦t be widely used in applications. CUDA is the newly-developed environment based on C language mainly for improving the complexity in programming with CUDA. The applications of GPU with CUDA has been expending to various fields gradually due to support of IEEE floating point as well as its lower cost in hardware while comparing to the super computers. Magnetic Resonance Spectroscopy (MRS) has the feature of non-invasive to probe the concentration distributed of metabolites in vivo. It can assist doctor in clinical diagnosis. The Magnetic Resonance Spectroscopy Imaging (MRSI) is imaging by many Signal Voxel Spectroscopy (SVS) to become multi-dimension MRS image. In MRSI, it can offer more information than SVS. CUDA are applied to MR image widely such as accelerating the image reconstruction and promoting the image quality, but in MRS it is seldom for the related application. In this paper, we using the CUDA to applied in MRS, the MRSI data pre-processing, to accelerate the spatial location in MRSI. In this work, we firstly use random data with different dimensions: 1D (one-dimension), 2D and 3D to evaluate the performance of Fourier transformation by using CUDA. We also finally apply some GE 2D/3D MRSI data to see how the acceleration of using CUDA works. Our results show that the acceleration rate of Fastest Fourier Transform (FFT) with CUDA in 1D, 2D and 3D random data largely increases as the data size increases. In the experiment of 2D/3D MRSI data, we find that using CUDA for accelerating the MRSI RAW-file generating procedure would avoid the data moving times, and it is not good for CUDA 1D FFT with parallel architecture while too small data amount processing in kernel. Therefore, how to solve the relationship between MRSI data format with CUDA FFT library and how to decrease the data moving time will discuss in the study. Fourier transform GPU Magnetic Resonance Spectroscopy CUDA Magnetic Resonance Imaging
62	CUDA-Based Modified Genetic Algorithms for Solving Fuzzy Flow Shop Scheduling Problems Huang, Yi-chen 23 August 2010 (has links) The flow shop scheduling problems with fuzzy processing times and fuzzy due dates are investigated in this paper. The concepts of earliness and tardiness are interpreted by using the concepts of possibility and necessity measures that were developed in fuzzy sets theory. And the objective function will be taken into account through the different combinations of possibility and necessity measures. The genetic algorithm will be invoked to tackle these objective functions. A new idea based on longest common substring will be introduced at the best-keeping step. This new algorithm reduces the number of generations needed to reach the stopping criterion. Also, we implement the algorithm on CUDA. The numerical experiments show that the performances of the CUDA program on GPU compare favorably to the traditional programs on CPU. CUDA framework Fuzzy number Genetic algorithm Flow shop scheduling problem
63	Optical Flow Computation on Compute Unified Device Architecture / Optiskt flödeberäkning med CUDA Ringaby, Erik January 2008 (has links) <p>There has been a rapid progress of the graphics processor the last years, much because of the demands from computer games on speed and image quality. Because of the graphics processor’s special architecture it is much faster at solving parallel problems than the normal processor. Due to its increasing programmability it is possible to use it for other tasks than it was originally designed for.</p><p>Even though graphics processors have been programmable for some time, it has been quite difficult to learn how to use them. CUDA enables the programmer to use C-code, with a few extensions, to program NVIDIA’s graphics processor and completely skip the traditional programming models. This thesis investigates if the graphics processor can be used for calculations without knowledge of how the hardware mechanisms work. An image processing algorithm calculating the optical flow has been implemented. The result shows that it is rather easy to implement programs using CUDA, but some knowledge of how the graphics processor works is required to achieve high performance.</p> optical flow GPU GPGPU CUDA Image analysis Bildanalys TECHNOLOGY TEKNIKVETENSKAP
64	GPU Implementation of a Novel Approach to Cramer’s Algorithm for Solving Large Scale Linear Systems West, Rosanne Lane 01 May 2010 (has links) Scientific computing often requires solving systems of linear equations. Most software pack- ages for solving large-scale linear systems use Gaussian elimination methods such as LU- decomposition. An alternative method, recently introduced by K. Habgood and I. Arel, involves an application of Cramer’s Rule and Chio’s condensation to achieve a better per- forming system for solving linear systems on parallel computing platforms. This thesis describes an implementation of this algorithm on an nVidia graphics processor card us- ing the CUDA language. Increased performance, relative to the serial implementation, is demonstrated, paving the way for future parallel realizations of the scheme. high performance computing GPU CUDA Other Computer Engineering
65	An Application Developed for Simulation of Electrical Excitation and Conduction in a 3D Human Heart Yu, Di 01 January 2013 (has links) This thesis first reviews the history of General Purpose computing Graphic Processing Unit (GPGPU) and then introduces the fundamental problems that are suitable for GPGPU algorithm. The architecture of GPGPU is compared against modern CPU architecture, and the fundamental difference is outlined. The programming challenges faced by GPGPU and the techniques utilized to overcome these issues are evaluated and discussed. The second part of the thesis presents an application developed with GPGPU technology to simulate the electrical excitation and conduction in a 3D human heart model based on cellular automata model. The algorithm and implementation are discussed in detail and the performance of GPU is compared against CPU. Cellular Automata CUDA GPGPU Parallel Algorithm SIMD Computer Sciences
66	Computational kinetics of a large scale biological process on GPU workstations : DNA bending Ruymgaart, Arnold Peter 30 October 2013 (has links) It has only recently become possible to study the dynamics of large time scale biological processes computationally in explicit solvent and atomic detail. This required a combination of advances in computer hardware, utilization of parallel and special purpose hardware as well as numerical and theoretical approaches. In this work we report advances in these areas contributing to the feasibility of a work of this scope in a reasonable time. We then make use of them to study an interesting model system, the action of the DNA bending protein 1IHF and demonstrate such an effort can now be performed on GPU equipped PC workstations. Many cellular processes require DNA bending. In the crowded compartment of the cell, DNA must be efficiently stored but this is just one example where bending is observed. Other examples include the effects of DNA structural features involved in transcription, gene regulation and recombination. 1IHF is a bacterial protein that binds and kinks DNA at sequence specific sites. The 1IHF binding to DNA is the cause or effect of bending of the double helix by almost 180 degrees. Most sequence specific DNA binding proteins bind in the major groove of the DNA and sequence specificity results from direct readout. 1IHF is an exception; it binds in the minor groove. The final structure of the binding/bending reaction was crystallized and shows the protein arm like features "latched" in place wrapping the DNA in the minor grooves and intercalating the tips between base pairs at the kink sites. This sequence specific, mostly indirect readout protein-DNA binding/bending interaction is therefore an interesting test case to study the mechanism of protein DNA binding and bending in general. Kinetic schemes have been proposed and numerous experimental studies have been carried out to validate these schemes. Experiments have included rapid kinetics laser T jump studies providing unprecedented temporal resolution and time resolved (quench flow) DNA foot-printing. Here we complement and add to those studies by investigating the mechanism and dynamics of the final latching/initial unlatching at an atomic level. This is accomplished with the computational tools of molecular dynamics and the theory of Milestoning. Our investigation begins by generating a reaction coordinate from the crystal structure of the DNA-protein complex and other images generated through modelling based on biochemical intuition. The initial path is generated by steepest descent minimization providing us with over 100 anchor images along the Steepest Descent Path (SDP) reaction coordinate. We then use the tools of Milestoning to sample hypersurfaces (milestones) between reaction coordinate anchors. Launching multiple trajectories from each milestone allowed us to accumulate average passage times to adjacent milestones and obtain transition probabilities. A complete set of rates was obtained this way allowing us to draw important conclusions about the mechanism of DNA bending. We uncover two possible metastable intermediates in the dissociation unkinking process. The first is an unexpected stable intermediate formed by initial unlatching of the IHF arms accompanied by a complete "psi-0" to "psi+140" conformational change of the IHF arm tip prolines. This unlatching (de-intercalation of the IHF tips from the kink sites) is required for any unkinking to occur. The second intermediate is formed by the IHF protein arms sliding over the DNA phosphate backbone and refolding in the next groove. The formation of this intermediate occurs on the millisecond timescale which is within experimental unkinking rate results. We show that our code optimization and parallelization enhancements allow the entire computational process of these millisecond timescale events in about one month on 10 or less GPU equipped workstations/cluster nodes bringing these studies within reach of researchers that do not have access to supercomputer clusters. / text GPU Molecular dynamics OMP MPI SHAKE CUDA Computational kinetics Milestoning
67	A Case Study of Parallel Bilateral Filtering on the GPU Larsson, Jonas January 2015 (has links) Smoothing and noise reduction of images is often an important ﬁrst step in image processing applications. Simple image smoothing algorithms like the Gaussian ﬁlter have the unfortunate side eﬀect of blurring the image which could obfuscate important information and have a negative impact on the following applications. The bilateral ﬁlter is a well-used non-linear smoothing algorithm that seeks to preserve edges and contours while removing noise. The bilateral ﬁlter comes at a heavy cost in computational speed, especially when used on larger images, since the algorithm does a greater amount of work for each pixel in the image than some simpler smoothing algorithms. In applications where timing is important, this may be enough to encourage certain developers to choose a simpler ﬁlter, at the cost of quality. However, the time cost of the bilateral ﬁlter can be greatly reduced through parallelization, as the work for each pixel can theoretically be done simultaneously. This work uses Nvidia’s Compute Uniﬁed Device Architecture (CUDA) to implement and evaluate some of the most common and eﬀective methods for parallelizing the bilateral ﬁlter on a Graphics processing unit (GPU). This includes use of the constant and shared memories, and a technique called 1 x N tiling. These techniques are evaluated on newer hardware and the results are compared to a sequential version, and a naive parallel version not using advanced techniques. This report also intends to give a detailed and comprehensible explanation to these techniques in the hopes that the reader may be able to use the information put forth to implement them on their own. The greatest speedup is achieved in the initial parallelizing step, where the algorithm is simply converted to run in parallel on a GPU. Storing some data in the constant memory provides a slight but reliable speedup for a small amount of work. Additional time can be gained by using shared memory. However, memory transactions did not account for as much of the execution time as was expected, and therefore the memory optimizations only yielded small improvements. Test results showed 1 x N tiling to be mostly non-beneﬁcial for the hardware that was used in this work, but there might have been problems with the implementation. Bilateral Filter Image filtering Image processing CUDA GPU GPGPU
68	Soft MIMO Detection on Graphics Processing Units and Performance Study of Iterative MIMO Decoding Arya, Richeek 2011 August 1900 (has links) In this thesis we have presented an implementation of soft Multi Input Multi Output (MIMO) detection, single tree search algorithm on Graphics Processing Units (GPUs). We have compared its performance on different GPUs and a Central Processing Unit (CPU). We have also done a performance study of iterative decoding algorithms. We have shown that by increasing the number of outer iterations error rate performance can be further improved. GPUs are specialized devices specially designed to accelerate graphics processing. They are massively parallel devices which can run thousands of threads simultaneously. Because of their tremendous processing power there is an increasing interest in using them for scientific and general purpose computations. Hence companies like Nvidia, Advanced Micro Devices (AMD) etc. have started their support for General Purpose GPU (GPGPU) applications. Nvidia came up with Compute Unified Device Architecture (CUDA) to program its GPUs. Efforts are made to come up with a standard language for parallel computing that can be used across platforms. OpenCL is the first such language which is supported by all major GPU and CPU vendors. MIMO detector has a high computational complexity. We have implemented a soft MIMO detector on GPUs and studied its throughput and latency performance. We have shown that a GPU can give throughput of up to 4Mbps for a soft detection algorithm which is more than sufficient for most general purpose tasks like voice communication etc. Compare to CPU a throughput increase of ~7x is achieved. We also compared the performances of two GPUs one with low computational power and one with high computational power. These comparisons show effect of thread serialization on algorithms with the lower end GPU's execution time curve shows a slope of 1/2. To further improve error rate performance iterative decoding techniques are employed where a feedback path is employed between detector and decoder. With an eye towards GPU implementation we have explored these algorithms. Better error rate performance however, comes at a price of higher power dissipation and more latency. By simulations we have shown that one can predict based on the Signal to Noise Ratio (SNR) values how many iterations need to be done before getting an acceptable Bit Error Rate (BER) and Frame Error Rate (FER) performance. Iterative decoding technique shows that a SNR gain of ~1:5dB is achieved when number of outer iterations is increased from zero. To reduce the complexity one can adjust number of possible candidates the algorithm can generate. We showed that where a candidate list of 128 is not sufficient for acceptable error rate performance for a 4x4 MIMO system using 16-QAM modulation scheme, performances are comparable with the list size of 512 and 1024 respectively. MIMO Detection Iterative Decoding GPU OpenCL CUDA Soft MIMO Detection
69	Využití GPU výpočtů pro rozpoznání dopravních značek Zídek, Karel January 2015 (has links) The thesis deals with the problem of GPU acceleration of algorithms for traffic sign recognition. Theoretical part of the thesis outlines methods for object detection with emphasis on the traffic sign detection problem. Further, it provides comparison of two well known tools for programming on the GPU: CUDA and OpenCL. On the basis of the review, an architecture of own solution is proposed. Finally, the thesis contains description of the implementation as well as evaluation of the results.
70	Adaptação e avaliação de triagem virtual em arquiteturas paralelas híbridas Jesus, Éverton Mendonça de 22 November 2016 (has links) Submitted by Mayara Nascimento (mayara.nascimento@ufba.br) on 2017-05-31T11:34:12Z No. of bitstreams: 1 dissertacao-everton-mendonca Copy.pdf: 756322 bytes, checksum: 010382d1618c37e3db7570c6c156e7fa (MD5) / Approved for entry into archive by Vanessa Reis (vanessa.jamile@ufba.br) on 2017-06-02T14:02:16Z (GMT) No. of bitstreams: 1 dissertacao-everton-mendonca Copy.pdf: 756322 bytes, checksum: 010382d1618c37e3db7570c6c156e7fa (MD5) / Made available in DSpace on 2017-06-02T14:02:16Z (GMT). No. of bitstreams: 1 dissertacao-everton-mendonca Copy.pdf: 756322 bytes, checksum: 010382d1618c37e3db7570c6c156e7fa (MD5) / A Triagem Virtual é uma metodologia computacional de busca de novos fármacos que verifica a interação entre moléculas (ligantes) e alvos macromoleculares. Este trabalho Objetivou a adaptação de uma ferramenta de Triagem Virtual para arquiteturas paralelas com GPUs e multicore e avaliação dos seus resultados, buscando com isso aumentar o desempenho da triagem, reduzindo seu tempo de execução e, consequentemente, permitindo a escalabilidade do número de moléculas envolvidas no processo. A ferramenta escolhida Para este propósito foi o Autodock devido a sua ampla adoção dentre os pesquisadores de novos fármacos que utilizam a Triagem Virtual. Três implementações foram criadas abordando diferentes técnicas de paralelismo. A primeira foi uma versão multicore onde foi utilizado OpenMP, a segunda foi uma implementação em GPUs utilizando CUDA e porém, foi criada uma implementação híbrida utilizando a versão multicore e a versão para GPUs em conjunto. Em todas as abordagens foram alcançados bons resultados em relação ao tempo de execução total, porém a versão híbrida foi a que obteve os melhores resultados. A versão multicore alcançou speedups, ou ganhos de desempenho, da ordem de 10 vezes. A versão para GPUs alcançou speedups da ordem de 28 vezes e a híbrida de 85 vezes. Com estes resultados foi possível determinar que o uso de plataformas de execução paralelas podem, efetivamente, melhorar o desempenho Triagem Virtual. Computação de Alto Desempenho Bioinformática Triagem Virtual Autodock CUDA

Search results