Global ETD Search

71	LARGE-SCALE MICROARRAY DATA ANALYSIS USING GPU- ACCELERATED LINEAR ALGEBRA LIBRARIES Zhang, Yun 01 August 2012 (has links) The biological datasets produced as a result of high-throughput genomic research such as specifically microarrays, contain vast amounts of knowledge for entire genome and their expression affiliations. Gene clustering from such data is a challenging task due to the huge data size and high complexity of the algorithms as well as the visualization needs. Most of the existing analysis methods for genome-wide gene expression profiles are sequential programs using greedy algorithms and require subjective human decision. Recently, Zhu et al. proposed a parallel Random matrix theory (RMT) based approach for generating transcriptional networks, which is much more resistant to high level of noise in the data [9] without human intervention. Nowadays GPUs are designed to be used more efficiently for general purpose computing [1] and are vastly superior to CPUs [6] in terms of threading performance. Our kernel functions running on GPU utilizes the functions from both the libraries of Compute Unified Basic Linear Algebra Subroutines (CUBLAS) and Compute Unified Linear Algebra (CULA) which implements the Linear Algebra Package (LAPACK). Our experiment results show that GPU program can achieve an average speed-up of 2~3 times for some simulated datasets. CUDA CULA Gene clustering GPU Microarray Random matrix theory
72	Simulação de corpos deformáveis baseada em pontos em tempo real através de programação de propósito geral em dispositivo gráfico William Santos Almeida, Mozart 31 January 2010 (has links) Made available in DSpace on 2014-06-12T15:56:34Z (GMT). No. of bitstreams: 2 arquivo2955_1.pdf: 3111597 bytes, checksum: 1a429acd96d2734eec9d3245ce25cf3a (MD5) license.txt: 1748 bytes, checksum: 8a4605be74aa9ea9d79846c1fba20a33 (MD5) Previous issue date: 2010 / Conselho Nacional de Desenvolvimento Científico e Tecnológico / Modelos de simulação física baseados em pontos vêm se tornando ao longo dos anos uma alternativa à utilização de malhas, visto que além de possiblitarem a simulação de características físicas mais realistas, possibilitam esta realização de forma mais eficiente do que nos modelos baseados em malhas. Esta dissertação de mestrado apresenta o desenvolvimento de uma solução para a simulação baseada em pontos de objetos deformáveis em tempo real, através da implementação de uma técnica livre de malha (meshless), conhecida por Point-Based Animation. Esta técnica utiliza apenas pontos como unidades de simulação, consequentemente reduzindo a necessidade de manter a informação de conectividade entre eles através de arestas. Essa abordagem possibilita a simulação mais eficiente de certos comportamentos, como mudança de topologia, por exemplo. Dessa forma, este modelo de simulação é adequado para a paralelização, podendo ser otimizado para execução em tempo real. Uma versão paralela do algoritmo foi implementada nesta dissertação, a fim de tornar os resultados interativos da versão sequencial do algoritmo em resultados de tempo real. Uma análise comparativa entre uma implementação em um processador de propósito geral (CPU) e uma em uma placa gráfica (GPU), através da abordagem massivamente paralela provida pela NVIDIA Compute Unified Device Architecture (CUDA), mostra um significativo ganho de desempenho. Foi observada a capacidade de simular em GPU dez objetos simultâneos a uma taxa de quadros por segundo (FPS) maior do que a execução de apenas um objeto em CPU, apesar da existência de alguns problemas relativos à precisão e estabilidade, em parte devido a algumas limitações impostas pela utilização da arquitetura de CUDA Simulação física Point Based Animation Computação paralela CUDA
73	GPU-Accelerated Frame Pre-Processing for Use in Low Latency Computer Vision Applications Tarassu, Jonas January 2017 (has links) The attention for low latency computer vision and video processing applications are growing for every year, not least the VR and AR applications. In this thesis the Contrast Limited Adaptive Histogram Equalization (CLAHE) and Radial Dis- tortion algorithms are implemented using both CUDA and OpenCL to determine whether these type of algorithms are suitable for implementations aimed to run at GPUs when low latency is of utmost importance. The result is an implemen- tation of the block versions of the CLAHE algorithm which utilizes the built in interpolation hardware that resides on the GPU to reduce block effects and an im- plementation of the Radial Distortion algorithm that corrects a 1920x1080 frame in 0.3 ms. Further this thesis concludes that the GPU-platform might be a good choice if the data to be processed can be transferred to and possibly from the GPU fast enough and that the choice of compute API mostly is a matter of taste. GPU CUDA OpenCL CLAHE RDC Computer and Information Sciences Data- och informationsvetenskap
74	GPGPU-LOD (General Purpose Graphics Processing Unit - Level Of Detail) : Grafikkortsdriven terräng-LOD-algoritm Jansson, Karl January 2009 (has links) Dagens grafikkort är uppbyggda av kraftfulla multiprocessorer som gör dom ypperliga för att hantera parallelliserbara problem som skulle ta lång tid att utföra på en vanlig processor, så som exempelvis level-of-detail eller raytracing. Denna rapport presenterar en parallelliserbar level-of-detail algoritm för terränghöjdkartor samt implementerar denna för användning på grafikkort användande Nvidias CUDA API. Algoritmen delar upp den totala höjdkartan i sektioner som ytterligare delas upp i mindre block som beräknas parallellt på grafikkortet. Algoritmen räknar ut vertexpositioner, normaler och texturkoordinater för vardera block och skickar datan till applikationen som skapar vertex och indexbuffertar och renderar sektionerna. Implementationens prestanda och förmåga att reducera trianglar analyseras med två olika sorters culling-metoder; en metod som gallrar trianglar på sektionsnivå och en metod som gallrar på blocknivå. Resultaten visar att det är mycket fördelaktigt att låta grafikkortet hantera level-of-detail beräkningar på detta vis även om minneskopiering över grafikkortsbussen är ett problem, då det tar upp ungefär åttiofem procent av den totala tiden för att hantera en sektion. Beräkningarna i sig tar väldigt lite tid och det finns gott om utrymme för utveckling för att uppnå en så bra fördelningen av trianglar över terrängområdet som möjligt. GPGPU LOD CUDA Datorgrafik Computer Sciences Datavetenskap (datalogi)
75	Parallel parsing of context-free grammars Skrzypczak, Piotr January 2012 (has links) During the last decade increasing interest in parallel programming can be observed. It is caused by a tendency of developing microprocessors as a multicore units, that can perform instructions simultaneously. Popular and widely used example of such platform is a graphic processing unit (GPU). Its ability to perform calculations simultaneously is being investigated as a way for improving performance of the complex algorithms. Therefore, GPU’s are now having the architectures that allows to use its computional power by programmers and software developers in the same way as CPU. One of these architectures is CUDA platform, developed by nVidia. Aim of this thesis is to implement the parallel CYK algorithm, which is one of the most popular parsing algorithms, for CUDA platform, that will gain a significant speed-up in comparison with the sequential CYK algorithm. The thesis presents review of existing parallelisations of CYK algorithm, descriptions of implemented algorithms (basic version and few modifications), and experimental stage, that includes testing these versions for various inputs in order to justify which version of algorithm is giving the best performance. There are three versions of algorithm presented, from which one was selected as the best (giving about 10 times better performance for the longest instances of inputs). Also, a limited version of algorithm, that gives best performance (even 100 times better in comparison with non-limited sequential version), but requires some conditions to be fulfilled by grammar, is presented. The motivation for the thesis is to use the developed algorithm in GCS. parallel parsing CYK algorithm CUDA Computer Sciences Datavetenskap (datalogi)
76	Simulation of Modelica Models on the CUDA Architecture Östlund, Per January 2009 (has links) Simulations are very important for many reasons, and finding ways of accelerating simulations are therefore interesting. In this thesis the feasibility of automatically generating simulation code for a limited set of Modelica models that can be executed on NVIDIAs CUDA architecture is studied. The OpenModelica compiler, an open-source Modelica compiler, was for this purpose extended to generate CUDA code. This thesis presents an overview of the CUDA architecture, and looks at the problems that need to be solved to generate efficient simulation code for this architecture. Methods of finding parallelism in models that can be used on the highly parallel CUDA architecture are shown, and methods of efficiently using the available memory spaces on the architecture are also presented. This thesis shows that it is possible to generate CUDA simulation code for the set of Modelica models that were chosen. It also shows that for models with a large amount of parallelism it is possible to get significant speedups compared with simulation on a normal processor, and a speedup of 4.6 was reached for one of the models used in the thesis. Several suggestions on how the CUDA architecture can be used even more efficiently for Modelica simulations are also given. Modelica OpenModelica CUDA Parallelization Computer Sciences Datavetenskap (datalogi)
77	High-performance particle simulation using CUDA Kalms, Mikael January 2015 (has links) Over the past 15 years, modern PC graphics cards (GPUs) have changed from being pure graphics accelerators into parallel computing platforms.Several new parallel programming languages have emerged, including NVIDIA's parallel programming language for GPUs (CUDA). This report explores two related problems in parallel: How well-suited is CUDA for implementing algorithms that utilize non-trivial data structures?And, how does one develop a complex algorithm that uses a CUDA system efficiently? A guide for how to implement complex algorithms in CUDA is presented. Simulation of a dense 2D particle system is chosen as the problem domain foralgorithm optimization. Two algorithmic optimization strategies are presented which reduce the computational workload when simulating theparticle system. The strategies can either be used independently, or combined for slightly improved results. Finally, the resultingimplementations are benchmarked against a simpler implementation on a normal PC processor (CPU) as well as a simpler GPU-algorithm. A simple GPU solution is shown to run at least 10 times faster than a simple CPU solution. An improved GPU solution can thenyield another 10 times speed-up, while sacrificing some accuracy. CUDA parallel computing particle simulation GPU Computer Engineering Datorteknik
78	Optimización de proceso de detección de partículas a partir de imágenes de video mediante paralelización Silva Leal, Juan Sebastián January 2012 (has links) Ingeniero Civil en Computación / La detección de objetos a partir de imágenes se ha convertido en una herramienta muy poderosa para diferentes disciplinas. El Laboratorio de Materia Fuera del Equilibrio del Departamento de Física de la Facultad cuenta con una implementación en C del Método χ^2 usando bibliotecas ad-hoc compatibles con Mac OSX para detectar partículas en sistemas granulares cuasi-bidimensionales compuestos por miles de partículas de acero de 1 mm de diámetro, pudiendo detectar partículas en una imagen de 1 MegaPixel en alrededor de 10 segundos. Sin embargo, estas imágenes provienen de videos que se desean analizar y en una sesión de trabajo se puede requerir analizar alrededor de unas 100.000 imágenes en total, por lo cual el procesamiento y posterior análisis de estas imágenes de video tiene una duración de varios días. Es por esto que fue necesario agilizar de alguna manera este procesamiento de imágenes y generar una solución robusta. El objetivo principal de la memoria consistió en reducir los tiempos de detección de partículas generando un nuevo software basado en el anterior, facilitando extensiones futuras, y utilizando el máximo poder de cómputo disponible en el laboratorio. El alumno ideó como solución un sistema distribuido haciendo uso de todos los computadores disponibles para el procesamiento de imágenes, reimplementando el código del software, en ese entonces utilizado, de C a C++ utilizando patrones de diseño para facilitar futuras extensiones del software y threads con el fin de aumentar el rendimiento de este. También se agregó tecnología CUDA para el procesamiento de datos reduciendo de forma considerable los tiempos de ejecución. Como resultado final de la memoria, se logró obtener un speedup de alrededor de 5x haciendo uso de distribución de carga computacional, uso de procesos en paralelo, hilos de ejecución y tecnología CUDA, además se logró una solución más robusta y extensible para futuros cambios o generación de nuevos algoritmos de procesamiento. Todo el proceso de investigación, desde la obtención de datos hasta la validación de la hipótesis, lleva mucho tiempo, en donde la detección de partículas es solo una parte de todo el calculo computacional que se debe realizar, por lo que se aconseja implementar en lenguajes no interpretados y más rápidos, como por ejemplo C++, otras etapas de cálculo de datos y además, en lo posible, distribuir el computo y usar CUDA. Reconocimiento visual de modelos Procesamiento de imagen Detección de objetos CUDA
79	Parallelization of Push-based System for Molecular Simulation Data Analysis with GPU Akhmedov, Iliiazbek 19 October 2016 (has links) Modern simulation systems generate big amount of data, which consequently has to be analyzed in a timely fashion. Traditional database management systems follow principle of pulling the needed data, processing it, and then returning the results. This approach is then optimized by means of caching, storing in different structures, or doing some sacrifices on precision of the results to make it faster. When it comes to the point of doing various queries that require analysis of the whole data, this design has the following disadvantages: considerable overhead on traditional disk random I/O framework while reading from the simulation output files and low throughput of the data that consequently results in long latency, and, if there was any indexing to optimize selections, overhead of storing those becomes too big, too. Beside it, indexing will also cause delay during write operations and since most of the queries work with the entire data sets, indexing loses its point. There is a new approach to this problem – Push-based System for Molecular Simulation Data Analysis for processing network of queries proposed in the previous paper and its primary steps are: i) it uses traditional scan-based I/O framework to load the data from files to the main memory and then ii) the data is pushed through a network of queries which consequently filters the data and collect all the needed information which increases efficiency and data throughput. It has a considerable advantage in analysis of molecular simulation data, because it normally involves all the data sets to be processed by the queries. In this paper, we propose improved version of Push-based System for Molecular Simulation Data Analysis. Its major difference with the previous design is usage of GPU for the actual processing part of the data flow. Using the same scan-based I/O framework the data is pushed through the network of queries which are processed by GPU, and due to the nature of science simulation data, this gives a big advantage for processing it faster and easier (it will be explained more in later sections). In the old approach there were some custom data structures such as quad-tree for calculation of histograms to make the processing faster and those involved loss of data and some expectations from the data nature, too. In the new approach due to high performance of GPU processing and its nature, custom data structures were not even needed much, though it didn’t bear any loss in precision and performance. data processing cuda optimization big data streaming Computer Sciences
80	Addressing software-managed cache development effort in GPGPUs Lashgar, Ahmad 29 August 2017 (has links) GPU Computing promises very high performance per watt for highly-parallelizable workloads. Nowadays, there are various programming models developed to utilize the computational power of GPGPUs. Low-level programming models provide full control over GPU resources and allow programmers to achieve peak performance of the chip. In contrast, high-level programming models hide GPU-specific programming details and allow programmers to mainly express parallelism. Later, the compiler parses the parallelization notes and translates them to low-level programming models. This saves tremendous development effort and improves productivity, often achieved at the cost of sacrificing performance. In this dissertation, we investigate the limitations of high-level programming models in achieving a performance near to low-level models. Specifically, we study the performance and productivity gap between high-level OpenACC and low-level CUDA programming models and aim at reducing the performance gap, while maintaining the productivity advantages. We start this study by developing our in-house OpenACC compiler. Our compiler, called IPMACC, translates OpenACC for C to CUDA and uses the system compile to generate GPU binaries. We develop various micro-benchmarks to understand GPU structure and implement a more efficient OpenACC compiler. By using IPMACC, we evaluate the performance and productivity gap between a wide set of OpenACC and CUDA kernels. From our findings, we conclude that one of the major reasons behind the big performance gap between OpenACC and CUDA is CUDA’s flexibility in exploiting the GPU software-managed cache. Identifying this key benefit in low-level CUDA, we follow three effective paths in utilizing software-managed cache similar to CUDA, but at a lower development effort (e.g. using OpenACC instead). In the first path, we explore the possibility of employing existing OpenACC directives in utilizing software-managed cache. Specifically, the cache directive is devised in OpenACC API standard to allow the use of software-managed cache in GPUs. We introduce an efficient implementation of OpenACC cache directive that performs very close to CUDA. However, we show that the use of the cache directive is limited and the directive may not offer the full-functionality associated with the software-managed cache, as existing in CUDA. In the second path, we build on our observation on the limitations of the cache directive and propose a new OpenACC directive, called the fcw directive, to address the shortcomings of the cache directive, while maintaining OpenACC productivity advantages. We show that the fcw directive overcomes the cache directive limitations and narrows down the performance gap between CUDA and OpenACC significantly. In the third path, we propose fully-automated hardware/software approach, called TELEPORT, for software-managed cache programming. On the software side, TELEPORT statically analyzes CUDA kernels and identifies opportunities in utilizing the software-managed cache. The required information is passed to the GPU via API calls. Based on this information, on the hardware side, TELEPORT prefetches the data to the software-managed cache at runtime. We show that TELEPORT can improve performance by 32% on average, while lowering the development effort by 2.5X, compared to hand-written CUDA equivalent. / Graduate GPGPU CUDA Cache OpenACC Memory Performance Development Effort

Search results