Global ETD Search

191	Analysis of Hardware Usage Of Shuffle Instruction Based Performance Optimization in the Blinds-II Image Quality Assessment Algorithm January 2017 (has links) abstract: With the advent of GPGPU, many applications are being accelerated by using CUDA programing paradigm. We are able to achieve around 10x -100x speedups by simply porting the application on to the GPU and running the parallel chunk of code on its multi cored SIMT (Single instruction multiple thread) architecture. But for optimal performance it is necessary to make sure that all the GPU resources are efficiently used, and the latencies in the application are minimized. For this, it is essential to monitor the Hardware usage of the algorithm and thus diagnose the compute and memory bottlenecks in the implementation. In the following thesis, we will be analyzing the mapping of CUDA implementation of BLIINDS-II algorithm on the underlying GPU hardware, and come up with a Kepler architecture specific solution of using shuffle instruction via CUB library to tackle the two major bottlenecks in the algorithm. Experiments were conducted to convey the advantage of using shuffle instru3ction in algorithm over only using shared memory as a buffer to global memory. With the new implementation of BLIINDS-II algorithm using CUB library, a speedup of around 13.7% was achieved. / Dissertation/Thesis / Masters Thesis Engineering 2017 Engineering Computer engineering Computer science BLIINDS-II CUDA C++ GPGPU IQA KEPLER NVIDIA
192	Implementação e análise de algoritmos para estimação de movimento em processadores paralelos tipo GPU (Graphics Processing Units) / Implementation and analysis of algorithms for motion estimation onto parallels processors type GPU Monteiro, Eduarda Rodrigues January 2012 (has links) A demanda por aplicações que processam vídeos digitais têm obtido atenção na indústria e na academia. Considerando a manipulação de um elevado volume de dados em vídeos de alta resolução, a compressão de vídeo é uma ferramenta fundamental para reduzir a quantidade de informações de modo a manter a qualidade viabilizando a respectiva transmissão e armazenamento. Diferentes padrões de codificação de vídeo foram desenvolvidos para impulsionar o desenvolvimento de técnicas avançadas para este fim, como por exemplo, o padrão H.264/AVC. Este padrão é considerado o estado-da-arte, pois proporciona maior eficiência em codificação em relação a padrões existentes (MPEG-4). Entre todas as ferramentas inovadoras apresentadas pelas mais recentes normas de codificação, a Estimação de Movimento (ME) é a técnica que provê a maior parcela dos ganhos. A ME busca obter a relação de similaridade entre quadros vizinhos de uma cena, porém estes ganhos são obtidos ao custo de um elevado custo computacional representando a maior parte da complexidade total dos codificadores atuais. O objetivo do trabalho é acelerar o processo de ME, principalmente quando vídeos de alta resolução são codificados. Esta aceleração concentra-se no uso de uma plataforma massivamente paralela, denominada GPU (Graphics Processing Unit). Os algoritmos da ME apresentam um elevado potencial de paralelização e são adequados para implementação em arquiteturas paralelas. Assim, diferentes algoritmos têm sido propostos a fim de diminuir o custo computacional deste módulo. Este trabalho apresenta a implementação e a exploração do paralelismo de dois algoritmos da ME em GPU, focados na codificação de vídeo de alta definição e no processamento em tempo real. O algoritmo Full Search (FS) é conhecido como algoritmo ótimo, pois encontra os melhores resultados a partir de uma busca exaustiva entre os quadros. O algoritmo rápido Diamond Search (DS) reduz significativamente a complexidade da ME mantendo a qualidade de vídeo próxima ao desempenho apresentado pelo FS. A partir da exploração máxima do paralelismo dos algoritmos FS e DS e do processamento paralelo disponível nas GPUs, este trabalho apresenta um método para mapear estes algoritmos em GPU, considerando a arquitetura CUDA (Compute Unified Device Architecture). Para avaliação de desempenho, as soluções CUDA são comparadas com as respectivas versões multi-core (utilizando biblioteca OpenMP) e distribuídas (utilizando MPI como infraestrutura de suporte). Todas as versões foram avaliadas em diferentes resoluções e os resultados foram comparados com algoritmos da literatura. As implementações propostas em GPU apresentam aumentos significativos, em termos de desempenho, em relação ao software de referência do codificador H.264/AVC e, além disso, apresentam ganhos expressivos em relação às respectivas versões multi-core, distribuída e trabalhos GPGPU propostos na literatura. / The demand for applications processing digital videos has become the focus of attention in industry and academy. Considering the manipulation of the high volume of data contained in high resolution digital videos, video compression is a fundamental tool for reduction in the amount of information in order to maintain the quality and, thus enabling its respective transfer and storage. As to obtain the development of advanced video coding techniques, different standards of video encoding were developed, for example, the H.264/AVC. This standard is considered the state-of-art for proving high coding efficiency compared to previous standards (MPEG-4). Among all innovative tools featured by the latest video coding standards, the Motion Estimation is the technique that provides the most important coding gains. ME searches obtain the similarity relation between neighboring frames of the one scene. However, these gains were obtained by the elevated computational cost, representing the greater part of the total complexity of the current encoders. The goal of this project is to accelerate the Motion Estimation process, mainly when high resolution digital videos were encoded. This acceleration focuses on the use of a massively parallel platform called GPU (Graphics Processing Unit). The Motion Estimation block matching algorithms present a high potential for parallelization and are suitable for implementation in parallel architectures. Therefore, different algorithms have been proposed to decrease the computational complexity of this module. This work presents the implementation and parallelism exploitation of two motion estimation algorithms in GPU focused in encoding high definition video and the real time processing. Full Search algorithm (FS) is known as optimal since it finds the best match by exhaustively searching between frames. The fast Diamond Search algorithm reduces significantly the ME complexity while keeping the video quality near FS performance. By exploring the maximum inherent parallelism of FS and DS and the available parallel processing capability of GPUs, this work presents an efficient method to map out these algorithms onto GPU considering the CUDA architecture (Compute Unified Device Architecture). For performance evaluation, the CUDA solutions are compared with respective multi-core (using OpenMP library) and distributed (using MPI as supporting infrastructure) versions. All versions were evaluated in different video resolutions and the results were compared with algorithms found in the literature. The proposed implementations onto GPU present significant increase, in terms of performance, in relation with the H.264/AVC encoder reference software and, moreover, present expressive gains in relation with multi-core, distributed versions and GPGPU alternatives proposed in literature. Compressao : Video Algoritmos Microeletrônica Motion estimation Full search Diamond search GPU CUDA
193	Um estudo aplicado de paralelismo para o problema do subgrafo planar de peso máximo / An applied study using parallelism for the maximum weight planar subgraph problem Coelho, Vinícius de Sousa 27 April 2018 (has links) Submitted by Liliane Ferreira (ljuvencia30@gmail.com) on 2018-05-21T15:48:27Z No. of bitstreams: 2 Dissertação - Vinícius de Sousa Coelho - 2018.pdf: 1071318 bytes, checksum: fba98fd6feb916f0400af915d4d92a2b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2018-05-22T12:14:46Z (GMT) No. of bitstreams: 2 Dissertação - Vinícius de Sousa Coelho - 2018.pdf: 1071318 bytes, checksum: fba98fd6feb916f0400af915d4d92a2b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2018-05-22T12:14:46Z (GMT). No. of bitstreams: 2 Dissertação - Vinícius de Sousa Coelho - 2018.pdf: 1071318 bytes, checksum: fba98fd6feb916f0400af915d4d92a2b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2018-04-27 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / The Maximum Weight Planar Subgraph Problem (MWPSP) consists of identifying a planar subgraph of maximum weight of a given edge-weighted graph. This work proposes new heuristic solutions, mainly using Graphic Processing Units, based on local transformations on the graph topology, consisting of vertex and edge insertion/relocation moves. Sequential and parallel implementations were built and applied to various numerical instances with promising results. One of the approaches requires only 25 seconds of execution, being more than 200 times faster than its corresponding sequential version, for a 100-vertex instance. In terms of quality, the proposed solutions obtained better results than state of the art proposals. / O problema do subgrafo planar de peso máximo (MWPSP) consiste em extrair um subgrafo planar maximal, a partir de um grafo completo com pesos atribuídos às arestas, cuja soma dos pesos das arestas seja máxima. Este trabalho propõe soluções heurísticas, construídas por meio de Unidades de Processamento Gráfico (GPUs), baseadas em transformações locais na topologia do grafo através da inserção/realocação de vértices e arestas. Implementações sequencias e paralelas foram propostas, apresentando resultados satisfatórios. Em uma das propostas, a versão paralela requer cerca de 25 segundos de execução em uma instância de 100 vértices, sendo cerca de 200 vezes mais rápida que a versão sequencial correspondente. Em termos de qualidade da solução, as propostas superaram os resultados obtidos por algoritmos no estado da arte. Planaridade MWPSP Paralelismo CUDA Planarity Parallelism
194	Implementação e análise de algoritmos para estimação de movimento em processadores paralelos tipo GPU (Graphics Processing Units) / Implementation and analysis of algorithms for motion estimation onto parallels processors type GPU Monteiro, Eduarda Rodrigues January 2012 (has links) A demanda por aplicações que processam vídeos digitais têm obtido atenção na indústria e na academia. Considerando a manipulação de um elevado volume de dados em vídeos de alta resolução, a compressão de vídeo é uma ferramenta fundamental para reduzir a quantidade de informações de modo a manter a qualidade viabilizando a respectiva transmissão e armazenamento. Diferentes padrões de codificação de vídeo foram desenvolvidos para impulsionar o desenvolvimento de técnicas avançadas para este fim, como por exemplo, o padrão H.264/AVC. Este padrão é considerado o estado-da-arte, pois proporciona maior eficiência em codificação em relação a padrões existentes (MPEG-4). Entre todas as ferramentas inovadoras apresentadas pelas mais recentes normas de codificação, a Estimação de Movimento (ME) é a técnica que provê a maior parcela dos ganhos. A ME busca obter a relação de similaridade entre quadros vizinhos de uma cena, porém estes ganhos são obtidos ao custo de um elevado custo computacional representando a maior parte da complexidade total dos codificadores atuais. O objetivo do trabalho é acelerar o processo de ME, principalmente quando vídeos de alta resolução são codificados. Esta aceleração concentra-se no uso de uma plataforma massivamente paralela, denominada GPU (Graphics Processing Unit). Os algoritmos da ME apresentam um elevado potencial de paralelização e são adequados para implementação em arquiteturas paralelas. Assim, diferentes algoritmos têm sido propostos a fim de diminuir o custo computacional deste módulo. Este trabalho apresenta a implementação e a exploração do paralelismo de dois algoritmos da ME em GPU, focados na codificação de vídeo de alta definição e no processamento em tempo real. O algoritmo Full Search (FS) é conhecido como algoritmo ótimo, pois encontra os melhores resultados a partir de uma busca exaustiva entre os quadros. O algoritmo rápido Diamond Search (DS) reduz significativamente a complexidade da ME mantendo a qualidade de vídeo próxima ao desempenho apresentado pelo FS. A partir da exploração máxima do paralelismo dos algoritmos FS e DS e do processamento paralelo disponível nas GPUs, este trabalho apresenta um método para mapear estes algoritmos em GPU, considerando a arquitetura CUDA (Compute Unified Device Architecture). Para avaliação de desempenho, as soluções CUDA são comparadas com as respectivas versões multi-core (utilizando biblioteca OpenMP) e distribuídas (utilizando MPI como infraestrutura de suporte). Todas as versões foram avaliadas em diferentes resoluções e os resultados foram comparados com algoritmos da literatura. As implementações propostas em GPU apresentam aumentos significativos, em termos de desempenho, em relação ao software de referência do codificador H.264/AVC e, além disso, apresentam ganhos expressivos em relação às respectivas versões multi-core, distribuída e trabalhos GPGPU propostos na literatura. / The demand for applications processing digital videos has become the focus of attention in industry and academy. Considering the manipulation of the high volume of data contained in high resolution digital videos, video compression is a fundamental tool for reduction in the amount of information in order to maintain the quality and, thus enabling its respective transfer and storage. As to obtain the development of advanced video coding techniques, different standards of video encoding were developed, for example, the H.264/AVC. This standard is considered the state-of-art for proving high coding efficiency compared to previous standards (MPEG-4). Among all innovative tools featured by the latest video coding standards, the Motion Estimation is the technique that provides the most important coding gains. ME searches obtain the similarity relation between neighboring frames of the one scene. However, these gains were obtained by the elevated computational cost, representing the greater part of the total complexity of the current encoders. The goal of this project is to accelerate the Motion Estimation process, mainly when high resolution digital videos were encoded. This acceleration focuses on the use of a massively parallel platform called GPU (Graphics Processing Unit). The Motion Estimation block matching algorithms present a high potential for parallelization and are suitable for implementation in parallel architectures. Therefore, different algorithms have been proposed to decrease the computational complexity of this module. This work presents the implementation and parallelism exploitation of two motion estimation algorithms in GPU focused in encoding high definition video and the real time processing. Full Search algorithm (FS) is known as optimal since it finds the best match by exhaustively searching between frames. The fast Diamond Search algorithm reduces significantly the ME complexity while keeping the video quality near FS performance. By exploring the maximum inherent parallelism of FS and DS and the available parallel processing capability of GPUs, this work presents an efficient method to map out these algorithms onto GPU considering the CUDA architecture (Compute Unified Device Architecture). For performance evaluation, the CUDA solutions are compared with respective multi-core (using OpenMP library) and distributed (using MPI as supporting infrastructure) versions. All versions were evaluated in different video resolutions and the results were compared with algorithms found in the literature. The proposed implementations onto GPU present significant increase, in terms of performance, in relation with the H.264/AVC encoder reference software and, moreover, present expressive gains in relation with multi-core, distributed versions and GPGPU alternatives proposed in literature. Compressao : Video Algoritmos Microeletrônica Motion estimation Full search Diamond search GPU CUDA
195	Um pipeline para renderização fotorrealística de tempo real com ray tracing para a realidade aumentada Lemos de Almeida Melo, Diego 31 January 2012 (has links) Made available in DSpace on 2014-06-12T16:01:28Z (GMT). No. of bitstreams: 2 arquivo9410_1.pdf: 4384561 bytes, checksum: 4ebaaa7cbd8455ac2eed9a38c2530cf4 (MD5) license.txt: 1748 bytes, checksum: 8a4605be74aa9ea9d79846c1fba20a33 (MD5) Previous issue date: 2012 / A Realidade Aumentada é um campo de pesquisa que trata do estudo de técnicas para integrar informações virtuais com o mundo real. Algumas aplicações de Realidade Aumentada requerem fotorrealismo, onde os elementos virtuais são tão coerentemente inseridos na cena real que o usuário não consegue distinguir o virtual do real. Para a síntese de cenas 3D existem diversas técnicas, entre elas o ray tracing. Ele é um algoritmo baseado em conceitos básicos da Física Ótica, cuja principal característica é a alta qualidade visual a um custo computacional elevado, o que condicionava a sua utilização a aplicações offline. Contudo, com o avanço do poder computacional das GPUs este algoritmo passou a ser viável para ser utilizado em aplicações de tempo real, devido principalmente ao fato de ser um algoritmo com a característica de poder ser massivamente paralelizado. Levando isto em consideração, esta dissertação propõe um pipeline para renderização fotorrealística em tempo real utilizando a técnica ray tracing em aplicações de Realidade Aumentada. O ray tracer utilizado foi o Real Time Ray Tracer, ou RT2, de Santos et al., que serviu de base para a construção de um pipeline com suporte a sombreamento, síntese de diversos tipos de materiais, oclusão, reflexão, refração e alguns efeitos de câmera. Para que fosse possível obter um sistema que funciona a taxas interativas, todo o pipeline de renderização foi implementado em GPU, utilizando a linguagem CUDA, da NVIDIA. Outra contribuição importante deste trabalho é a integração deste pipeline com o dispositivo Kinect, da Microsoft, possibilitando a obtenção de informações reais da cena, em tempo real, eliminando assim a necessidade de se conhecer previamente os objetos pertencentes à cena real RT2 CUDA GPGPU Ray Tracing Visualização Realidade Aumentada
196	Computação paralela em cluster de GPU aplicado a problema da engenharia nuclear MORAES, Sérgio Ricardo dos Santos 04 1900 (has links) Submitted by Almir Azevedo (barbio1313@gmail.com) on 2013-12-09T12:17:20Z No. of bitstreams: 1 dissertacao_mestrado_ien_2012_01.pdf: 1805099 bytes, checksum: c22681117de84a4db428c8b495af3eab (MD5) / Made available in DSpace on 2013-12-09T12:17:20Z (GMT). No. of bitstreams: 1 dissertacao_mestrado_ien_2012_01.pdf: 1805099 bytes, checksum: c22681117de84a4db428c8b495af3eab (MD5) Previous issue date: 2012 / A computação em cluster tem sido amplamente utilizada como uma alternativa de relativo baixo custo para processamento paralelo em aplicações científicas. Com a utilização do padrão de interface de troca de mensagens (MPI, do inglês Message-Passing Interface), o desenvolvimento tornou-se ainda mais acessível e difundido na comunidade científica. Uma tendência mais recente é a utilização de Unidades de Processamento Gráfico (GPU, do inglês Graphic Processing Unit), que são poderosos coprocessadores capazes de realizar centenas de instruções ao mesmo tempo, podendo chegar a uma capacidade de processamento centenas de vezes a de uma CPU. Entretanto, um microcomputador convencional não abriga, em geral, mais de duas GPUs. Portanto, propõe-se neste trabalho o desenvolvimento e avaliação de uma abordagem paralela híbrida de baixo custo na solução de um problema típico da engenharia nuclear. A ideia é utilizar a tecnologia de paralelismo em clusters (MPI) em conjunto com a de programação de GPUs (CUDA, do inglês Compute Unified Device Architecture) no desenvolvimento de um sistema para simulação do transporte de nêutrons, através de uma blindagem por meio do Método Monte Carlo. Utilizando a estrutura física de cluster composto de quatro computadores com processadores quad-core e 2 GPUs cada, foram desenvolvidos programas utilizando as tecnologias MPI e CUDA. Experimentos empregando diversas configurações, desde 1 até 8 GPUs, foram executados e comparados entre si, bem como com o programa sequencial (não paralelo). Observou-se uma redução do tempo de processamento da ordem de 2.000 vezes quando se comparada a versão paralela de 8 GPUs com a versão sequencial. Os resultados aqui apresentados são discutidos e analisados com o objetivo de destacar ganhos e possíveis limitações da abordagem proposta. / Cluster computing has been widely used as a low cost alternative for parallel processing in scientific applications. With the use of Message-Passing Interface (MPI) protocol development became even more accessible and widespread in the scientific community. A more recent trend is the use of Graphic Processing Unit (GPU), which is a powerful co-processor able to perform hundreds of instructions in parallel, reaching a capacity of hundreds of times the processing of a CPU. However, a standard PC does not allow, in general, more than two GPUs. Hence, it is proposed in this work development and evaluation of a hybrid low cost parallel approach to the solution to a nuclear engineering typical problem. The idea is to use clusters parallelism technology (MPI) together with GPU programming techniques (CUDA – Compute Unified Device Architeture) to simulate neutron transport through a slab using Monte Carlo method. By using a cluster comprised by four quad-core computers with 2 GPU each, it has been developed programs using MPI and CUDA technologies. Experiments, applying different configurations, from 1 to 8 GPUs has been performed and results were compared with the sequential (non-parallel) version. A speed up of about 2.000 times has been observed when comparing the 8- GPU with the sequential version. Results here presented are discussed and analysed with the objective of outlining gains and possible limitations of the proposed approah. Computação paralela Método de Monte Carlo Transporte de neutróns GPU CUDA MPI blindagem
197	Um Pipeline Para Renderização Fotorrealística de Tempo Real com Ray Tracing para Realidade Aumentada Melo, Diego Lemos de Almeida 09 March 2012 (has links) Submitted by Pedro Henrique Rodrigues (pedro.henriquer@ufpe.br) on 2015-03-04T18:12:29Z No. of bitstreams: 2 Dissertacao_completa_Diego_Lemos.pdf: 4382725 bytes, checksum: 304625beefcdb33f03bb97376f48c770 (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) / Made available in DSpace on 2015-03-04T18:12:29Z (GMT). No. of bitstreams: 2 Dissertacao_completa_Diego_Lemos.pdf: 4382725 bytes, checksum: 304625beefcdb33f03bb97376f48c770 (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Previous issue date: 2012-03-09 / A Realidade Aumentada é um campo de pesquisa que trata do estudo de técnicas para integrar informações virtuais com o mundo real. Algumas aplicações de Realidade Aumentada requerem fotorrealismo, onde os elementos virtuais são tão coerentemente inseridos na cena real que o usuário não consegue distinguir o virtual do real. Para a síntese de cenas 3D existem diversas técnicas, entre elas o ray tracing. Ele é um algoritmo baseado em conceitos básicos da Física Ótica, cuja principal característica é a alta qualidade visual a um custo computacional elevado, o que condicionava a sua utilização a aplicações offline. Contudo, com o avanço do poder computacional das GPUs este algoritmo passou a ser viável para ser utilizado em aplicações de tempo real, devido principalmente ao fato de ser um algoritmo com a característica de poder ser massivamente paralelizado. Levando isto em consideração, esta dissertação propõe um pipeline para renderização fotorrealística em tempo real utilizando a técnica ray tracing em aplicações de Realidade Aumentada. O ray tracer utilizado foi o Real Time Ray Tracer, ou RT2, de Santos et al., que serviu de base para a construção de um pipeline com suporte a sombreamento, síntese de diversos tipos de materiais, oclusão, reflexão, refração e alguns efeitos de câmera. Para que fosse possível obter um sistema que funciona a taxas interativas, todo o pipeline de renderização foi implementado em GPU, utilizando a linguagem CUDA, da NVIDIA. Outra contribuição importante deste trabalho é a integração deste pipeline com o dispositivo Kinect, da Microsoft, possibilitando a obtenção de informações reais da cena, em tempo real, eliminando assim a necessidade de se conhecer previamente os objetos pertencentes à cena real. Realidade Aumentada Visualização Ray Tracing GPGPU CUDA RT²
198	The Implementation of A Fingerprint Enhancement System Based on GPU via CUDA Yang, Kaiyuan, Wang, Fuliang January 2017 (has links) In order to reduce the large execution time of an existing fingerprint enhancement system, a parallel implementation method based on GPU via CUDA is proposed. Firstly, the necessity and feasibility of employing parallel programming for the whole system are analyzed. Then pre-processing, global analysis, local analysis and matched filtering of the whole fingerprint enhancement system is designed, optimized and implemented respectively using parallel computing technology via CUDA. Finally, numerous fingerprints from FVC2000 databases are tested and the obtained execution time is compared with that of the CPU based system. The results show that the execution time is significantly reduced by using the parallel implementation method based on GPU. Adaptive Fingerprint Enhancement CUDA Parallel Programming GPU Programming Signal Processing Signalbehandling
199	A GPU-based framework for efficient image processing Karlsson, Per January 2014 (has links) This thesis tries to answer how to design a framework for image processing on the GPU, supporting the common environments OpenGL GLSL, OpenCL and CUDA. An generalized view of GPU image processing is presented. The framework is called gpuip and is implemented in C++ but also wrapped with Python-bindings. The framework is cross-platform and works for Windows, Mac OSX and Unix operating systems. The thesis also involves the work of creating two executable programs that uses the gpuip-framework. One of the programs has a graphical user interface and the other program is command-line only. Both programs are developed in Python. Performance tests are created to compare the GPU environments against a single core CPU implementation. All the GPU implementations in the gpuip-framework are significantly faster than the CPU when executing the presented test-cases. On average, the framework is two magnitudes faster than the single core CPU. gpu image processing c++ glsl opencl cuda Media and Communication Technology Medieteknik
200	Performance analysis of GPGPU and CPU on AES Encryption Neelap, Akash Kiran January 2014 (has links) The advancements in computing have led to tremendous increase in the amount of data being generated every minute, which needs to be stored or transferred maintaining high level of security. The military and armed forces today heavily rely on computers to store huge amount of important and secret data, that holds a big deal for the security of the Nation. The traditional standard AES encryption algorithm being the heart of almost every application today, although gives a high amount of security, is time consuming with the traditional sequential approach. Implementation of AES on GPUs is an ongoing research since few years, which still is either inefficient or incomplete, and demands for optimizations for better performance. Considering the limitations in previous research works as a research gap, this paper aims to exploit efficient parallelism on the GPU, and on multi-core CPU, to make a fair and reliable comparison. Also it aims to deduce implementation techniques on multi-core CPU and GPU, in order to utilize them for future implementations. This paper experimentally examines the performance of a CPU and GPGPU in different levels of optimizations using Pthreads, CUDA and CUDA STREAMS. It critically exploits the behaviour of a GPU for different granularity levels and different grid dimensions, to examine the effect on the performance. The results show considerable acceleration in speed on NVIDIA GPU (QuadroK4000), over single-threaded and multi-threaded implementations on CPU (Intel® Xeon® E5-1650). / +46-760742850 AES algorithm CUDA GPU computing Pthreads Computer Sciences Datavetenskap (datalogi) Telecommunications Telekommunikation Software Engineering Programvaruteknik

Search results