Global ETD Search

841	[en] HETEROGENEOUS PARALLELIZATION OF QUANTUM-INSPIRED LINEAR GENETIC PROGRAMMING / [pt] PARALELIZAÇÃO HETEROGÊNEA DA PROGRAMAÇÃO GENÉTICA LINEAR COM INSPIRAÇÃO QUÂNTICA CRISTIAN ENRIQUE MUNOZ VILLALOBOS 27 October 2016 (has links) [pt] Um dos principais desafios da ciência da computação é conseguir que um computador execute uma tarefa que precisa ser feita, sem dizer-lhe como fazê-la. A Programação Genética (PG) aborda este desafio a partir de uma declaração de alto nível sobre o que é necessário ser feito e cria um programa de computador para resolver o problema automaticamente. Nesta dissertação, é desenvolvida uma extensão do modelo de Programação Genética Linear com Inspiração Quântica (PGLIQ) com melhorias na eficiência e eficácia na busca de soluções. Para tal, primeiro o algoritmo é estruturado em um sistema de paralelização heterogênea visando à aceleração por Unidades de Processamento Gráfico e a execução em múltiplos processadores CPU, maximizando a velocidade dos processos, além de utilizar técnicas otimizadas para reduzir os tempos de transferências de dados. Segundo, utilizam-se as técnicas de Visualização Gráfica que interpretam a estrutura e os processos que o algoritmo evolui para entender o efeito da paralelização do modelo e o comportamento da PGLIQ. Na implementação da paralelização heterogênea, são utilizados os recursos de computação paralela como Message Passing Interface (MPI) e Open Multi-Processing (OpenMP), que são de vital importância quando se trabalha com multi-processos. Além de representar graficamente os parametros da PGLIQ, visualizando-se o comportamento ao longo das gerações, uma visualização 3D para casos de robôtica evolutiva é apresentada, na qual as ferramentas de simulação dinâmica como Bullet SDK e o motor gráfico OGRE para a renderização são utilizadas. / [en] One of the main challenges of computer science is to get a computer execute a task that must be done, without telling it how to do it. Genetic Programming (GP) deals with this challenge from a high level statement of what is needed to be done and creates a computer program to solve the problem automatically. In this dissertation we developed an extension of Quantum-Inspired Linear Genetic Programming Model (QILGP), aiming to improve its efficiency and effectiveness in the search for solutions. For this, first the algorithm is structured in a Heterogeneous Parallelism System, Aiming to accelerated using Graphics Processing Units GPU and multiple CPU processors, reducing the timing of data transfers while maximizing the speed of the processes. Second, using the techniques of Graphic Visualization which interpret the structure and the processes that the algorithm evolves, understanding the behavior of QILGP. We used the highperformance features such as Message Passing Interface (MPI) and Open Multi- Processing (OpenMP), which are of vital importance when working with multiprocesses, as it is necessary to design a topology that has multiple levels of parallelism to avoid delaying the process for transferring the data to a local computer where the visualization is projected. In addition to graphically represent the parameters of PGLIQ devising the behavior over generations, a 3D visualization for cases of evolutionary robotics is presented, in which the tools of dynamic simulation as Bullet SDK and graphics engine OGRE for rendering are used . This visualization is used as a tool for a case study in this dissertation. [pt] COMPUTACAO DE ALTO DESEMPENHO [pt] PARALELIZACAO HETEROGENEA [pt] PROGRAMACAO GENETICA LINEAR [pt] COMPUTACAO COM GPU [pt] VISUALIZACAO GRAFICA
842	Enhancing GPGPU Performance through Warp Scheduling, Divergence Taming and Runtime Parallelizing Transformations Anantpur, Jayvant P January 2017 (has links) (PDF) There has been a tremendous growth in the use of Graphics Processing Units (GPU) for the acceleration of general purpose applications. The growth is primarily due to the huge computing power offered by the GPUs and the emergence of programming languages such as CUDA and OpenCL. A typical GPU consists of several 100s to a few 1000s of Single Instruction Multiple Data (SIMD) cores, organized as 10s of Streaming Multiprocessors (SMs), each having several SIMD cores which operate in a lock-step manner, o ering a few TeraFLOPS of performance in a single socket. SMs execute instructions from a group of consecutive threads, called warps. At each cycle, an SM schedules a warp from a group of active warps and can context switch among the active warps to hide various stalls. However, various factors, such as global memory latency, divergence among warps of a thread block (TB), branch divergence among threads of a warp (Control Divergence), number of active warps, etc., can significantly impact the ability of a warp scheduler to hide stalls. This reduces the speedup of applications running on the GPU. Further, applications containing loops with potential cross iteration dependences, do not utilize the available resources (SIMD cores) effectively and hence su er in terms of performance. In this thesis, we propose several mechanisms which address the above issues and enhance the performance of GPU applications through efficient warp scheduling, taming branch and warp divergence, and runtime parallelization. First, we propose RLWS, a Reinforcement Learning (RL) based Warp Scheduler which uses unsupervised learning to schedule warps based on the current state of the core and the long-term benefits of scheduling actions. As the design space involving the state variables used by the RL and the RL parameters (such as learning and exploration rates, reward and penalty values, etc.) is large, we use a Genetic Algorithm to identify the useful subset of state variables and RL parameter values. We evaluated the proposed RL based scheduler using the GPGPU-SIM simulator on a large number of applications from the Rodinia, Parboil, CUDA-SDK and GPGPU-SIM benchmark suites. Our RL based implementation achieved an average speedup of 1.06x over the Loose Round Robin (LRR) strategy and 1.07x over the Two-Level (TL) strategy. A salient feature of RLWS is that it is robust, i.e., performs nearly as well as the best performing warp scheduler, consistently across a wide range of applications. Using the insights obtained from RLWS, we designed PRO, a heuristic warp scheduler which in addition to hiding the long latencies of certain operations, reduces the waiting time of warps at synchronization points. Evaluation of the proposed algorithm using the GPGPU-SIM simulator on a diverse set of applications showed an average speedup of 1.07x over the LRR warp scheduler and 1.08x over the TL warp scheduler. In the second part of the thesis, we address problems due to warp and branch divergences. First, many GPU kernels exhibit warp divergence due to various reasons such as, different amounts of work, cache misses, and thread divergence. Also, we observed that some kernels contain code which is redundant across TBs, i.e., all TBs will execute the code identically and hence compute the same results. To improve performance of such kernels, we propose a solution based on the concept of virtual TBs and loop independent code motion. We propose necessary code transformations which enable one virtual TB to execute the kernel code for multiple real TBs. We evaluated this technique using the GPGPU-SIM simulator on a diverse set of applications and observed an average improvement of 1.08x over the LRR and 1.04x over the Greedy Then Old (GTO) warp scheduling algorithms. Second, branch divergence causes execution of diverging branches to be serialized to execute only one control ow path at a time. Existing stack based hardware mechanism to reconverge threads causes duplicate execution of code for unstructured control ow graphs (CFG). We propose a simple and elegant transformation to convert an unstructured CFG to a structured CFG. The transformation eliminates duplicate execution of user code while incurring only a linear increase in the number of basic blocks and also the number of instructions. We implemented the proposed transformation at the PTX level using the Ocelot compiler infrastructure and demonstrate that the pro-posed technique is effective in handling the performance problem due to divergence in unstructured CFGs. Our third proposal is to enable efficient execution of loops with indirect memory accesses that can potentially cause cross iteration dependences. Such dependences are hard to detect using existing compilation techniques. We present an algorithm to compute at run-time, the cross iteration dependences in such loops, using both the CPU and the GPU. It effectively uses the compute capabilities of the GPU to collect the memory accesses performed by the iterations. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Experimental evaluation on real hardware (NVIDIA GPUs) reveals that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences. Computer Graphics Graphics Processing Units (GPU) Runtime Parallelization Transformation Warp Scheduler Taming Warp Divergence Warp Scheduling Reinforcement Learning Control Divergence Warp Divergence Computer Science
843	An efficient GPU-based implementation of recursive linear filters and its application to realistic real-time re-synthesis for interactive virtual worlds / Uma implementação eficiente de filtros lineares recursivos e sua aplicação a re-síntese realistica em tempo real para mundos virtuais interativos Trebien, Fernando January 2009 (has links) Muitos pesquisadores têm se interessado em explorar o vasto poder computacional das recentes unidades de processamento gráfico (GPUs) em aplicações fora do domínio gráfico. Essa tendência ao desenvolvimento de propósitos gerais com a GPU (GPGPU) foi intensificada com a produção de APIs não-gráficas, tais como a Compute Unified Device Architecture (CUDA), da NVIDIA. Com elas, estudou-se a solução na GPU de muitos problemas de processamento de sinal 2D e 3D envolvendo álgebra linear e equações diferenciais parciais, mas pouca atenção tem sido dada ao processamento de sinais 1D, que também podem exigir recursos computacionais significativos. Já havia sido demonstrado que a GPU pode ser usada para processamento de sinais em tempo-real, mas alguns processos não se adequavam bem à arquitetura da GPU. Neste trabalho, apresento uma nova técnica para implementar um filtro digital linear recursivo usando a GPU. Até onde eu sei, a solução aqui apresentada é a primeira na literatura. Uma comparação entre esta abordagem e uma implementação equivalente baseada na CPU demonstra que, quando usada em um sistema de processamento de áudio em temporeal, esta técnica permite o processamento de duas a quatro vezes mais coeficientes do que era possível anteriormente. A técnica também elimina a necessidade de processar o filtro na CPU - evitando transferências de memória adicionais entre CPU e GPU - quando se deseja usar o filtro junto a outros processos, tais como síntese de som. A recursividade estabelecida pela equação do filtro torna difícil obter uma implementação eficiente em uma arquitetura paralela como a da GPU. Já que cada amostra de saída é computada em paralelo, os valores necessários de amostras de saída anteriores não estão disponíveis no momento do cômputo. Poder-se-ia forçar a GPU a executar o filtro sequencialmente usando sincronização, mas isso seria um uso ineficiente da GPU. Este problema foi resolvido desdobrando-se a equação e "trocando-se" as dependências de amostras próximas à saída atual por outras precedentes, assim exigindo apenas o armazenamento de um certo número de amostras de saída. A equação resultante contém convoluções que então são eficientemente computadas usando a FFT. A implementação da técnica é geral e funciona para qualquer filtro recursivo linear invariante no tempo. Para demonstrar sua relevância, construímos um filtro LPC para sintetizar em tempo-real sons realísticos de colisões de objetos feitos de diferentes materiais, tais como vidro, plástico e madeira. Os sons podem ser parametrizados por material dos objetos, velocidade e ângulo das colisões. Apesar de flexível, esta abordagem usa pouca memória, exigindo apenas alguns coeficientes para representar a resposta ao impulso do filtro para cada material. Isso torna esta abordagem uma alternativa atraente frente às técnicas tradicionais baseadas em CPU que apenas realizam a reprodução de sons gravados. / Many researchers have been interested in exploring the vast computational power of recent graphics processing units (GPUs) in applications outside the graphics domain. This trend towards General-Purpose GPU (GPGPU) development has been intensified with the release of non-graphics APIs for GPU programming, such as NVIDIA's Compute Unified Device Architecture (CUDA). With them, the GPU has been widely studied for solving many 2D and 3D signal processing problems involving linear algebra and partial differential equations, but little attention has been given to 1D signal processing, which may demand significant computational resources likewise. It has been previously demonstrated that the GPU can be used for real-time signal processing, but several processes did not fit the GPU architecture well. In this work, a new technique for implementing a digital recursive linear filter using the GPU is presented. To the best of my knowledge, the solution presented here is the first in the literature. A comparison between this approach and an equivalent CPU-based implementation demonstrates that, when used in a real-time audio processing system, this technique supports processing of two to four times more coefficients than it was possible previously. The technique also eliminates the necessity of processing the filter on the CPU - avoiding additional memory transfers between CPU and GPU - when one wishes to use the filter in conjunction with other processes, such as sound synthesis. The recursivity established by the filter equation makes it difficult to obtain an efficient implementation on a parallel architecture like the GPU. Since every output sample is computed in parallel, the necessary values of previous output samples are unavailable at the time the computation takes place. One could force the GPU to execute the filter sequentially using synchronization, but this would be a very inefficient use of GPU resources. This problem is solved by unrolling the equation and "trading" dependences on samples close to the current output by other preceding ones, thus requiring only the storage of a limited number of previous output samples. The resulting equation contains convolutions which are then efficiently computed using the FFT. The proposed technique's implementation is general and works for any time-invariant recursive linear filter. To demonstrate its relevance, an LPC filter is designed to synthesize in real-time realistic sounds of collisions between objects made of different materials, such as glass, plastic, and wood. The synthesized sounds can be parameterized by the objects' materials, velocities and collision angles. Despite its flexibility, this approach uses very little memory, requiring only a few coefficients to represent the impulse response for the filter of each material. This turns this approach into an attractive alternative to traditional CPU-based techniques that use playback of pre-recorded sounds. Música eletrônica Computação musical Processamento : Sinais Digital filters Linear filters Recursive filters Signal processing Sound synthesis Sound effects GPU GPGPU Realtime systems
844	An efficient GPU-based implementation of recursive linear filters and its application to realistic real-time re-synthesis for interactive virtual worlds / Uma implementação eficiente de filtros lineares recursivos e sua aplicação a re-síntese realistica em tempo real para mundos virtuais interativos Trebien, Fernando January 2009 (has links) Muitos pesquisadores têm se interessado em explorar o vasto poder computacional das recentes unidades de processamento gráfico (GPUs) em aplicações fora do domínio gráfico. Essa tendência ao desenvolvimento de propósitos gerais com a GPU (GPGPU) foi intensificada com a produção de APIs não-gráficas, tais como a Compute Unified Device Architecture (CUDA), da NVIDIA. Com elas, estudou-se a solução na GPU de muitos problemas de processamento de sinal 2D e 3D envolvendo álgebra linear e equações diferenciais parciais, mas pouca atenção tem sido dada ao processamento de sinais 1D, que também podem exigir recursos computacionais significativos. Já havia sido demonstrado que a GPU pode ser usada para processamento de sinais em tempo-real, mas alguns processos não se adequavam bem à arquitetura da GPU. Neste trabalho, apresento uma nova técnica para implementar um filtro digital linear recursivo usando a GPU. Até onde eu sei, a solução aqui apresentada é a primeira na literatura. Uma comparação entre esta abordagem e uma implementação equivalente baseada na CPU demonstra que, quando usada em um sistema de processamento de áudio em temporeal, esta técnica permite o processamento de duas a quatro vezes mais coeficientes do que era possível anteriormente. A técnica também elimina a necessidade de processar o filtro na CPU - evitando transferências de memória adicionais entre CPU e GPU - quando se deseja usar o filtro junto a outros processos, tais como síntese de som. A recursividade estabelecida pela equação do filtro torna difícil obter uma implementação eficiente em uma arquitetura paralela como a da GPU. Já que cada amostra de saída é computada em paralelo, os valores necessários de amostras de saída anteriores não estão disponíveis no momento do cômputo. Poder-se-ia forçar a GPU a executar o filtro sequencialmente usando sincronização, mas isso seria um uso ineficiente da GPU. Este problema foi resolvido desdobrando-se a equação e "trocando-se" as dependências de amostras próximas à saída atual por outras precedentes, assim exigindo apenas o armazenamento de um certo número de amostras de saída. A equação resultante contém convoluções que então são eficientemente computadas usando a FFT. A implementação da técnica é geral e funciona para qualquer filtro recursivo linear invariante no tempo. Para demonstrar sua relevância, construímos um filtro LPC para sintetizar em tempo-real sons realísticos de colisões de objetos feitos de diferentes materiais, tais como vidro, plástico e madeira. Os sons podem ser parametrizados por material dos objetos, velocidade e ângulo das colisões. Apesar de flexível, esta abordagem usa pouca memória, exigindo apenas alguns coeficientes para representar a resposta ao impulso do filtro para cada material. Isso torna esta abordagem uma alternativa atraente frente às técnicas tradicionais baseadas em CPU que apenas realizam a reprodução de sons gravados. / Many researchers have been interested in exploring the vast computational power of recent graphics processing units (GPUs) in applications outside the graphics domain. This trend towards General-Purpose GPU (GPGPU) development has been intensified with the release of non-graphics APIs for GPU programming, such as NVIDIA's Compute Unified Device Architecture (CUDA). With them, the GPU has been widely studied for solving many 2D and 3D signal processing problems involving linear algebra and partial differential equations, but little attention has been given to 1D signal processing, which may demand significant computational resources likewise. It has been previously demonstrated that the GPU can be used for real-time signal processing, but several processes did not fit the GPU architecture well. In this work, a new technique for implementing a digital recursive linear filter using the GPU is presented. To the best of my knowledge, the solution presented here is the first in the literature. A comparison between this approach and an equivalent CPU-based implementation demonstrates that, when used in a real-time audio processing system, this technique supports processing of two to four times more coefficients than it was possible previously. The technique also eliminates the necessity of processing the filter on the CPU - avoiding additional memory transfers between CPU and GPU - when one wishes to use the filter in conjunction with other processes, such as sound synthesis. The recursivity established by the filter equation makes it difficult to obtain an efficient implementation on a parallel architecture like the GPU. Since every output sample is computed in parallel, the necessary values of previous output samples are unavailable at the time the computation takes place. One could force the GPU to execute the filter sequentially using synchronization, but this would be a very inefficient use of GPU resources. This problem is solved by unrolling the equation and "trading" dependences on samples close to the current output by other preceding ones, thus requiring only the storage of a limited number of previous output samples. The resulting equation contains convolutions which are then efficiently computed using the FFT. The proposed technique's implementation is general and works for any time-invariant recursive linear filter. To demonstrate its relevance, an LPC filter is designed to synthesize in real-time realistic sounds of collisions between objects made of different materials, such as glass, plastic, and wood. The synthesized sounds can be parameterized by the objects' materials, velocities and collision angles. Despite its flexibility, this approach uses very little memory, requiring only a few coefficients to represent the impulse response for the filter of each material. This turns this approach into an attractive alternative to traditional CPU-based techniques that use playback of pre-recorded sounds. Música eletrônica Computação musical Processamento : Sinais Digital filters Linear filters Recursive filters Signal processing Sound synthesis Sound effects GPU GPGPU Realtime systems
845	[en] INTERACTIVE VOLUME VISUALIZATION OF UNSTRUCTURED MESHES USING PROGRAMMABLE GRAPHICS CARDS / [pt] VISUALIZAÇÃO VOLUMÉTRICA INTERATIVA DE MALHAS NÃO-ESTRUTURADAS UTILIZANDO PLACAS GRÁFICAS PROGRAMÁVEIS RODRIGO DE SOUZA LIMA ESPINHA 15 June 2005 (has links) [pt] A visualização volumétrica é uma importante técnica para a exploração de dados tridimensionais complexos, como, por exemplo, o resultado de análises numéricas usando o método dos elementos finitos. A aplicação eficiente dessa técnica a malhas não-estruturadas tem sido uma importante área de pesquisa nos últimos anos. Há dois métodos básicos para a visualização dos dados volumétricos: extração de superfícies e renderização direta de volumes. Na primeira, iso-superfícies de um campo escalar são extraídas explicitamente. Na segunda, que é a utilizada neste trabalho, dados escalares são classificados a partir de uma função de transferência, que mapeia valores do campo escalar em cor e opacidade, para serem visualizados. Com a evolução das placas gráficas (GPU) dos computadores pessoais, foram desenvolvidas novas técnicas para visualização volumétrica interativa de malhas não-estruturadas. Os novos algoritmos tiram proveito da aceleração e da possibilidade de programação dessas placas, cujo poder de processamento cresce a um ritmo superior ao dos processadores convencionais (CPU). Este trabalho avalia e compara dois algoritmos para visualização volumétrica de malhas não-estruturadas, baseados em GPU: projeção de células independente do observador e traçado de raios. Adicionalmente, são propostas duas adaptações dos algoritmos estudados. Para o algoritmo de projeção de células, propõe-se uma estruturação dos dados na GPU para eliminar o alto custo de transferência de dados para a placa gráfica. Para o algoritmo de traçado de raios, propõe-se fazer a integração da função de transferência na GPU, melhorando a qualidade da imagem final obtida e permitindo a alteração da função de transferência de maneira interativa. / [en] Volume visualization is an important technique for the exploration of threedimensional complex data sets, such as the results of numerical analysis using the finite elements method. The efficient application of this technique to unstructured meshes has been an important area of research in the past few years. There are two basic methods to visualize volumetric data: surface extraction and direct volume rendering. In the first, the iso-surfaces of the scalar field are explicitly extracted. In the second, which is the one used in this work, scalar data are classified by a transfer function, which maps the scalar values to color and opacity, to be visualized. With the evolution of personal computer graphics cards (GPU), new techniques for volume visualization have been developed. The new algorithms take advantage of modern programmable graphics cards, whose processing power increases at a faster rate than the one observed in conventional processors (CPU). This work evaluates and compares two GPU- based algorithms for volume visualization of unstructured meshes: view- independent cell projection (VICP) and ray-tracing. In addition, two adaptations of the studied algorithms are proposed. For the cell projection algorithm, we propose a GPU data structure in order to eliminate the high costs of the CPU to GPU data transfer. For the raytracing algorithm, we propose to integrate the transfer function in the GPU, which increases the quality of the generated image and allows to interactively change the transfer function. [pt] FUNCOES DE TRANSFERENCIA [en] TRANSFER FUNCTIONS [pt] VISUALIZACAO VOLUMETRICA [en] VOLUME RENDERING [pt] VISUALIZACAO INTERATIVA [en] INTERACTIVE VISUALIZATION [pt] PROGRAMACAO EM PLACAS GRAFICAS [en] GPU PROGRAMMING [pt] MALHAS NAO ESTRUTURADAS [en] UNSTRUCTURED MESHES
846	HPSM: uma API em linguagem c++ para programas com laços paralelos com suporte a multi-CPUs e Multi-GPUs / HPSM: a c++ API for parallel loops programs Supporting multi-CPUs and multi-GPUs Di Domenico, Daniel 21 December 2016 (has links) Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / Parallel architectures has been ubiquitous for some time now. However, the word ubiquitous can’t be applied to parallel programs, because there is a greater complexity to code them comparing to ordinary programs. This fact is aggravated when the programming also involves accelerators, like GPUs, which demand the use of tools with scpecific resources. Considering this setting, there are programming models that make easier the codification of parallel applications to explore accelerators, nevertheless, we don’t know APIs that allow implementing programs with parallel loops that can be processed simultaneously by multiple CPUs and multiple GPUs. This works presents a high-level C++ API called HPSM aiming to make easier and more efficient the codification of parallel programs intended to explore multi-CPU and multi-GPU architectures. Following this idea, the desire is to improve performance through the sum of resources. HPSM uses parallel loops and reductions implemented by three parallel back-ends, being Serial, OpenMP and StarPU. Our hypothesis estimates that scientific applications can explore heterogeneous processing in multi-CPU and multi-GPU to achieve a better performance than exploring just accelerators. Comparisons with other parallel programming interfaces demonstrated that HPSM can reduce a multi-CPU and multi-GPU code in more than 50%. The use of the new API can introduce impact to program performance, where experiments showed a variable overhead for each application, that can achieve a maximum value of 16,4%. The experimental results confirmed the hypothesis, because the N-Body, Hotspot e CFD applications achieved gains using just CPUs and just GPUs, as well as overcame the performance achieved by just accelerators (GPUs) through the combination of multi-CPU and multi-GPU. / Arquiteturas paralelas são consideradas ubíquas atualmente. No entanto, o mesmo termo não pode ser aplicado aos programas paralelos, pois existe uma complexidade maior para codificálos em relação aos programas convencionais. Este fato é agravado quando a programação envolve também aceleradores, como GPUs, que demandam o uso de ferramentas com recursos muito específicos. Neste cenário, apesar de existirem modelos de programação que facilitam a codificação de aplicações paralelas para explorar aceleradores, desconhece-se a existência de APIs que permitam a construção de programas com laços paralelos que possam ser processados simultaneamente em múltiplas CPUs e múltiplas GPUs. Este trabalho apresenta uma API C++ de alto nível, denominada HPSM, visando facilitar e tornar mais eficiente a codificação de programas paralelos voltados a explorar arquiteturas com multi-CPU e multi-GPU. Seguindo esta ideia, deseja-se ganhar desempenho através da soma dos recursos. A HPSM é baseada em laços e reduções paralelas implementadas por meio de três diferentes back-ends paralelos, sendo Serial, OpenMP e StarPU. A hipótese deste estudo é que aplicações científicas podem valer-se do processamento heterogêneo em multi-CPU e multi-GPU para alcançar um desempenho superior em relação ao uso de apenas aceleradores. Comparações com outras interfaces de programação paralela demonstraram que o uso da HPSM pode reduzir em mais de 50% o tamanho de um programa multi-CPU e multi-GPU. O uso da nova API pode trazer impacto no desempenho do programa, sendo que experimentos demonstraram que seu sobrecusto é variável de acordo com a aplicação, chegando até 16,4%. Os resultados experimentais confirmaram a hipótese, pois as aplicações N-Body, Hotspot e CFD, além de alcançarem ganhos ao utilizar somente CPUs e somente GPUs, também superaram o desempenho obtido por somente aceleradores (GPUs) através da combinação de multi-CPU e multi-GPU. API C++ Programação paralela Laços paralelos Computação heterogênea GPU C++ API Parallel programming Parallel loops Heterogenous computing
847	M?todo de detec??o massiva de sistemas LS-MIMO empregando o m?todo de Richardson modificado em aceleradores gr?ficos Costa, Haulisson Jody Batista da 22 August 2016 (has links) Submitted by Automa??o e Estat?stica (sst@bczm.ufrn.br) on 2017-03-28T19:29:10Z No. of bitstreams: 1 HaulissonJodyBatistaDaCosta_TESE.pdf: 2105364 bytes, checksum: 02ce6394c65e4f1c777677345eac66c6 (MD5) / Approved for entry into archive by Arlan Eloi Leite Silva (eloihistoriador@yahoo.com.br) on 2017-03-29T18:57:57Z (GMT) No. of bitstreams: 1 HaulissonJodyBatistaDaCosta_TESE.pdf: 2105364 bytes, checksum: 02ce6394c65e4f1c777677345eac66c6 (MD5) / Made available in DSpace on 2017-03-29T18:57:57Z (GMT). No. of bitstreams: 1 HaulissonJodyBatistaDaCosta_TESE.pdf: 2105364 bytes, checksum: 02ce6394c65e4f1c777677345eac66c6 (MD5) Previous issue date: 2016-08-22 / A evolu??o da comunica??o sem fio traz suporte a m?ltiplos dispositivos que, simultaneamente, transmitem altas taxas de dados. T?cnicas emergentes de comunica??o LSMIMO permitem explorar o aumento da capacidade para a moderniza??o dos sistemas de transmiss?o. Apesar da pr?pria caracter?stica do canal de multipercurso proporcionar efici?ncia espectral, a complexidade computacional dos m?todos de detec??o LS-MIMO tornam-se proibitivos em sistemas com elevada quantidades de antenas. Procurando aumentar a quantidade de antenas empregadas na detec??o MIMO, este trabalho prop?e adaptar o m?todo iterativo de Richardson ao conceito de matrizes aleat?rias estabelecido por Marchenko-Pastur e ao conceito de execu??o paralela de modo a adequ?-lo ? aplica??o em sistema LS-MIMO. O m?todo iterativo de Richardson exige condi??es para resolu??o linear que restringem sua ampla aplica??o. Contudo, a compreens?o do canal permite estabelecer adapta??es que suprem as exig?ncias do m?todo. Os efeitos do canal conceituados por Marchenko-Pastur permitem modificar o m?todo, atrelando a estabilidade ? quantidade de antenas de maneira que o aumento dessa propor??o contribui tanto para a melhoria da converg?ncia quanto para a redu??o relativa das itera??es. Adicionalmente, a execu??o compartilhada com o m?todo de decodifica??o proporciona uma divis?o de carga de trabalho, de modo a permitir uma taxa de transfer?ncia que supera outros m?todos. Os resultados levantados a partir das an?lises comparativas entre outras propostas de execu??o paralela mostraram de forma in?dita a capacidade de detec??o em larga escala. Ainda, a proposta mostrou um n?vel de adaptabilidade que permite variar a rela??o entre taxa de transfer?ncia de dados e complexidade. Nesse sentido, o m?todo mostrou que com taxas de transmiss?o equipar?veis com outras propostas permite aumentar seu desempenho em 150% abdicando 1,75 dB da rela??o sinal ru?do. Baseando-se nessa abordagem, o sistema mostra um desempenho superior as outras estrat?gias executadas em GPU que apontam um incremento significativo na capacidade de transmiss?o paralela. A proposta, tamb?m, mostra aspectos escal?veis que permitem alcan?ar um desempenho na ordem de Gb=s pela inser??o de outros dispositivos (GPUs) operando paralelamente no sistema. / The evolution of wireless communications must support multiples devices and maintain high-speed data transmission. The emerging Large-Scale MIMO techniques allows improving the capacity for the next generations of communications systems. Although the benefit of multipath involves the spectral efficiency, the computational complexity of LS-MIMO detection becomes prohibitive in large systems. Seeking to overcome it, we propose to adapt the Richardson iterative method to the LS-MIMO with random matrices theory by concepts of Marchenko-Pastur and parallel executions. This method requires restricted conditions for the linear resolution that limits its applications. However, the channel knowledge allows establishing adaptations that supply the requirements of the method. The channel effect explained by Marchenko-Pastur allows associating the stability of the process with an increase in the numbers of antennas that contributed to improved the convergence and reduction the iterations. Furthermore, the shared execution with decoding blocks provide a workload distribution that surpasses the throughput of others detections. The results achieved from the comparative analysis of other proposals showed an unprecedented way to increase capability on the large scale detection and provides an efficient parallel processing. Also, the proposal demonstrated a level of adaptability that allows diversifying the association between transmission rate and complexity. Therefore, the implementation of Richardson detection establishes that the transmission rate is comparable with other projects and the increasing of 1.74dB SNR improve 150% at throughput. Based on this approach, the execution shows a significant increase in parallel transmission capacity when implemented on GPU. Also, the implementation shows scalable aspects that allow increasing the performance to Gb/s by insertion of others parallels devices (GPUs) in the system. Sistema LS-MIMO M?todo de Richardson Lei de Marchenko-Pastur Detec??o massiva Comunica??o sem fio Aceleradores de algoritmos em GPU
848	An efficient GPU-based implementation of recursive linear filters and its application to realistic real-time re-synthesis for interactive virtual worlds / Uma implementação eficiente de filtros lineares recursivos e sua aplicação a re-síntese realistica em tempo real para mundos virtuais interativos Trebien, Fernando January 2009 (has links) Muitos pesquisadores têm se interessado em explorar o vasto poder computacional das recentes unidades de processamento gráfico (GPUs) em aplicações fora do domínio gráfico. Essa tendência ao desenvolvimento de propósitos gerais com a GPU (GPGPU) foi intensificada com a produção de APIs não-gráficas, tais como a Compute Unified Device Architecture (CUDA), da NVIDIA. Com elas, estudou-se a solução na GPU de muitos problemas de processamento de sinal 2D e 3D envolvendo álgebra linear e equações diferenciais parciais, mas pouca atenção tem sido dada ao processamento de sinais 1D, que também podem exigir recursos computacionais significativos. Já havia sido demonstrado que a GPU pode ser usada para processamento de sinais em tempo-real, mas alguns processos não se adequavam bem à arquitetura da GPU. Neste trabalho, apresento uma nova técnica para implementar um filtro digital linear recursivo usando a GPU. Até onde eu sei, a solução aqui apresentada é a primeira na literatura. Uma comparação entre esta abordagem e uma implementação equivalente baseada na CPU demonstra que, quando usada em um sistema de processamento de áudio em temporeal, esta técnica permite o processamento de duas a quatro vezes mais coeficientes do que era possível anteriormente. A técnica também elimina a necessidade de processar o filtro na CPU - evitando transferências de memória adicionais entre CPU e GPU - quando se deseja usar o filtro junto a outros processos, tais como síntese de som. A recursividade estabelecida pela equação do filtro torna difícil obter uma implementação eficiente em uma arquitetura paralela como a da GPU. Já que cada amostra de saída é computada em paralelo, os valores necessários de amostras de saída anteriores não estão disponíveis no momento do cômputo. Poder-se-ia forçar a GPU a executar o filtro sequencialmente usando sincronização, mas isso seria um uso ineficiente da GPU. Este problema foi resolvido desdobrando-se a equação e "trocando-se" as dependências de amostras próximas à saída atual por outras precedentes, assim exigindo apenas o armazenamento de um certo número de amostras de saída. A equação resultante contém convoluções que então são eficientemente computadas usando a FFT. A implementação da técnica é geral e funciona para qualquer filtro recursivo linear invariante no tempo. Para demonstrar sua relevância, construímos um filtro LPC para sintetizar em tempo-real sons realísticos de colisões de objetos feitos de diferentes materiais, tais como vidro, plástico e madeira. Os sons podem ser parametrizados por material dos objetos, velocidade e ângulo das colisões. Apesar de flexível, esta abordagem usa pouca memória, exigindo apenas alguns coeficientes para representar a resposta ao impulso do filtro para cada material. Isso torna esta abordagem uma alternativa atraente frente às técnicas tradicionais baseadas em CPU que apenas realizam a reprodução de sons gravados. / Many researchers have been interested in exploring the vast computational power of recent graphics processing units (GPUs) in applications outside the graphics domain. This trend towards General-Purpose GPU (GPGPU) development has been intensified with the release of non-graphics APIs for GPU programming, such as NVIDIA's Compute Unified Device Architecture (CUDA). With them, the GPU has been widely studied for solving many 2D and 3D signal processing problems involving linear algebra and partial differential equations, but little attention has been given to 1D signal processing, which may demand significant computational resources likewise. It has been previously demonstrated that the GPU can be used for real-time signal processing, but several processes did not fit the GPU architecture well. In this work, a new technique for implementing a digital recursive linear filter using the GPU is presented. To the best of my knowledge, the solution presented here is the first in the literature. A comparison between this approach and an equivalent CPU-based implementation demonstrates that, when used in a real-time audio processing system, this technique supports processing of two to four times more coefficients than it was possible previously. The technique also eliminates the necessity of processing the filter on the CPU - avoiding additional memory transfers between CPU and GPU - when one wishes to use the filter in conjunction with other processes, such as sound synthesis. The recursivity established by the filter equation makes it difficult to obtain an efficient implementation on a parallel architecture like the GPU. Since every output sample is computed in parallel, the necessary values of previous output samples are unavailable at the time the computation takes place. One could force the GPU to execute the filter sequentially using synchronization, but this would be a very inefficient use of GPU resources. This problem is solved by unrolling the equation and "trading" dependences on samples close to the current output by other preceding ones, thus requiring only the storage of a limited number of previous output samples. The resulting equation contains convolutions which are then efficiently computed using the FFT. The proposed technique's implementation is general and works for any time-invariant recursive linear filter. To demonstrate its relevance, an LPC filter is designed to synthesize in real-time realistic sounds of collisions between objects made of different materials, such as glass, plastic, and wood. The synthesized sounds can be parameterized by the objects' materials, velocities and collision angles. Despite its flexibility, this approach uses very little memory, requiring only a few coefficients to represent the impulse response for the filter of each material. This turns this approach into an attractive alternative to traditional CPU-based techniques that use playback of pre-recorded sounds. Música eletrônica Computação musical Processamento : Sinais Digital filters Linear filters Recursive filters Signal processing Sound synthesis Sound effects GPU GPGPU Realtime systems
849	Projeto e avaliação de algoritmos paralelos para sistemas Multicore e Manycore aplicados no processamento de documentos / Design and evaluation of parallel algorithms for Multicore and Manycore systems applied on document processing Freitas, Mateus Ferreira e 30 August 2017 (has links) Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2017-10-02T15:28:01Z No. of bitstreams: 2 Dissertação - Mateus Ferreira e Freitas - 2017.pdf: 4269845 bytes, checksum: e84e69d8747a21125170793812384a98 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2017-10-02T15:30:07Z (GMT) No. of bitstreams: 2 Dissertação - Mateus Ferreira e Freitas - 2017.pdf: 4269845 bytes, checksum: e84e69d8747a21125170793812384a98 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2017-10-02T15:30:07Z (GMT). No. of bitstreams: 2 Dissertação - Mateus Ferreira e Freitas - 2017.pdf: 4269845 bytes, checksum: e84e69d8747a21125170793812384a98 (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2017-08-30 / Several applications process documents in different ways, aiming to filter, organize or learn with them. Nowadays, a great computational power is necessary in order to do that efficiently, due to the large and increasing number of documents. Usually, documents are independent of each other, which facilitates the use of parallelism to speed up this processing. This work explores three problems: active learning, learning to rank (L2R) and top-k search. Using the parallelism on multicore CPUs and manycore GPUs (Graphics Processing Unit), parallel algorithms were proposed and evaluated for each problem, and implemented with the OpenMP and CUDA APIs. For the active learning problem a multicore algorithm was proposed, which obtained 10.8x of speedup in the best case with 12 threads. The proposed manycore version obtained 128x of speedup over the serial version, and a solution with 4 GPUs achieved 3.5x of speedup over 1 GPU. For the L2R problem a manycore algorithm was proposed, which follows a thread-block approach using the concept of Combinadic, and uses a cache with fingerprint to speed up the processing. The best case speedups were 508x over the serial, 9x over a GPU baseline, and 4x over our solution when using 4 GPUs. When comparing with a version without combinadic, the speedup over it was 4.4x with both versions using 1 GPU and 3.9x with 4. These solutions used bitmap structures to speed up the association rules creation. In the top-k search a serial and multicore solutions were implemented from a state of the art manycore algorithm for exact searches. These implementations served as baselines for our extension of this algorithm, which includes the use of multi-GPU, group searches and an intra-block load balancing. The speedups were 2.7x over the original algorithm, 17x over the serial, 4x over the multicore, and 4x over our version when using 4 GPUs. / Diversas aplicações processam documentos de diferentes maneiras, visando filtrá-los, organizá-los ou aprender com eles. Atualmente, é necessário um grande poder computacional para que isso seja feito eficientemente, devido ao número grande e crescente de documentos. Geralmente os documentos são independentes entre si, o que facilita o uso de paralelismo para acelerar esse processamento. Este trabalho explora três problemas: aprendizado ativo, learning to rank (L2R) e busca top-k. Usando o paralelismo em CPUs multicore e GPUs (Graphics Processing Unit) manycore, algoritmos paralelos foram propostos e avaliados para cada problema, e implementados com as APIs OpenMP e CUDA. Para problema de aprendizado ativo foi proposto um algoritmo multicore, que obteve speedup de 10,8x no melhor caso com 12 threads. A versão manycore proposta obteve speedup de 128x em relação ao serial, e uma solução com 4 GPUs atingiu 3,5x de speedup sobre 1 GPU. Para o problema de L2R foi proposto um algoritmo manycore, que segue uma abordagem por bloco de threads} usando o conceito de Combinadic, e usa uma cache} com fingerprint para acelerar o processamento. Os speedups nos melhores casos foram de 508x sobre o serial, 9x sobre uma baseline em GPU, e 4x sobre nossa solução com 1 GPU ao usar 4 GPUs. Ao comparar com uma versão sem o combinadic, o speedup sobre ela foi de 4,4x com ambas versões usando 1 GPU e 3,9x usando 4. Estas soluções usaram estruturas de mapa de bits para acelerar a criação de regras de associação. Na busca top-k foram implementadas uma solução serial e uma multicore de um algoritmo manycore estado da arte para buscas exatas. Estas implementações serviram de baseline para nossa extensão desse algoritmo, que inclui o uso de multi-GPU, buscas em grupos e um balanceamento de carga intra-bloco. Os speedups obtidos foram de 2,7x sobre o algoritmo original, 17x sobre o serial, 4x sobre o multicore, e 4x sobre nossa versão ao usar 4 GPUs. Paralelismo Regras de associação Aprendizado ativo Busca top-K parallelism Learning to rank GPU Association rules Learning to rank Active learning Top-K search
850	Real-time Object Recognition on a GPU Pettersson, Johan January 2007 (has links) Shape-Based matching (SBM) is a known method for 2D object recognition that is rather robust against illumination variations, noise, clutter and partial occlusion. The objects to be recognized can be translated, rotated and scaled. The translation of an object is determined by evaluating a similarity measure for all possible positions (similar to cross correlation). The similarity measure is based on dot products between normalized gradient directions in edges. Rotation and scale is determined by evaluating all possible combinations, spanning a huge search space. A resolution pyramid is used to form a heuristic for the search that then gains real-time performance. For SBM, a model consisting of normalized edge gradient directions, are constructed for all possible combinations of rotation and scale. We have avoided this by using (bilinear) interpolation in the search gradient map, which greatly reduces the amount of storage required. SBM is highly parallelizable by nature and with our suggested improvements it becomes much suited for running on a GPU. This have been implemented and tested, and the results clearly outperform those of our reference CPU implementation (with magnitudes of hundreds). It is also very scalable and easily benefits from future devices without effort. An extensive evaluation material and tools for evaluating object recognition algorithms have been developed and the implementation is evaluated and compared to two commercial 2D object recognition solutions. The results show that the method is very powerful when dealing with the distortions listed above and competes well with its opponents. object recognition pattern matching GPU CUDA transformation rotation scale noise illumination occlusion clutter evaluation

Search results