Global ETD Search

41	Aceleração por GPU de serviços em sistemas robóticos focado no processamento de tempo real de nuvem de pontos 3D / GPU Acceleration of robotic systems services focused in real-time processing of 3D point clouds Christino, Leonardo Milhomem Franco 03 February 2016 (has links) O projeto de mestrado, denominado de forma abreviada como GPUServices, se insere no contexto da pesquisa e do desenvolvimento de métodos de processamento de dados de sensores tridimensionais aplicados a robótica móvel. Tais métodos serão chamados de serviços neste projeto e incluem algoritmos de pré-processamento de nuvens de pontos 3D com segmentação dos dados, a separação e identificação de zonas planares (chão, vias), e detecção de elementos de interesse (bordas, obstáculos). Devido à grande quantidade de dados a serem tratados em um curto espaço de tempo, esses serviços utilizam processamento paralelo por GPU para realizar o processamento parcial ou completo destes dados. A área de aplicação em foco neste projeto visa prover serviços para um sistema ADAS: veículos autônomos e inteligentes, forçando-os a se aproximarem de um sistema de processamento em tempo real devido ao contexto de direção autônoma. Os serviços são divididos em etapas de acordo com a metodologia do projeto, mas sempre buscando a aceleração com o uso de paralelismo inerente: O pré-projeto consiste de organizar um ambiente que seja capaz de coordenar todas as tecnologias utilizadas e que explore o paralelismo; O primeiro serviço tem a responsabilidade de extrair inteligentemente os dados do sensor que foi usado pelo projeto (Sensor laser Velodyne de múltiplos feixes), que se mostra necessário devido à diversos erros de leitura e ao formato de recebimento, fornecendo os dados em uma estrutura matricial; O segundo serviço em cooperação com o anterior corrige a desestabilidade espacial do sensor devido à base de fixação não estar perfeitamente paralela ao chão e devido aos amortecimentos do veículo; O terceiro serviço separa as zonas semânticas do ambiente, como plano do chão, regiões abaixo e acima do chão; O quarto serviço, similar ao anterior, realiza uma pré-segmentação das guias da rua; O quinto serviço realiza uma segmentação de objetos do ambiente, separando-os em blobs; E o sexto serviço utiliza de todos os anteriores para a detecção e segmentação das guias da rua. Os dados recebidos pelo sensor são na forma de uma nuvem de pontos 3D com grande potencial de exploração do paralelismo baseado na localidade das informações. Porém, sua grande dificuldade é a grande taxa de dados recebidos do sensor (em torno de 700.000 pontos/seg.), sendo esta a motivação deste projeto: usar todo o potencial do sensor de forma eficiente ao usar o paralelismo de programação GPU, disponibilizando assim ao usuário serviços de tratamento destes dados. / The master\'s project, abbreviated hence forth as GPUServices, fits in the context of research and development of three-dimensional sensor data processing methods applied to mobile robotics. Such methods will be called services in this project, which include a 3D point cloud preprocessing algorithms with data segmentation, separation and identification of planar areas (ground track), and also detecting elements of interest (borders, barriers). Due to the large amount of data to be processed in a short time, these services should use parallel processing, using the GPU to perform partial or complete processing of these data. The application area in focus in this project aims to provide services for an ADAS system: autonomous and intelligent vehicles, forcing them to get close to a real-time processing system due to the autonomous direction of context.The services are divided into stages according to the project methodology, but always striving for acceleration using inherent parallelism: The pre-project consists of organizing an environment for development that is able to coordinate all used technologies, to exploit parallelism and to be integrated to the system already used by the autonomous car; The first service has a responsibility to intelligently extract sensor data that will be used by the project (Laser sensor Velodyne multi-beam), it appears necessary because of the many reading errors and the receiving data format, hence providing data in a matrix structure; The second service, in cooperation with the above, corrects the spatial destabilization due to the sensor fixing base not perfectly parallel to the ground and due to the damping of the vehicle; The third service separates the environment into semantics areas such as ground plane and regions below and above the ground; The fourth service, similar to the above, performs a pre-segmentation of street cruds; The fifth service performs an environmental objects segmentation, separating them into blobs; The sixth service uses all prior to detection and segmentation of street guides.The received sensor data is structured in the form of a cloud of points. They allow processing with great potential for exploitation of parallelism based on the location of the information. However, its major difficulty is the high rate of data received from the sensor (around 700,000 points/sec), and this gives the motivation of this project: to use the full potential of sensor to efficiently use the parallelism of GPU programming, therefore providing data processing services to the user, providing services that helps and make the implementation of ADAS systems easier and/or faster. Autonomous vehicle CUDA CUDA GPU GPU Robótica Robotics ROS ROS Veículo autônomo.
42	Implementing method of moments on a GPGPU using Nvidia CUDA Virk, Bikram 12 April 2010 (has links) This thesis concentrates on the algorithmic aspects of Method of Moments (MoM) and Locally Corrected Nyström (LCN) numerical methods in electromagnetics. The data dependency in each step of the algorithm is analyzed to implement a parallel version that can harness the powerful processing power of a General Purpose Graphics Processing Unit (GPGPU). The GPGPU programming model provided by NVIDIA's Compute Unified Device Architecture (CUDA) is described to learn the software tools at hand enabling us to implement C code on the GPGPU. Various optimizations such as the partial update at every iteration, inter-block synchronization and using shared memory enable us to achieve an overall speedup of approximately 10. The study also brings out the strengths and weaknesses in implementing different methods such as Crout's LU decomposition and triangular matrix inversion on a GPGPU architecture. The results suggest future directions of study in different algorithms and their effectiveness on a parallel processor environment. The performance data collected show how different features of the GPGPU architecture can be enhanced to yield higher speedup. Nvidia CUDA Electromagnetics Numerical methods Method of moments MoM CUDA GPGPU GPU Moments method (Statistics) Electromagnetism
43	Paralelização em CUDA/GLSL do algoritmo SIFT para reconhecimento de íris / A CUDA/GLSL parallelization of SIFT algorithm for iris recognition Luiz Fernando Rosalba Telles de Sousa 28 February 2012 (has links) Conselho Nacional de Desenvolvimento Científico e Tecnológico / Neste trabalho é estudada a viabilidade de uma implementação em paralelo do algoritmo scale invariant feature transform (SIFT) para identificação de íris. Para a implementação do código foi utilizada a arquitetura para computação paralela compute unified device architecture (CUDA) e a linguagem OpenGL shading language (GLSL). O algoritmo foi testado utilizando três bases de dados de olhos e íris, o noisy visible wavelength iris image Database (UBIRIS), Michal-Libor e CASIA. Testes foram feitos para determinar o tempo de processamento para verificação da presença ou não de um indivíduo em um banco de dados, determinar a eficiência dos algoritmos de busca implementados em GLSL e CUDA e buscar valores de calibração que melhoram o posicionamento e a distribuição dos pontos-chave na região de interesse (íris) e a robustez do programa final. / Present work studies the feasibility of a parallel implementation of the scene recognition algorithm SIFT for iris recognition. The code was built using the Compute Unified Device Architecture (CUDA) and the shading language GLSL. The algorithm was tested using three databases containing eyes and iris, the UBIRIS, Michal- Libor and CASIA. Tests were made for: analyzing the requested time for checking if an subject is or is not present on current database, the efficiency of the search algorithms written in CUDA and GLSL, the search for calibration values that improve keypoints position and distribution through the region of interest (iris), analyzing the reliability of the final code. SIFT CUDA Biometria Reconhecimento de íris Processamento de imagem SIFT CUDA Biometry Iris recognition Image processing MATEMATICA APLICADA
44	Paralelização em CUDA/GLSL do algoritmo SIFT para reconhecimento de íris / A CUDA/GLSL parallelization of SIFT algorithm for iris recognition Luiz Fernando Rosalba Telles de Sousa 28 February 2012 (has links) Conselho Nacional de Desenvolvimento Científico e Tecnológico / Neste trabalho é estudada a viabilidade de uma implementação em paralelo do algoritmo scale invariant feature transform (SIFT) para identificação de íris. Para a implementação do código foi utilizada a arquitetura para computação paralela compute unified device architecture (CUDA) e a linguagem OpenGL shading language (GLSL). O algoritmo foi testado utilizando três bases de dados de olhos e íris, o noisy visible wavelength iris image Database (UBIRIS), Michal-Libor e CASIA. Testes foram feitos para determinar o tempo de processamento para verificação da presença ou não de um indivíduo em um banco de dados, determinar a eficiência dos algoritmos de busca implementados em GLSL e CUDA e buscar valores de calibração que melhoram o posicionamento e a distribuição dos pontos-chave na região de interesse (íris) e a robustez do programa final. / Present work studies the feasibility of a parallel implementation of the scene recognition algorithm SIFT for iris recognition. The code was built using the Compute Unified Device Architecture (CUDA) and the shading language GLSL. The algorithm was tested using three databases containing eyes and iris, the UBIRIS, Michal- Libor and CASIA. Tests were made for: analyzing the requested time for checking if an subject is or is not present on current database, the efficiency of the search algorithms written in CUDA and GLSL, the search for calibration values that improve keypoints position and distribution through the region of interest (iris), analyzing the reliability of the final code. SIFT CUDA Biometria Reconhecimento de íris Processamento de imagem SIFT CUDA Biometry Iris recognition Image processing MATEMATICA APLICADA
45	Um filtro adaptativo de alto desempenho instaciado do algoritmo GAADT para o processamento de sinais de eletrocardiograma MACIEL, Andrilene Ferreira 09 September 2015 (has links) Submitted by Irene Nascimento (irene.kessia@ufpe.br) on 2016-11-08T18:59:31Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) TESE_IMPRESSÃO_ANDRILENE_VFINAL_bib.pdf: 10222135 bytes, checksum: e23334e08daf26aa1743055d338fbea5 (MD5) / Made available in DSpace on 2016-11-08T18:59:31Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) TESE_IMPRESSÃO_ANDRILENE_VFINAL_bib.pdf: 10222135 bytes, checksum: e23334e08daf26aa1743055d338fbea5 (MD5) Previous issue date: 2015-09-09 / A implementação dos algoritmos genéticos (AGs) inspirados no modelo de Holland em hardware para filtrar sinais visa acelerar o tempo de convergência desses algoritmos através da implementação dos módulos considerados um gargalo para uma implementação em software. Porém estes módulos apresentam os mesmos problemas com a representação do cromossomo, a dependência dos operadores genéticos e a representação adotada para o cromossomo e a população, e a perda de cromossomos com características relevantes para a solução do problema ao qual o AG está sendo aplicado. Esta tese apresenta um filtro adaptativo que adota o algoritmo genético baseado em tipos de dados abstratos (GAADT), para o processamento de sinais de ECG, denominado de CGAADT, na plataforma GPU/CUDA. O CGAADT desenvolvido apresenta uma solução de alto desempenho. A escolha por este modelo de algoritmo genético justifica-se pelo fato do GAADT ter sido definido com o intuito de evitar os problemas dos modelos de AG até então encontrados na literatura de computação evolucionária. O GAADT trabalha com uma arquitetura aberta que considera a dinâmica do ambiente o qual os cromossomos estão inseridos, ou seja, a função de adaptação do GAADT busca o cromossomo da população mais adaptado ao ambiente, se este ambiente mudar então a busca realizada pelo GAADT será redirecionado para o cromossomo mais adaptado ao ambiente atual, em tempo de execução, sem a necessidade de interromper a execução atual do GAADT. O resultado obtido pelo GAADT é de melhor qualidade do que os outros modelos de AGs uma vez que este trabalha a definição de gene dominante, que são as informações presentes nos cromossomos relevantes para a solução do problema. Provocando uma explosão exponencial na população do GAADT, na busca por um cromossomo mais adaptado que contenha a maior quantidade possível de genes dominantes, o que pode levar meses de processamento até a coleta de dados em arquiteturas de CPUs convencionais. Um estudo comparativo entre a qualidade dos resultados obtidos ao filtrar os sinais de ECG de pacientes com arritmias sinusal, flutter atrial e fibrilação atrial do CGAADT com outros modelos é apresentado. As experiências avaliadas neste estudo indicam que o CGAADT apresenta uma versão otimizada do GAADT, que permite que todo o processamento do algoritmo genético, seja realizado na GPU, o que resultou em um ganho no tempo total médio do processamento do algoritmo em 17,43% na seleção, 1,39% no cruzamento, 1,12% na mutação, 9,02% na reprodução, 15,11% no processo de inserção de descendentes na população. Tais índices representam um ganho de tempo de processamento de 73,6% relacionado ao algoritmo genético de Holland. / The implementation of genetic algorithms (GAs) inspired by Holland model in hardware to filter signals aims to speed up convergence time of these algorithms by implementing the modules considered a bottleneck for a software implementation. However, these modules have the same problems with the representation of the chromosome, dependence on genetic operators, representation adopted for the chromosome and population, and the loss of chromosomes with relevant features for the solution of the problem to which the AG has being applied. This thesis presents an adaptive filter that takes a genetic algorithm based on abstract data types (GAADT) for processing ECG signals, called CGAADT, the GPU /CUDA plataform. The compact genetic algorithm based on abstract data types (CGAADT) developed presents a solution for high performance of genetic algorithms based on abstract data types. The choice of this genetic algorithm model is justified by the fact that the GAADT have been define with the purpose of avoid the problems of models AG until then found of evolutionary computation literature. The GAADT works with an open architecture that considers the dynamics of the environment to which the chromosomes are inserted, that is, GAADT adaptation function search the most suitable chromosome population to the environment, if this environment change, then the search will be performed by GAADT will be redirected to the chromosome more adapted to the current environment, at runtime, without need to interrupt the current run of GAADT. The result obtained by GAADT has better quality than others AG models, since this works the definition of dominant gene, which are the information provided in the relevant chromosomes to solve the problem. Causing an exponential explosion in GAADT population, in the search for a more suitable chromosome containing the maximum amount of dominant genes, which can take months of processing to data collection in architectures over traditional CPUs. A comparative study of the quality of the results obtained by filtering the ECG signals from patients with sinus arrhythmia, atrial flutter and atrial fibrillation CGAADT with other models is presented. Experiences assessed in this study indicate that CGAADT shows an optimized version of GAADT, which allows all processing of the genetic algorithm is performed on the GPU, which resulted in a gain in the average total processing time of the algorithm in 17,43%selection, 1,39% in crossover, 1,12% in mutation, 9,02% in reproduction, 15,11% in the process of inserting descendants in the population. Such percentage represent a 73,6% enhancement processing gain related to genetic algorithm Holland. Finally, they are made some relevant considerations on the CGAADT and suggested some interesting questions for future work.
46	Transformations de programme automatiques et source-à-source pour accélérateurs matériels de type GPU / Source-to-Source Automatic Program Transformations for GPU-like Hardware Accelerators Amini, Mehdi 13 December 2012 (has links) Depuis le début des années 2000, la performance brute des cœurs des processeurs a cessé son augmentation exponentielle. Les circuits graphiques (GPUs) modernes ont été conçus comme des circuits composés d'une véritable grille de plusieurs centaines voir milliers d'unités de calcul. Leur capacité de calcul les a amenés à être rapidement détournés de leur fonction première d'affichage pour être exploités comme accélérateurs de calculs généralistes. Toutefois programmer un GPU efficacement en dehors du rendu de scènes 3D reste un défi.La jungle qui règne dans l'écosystème du matériel se reflète dans le monde du logiciel, avec de plus en plus de modèles de programmation, langages, ou API, sans laisser émerger de solution universelle.Cette thèse propose une solution de compilation pour répondre partiellement aux trois "P" propriétés : Performance, Portabilité, et Programmabilité. Le but est de transformer automatiquement un programme séquentiel en un programme équivalent accéléré à l'aide d'un GPU. Un prototype, Par4All, est implémenté et validé par de nombreuses expériences. La programmabilité et la portabilité sont assurées par définition, et si la performance n'est pas toujours au niveau de ce qu'obtiendrait un développeur expert, elle reste excellente sur une large gamme de noyaux et d'applications.Une étude des architectures des GPUs et les tendances dans la conception des langages et cadres de programmation est présentée. Le placement des données entre l'hôte et l'accélérateur est réalisé sans impliquer le développeur. Un algorithme d'optimisation des communications est proposé pour envoyer les données sur le GPU dès que possible et les y conserver aussi longtemps qu'elle ne sont pas requises sur l'hôte. Des techniques de transformations de boucles pour la génération de code noyau sont utilisées, et même certaines connues et éprouvées doivent être adaptées aux contraintes posées par les GPUs. Elles sont assemblées de manière cohérente, et ordonnancées dans le flot d'un compilateur interprocédural. Des travaux préliminaires sont présentés au sujet de l'extension de l'approche pour cibler de multiples GPUs. / Since the beginning of the 2000s, the raw performance of processors stopped its exponential increase. The modern graphic processing units (GPUs) have been designed as array of hundreds or thousands of compute units. The GPUs' compute capacity quickly leads them to be diverted from their original target to be used as accelerators for general purpose computation. However programming a GPU efficiently to perform other computations than 3D rendering remains challenging.The current jungle in the hardware ecosystem is mirrored by the software world, with more and more programming models, new languages, different APIs, etc. But no one-fits-all solution has emerged.This thesis proposes a compiler-based solution to partially answer the three "P" properties: Performance, Portability, and Programmability. The goal is to transform automatically a sequential program into an equivalent program accelerated with a GPU. A prototype, Par4All, is implemented and validated with numerous experiences. The programmability and portability are enforced by definition, and the performance may not be as good as what can be obtained by an expert programmer, but still has been measured excellent for a wide range of kernels and applications.A survey of the GPU architectures and the trends in the languages and framework design is presented. The data movement between the host and the accelerator is managed without involving the developer. An algorithm is proposed to optimize the communication by sending data to the GPU as early as possible and keeping them on the GPU as long as they are not required by the host. Loop transformations techniques for kernel code generation are involved, and even well-known ones have to be adapted to match specific GPU constraints. They are combined in a coherent and flexible way and dynamically scheduled within the compilation process of an interprocedural compiler. Some preliminary work is presented about the extension of the approach toward multiple GPUs. GPU CUDA OpenCL Parallélisation automatisée Compilation GPU CUDA OpenCL Automatic Parallelization Compilation
47	Un environnement parallèle de développement haut niveau pour les accélérateurs graphiques : mise en œuvre à l’aide d’OPENMP / A high-level parallel development framework for graphic accelerators : an implementation based on OPENMP Noaje, Gabriel 07 March 2013 (has links) Les processeurs graphiques (GPU), originellement dédiés à l'accélération de traitements graphiques, ont une structure hautement parallèle. Les innovations matérielles et de langage de programmation ont permis d'ouvrir le domaine du GPGPU, où les cartes graphiques sont utilisées comme des accélérateurs de calcul pour des applications HPC généralistes.L'objectif de nos travaux est de faciliter l'utilisation de ces nouvelles architectures pour les besoins du calcul haute performance ; ils suivent deux objectifs complémentaires.Le premier axe de nos recherches concerne la transformation automatique de code, permettant de partir d'un code de haut niveau pour le transformer en un code de bas niveau, équivalent, pouvant être exécuté sur des accélérateurs. Dans ce but nous avons implémenté un transformateur de code capable de prendre en charge les boucles « pour » parallèles d'un code OpenMP (simples ou imbriquées) et de le transformer en un code CUDA équivalent, qui soit suffisamment lisible pour permettre de le retravailler par des optimisations ultérieures.Par ailleurs, le futur des architectures HPC réside dans les architectures distribuées basées sur des nœuds dotés d'accélérateurs. Pour permettre aux utilisateurs d'exploiter les nœuds multiGPU, il est nécessaire de mettre en place des schémas d'exécution appropriés. Nous avons mené une étude comparative et mis en évidence que les threads OpenMP permettent de gérer de manière efficace plusieurs cartes graphiques et les communications au sein d'un nœud de calcul multiGPU. / Graphic cards (GPUs), initially used for graphic processing, have a highly parallel architecture. Innovations in both architecture and programming languages opened the new domain of GPGPU where GPUs are used as accelerators for general purpose HPC applications.Our main objective is to facilitate the use of these new architectures for high-performance computing needs; our research follows two main directions.The first direction concerns an automatic code transformation from a high level code into an equivalent low level one, capable of running on accelerators. To this end we implemented a code transformer that can handle parallel “for” loops (single or nested) of an OpenMP code and convert it into an equivalent CUDA code, which is in a human readable form that allows for further optimizations.Moreover, the future of HPC lies in distributed architectures based on hybrid nodes. Specific programming schemes have to be used in order to allow users to benefit from such multiGPU nodes. We conducted a comparative study which revealed that using OpenMP threads is the most adequate way to control multiple graphic cards as well as manage communications efficiently within a multiGPU node. OpenMP Cuda Compilateur Transformation de code Manycœurs MultiGPU OpenMP Cuda Compiler Code transformation Manycores MultiGPU
48	Aceleração por GPU de serviços em sistemas robóticos focado no processamento de tempo real de nuvem de pontos 3D / GPU Acceleration of robotic systems services focused in real-time processing of 3D point clouds Leonardo Milhomem Franco Christino 03 February 2016 (has links) O projeto de mestrado, denominado de forma abreviada como GPUServices, se insere no contexto da pesquisa e do desenvolvimento de métodos de processamento de dados de sensores tridimensionais aplicados a robótica móvel. Tais métodos serão chamados de serviços neste projeto e incluem algoritmos de pré-processamento de nuvens de pontos 3D com segmentação dos dados, a separação e identificação de zonas planares (chão, vias), e detecção de elementos de interesse (bordas, obstáculos). Devido à grande quantidade de dados a serem tratados em um curto espaço de tempo, esses serviços utilizam processamento paralelo por GPU para realizar o processamento parcial ou completo destes dados. A área de aplicação em foco neste projeto visa prover serviços para um sistema ADAS: veículos autônomos e inteligentes, forçando-os a se aproximarem de um sistema de processamento em tempo real devido ao contexto de direção autônoma. Os serviços são divididos em etapas de acordo com a metodologia do projeto, mas sempre buscando a aceleração com o uso de paralelismo inerente: O pré-projeto consiste de organizar um ambiente que seja capaz de coordenar todas as tecnologias utilizadas e que explore o paralelismo; O primeiro serviço tem a responsabilidade de extrair inteligentemente os dados do sensor que foi usado pelo projeto (Sensor laser Velodyne de múltiplos feixes), que se mostra necessário devido à diversos erros de leitura e ao formato de recebimento, fornecendo os dados em uma estrutura matricial; O segundo serviço em cooperação com o anterior corrige a desestabilidade espacial do sensor devido à base de fixação não estar perfeitamente paralela ao chão e devido aos amortecimentos do veículo; O terceiro serviço separa as zonas semânticas do ambiente, como plano do chão, regiões abaixo e acima do chão; O quarto serviço, similar ao anterior, realiza uma pré-segmentação das guias da rua; O quinto serviço realiza uma segmentação de objetos do ambiente, separando-os em blobs; E o sexto serviço utiliza de todos os anteriores para a detecção e segmentação das guias da rua. Os dados recebidos pelo sensor são na forma de uma nuvem de pontos 3D com grande potencial de exploração do paralelismo baseado na localidade das informações. Porém, sua grande dificuldade é a grande taxa de dados recebidos do sensor (em torno de 700.000 pontos/seg.), sendo esta a motivação deste projeto: usar todo o potencial do sensor de forma eficiente ao usar o paralelismo de programação GPU, disponibilizando assim ao usuário serviços de tratamento destes dados. / The master\'s project, abbreviated hence forth as GPUServices, fits in the context of research and development of three-dimensional sensor data processing methods applied to mobile robotics. Such methods will be called services in this project, which include a 3D point cloud preprocessing algorithms with data segmentation, separation and identification of planar areas (ground track), and also detecting elements of interest (borders, barriers). Due to the large amount of data to be processed in a short time, these services should use parallel processing, using the GPU to perform partial or complete processing of these data. The application area in focus in this project aims to provide services for an ADAS system: autonomous and intelligent vehicles, forcing them to get close to a real-time processing system due to the autonomous direction of context.The services are divided into stages according to the project methodology, but always striving for acceleration using inherent parallelism: The pre-project consists of organizing an environment for development that is able to coordinate all used technologies, to exploit parallelism and to be integrated to the system already used by the autonomous car; The first service has a responsibility to intelligently extract sensor data that will be used by the project (Laser sensor Velodyne multi-beam), it appears necessary because of the many reading errors and the receiving data format, hence providing data in a matrix structure; The second service, in cooperation with the above, corrects the spatial destabilization due to the sensor fixing base not perfectly parallel to the ground and due to the damping of the vehicle; The third service separates the environment into semantics areas such as ground plane and regions below and above the ground; The fourth service, similar to the above, performs a pre-segmentation of street cruds; The fifth service performs an environmental objects segmentation, separating them into blobs; The sixth service uses all prior to detection and segmentation of street guides.The received sensor data is structured in the form of a cloud of points. They allow processing with great potential for exploitation of parallelism based on the location of the information. However, its major difficulty is the high rate of data received from the sensor (around 700,000 points/sec), and this gives the motivation of this project: to use the full potential of sensor to efficiently use the parallelism of GPU programming, therefore providing data processing services to the user, providing services that helps and make the implementation of ADAS systems easier and/or faster. CUDA GPU Robótica ROS Veículo autônomo. Autonomous vehicle CUDA GPU Robotics ROS
49	Efektivní komunikace v multi-GPU systémech / Efficient Communication in Multi-GPU Systems Špeťko, Matej January 2018 (has links) After the introduction of CUDA by Nvidia, the GPUs became devices capable of accelerating any general purpose computation. GPUs are designed as parallel processors which posses huge computation power. Modern supercomputers are often equipped with GPU accelerators. Sometimes the performance or the memory capacity of a single GPU is not enough for a scientific application. The application needs to be scaled into multiple GPUs. During the computation there is need for the GPUs to exchange partial results. This communication represents computation overhead. For this reason it is important to research the methods of the effective communication between GPUs. This means less CPU involvement, lower latency, shared system buffers. Inter-node and intra-node communication is examined. The main focus is on GPUDirect technologies from Nvidia and CUDA-Aware MPI. Subsequently k-Wave toolbox for simulating the propagation of acoustic waves is introduced. This application is accelerated by using CUDA-Aware MPI.
50	Performance Evaluation of a Signal Processing Algorithm with General-Purpose Computing on a Graphics Processing Unit Appelgren, Filip, Ekelund, Måns January 2019 (has links) Graphics Processing Units (GPU) are increasingly being used for general-purpose programming, instead of their traditional graphical tasks. This is because of their raw computational power, which in some cases give them an advantage over the traditionally used Central Processing Unit (CPU). This thesis therefore sets out to identify the performance of a GPU in a correlation algorithm, and what parameters have the greatest effect on GPU performance. The method used for determining performance was quantitative, utilizing a clock library in C++ to measure performance of the algorithm as problem size increased. Initial problem size was set to 28 and increased exponentially to 221. The results show that smaller sample sizes perform better on the serial CPU implementation but that the parallel GPU implementations start outperforming the CPU between problem sizes of 29 and 210. It became apparent that GPU’s benefit from larger problem sizes, mainly because of the memory overhead costs involved with allocating and transferring data. Further, the algorithm that is under evaluation is not suited for a parallelized implementation due to a high amount of branching. Logic can lead to warp divergence, which can drastically lower performance. Keeping logic to a minimum and minimizing the number of memory transfers are vital in order to reach high performance with a GPU. / GPUer (grafikprocessor) som traditionellt används för att rita grafik i datorer, används mer och mer till att utföra vanliga programmeringsuppgifter. Detta är för att de har en stor beräkningskraft, som kan ge dem ett övertag över vanliga CPUer (processor) i vissa uppgifter. Det här arbetet undersöker därför prestandaskillnaderna mellan en CPU och en GPU i en korrelations-algoritm samt vilka parametrar som har störst påverkan på prestanda. En kvantitativ metod har använts med hjälp av ett klock-bibliotek, som finns tillgängligt i C++, för att utföra tidtagning. Initial problemstorlek var satt till 28 och ökade sedan exponentiellt till 221. Resultaten visar att algoritmen är snabbare på en CPU vid mindre problemstorlekar. Däremot börjar GPUn prestera bättre än CPUn mellan problemstorlekar av 29 och 210. Det blev tydligt att GPUer tjänar på större problem, framför allt för att det tar mycket tid att involvera GPUn i algoritmen. Datäoverföringar och minnesallokering på GPUn tar tid, vilket blir tydligt vid små storlekar. Algoritmen passar sig inte heller speciellt bra för en parallell lösning, eftersom den innehåller mycket logik. En algoritm med design där exekveringstrådarna kan gå isär under exekvering, är helst att undvika eftersom mycket parallell prestanda tappas. Att minimera logik, datäoverföringar samt minnesallokeringar är viktiga delar för hög GPU-prestanda. Parallelization GPU CUDA RADAR Optimization Parallellisering GPU CUDA RADAR Optimering Computer and Information Sciences Data- och informationsvetenskap

Search results