• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 138
  • 41
  • 23
  • 16
  • 15
  • 9
  • 8
  • 5
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 303
  • 107
  • 104
  • 104
  • 60
  • 52
  • 50
  • 47
  • 46
  • 39
  • 31
  • 30
  • 30
  • 29
  • 29
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
111

Implementation of an object-detection algorithm on a CPU+GPU target

Berthou, Gautier January 2016 (has links)
Systems like autonomous vehicles may require real time embedded image processing under hardware constraints. This paper provides directions to design time and resource efficient Haar cascade detection algorithms. It also reviews some software architecture and hardware aspects. The considered algorithms were meant to be run on platforms equipped with a CPU and a GPU under power consumption limitations. The main aim of the project was to design and develop real time underwater object detection algorithms. However the concepts that are presented in this paper are generic and can be applied to other domains where object detection is required, face detection for instance. The results show how the solutions outperform OpenCV cascade detector in terms of execution time while having the same accuracy. / System så som autonoma vehiklar kan kräva inbyggd bildbehandling i realtid under hårdvarubegränsningar. Denna uppsats tillhandahåller anvisningar för att designa tidsoch resurseffektiva Haar-kasad detekterande algoritmer. Dessutom granskas en del mjukvaruarkitektur och hårdvaruaspekter. De avsedda algoritmerna är menade att användas på plattformar försedda med en CPU och en GPU under begränsad energitillgång. Det huvudsakliga målet med projektet var att designa och utveckla realtidsalgoritmer för detektering av objekt under vatten. Dock är koncepten som presenteras i arbetet generiska och kan appliceras på andra domäner där objektdetektering kan behövas, till exempel vid detektering av ansikten. Resultaten visar hur lösningarna överträffar OpenCVs kaskaddetektor beträffande exekutionstid och med samtidig lika stor träffsäkerhet.
112

Resolução de um problema térmico inverso utilizando processamento paralelo em arquiteturas de memória compartilhada / Resolution of an inverse thermal problem using parallel processing on shared memory architectures

Ansoni, Jonas Laerte 03 September 2010 (has links)
A programação paralela tem sido freqüentemente adotada para o desenvolvimento de aplicações que demandam alto desempenho computacional. Com o advento das arquiteturas multi-cores e a existência de diversos níveis de paralelismo é importante definir estratégias de programação paralela que tirem proveito desse poder de processamento nessas arquiteturas. Neste contexto, este trabalho busca avaliar o desempenho da utilização das arquiteturas multi-cores, principalmente o oferecido pelas unidades de processamento gráfico (GPUs) e CPUs multi-cores na resolução de um problema térmico inverso. Algoritmos paralelos para a GPU e CPU foram desenvolvidos utilizando respectivamente as ferramentas de programação em arquiteturas de memória compartilhada NVIDIA CUDA (Compute Unified Device Architecture) e a API POSIX Threads. O algoritmo do método do gradiente conjugado pré-condicionado para resolução de sistemas lineares esparsos foi implementado totalmente no espaço da memória global da GPU em CUDA. O algoritmo desenvolvido foi avaliado em dois modelos de GPU, os quais se mostraram mais eficientes, apresentando um speedup de quatro vezes que a versão serial do algoritmo. A aplicação paralela em POSIX Threads foi avaliada em diferentes CPUs multi-cores com distintas microarquiteturas. Buscando um maior desempenho do código paralelizado foram utilizados flags de otimização as quais se mostraram muito eficientes na aplicação desenvolvida. Desta forma o código paralelizado com o auxílio das flags de otimização chegou a apresentar tempos de processamento cerca de doze vezes mais rápido que a versão serial no mesmo processador sem nenhum tipo de otimização. Assim tanto a abordagem utilizando a GPU como um co-processador genérico a CPU como a aplicação paralela empregando as CPUs multi-cores mostraram-se ferramentas eficientes para a resolução do problema térmico inverso. / Parallel programming has been frequently adopted for the development of applications that demand high-performance computing. With the advent of multi-cores architectures and the existence of several levels of parallelism are important to define programming strategies that take advantage of parallel processing power in these architectures. In this context, this study aims to evaluate the performance of architectures using multi-cores, mainly those offered by the graphics processing units (GPUs) and CPU multi-cores in the resolution of an inverse thermal problem. Parallel algorithms for the GPU and CPU were developed respectively, using the programming tools in shared memory architectures, NVIDIA CUDA (Compute Unified Device Architecture) and the POSIX Threads API. The algorithm of the preconditioned conjugate gradient method for solving sparse linear systems entirely within the global memory of the GPU was implemented by CUDA. It evaluated the two models of GPU, which proved more efficient by having a speedup was four times faster than the serial version of the algorithm. The parallel application in POSIX Threads was evaluated in different multi-core CPU with different microarchitectures. Optimization flags were used to achieve a higher performance of the parallelized code. As those were efficient in the developed application, the parallelized code presented processing times about twelve times faster than the serial version on the same processor without any optimization. Thus both the approach using GPU as a coprocessor to the CPU as a generic parallel application using the multi-core CPU proved to be more efficient tools for solving the inverse thermal problem.
113

[en] ANNCOM: ARTIFICIAL NEURAL NETWORK LIBRARY FOR HIGH PERFORMANCE COMPUTING USING GRAPHIC CARDS / [pt] ANNCOM: BIBLIOTECA DE REDES NEURAIS ARTIFICIAIS PARA ALTO DESEMPENHO UTILIZANDO PLACAS DE VÍDEO

DANIEL SALLES CHEVITARESE 24 May 2019 (has links)
[pt] As Redes Neurais Artificiais têm sido utilizadas com bastante sucesso em problemas de previsão, inferência e classificação de padrões. Por essa razão, já se encontram disponíveis diversas bibliotecas que facilitam a modelagem e o treinamento de redes, tais como o NNtool do Matlab ou o WEKA. Embora essas bibliotecas sejam muito utilizadas, elas possuem limitações quanto à mobilidade, à flexibilidade e ao desempenho. Essa última limitação é devida, principalmente, ao treinamento que pode exigir muito tempo quando existe uma grande quantidade de dados com muitos atributos. O presente trabalho propõe o desenvolvimento de uma biblioteca (ANNCOM) de fácil utilização, flexível, multiplataforma e que utiliza a arquitetura CUDA (Compute Unified Device Architecture) para reduzir os tempos de treinamento das redes. Essa arquitetura é uma forma de GPGPU (General-Purpose computing on Graphics Processing Units) e tem sido utilizada como uma solução em computação paralela na área de alto desempenho, uma vez que a tecnologia utilizada nos processadores atuais está chegando ao limite de velocidade. Adicionalmente, foi criada uma ferramenta gráfica que auxilia o desenvolvimento de soluções aplicando as técnicas de redes neurais de forma fácil e clara usando a biblioteca desenvolvida. Para avaliação de desempenho da ANNCOM, foram realizados seis treinamentos para classificação de clientes de baixa tensão de uma distribuidora de energia elétrica. O treinamento das redes, utilizando a ANNCOM com a tecnologia CUDA, alcançou um desempenho quase 30 vezes maior do que a ANNCOM auxiliada pela MKL (Math Kernel Library) da Intel, também utilizada pelo Matlab. / [en] The Artificial Neural Networks have been used quite successfully in problems of prediction, inference and classification standards. For this reason, are already available several libraries that facilitate the modeling and training networks, such as NNtool Matlab or WEKA. While these libraries are widely used, they have limited mobility, flexibility and performance. This limitation is due mainly to the training that can take a long time when there is a large amount of data with many attributes. This paper proposes the development of a library (ANNCOM) easy to use, flexible platform and architecture that uses the CUDA (Compute Unified Device Architecture) to reduce the training times of the networks. This architecture is a form of GPGPU (GeneralPurpose computing on Graphics Processing Units) and has been used as a solution in parallel computing in the area of high performance, since the technology used in current processors are reaching the limit of speed. Additionally created a graphical tool that helps the development of solutions using the techniques of neural networks easily and clearly using the library developed. For performance evaluation ANNCOM were conducted six trainings for customer classification of a low voltage electricity distribution. The training of networks using ANNCOM with CUDA technology, achieved a performance nearly 30 times greater than the ANNCOM aided by MKL (Math Kernel Library) by Intel, also used by Matlab.
114

SIMULAÇÃO CLIMÁTICA DE DADOS DE VENTO EM REDES P2P UTILIZANDO GPU

Baron Neto, Ciro 28 February 2014 (has links)
Made available in DSpace on 2017-07-21T14:19:39Z (GMT). No. of bitstreams: 1 Ciro Baron Neto.pdf: 1513768 bytes, checksum: a9f4624d5d9521cfa109fa40a688cbb2 (MD5) Previous issue date: 2014-02-28 / This paper presents an approach of technologies GPGPU (General-Purpose Computing on Graphics Processing Unit) and P2P (peer-to-peer) networks in order to improve the response time of climate data simulations. Thus, an application using CUDA (Compute Unified Device Architecture) architecture and the simulation model of Venthor simulator were initially adopted and integrated into the P2PComp framework. The results indicate an acceleration factor equal to 70 for single computers. Furthermore, the possibility of using a P2P sharing network for processing, higher acceleration factors can be obtained. Computer simulation models usually demand high processing power and this work showed that the use of parallelism in GPUs and P2P networks is an alternative that allows better performance when compared to sequential computing. / Este trabalho apresenta uma avaliação das tecnologias de GPGPU (General-Purpose Computing on Graphics Processing Unit) e de redes P2P (peer-to-peer) para melhorar o tempo de resposta de simulações de dados climáticos. Para isso, uma aplicação utilizando a arquitetura CUDA (Compute Unified Device Architecture) e o modelo de simulação de dados de vento do software Venthor foram inicialmente adotados e após integrados ao framework P2PComp. Os resultados indicam um fator de aceleração igual a 70 para computadores isolados. Além disso, com a possibilidade do uso de uma rede P2P para compartilhamento de processamento, fatores de aceleração maiores podem ser obtidos. Modelos de simulação computacional geralmente demandam alto poder de processamento e este trabalho mostrou que a utilização do paralelismo em redes P2P e GPUs constitui uma alternativa que permite melhor desempenho quando comparado à computação sequencial.
115

Trumpųjų bangų sklidimo modelis daugiaprocesorinėje aplinkoje / Development of the model of short wave propagation by using multi-processor environment

Mickus, Mykolas 04 November 2013 (has links)
Tampriosios bangos (arba akustinės ar bet kokios kitos bangos) sklidimo tyrimai yra svarbūs tokiose srityse kaip seismologija arba neardantis medžiagos testavimas. Tamprioje srityje šis reiškinys aprašomas tampriosios bangos dinamine diferencialine lygtimi. Tačiau šios lygties sprendimas naudojant tokius skaitinius metodus kaip baigtiniai elementai reikalauja sritį padalinti į milijonus elementų. Naujų skaičiavimo technologijų kaip bendros paskirties grafiniai procesoriai (GPU) atsiradimas skaičiavimų laiką leidžia ženkliai sumažinti, tačiau algoritmai turi būti specialiai pritaikomi. Todėl šiame darbe koncentruojamasi į trumpos tampriosios bangos baigtinių elementų modelio sukūrimą ir algoritmų tobulinimą naudojant GPU bei pagrindinį procesorių (CPU). Lygties integravimui buvo pasirinktas centrinių skirtumų metodo (CSM) schema. Ši integravimo schema buvo modifikuota taip, kad būtų galima išskirti tris integravimo algoritmo etapus: išorinės jėgos įvertinimas, elementų deformacijos sąlygotų jėgų įvertinimas bei magų poslinkių, greičių ir jėgų perskaičiavimas. Remiantis strategija pasiūlyta [1] šaltinyje, buvo sukurti lygiagretūs algoritmai 2 ir 3 etapo skaičiavimams atlikti. Toliau antrojo etapo algoritmas buvo optimizuotas 2 kartus. Pirmiausia buvo atsisakyta elementų mazgų indeksų masyvo: tai skaičiavimo laiką sumažino 20%. Po to algoritmas buvo modifikuotas taip, kad elementus būtų galima apdoroti blokais kaip siūloma [12] ir [22] šaltiniuose. Skaičiavimo laiką tai leido... [toliau žr. visą tekstą] / Understanding elastic wave (or acoustic or any other type of wave for that matter) phenomenon is of great importance in areas such as seismology or non destructive testing (NDT). This phenomenon in case of elastic environment is described by dynamic elastic differential equations. However, computational models like finite element method consumes huge amounts of computational power as even for relatively small problems require dividing area of interest into millions of elements. In the advent of general purpose GPU computing new opportunities for speeding up computations as well as challenges for developing high performance algorithms suited for new kinds of processors arise. Therefore this work concentrates on developing a finite element based short elastic wave propagation model on GPU as well as CPU. Central difference explicit wave equation integration scheme has been chosen. It then was slightly modified in order to separate integration algorithm into three phases: external force evaluation, evaluation of forces that occur due to stresses of elements and recalculation of node shifts, speeds and forces. A parallel algorithm has been developed for executing third and seconds phases, based on strategy suggested in [1]. Then the algorithm of the second phase has been optimized 2 times: at first the array of element node indices was eliminated yielding 20% performance boost, then modifications have been made to process elements in blocks by using strategy described at [22]... [to full text]
116

Resolução de um problema térmico inverso utilizando processamento paralelo em arquiteturas de memória compartilhada / Resolution of an inverse thermal problem using parallel processing on shared memory architectures

Jonas Laerte Ansoni 03 September 2010 (has links)
A programação paralela tem sido freqüentemente adotada para o desenvolvimento de aplicações que demandam alto desempenho computacional. Com o advento das arquiteturas multi-cores e a existência de diversos níveis de paralelismo é importante definir estratégias de programação paralela que tirem proveito desse poder de processamento nessas arquiteturas. Neste contexto, este trabalho busca avaliar o desempenho da utilização das arquiteturas multi-cores, principalmente o oferecido pelas unidades de processamento gráfico (GPUs) e CPUs multi-cores na resolução de um problema térmico inverso. Algoritmos paralelos para a GPU e CPU foram desenvolvidos utilizando respectivamente as ferramentas de programação em arquiteturas de memória compartilhada NVIDIA CUDA (Compute Unified Device Architecture) e a API POSIX Threads. O algoritmo do método do gradiente conjugado pré-condicionado para resolução de sistemas lineares esparsos foi implementado totalmente no espaço da memória global da GPU em CUDA. O algoritmo desenvolvido foi avaliado em dois modelos de GPU, os quais se mostraram mais eficientes, apresentando um speedup de quatro vezes que a versão serial do algoritmo. A aplicação paralela em POSIX Threads foi avaliada em diferentes CPUs multi-cores com distintas microarquiteturas. Buscando um maior desempenho do código paralelizado foram utilizados flags de otimização as quais se mostraram muito eficientes na aplicação desenvolvida. Desta forma o código paralelizado com o auxílio das flags de otimização chegou a apresentar tempos de processamento cerca de doze vezes mais rápido que a versão serial no mesmo processador sem nenhum tipo de otimização. Assim tanto a abordagem utilizando a GPU como um co-processador genérico a CPU como a aplicação paralela empregando as CPUs multi-cores mostraram-se ferramentas eficientes para a resolução do problema térmico inverso. / Parallel programming has been frequently adopted for the development of applications that demand high-performance computing. With the advent of multi-cores architectures and the existence of several levels of parallelism are important to define programming strategies that take advantage of parallel processing power in these architectures. In this context, this study aims to evaluate the performance of architectures using multi-cores, mainly those offered by the graphics processing units (GPUs) and CPU multi-cores in the resolution of an inverse thermal problem. Parallel algorithms for the GPU and CPU were developed respectively, using the programming tools in shared memory architectures, NVIDIA CUDA (Compute Unified Device Architecture) and the POSIX Threads API. The algorithm of the preconditioned conjugate gradient method for solving sparse linear systems entirely within the global memory of the GPU was implemented by CUDA. It evaluated the two models of GPU, which proved more efficient by having a speedup was four times faster than the serial version of the algorithm. The parallel application in POSIX Threads was evaluated in different multi-core CPU with different microarchitectures. Optimization flags were used to achieve a higher performance of the parallelized code. As those were efficient in the developed application, the parallelized code presented processing times about twelve times faster than the serial version on the same processor without any optimization. Thus both the approach using GPU as a coprocessor to the CPU as a generic parallel application using the multi-core CPU proved to be more efficient tools for solving the inverse thermal problem.
117

A Haptic Device Interface for Medical Simulations using OpenCL / Ett haptiskt gränssnitt för medicinska simuleringar med OpenCL

Machwirth, Mattias January 2013 (has links)
The project evaluates how well a haptic device can be used to interact with a visualization of volumetric data. Since the interface to the haptic device require explicit surface descriptions, triangles had to be constructed from the volumetric data. The algorithm used to extract these triangles is marching cubes. The triangles produced by marching cubes are then transmitted to the haptic device to enable the force feedback. Marching cubes was suitable for parallelization and it was executed using OpenCL. Graphs in the report shows how this parallelization ran almost 70 times faster than the sequential CPU counterpart of the same algorithm. Further development of the project would give medical students the opportunity to practice difficult procedures on a simulation instead of a real patient. This would give a realistic and accurate simulation to practice on. / Projektet går ut på att utvärdera hur väl en haptisk utrustning går att använda för att interagera med en visualisering av volumetrisk data. Eftersom haptikutrustningen krävde explicit beskrivna ytor, krävdes först en triangelgenerering utifrån den volymetriska datan. Algoritmen som används till detta är marching cubes. Trianglarna som producerades med hjälp av marching cubes skickas sedan vidare till den haptiska utrustningen för att kunna få gensvar i form av krafter för att utnyttja sig av känsel och inte bara syn. Eftersom marching cubes lämpas för en parallelisering användes OpenCL för att snabba upp algoritmen. Grafer i projektet visar hur denna algoritm exekveras upp emot 70 gånger snabbare när algoritmen körs som en kernel i OpenCL istället för ekvensiellt på CPUn. Tanken är att när vidareutveckling av projektet har gjorts i god mån, kan detta användas av läkarstuderande där övning av svåra snitt kan ske i en verklighetstrogen simulering innan samma ingrepp utförs på en individ.
118

Calcul hautes performances pour les formulations intégrales en électromagnétisme basses fréquences. Intégration, compression matricielle par ondelettes et résolution sur architecture GPGPU / High performance computing for integral formulations in low frequencies electromagnetism – Integration, wavelets matrix compression and solving on GPGPU architecture

Rubeck, Christophe 18 December 2012 (has links)
Les méthodes intégrales sont des méthodes particulièrement bien adaptées à la modélisation des systèmes électromagnétiques car contrairement aux méthodes par éléments finis elles ne nécessitent pas le maillage des matériaux inactifs tel que l'air. Ces modèles sont donc légers en terme du nombre de degrés de liberté. Cependant ceux sont des méthodes à interactions totales qui génèrent des matrices de systèmes d'équations pleines. Ces matrices sont longues à calculer en temps processeur et coûteuses à stocker dans la mémoire vive de l'ordinateur. Nous réduisons dans ces travaux les temps de calcul grâce au parallélisme, c'est-à-dire l'utilisation de plusieurs processeurs, notamment sur cartes graphiques (GPGPU). Nous réduisons également le coût du stockage mémoire via de la compression matricielle par ondelettes (il s'agit d'un algorithme proche de la compression d'images). C'est une compression par pertes, nous avons ainsi développé un critère pour contrôler l'erreur introduite par la compression. Les méthodes développées sont appliquées sur une formulation électrostatique de calcul de capacités, mais elles sont à priori également applicables à d'autres formulations. / Integral equation methods are widely used in electromagnetism modeling because, in opposition to finite element methods, they do not require the meshing of non-active materials like air. Therefore they lead to formulations with small degrees of freedom. However, they also lead to fully dense systems of equations. Computation times are expensive and the storage of the matrix is very expensive. This work presents different parallel computation strategies in order to speed up the computation time, in particular the use of graphical processing units (GPGPU) is focused. The next point is to reduce the memory requirements thanks to wavelets compression (it is an algorithm similar to image compression). The compression technique introduces errors, therefore a control criterion is proposed. The methodology is applied to an electrostatic formulation but it is general and it could also be used with others integral formulations.
119

Computational fluid dynamics on wildly heterogeneous systems

Huismann, Immo 23 February 2021 (has links)
In the last decade, high-order methods have gained increased attention. These combine the convergence properties of spectral methods with the geometrical flexibility of low-order methods. However, the time step is restrictive, necessitating the implicit treatment of diffusion terms in addition to the pressure. Therefore, efficient solution of elliptic equations is of central importance for fast flow solvers. As the operators scale with O(p · N), where N is the number of degrees of freedom and p the polynomial degree, the runtime of the best available multigrid algorithms scales with O(p · N) as well. This super-linear scaling limits the applicability of high-order methods to mid-range polynomial orders and constitutes a major road block on the way to faster flow solvers. This work reduces the super-linear scaling of elliptic solvers to a linear one. First, the static condensation method improves the condition of the system, then the associated operator is cast into matrix-free tensor-product form and factorized to linear complexity. The low increase in the condition and the linear runtime of the operator lead to linearly scaling solvers when increasing the polynomial degree, albeit with low robustness against the number of elements. A p-multigrid with overlapping Schwarz smoothers regains the robustness, but requires inverse operators on the subdomains and in the condensed case these are neither linearly scaling nor matrix-free. Embedding the condensed system into the full one leads to a matrix-free operator and factorization thereof to a linearly scaling inverse. In combination with the previously gained operator a multigrid method with a constant runtime per degree of freedom results, regardless of whether the polynomial degree or the number of elements is increased. Computing on heterogeneous hardware is investigated as a means to attain a higher performance and future-proof the algorithms. A two-level parallelization extends the traditional hybrid programming model by using a coarse-grain layer implementing domain decomposition and a fine-grain parallelization which is hardware-specific. Thereafter, load balancing is investigated on a preconditioned conjugate gradient solver and functional performance models adapted to account for the communication barriers in the algorithm. With the new model, runtime prediction and measurement fit closely with an error margin near 5 %. The devised methods are combined into a flow solver which attains the same throughput when computing with p = 16 as with p = 8, preserving the linear scaling. Furthermore, the multigrid method reduces the cost of implicit treatment of the pressure to the one for explicit treatment of the convection terms. Lastly, benchmarks confirm that the solver outperforms established high-order codes.
120

Optimizing a Water Simulation based on Wavefront Parameter Optimization

Lundgren, Martin January 2017 (has links)
DICE, a Swedish game company, wanted a more realistic water simulation. Currently, most large scale water simulations used in games are based upon ocean simulation technology. These techniques falter when used in other scenarios, such as coastlines. In order to produce a more realistic simulation, a new one was created based upon the water simulation technique "Wavefront Parameter Interpolation". This technique involves a rather extensive preprocess that enables ocean simulations to have interactions with the terrain. This paper is about optimizing the current implementation of DICE's water simulation. The goal is to achieve better runtime GPU performance. After implementing various optimizations, a speedup of roughly 4-6x was achieved. Performance was evaluated on the PlayStation 4 gaming console. / DICE, ett svenskt spelföretag, ville ha en mer realistisk vattensimulering. Det flesta storskalna vattensimuleringar som används i spel idag är baserade på havsvattensimuleringstekniker. Dessa tekniker fungerar inte lika bra i andra scenarier, som t.ex. kustlinjer. För att kunna få en mer realistisk simulation, skapades en ny simulation baserad på vattensimuleringstekniken Wavefront Parameter Interpolation. Denna simuleringsteknik involverar en lång förprocess som ger havsvattensimuleringar möjligheten att interagera med terräng. Denna uppsats handlar om att optimera den nuvarande implementationen av DICEs vattensimulering. Målet är att få bättre grafikprestanda under körtid. Efter att olika optimiseringar hade implementerats, blev programmet 4-6x gånger snabbare. Prestandan utvärderades på spelkonsolen PlayStation 4.

Page generated in 2.8961 seconds