Global ETD Search

281	Throughput-oriented analytical models for performance estimation on programmable hardware accelerators / Analyse de performance potentielle d'une simulation de QCD sur réseau sur processeur Cell et GPU Lai, Junjie 15 February 2013 (has links) Durant cette thèse, nous avons principalement travaillé sur deux sujets liés à l'analyse de la performance GPU (Graphics Processing Unit - Processeur graphique). Dans un premier temps, nous avons développé une méthode analytique et un outil d'estimation temporel (TEG) pour prédire les performances d'applications CUDA s’exécutant sur des GPUs de la famille GT200. Cet outil peut prédire les performances avec une précision approchant celle des outils précis au cycle près. Dans un second temps, nous avons développé une approche pour estimer la borne supérieure des performances d'une application GPU, en se basant sur l'analyse de l'application et de son code assembleur. Avec cette borne, nous connaissons la marge d'optimisation restante, et nous pouvons décider des efforts d'optimisation à fournir. Grâce à cette analyse, nous pouvons aussi comprendre quels paramètres sont critiques à la performance. / In this thesis work, we have mainly worked on two topics of GPU performance analysis. First, we have developed an analytical method and a timing estimation tool (TEG) to predict CUDA application's performance for GT200 generation GPUs. TEG can predict GPU applications' performance in cycle-approximate level. Second, we have developed an approach to estimate GPU applications' performance upper bound based on application analysis and assembly code level benchmarking. With the performance upper bound of an application, we know how much optimization space is left and can decide the optimization effort. Also with the analysis we can understand which parameters are critical to the performance. GPGPU Multi-coeurs Processeurs graphiques GPGPU CUDA Fermi GPU Kepler GPU Performance upper bound Performance Prediction Performance Analysis
282	Parallelizing Tabu Search Based Optimization Algorithm on GPUs Malleypally, Vinaya 14 March 2018 (has links) There are many combinatorial optimization problems such as traveling salesman problem, quadratic-assignment problem, flow shop scheduling, that are computationally intractable. Tabu search based simulated annealing is a stochastic search algorithm that is widely used to solve combinatorial optimization problems. Due to excessive run time, there is a strong demand for a parallel version that can be applied to any problem with minimal modifications. Existing advanced and/or parallel versions of tabu search algorithms are specific to the problem at hand. This leads to a drawback of optimization only for that particular problem. In this work, we propose a parallel version of tabu search based SA on the Graphics Processing Unit (GPU) platform. We propose two variants of the algorithm based on where the tabu list is stored (global vs. local). In the first version, the list is stored in the global shared memory such that all threads can access this list. Multiple random walks in solution space are carried out. Each walk avoids the moves made in rest of the walks due to their access to global tabu list at the expense of more time. In the second version, the list is stored at the block level and is shared by only the block threads. Groups of random walks are performed in parallel and a walk in a group avoids the moves made by the rest of the walks within that group due to their access to shared local tabu list. This version is better than the first version in terms of execution time. On the other hand, the first version finds the global optima more often. We present experimental results for six difficult optimization functions with known global optima. Compared to the CPU implementation with similar workload, the proposed GPU versions are faster by approximately three orders of magnitude and often find the global optima. Hill Climbing Simulated Annealing Parallel Computing CUDA Tabu List Aspiration Criteria Neighborhood Search Global Optimum Local Minima Computer Sciences
283	Suivi d'objets d'intérêt dans une séquence d`images : des points saillants aux mesures statistiques Garcia, Vincent 11 December 2008 (has links) (PDF) Le problème du suivi d'objets dans une vidéo se pose dans des domaines tels que la vision par ordinateur (vidéo-surveillance par exemple) et la post-production télévisuelle et cinématographique (effets spéciaux). Il se décline en deux variantes principales : le suivi d'une région d'intérêt, qui désigne un suivi grossier d'objet, et la segmentation spatio-temporelle, qui correspond à un suivi précis des contours de l'objet d'intérêt. Dans les deux cas, la région ou l'objet d'intérêt doivent avoir été préalablement détourés sur la première, et éventuellement la dernière, image de la séquence vidéo. Nous proposons dans cette thèse une méthode pour chacun de ces types de suivi ainsi qu'une implémentation rapide tirant partie du Graphics Processing Unit (GPU) d'une méthode de suivi de régions d'intérêt développée par ailleurs. La première méthode repose sur l'analyse de trajectoires temporelles de points saillants et réalise un suivi de régions d'intérêt. Des points saillants (typiquement des lieux de forte courbure des lignes isointensité) sont détectés dans toutes les images de la séquence. Les trajectoires sont construites en liant les points des images successives dont les voisinages sont cohérents. Notre contribution réside premièrement dans l'analyse des trajectoires sur un groupe d'images, ce qui améliore la qualité d'estimation du mouvement. De plus, nous utilisons une pondération spatio-temporelle pour chaque trajectoire qui permet d'ajouter une contrainte temporelle sur le mouvement tout en prenant en compte les déformations géométriques locales de l'objet ignorées par un modèle de mouvement global. La seconde méthode réalise une segmentation spatio-temporelle. Elle repose sur l'estimation du mouvement du contour de l'objet en s'appuyant sur l'information contenue dans une couronne qui s'étend de part et d'autre de ce contour. Cette couronne nous renseigne sur le contraste entre le fond et l'objet dans un contexte local. C'est là notre première contribution. De plus, la mise en correspondance par une mesure de similarité statistique, à savoir l'entropie du résiduel, d'une portion de la couronne et d'une zone de l'image suivante dans la séquence permet d'améliorer le suivi tout en facilitant le choix de la taille optimale de la couronne. Enfin, nous proposons une implémentation rapide d'une méthode de suivi de régions d'intérêt existante. Cette méthode repose sur l'utilisation d'une mesure de similarité statistique : la divergence de Kullback-Leibler. Cette divergence peut être estimée dans un espace de haute dimension à l'aide de multiples calculs de distances au k-ème plus proche voisin dans cet espace. Ces calculs étant très coûteux, nous proposons une implémentation parallèle sur GPU (grâce à l'interface logiciel CUDA de NVIDIA) de la recherche exhaustive des k plus proches voisins. Nous montrons que cette implémentation permet d'accélérer le suivi des objets, jusqu'à un facteur 15 par rapport à une implémentation de cette recherche nécessitant au préalable une structuration des données. Suivi d'objets point d'intérêt images points saillants mesures statistiques GPU CUDA entropie kulback-leibler
284	Throughput-oriented analytical models for performance estimation on programmable hardware accelerators Lai, Junjie 15 February 2013 (has links) (PDF) In this thesis work, we have mainly worked on two topics of GPU performance analysis. First, we have developed an analytical method and a timing estimation tool (TEG) to predict CUDA application's performance for GT200 generation GPUs. TEG can predict GPU applications' performance in cycle-approximate level. Second, we have developed an approach to estimate GPU applications' performance upper bound based on application analysis and assembly code level benchmarking. With the performance upper bound of an application, we know how much optimization space is left and can decide the optimization effort. Also with the analysis we can understand which parameters are critical to the performance. [INFO:INFO_OH] Computer Science/Other GPGPU CUDA Fermi GPU Kepler GPU Performance upper bound Performance Prediction Performance Analysis
285	GPU-accelerated Model Checking of Periodic Self-Suspending Real-Time Tasks Liberg, Tim, Måhl, Per-Erik January 2012 (has links) Efficient model checking is important in order to make this type of software verification useful for systems that are complex in their structure. If a system is too large or complex then model checking does not simply scale, i.e., it could take too much time to verify the system. This is one strong argument for focusing on making model checking faster. Another interesting aim is to make model checking so fast that it can be used for predicting scheduling decisions for real-time schedulers at runtime. This of course requires the model checking to complete within a order of milliseconds or even microseconds. The aim is set very high but the results of this thesis will at least give a hint on whether this seems possible or not. The magic card for (maybe) making this possible is called Graphics Processing Unit (GPU). This thesis will investigate if and how a model checking algorithm can be ported and executed on a GPU. Modern GPU architectures offers a high degree of processing power since they are equipped with up to 1000 (NVIDIA GTX 590) or 3000 (NVIDIA Tesla K10) processor cores. The drawback is that they offer poor thread-communication possibilities and memory caches compared to CPU. This makes it very difficult to port CPU programs to GPUs.The example model (system) used in this thesis represents a real-time task scheduler that can schedule up to three periodic self-suspending tasks. The aim is to verify, i.e., find a feasible schedule for these tasks, and do it as fast as possible with the help of the GPU. GPU Model Checking Verification CUDA STS Tree search GPGPU Periodic Self-Suspending Tasks Real-Time Scheduling Timed automata
286	Real-time Object Recognition on a GPU Pettersson, Johan January 2007 (has links) <p>Shape-Based matching (SBM) is a known method for 2D object recognition that is rather robust against illumination variations, noise, clutter and partial occlusion.</p><p>The objects to be recognized can be translated, rotated and scaled.</p><p>The translation of an object is determined by evaluating a similarity measure for all possible positions (similar to cross correlation).</p><p>The similarity measure is based on dot products between normalized gradient directions in edges.</p><p>Rotation and scale is determined by evaluating all possible combinations, spanning a huge search space.</p><p>A resolution pyramid is used to form a heuristic for the search that then gains real-time performance.</p><p>For SBM, a model consisting of normalized edge gradient directions, are constructed for all possible combinations of rotation and scale.</p><p>We have avoided this by using (bilinear) interpolation in the search gradient map, which greatly reduces the amount of storage required.</p><p>SBM is highly parallelizable by nature and with our suggested improvements it becomes much suited for running on a GPU.</p><p>This have been implemented and tested, and the results clearly outperform those of our reference CPU implementation (with magnitudes of hundreds).</p><p>It is also very scalable and easily benefits from future devices without effort.</p><p>An extensive evaluation material and tools for evaluating object recognition algorithms have been developed and the implementation is evaluated and compared to two commercial 2D object recognition solutions.</p><p>The results show that the method is very powerful when dealing with the distortions listed above and competes well with its opponents.</p> object recognition pattern matching GPU CUDA transformation rotation scale noise illumination occlusion clutter evaluation Image analysis Bildanalys
287	Ordonnancement dynamique, adapté aux architectures hétérogènes, de la méthode multipôle pour les équations de Maxwell, en électromagnétisme Bordage, Cyril 20 December 2013 (has links) (PDF) La méthode multipôle permet d'accélérer les produits matrices-vecteurs, utilisés par les solveurs itératifs pour déterminer le comportement électromagnétique, d'un objet soumis à une onde incidente. Nos travaux ont pour but d'adapter cette méthode pour la rendre efficace sur les architectures hétérogènes contenant des GPU. Pour cela, nous utilisons une ordonnanceur dynamique, StarPU, qui effectuera la distribution des tâches de calcul au sein d'un nœud. Pour la parallélisation en mémoire distribuée, nous effectuerons un ordonnancement statique des boîtes, couplé à un ordonnancement dynamique des interactions proches. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Méthode multipôle Fmm Ordonnancement dynamique Électromagnétique Helmholtz Mpi StarPU Cuda
288	Résolution de systèmes linéaires et non linéaires creux sur grappes de GPUs Ziane Khodja, Lilia 07 June 2013 (has links) (PDF) Depuis quelques années, les grappes équipées de processeurs graphiques GPUs sont devenues des outils très attrayants pour le calcul parallèle haute performance. Dans cette thèse, nous avons conçu des algorithmes itératifs parallèles pour la résolution de systèmes linéaires et non linéaires creux de très grandes tailles sur grappes de GPUs. Dans un premier temps, nous nous sommes focalisés sur la résolution de systèmes linéaires creux à l'aide des méthodes itératives CG et GMRES. Les expérimentations ont montré qu'une grappe de GPUs est plus performante que son homologue grappe de CPUs pour la résolution de systèmes linéaires de très grandes tailles. Ensuite, nous avons mis en oeuvre des algorithmes parallèles synchrones et asynchrones des méthodes itératives Richardson et de relaxation par blocs pour la résolution de systèmes non linéaires creux. Nous avons constaté que les meilleurs solutions développées pour les CPUs ne sont pas nécessairement bien adaptées aux GPUs. En effet, les simulations effectuées sur une grappe de GPUs ont montré que les algorithmes Richardson sont largement plus efficaces que ceux de relaxation par blocs. De plus, elles ont aussi montré que la puissance de calcul des GPUs permet de réduire le rapport entre le temps d'exécution et celui de communication, ce qui favorise l'utilisation des algorithmes asynchrones sur des grappes de GPUs. Enfin, nous nous sommes intéressés aux grappes géographiquement distantes pour la résolution de systèmes linéaires creux. Dans ce contexte, nous avons utilisé la méthode de multi-décomposition à deux niveaux avec GMRES parallèle adaptée aux grappes de GPUs. Celle-ci utilise des itérations synchrones pour résoudre localement les sous-systèmes linéaires et des itérations asynchrones pour résoudre la globalité du système linéaire. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Méthodes itératives Parallélisme MPI/CUDA Grappes de GPUs
289	UN ENVIRONNEMENT PARALLÈLE DE DÉVELOPPEMENT HAUT NIVEAU POUR LES ACCÉLÉRATEURS GRAPHIQUES : MISE EN OEUVRE À L'AIDE D'OPENMP Noaje, Gabriel 07 March 2013 (has links) (PDF) Les processeurs graphiques (GPU), originellement dédiés à l'accélération de traitements graphiques, ont une structure hautement parallèle. Les innovations matérielles et de langage de programmation ont permis d'ouvrir le domaine du GPGPU, où les cartes graphiques sont utilisées comme des accélérateurs de calcul pour des applications HPC généralistes. L'objectif de nos travaux est de faciliter l'utilisation de ces nouvelles architectures pour les besoins du calcul haute performance ; ils suivent deux objectifs complémentaires. Le premier axe de nos recherches concerne la transformation automatique de code, permettant de partir d'un code de haut niveau pour le transformer en un code de bas niveau, équivalent, pouvant être exécuté sur des accélérateurs. Dans ce but nous avons implémenté un transformateur de code capable de prendre en charge les boucles " pour " parallèles d'un code OpenMP (simples ou imbriquées) et de le transformer en un code CUDA équivalent, qui soit suffisamment lisible pour permettre de le retravailler par des optimisations ultérieures. Par ailleurs, le futur des architectures HPC réside dans les architectures distribuées basées sur des noeuds dotés d'accélérateurs. Pour permettre aux utilisateurs d'exploiter les noeuds multiGPU, il est nécessaire de mettre en place des schémas d'exécution appropriés. Nous avons mené une étude comparative et mis en évidence que les threads OpenMP permettent de gérer de manière efficace plusieurs cartes graphiques et les communications au sein d'un noeud de calcul multiGPU. OpenMP CUDA compilateur transformation de code manycoeurs multiGPU
290	Optical Flow Computation on Compute Unified Device Architecture / Optiskt flödeberäkning med CUDA Ringaby, Erik January 2008 (has links) There has been a rapid progress of the graphics processor the last years, much because of the demands from computer games on speed and image quality. Because of the graphics processor’s special architecture it is much faster at solving parallel problems than the normal processor. Due to its increasing programmability it is possible to use it for other tasks than it was originally designed for. Even though graphics processors have been programmable for some time, it has been quite difficult to learn how to use them. CUDA enables the programmer to use C-code, with a few extensions, to program NVIDIA’s graphics processor and completely skip the traditional programming models. This thesis investigates if the graphics processor can be used for calculations without knowledge of how the hardware mechanisms work. An image processing algorithm calculating the optical flow has been implemented. The result shows that it is rather easy to implement programs using CUDA, but some knowledge of how the graphics processor works is required to achieve high performance. optical flow GPU GPGPU CUDA Engineering and Technology Teknik och teknologier

Search results