Global ETD Search

361	A GPU Accelerated Tensor Spectral Method for Subspace Clustering Pai, Nithish January 2016 (has links) (PDF) In this thesis we consider the problem of clustering the data lying in a union of subspaces using spectral methods. Though the data generated may have high dimensionality, in many of the applications, such as motion segmentation and illumination invariant face clustering, the data resides in a union of subspaces having small dimensions. Furthermore, for a number of classification and inference problems, it is often useful to identify these subspaces and work with data in this smaller dimensional manifold. If the observations in each cluster were to be distributed around a centric, applying spectral clustering on an a nifty matrix built using distance based similarity measures between the data points have been used successfully to solve the problem. But it has been observed that using such pair-wise distance based measure between the data points to construct a similarity matrix is not sufficient to solve the subspace clustering problem. Hence, a major challenge is to end a similarity measure that can capture the information of the subspace the data lies in. This is the motivation to develop methods that use an affinity tensor by calculating similarity between multiple data points. One can then use spectral methods on these tensors to solve the subspace clustering problem. In order to keep the algorithm computationally feasible, one can employ column sampling strategies. However, the computational costs for performing the tensor factorization increases very quickly with increase in sampling rate. Fortunately, the advances in GPU computing has made it possible to perform many linear algebra operations several order of magnitudes faster than traditional CPU and multicourse computing. In this work, we develop parallel algorithms for subspace clustering on a GPU com-putting environment. We show that this gives us a significant speedup over the implementations on the CPU, which allows us to sample a larger fraction of the tensor and thereby achieve better accuracies. We empirically analyze the performance of these algorithms on a number of synthetically generated subspaces con gyrations. We ally demonstrate the effectiveness of these algorithms on the motion segmentation, handwritten digit clustering and illumination invariant face clustering and show that the performance of these algorithms are comparable with the state of the art approaches. Subspace Clustering Tensors Spectral Method Hypergraphs and Tensors Tensor Factorization Spectral Clustering based Algorithms GPU Accelerated Algorithm GPU Computing Computer Science
362	Segmatation multi-agents en imagerie biologique et médicale : application aux IRM 3D / Multi-agent segmentation for biological and medical imaging : 3D MRI application Moussa, Richard 12 December 2011 (has links) La segmentation d’images est une opération cruciale pour le traitement d’images. Elle est toujours le point de départ des processus d’analyse de formes, de détection de mouvement, de visualisation, des estimations quantitatives de distances linéaires, de surfaces et de volumes. À ces fins, la segmentation consiste à catégoriser les voxels en des classes basées sur leurs intensités locales, leur localisation spatiale et leurs caractéristiques de forme ou de voisinage. La difficulté de la stabilité des résultats des méthodes de segmentation pour les images médicales provient des différents types de bruit présents.Dans ces images, le bruit prend deux formes : un bruit physique dû au système d’acquisition, dans notre cas l’IRM (Imagerie par Résonance Magnétique), et le bruit physiologique dû au patient. Ces bruits doivent être pris en compte pour toutes les méthodes de segmentation d’images. Durant cette thèse,nous nous sommes focalisés sur des modèles Multi-Agents basés sur les comportements biologiques des araignées et des fourmis pour effectuer la tâche de segmentation. Pour les araignées, nous avons proposé une approche semi-automatique utilisant l’histogramme de l’image pour déterminer le nombre d’objets à détecter. Tandis que pour les fourmis, nous avons proposé deux approches : la première dite classique qui utilise le gradient de l’image et la deuxième, plus originale, qui utilise une partition intervoxel de l’image. Nous avons également proposé un moyen pour accélérer le processus de segmentation grâce à l’utilisation des GPU (Graphics Processing Unit). Finalement, ces deux méthodes ont été évaluées sur des images d’IRM de cerveau et elles ont été comparées aux méthodes classiques de segmentation : croissance de régions et Otsu pour le modèle des araignées et le gradientde Sobel pour les fourmis. / Image segmentation is a crucial operation for image processing. It is always the starting point of shape analysis process, motion detection, visualization, and quantitative estimation of linear distances, surfaces and volumes. For this, the segmentation consists on classifying the voxels into classes based on their local strengths, their spatial location and shape characteristics or neighborhood. The difficulty of the results stability of segmentation methods for medical images comes from the different types of noise present inside every image. In these images, the noise takes two forms: a physical noise due to the acquisition system, in our case, MRI (Magnetic Resonance Imaging), and a physiological noise due to the patient. These noises should be considered for all methods of segmentation. In this thesis, we focused on Multi-Agent models based on the biological behavior of spiders and ants to perform the task of segmentation. For spiders, we proposed a semi-automatic method using the histogram of the image to determine the number of objects to be detected. As for ants, we proposed two approaches: one that uses the so-called classical gradient of the image and the second, more original, which uses an intervoxel partition of the image. We also proposed a way to speed up the segmentation process through the use of the GPU (Graphics Processing Unit). Finally, these two methods were evaluated on MR images of brain and were compared with conventional methods of segmentation: region growing and Otsu for the model of spiders and Sobel gradient for the ants. Systèmes Multi-Agents Agents sociaux GPU Segmentation d’images IRM Imagerie médicale Multi-Agent Systems Social agents GPU Image segmentation MRI Médical imaging
363	Adaptation de l’algorithmique aux architectures parallèles / Adapting algorithms to parallel architectures Borghi, Alexandre 10 October 2011 (has links) Dans cette thèse, nous nous intéressons à l'adaptation de l'algorithmique aux architectures parallèles. Les plateformes hautes performances actuelles disposent de plusieurs niveaux de parallélisme et requièrent un travail considérable pour en tirer parti. Les superordinateurs possèdent de plus en plus d'unités de calcul et sont de plus en plus hétérogènes et hiérarchiques, ce qui complexifie d'autant plus leur utilisation.Nous nous sommes intéressés ici à plusieurs aspects permettant de tirer parti des architectures parallèles modernes. Tout au long de cette thèse, plusieurs problèmes de natures différentes sont abordés, de manière plus théorique ou plus pratique selon le cadre et l'échelle des plateformes parallèles envisagées.Nous avons travaillé sur la modélisation de problèmes dans le but d'adapter leur formulation à des solveurs existants ou des méthodes de résolution existantes, en particulier dans le cadre du problème de la factorisation en nombres premiers modélisé et résolu à l'aide d'outils de programmation linéaire en nombres entiers.La contribution la plus importante de cette thèse correspond à la conception d'algorithmes pensés dès le départ pour être performants sur les architectures modernes (processeurs multi-coeurs, Cell, GPU). Deux algorithmes pour résoudre le problème du compressive sensing ont été conçus dans ce cadre : le premier repose sur la programmation linéaire et permet d'obtenir une solution exacte, alors que le second utilise des méthodes de programmation convexe et permet d'obtenir une solution approchée.Nous avons aussi utilisé une bibliothèque de parallélisation de haut niveau utilisant le modèle BSP dans le cadre de la vérification de modèles pour implémenter de manière parallèle un algorithme existant. A partir d'une unique implémentation, cet outil rend possible l'utilisation de l'algorithme sur des plateformes disposant de différents niveaux de parallélisme, tout en ayant des performances de premier ordre sur chacune d'entre elles. En l'occurrence, la plateforme de plus grande échelle considérée ici est le cluster de machines multiprocesseurs multi-coeurs. De plus, dans le cadre très particulier du processeur Cell, une implémentation a été réécrite à partir de zéro pour tirer parti de celle-ci. / In this thesis, we are interested in adapting algorithms to parallel architectures. Current high performance platforms have several levels of parallelism and require a significant amount of work to make the most of them. Supercomputers possess more and more computational units and are more and more heterogeneous and hierarchical, which make their use very difficult.We take an interest in several aspects which enable to benefit from modern parallel architectures. Throughout this thesis, several problems with different natures are tackled, more theoretically or more practically according to the context and the scale of the considered parallel platforms.We have worked on modeling problems in order to adapt their formulation to existing solvers or resolution methods, in particular in the context of integer factorization problem modeled and solved with integer programming tools.The main contribution of this thesis corresponds to the design of algorithms thought from the beginning to be efficient when running on modern architectures (multi-core processors, Cell, GPU). Two algorithms which solve the compressive sensing problem have been designed in this context: the first one uses linear programming and enables to find an exact solution, whereas the second one uses convex programming and enables to find an approximate solution.We have also used a high-level parallelization library which uses the BSP model in the context of model checking to implement in parallel an existing algorithm. From a unique implementation, this tool enables the use of the algorithm on platforms with different levels of parallelism, while obtaining cutting edge performance for each of them. In our case, the largest-scale platform that we considered is the cluster of multi-core multiprocessors. More, in the context of the very particular Cell processor, an implementation has been written from scratch to take benefit from it. Parallélisation Vectorisation Architectures parallèles Multi-coeur GPU Cell Programmation linéaire Programmation convexe Parallelization Vectorization Parallel architectures Multi-core GPU Cell Linear programming Convex programming
364	Scheduling and memory optimizations for sparse direct solver on multi-core/multi-gpu duster systems / Ordonnancement et optimisations mémoire pour un solveur creux par méthodes directes sur des machines hétérogènes Lacoste, Xavier 18 February 2015 (has links) L’évolution courante des machines montre une croissance importante dans le nombre et l’hétérogénéité des unités de calcul. Les développeurs doivent alors trouver des alternatives aux modèles de programmation habituels permettant de produire des codes de calcul à la fois performants et portables. PaStiX est un solveur parallèle de système linéaire creux par méthodes directe. Il utilise un ordonnanceur de tâche dynamique pour être efficaces sur les machines modernes multi-coeurs à mémoires hiérarchiques. Dans cette thèse, nous étudions les bénéfices et les limites que peut nous apporter le remplacement de l’ordonnanceur interne, très spécialisé, du solveur PaStiX par deux systèmes d’exécution génériques : PaRSEC et StarPU. Pour cela l’algorithme doit être décrit sous la forme d’un graphe de tâches qui est fournit aux systèmes d’exécution qui peuvent alors calculer une exécution optimisée de celui-ci pour maximiser l’efficacité de l’algorithme sur la machine de calcul visée. Une étude comparativedes performances de PaStiX utilisant ordonnanceur interne, PaRSEC, et StarPU a été menée sur différentes machines et est présentée ici. L’analyse met en évidence les performances comparables des versions utilisant les systèmes d’exécution par rapport à l’ordonnanceur embarqué optimisé pour PaStiX. De plus ces implémentations permettent d’obtenir une accélération notable sur les machines hétérogènes en utilisant lesaccélérateurs tout en masquant la complexité de leur utilisation au développeur. Dans cette thèse nous étudions également la possibilité d’obtenir un solveur distribué de système linéaire creux par méthodes directes efficace sur les machines parallèles hétérogènes en utilisant les systèmes d’exécution à base de tâche. Afin de pouvoir utiliser ces travaux de manière efficace dans des codes parallèles de simulations, nous présentons également une interface distribuée, orientée éléments finis, permettant d’obtenir un assemblage optimisé de la matrice distribuée tout en masquant la complexité liée à la distribution des données à l’utilisateur. / The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this thesis, we study the benefits and the limits of replacing the highly specialized internal scheduler of the PaStiX solver by two generic runtime systems: PaRSEC and StarPU. Thus, we have to describe the factorization algorithm as a tasks graph that we provide to the runtime system. Then it can decide how to process and optimize the graph traversal in order to maximize the algorithm efficiency for thetargeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its original internal scheduler, PaRSEC, and StarPU frameworks is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer. In this thesis, we also study the possibilities to build a distributed sparse linear solver on top of task-based runtime systems to target heterogeneous clusters. To permit an efficient and easy usage of these developments in parallel simulations, we also present an optimized distributed interfaceaiming at hiding the complexity of the construction of a distributed matrix to the user. GPU Multi-coeur MPI, Ordonnanceur à base de tâches Sparse direct solver GPU Multi-core MPI Tasks based runtime systems
365	Proposta para aceleração de desempenho de algoritmos de visão computacional em sistemas embarcados / Proposed algorithms performance acceleration computer vision in embedded systems André Márcio de Lima Curvello 10 June 2016 (has links) O presente trabalho apresenta um benchmark para avaliar o desempenho de uma plataforma embarcada WandBoard Quad no processamento de imagens, considerando o uso da sua GPU Vivante GC2000 na execução de rotinas usando OpenGL ES 2.0. Para esse fim, foi tomado por base a execução de filtros de imagem em CPU e GPU. Os filtros são as aplicações mais comumente utilizadas em processamento de imagens, que por sua vez operam por meio de convoluções, técnica esta que faz uso de sucessivas multiplicações matriciais, o que justifica um alto custo computacional dos algoritmos de filtros de imagem em processamento de imagens. Dessa forma, o emprego da GPU em sistemas embarcados é uma interessante alternativa que torna viável a realização de processamento de imagem nestes sistemas, pois além de fazer uso de um recurso presente em uma grande gama de dispositivos presentes no mercado, é capaz de acelerar a execução de algoritmos de processamento de imagem, que por sua vez são a base para aplicações de visão computacional tais como reconhecimento facial, reconhecimento de gestos, dentre outras. Tais aplicações tornam-se cada vez mais requisitadas em um cenário de uso e consumo em aplicações modernas de sistemas embarcados. Para embasar esse objetivo foram realizados estudos comparativos de desempenho entre sistemas e entre bibliotecas capazes de auxiliar no aproveitamento de recursos de processadores multicore. Para comprovar o potencial do assunto abordado e fundamentar a proposta do presente trabalho, foi realizado um benchmark na forma de uma sequência de testes, tendo como alvo uma aplicação modelo que executa o algoritmo do Filtro de Sobel sobre um fluxo de imagens capturadas de uma webcam. A aplicação foi executada diretamente na CPU e também na GPU embarcada. Como resultado, a execução em GPU por meio de OpenGL ES 2.0 alcançou desempenho quase 10 vezes maior com relação à execução em CPU, e considerando tempos de readback, obteve ganho de desempenho total de até 4 vezes. / This work presents a benchmark for evaluating the performance of an embedded WandBoard Quad platform in image processing, considering the use of its GPU Vivante GC2000 in executing routines using OpenGL ES 2.0. To this goal, it has relied upon the execution of image filters in CPU and GPU. The filters are the most commonly applications used in image processing, which in turn operate through convolutions, a technique which makes use of successive matrix multiplications, which justifies a high computational cost of image filters algorithms for image processing. Thus, the use of the GPU for embedded systems is an interesting alternative that makes it feasible to image processing performing in these systems, as well as make use of a present feature in a wide range of devices on the market, it is able to accelerate image processing algorithms, which in turn are the basis for computer vision applications such as facial recognition, gesture recognition, among others. Such applications become increasingly required in a consumption and usage scenario in modern applications of embedded systems. To support this goal were carried out a comparative studies of performance between systems and between libraries capable of assisting in the use of multicore processors resources. To prove the potential of the subject matter and explain the purpose of this study, it was performed a benchmark in the form of a sequence of tests, targeting a model application that runs Sobel filter algorithm on a stream of images captured from a webcam. The application was performed directly on the embbedded CPU and GPU. As a result, running on GPU via OpenGL ES 2.0 performance achieved nearly 10 times higher with respect to the running CPU, and considering readback times, achieved total performance gain of up to 4 times. Multicore GPU Linux embarcado OpenGL ES 2.0 Sistemas embarcados Visão computacional Computer vision Embedded Linux Embedded systems GPU Multicore OpenGLES 2.0
366	Processamento de áudio em tempo real em dispositivos computacionais de alta disponibilidade e baixo custo / Real time digital audio processing using highly available, low cost devices André Jucovsky Bianchi 21 October 2013 (has links) Neste trabalho foi feita uma investigação sobre a realização de processamento de áudio digital em tempo real utilizando três dispositivos com características computacionais fundamentalmente distintas porém bastante acessíveis em termos de custo e disponibilidade de tecnologia: Arduino, GPU e Android. Arduino é um dispositivo com licenças de hardware e software abertas, baseado em um microcontrolador com baixo poder de processamento, muito utilizado como plataforma educativa e artística para computações de controle e interface com outros dispositivos. GPU é uma arquitetura de placas de vídeo com foco no processamento paralelo, que tem motivado o estudo de modelos de programação específicos para sua utilização como dispositivo de processamento de propósito geral. Android é um sistema operacional para dispositivos móveis baseado no kernel do Linux, que permite o desenvolvimento de aplicativos utilizando linguagem de alto nível e possibilita o uso da infraestrutura de sensores, conectividade e mobilidade disponível nos aparelhos. Buscamos sistematizar as limitações e possibilidades de cada plataforma através da implementação de técnicas de processamento de áudio digital em tempo real e da análise da intensidade computacional em cada ambiente. / This dissertation describes an investigation about real time audio signal processing using three platforms with fundamentally distinct computational characteristics, but which are highly available in terms of cost and technology: Arduino, GPU boards and Android devices. Arduino is a device with open hardware and software licences, based on a microcontroller with low processing power, largely used as educational and artistic platform for control computations and interfacing with other devices. GPU is a video card architecture focusing on parallel processing, which has motivated the study of specific programming models for its use as a general purpose processing device. Android is an operating system for mobile devices based on the Linux kernel, which allows the development of applications using high level language and allows the use of sensors, connectivity and mobile infrastructures available on devices. We search to systematize the limitations and possibilities of each platform through the implementation of real time digital audio processing techinques and the analysis of computational intensity in each environment. android arduino gpu processamento de áudio processamento em tempo real processamento sonoro android arduino audio processing digital signal processing gpu real time
367	Vision stéréoscopique temps-réel pour la navigation autonome d'un robot en environnement dynamique / Real-time stereovision for autonomous robot navigation in dynamic environment Derome, Maxime 22 June 2017 (has links) L'objectif de cette thèse est de concevoir un système de perception stéréoscopique embarqué, permettant une navigation robotique autonome en environnement dynamique (i.e. comportant des objets mobiles). Pour cela, nous nous sommes imposé plusieurs contraintes : 1) Puisque l'on souhaite pouvoir naviguer en terrain inconnu et en présence de tout type d'objets mobiles, nous avons adopté une approche purement géométrique. 2) Pour assurer une couverture maximale du champ visuel nous avons choisi d'employer des méthodes d'estimation denses qui traitent chaque pixel de l'image. 3) Puisque les algorithmes utilisés doivent pouvoir s'exécuter en embarqué sur un robot, nous avons attaché le plus grand soin à sélectionner ou concevoir des algorithmes particulièrement rapides, pour nuire au minimum à la réactivité du système. La démarche présentée dans ce manuscrit et les contributions qui sont faites sont les suivantes. Dans un premier temps, nous étudions plusieurs algorithmes d’appariement stéréo qui permettent d'estimer une carte de disparité dont on peut déduire, par triangulation, une carte de profondeur. Grâce à cette évaluation nous mettons en évidence un algorithme qui ne figure pas sur les benchmarks KITTI, mais qui offre un excellent compromis précision/temps de calcul. Nous proposons également une méthode pour filtrer les cartes de disparité. En codant ces algorithmes en CUDA pour profiter de l’accélération des calculs sur cartes graphiques (GPU), nous montrons qu’ils s’exécutent très rapidement (19ms sur les images KITTI, sur GPU GeForce GTX Titan).Dans un deuxième temps, nous souhaitons percevoir les objets mobiles et estimer leur mouvement. Pour cela nous calculons le déplacement du banc stéréo par odométrie visuelle pour pouvoir isoler dans le mouvement apparent 2D ou 3D (estimé par des algorithmes de flot optique ou de flot de scène) la part induite par le mouvement propre à chaque objet. Partant du constat que seul l'algorithme d'estimation du flot optique FOLKI permet un calcul en temps-réel, nous proposons plusieurs modifications de celui-ci qui améliorent légèrement ses performances au prix d'une augmentation de son temps de calcul. Concernant le flot de scène, aucun algorithme existant ne permet d'atteindre la vitesse d'exécution souhaitée, nous proposons donc une nouvelle approche découplant structure et mouvement pour estimer rapidement le flot de scène. Trois algorithmes sont proposés pour exploiter cette décomposition structure-mouvement et l’un d’eux, particulièrement efficace, permet d'estimer très rapidement le flot de scène avec une précision relativement bonne. A notre connaissance, il s'agit du seul algorithme publié de calcul du flot de scène capable de s'exécuter à cadence vidéo sur les données KITTI (10Hz).Dans un troisième temps, pour détecter les objets en mouvement et les segmenter dans l'image, nous présentons différents modèles statistiques et différents résidus sur lesquels fonder une détection par seuillage d'un critère chi2. Nous proposons une modélisation statistique rigoureuse qui tient compte de toutes les incertitudes d'estimation, notamment celles de l'odométrie visuelle, ce qui n'avait pas été fait à notre connaissance dans le contexte de la détection d'objets mobiles. Nous proposons aussi un nouveau résidu pour la détection, en utilisant la méthode par prédiction d’image qui permet de faciliter la propagation des incertitudes et l'obtention du critère chi2. Le gain apporté par le résidu et le modèle d'erreur proposés est démontré par une évaluation des algorithmes de détection sur des exemples tirés de la base KITTI. Enfin, pour valider expérimentalement notre système de perception en embarqué sur une plateforme robotique, nous implémentons nos codes sous ROS et certains codes en CUDA pour une accélération sur GPU. Nous décrivons le système de perception et de navigation utilisé pour la preuve de concept qui montre que notre système de perception, convient à une application embarquée. / This thesis aims at designing an embedded stereoscopic perception system that enables autonomous robot navigation in dynamic environments (i.e. including mobile objects). To do so, we need to satisfy several constraints: 1) We want to be able to navigate in unknown environment and with any type of mobile objects, thus we adopt a geometric approach. 2) We want to ensure the best possible coverage of the field of view, so we employ dense methods that process every pixel in the image. 3) The algorithms must be compliant with an embedded platform, therefore we must carefully design the algorithms so they are fast enough to keep a certain level of reactivity. The approach presented in this thesis manuscript and the contributions are summarized below. First, we study several stereo matching algorithms that estimate a disparity map from which we can deduce a depth map, by triangulation. This comparative study highlights one algorithm that is not in the KITTI benchmarks, but that gives a great accuracy/processing time tradeoff. We also propose a filtering method to post-process the disparity maps. By coding these algorithm in CUDA to benefit from hardware acceleration on Graphics Processing Unit, we show that they can perform very fast (19ms on KITTI images, with a GPU GeForce GTX Titan).Second, we want to detect mobile objects and estimate their motion. To do so we compute the stereo rig motion using visual odometry, in order to isolate the part induced by moving objects in the 2D or 3D apparent motion (estimated by optical flow or scene flow algorithms). Considering that the only optical flow algorithm able to perform in real-time is FOLKI, we propose several modifications of it to slightly improve its performances at the cost of a slower processing time. Regarding the scene flow estimation, existing algorithms cannot reach the desired computation speed, so we propose a new approach by decoupling structure and motion for a fast scene flow estimation. Three algorithms are proposed to use this structure-motion decomposition, and one of them, particularly efficient, enables very fast scene flow computing with a relatively good accuracy. To our knowledge it is the only published scene flow algorithm able to perform at framerate on KITTI dataset (10 Hz).Third, to detect moving objects and segment them in the image, we show several statistical models and residual quantities on which we can base the detection by thresholding a chi2 criterion. We propose a rigorous statistical modeling that takes into account all the uncertainties occurring during the estimation, in particular during the visual odometry, which had not been done to our knowledge, in the context of moving object detection. We also propose a new residual quantity for the detection, using an image prediction approach to facilitate uncertainty propagation and the chi2 criterion modeling. The benefit brought by the proposed residual quantity and error model is demonstrated by evaluating detection algorithms on a samples of annotated KITTI data. Finally, we implement our algorithms on ROS to run the perception system on en embedded platform, and we code some algorithms in CUDA to accelerate the computing using GPU. We describe the perception and the navigation system that we use for the experimental validation. We show in our experiments that the proposed stereovision perception system is suitable for embedded robotic applications. Vision stéréoscopique Robotique Flot optique Flot de scène Détection d’objets mobiles Temps-réel GPU Stereovision Robotics Optical flow Scene flow Mobile object detection Real-time GPU
368	Méthodes de génération automatique de code appliquées à l’algèbre linéaire numérique dans le calcul haute performance / Automatic code generation methods applied to numerical linear algebra in high performance computing Masliah, Ian 26 September 2016 (has links) Les architectures parallèles sont aujourd'hui présentes dans tous les systèmes informatiques, allant des smartphones aux supercalculateurs en passant par les ordinateurs de bureau. Programmer efficacement ces architectures en fonction des applications requiert un effort pluridisciplinaire portant sur les langages dédiés (Domain Specific Languages - DSL), les techniques de génération de code et d'optimisation, et les algorithmes numériques propres aux applications. Dans cette thèse, nous présentons une méthode de programmation haut niveau prenant en compte les caractéristiques des architectures hétérogènes et les propriétés existantes des matrices pour produire un solveur générique d'algèbre linéaire dense. Notre modèle de programmation supporte les transferts explicites et implicites entre un processeur (CPU) et un processeur graphique qui peut être généraliste (GPU) ou intégré (IGP). Dans la mesure où les GPU sont devenus un outil important pour le calcul haute performance, il est essentiel d'intégrer leur usage dans les plateformes de calcul. Une architecture récente telle que l'IGP requiert des connaissances supplémentaires pour pouvoir être programmée efficacement. Notre méthodologie a pour but de simplifier le développement sur ces architectures parallèles en utilisant des outils de programmation haut niveau. À titre d'exemple, nous avons développé un solveur de moindres carrés en précision mixte basé sur les équations semi-normales qui n'existait pas dans les bibliothèques actuelles. Nous avons par la suite étendu nos travaux à un modèle de programmation multi-étape ("multi-stage") pour résoudre les problèmes d'interopérabilité entre les modèles de programmation CPU et GPU. Nous utilisons cette technique pour générer automatiquement du code pour accélérateur à partir d'un code effectuant des opérations point par point ou utilisant des squelettes algorithmiques. L'approche multi-étape nous assure que le typage du code généré est valide. Nous avons ensuite montré que notre méthode est applicable à d'autres architectures et algorithmes. Les routines développées ont été intégrées dans une bibliothèque de calcul appelée NT2.Enfin, nous montrons comment la programmation haut niveau peut être appliquée à des calculs groupés et des contractions de tenseurs. Tout d'abord, nous expliquons comment concevoir un modèle de container en utilisant des techniques de programmation basées sur le C++ moderne (C++-14). Ensuite, nous avons implémenté un produit de matrices optimisé pour des matrices de petites tailles en utilisant des instructions SIMD. Pour ce faire, nous avons pris en compte les multiples problèmes liés au calcul groupé ainsi que les problèmes de localité mémoire et de vectorisation. En combinant la programmation haut niveau avec des techniques avancées de programmation parallèle, nous montrons qu'il est possible d'obtenir de meilleures performances que celles des bibliothèques numériques actuelles. / Parallelism in today's computer architectures is ubiquitous whether it be in supercomputers, workstations or on portable devices such as smartphones. Exploiting efficiently these systems for a specific application requires a multidisciplinary effort that concerns Domain Specific Languages (DSL), code generation and optimization techniques and application-specific numerical algorithms. In this PhD thesis, we present a method of high level programming that takes into account the features of heterogenous architectures and the properties of matrices to build a generic dense linear algebra solver. Our programming model supports both implicit or explicit data transfers to and from General-Purpose Graphics Processing Units (GPGPU) and Integrated Graphic Processors (IGPs). As GPUs have become an asset in high performance computing, incorporating their use in general solvers is an important issue. Recent architectures such as IGPs also require further knowledge to program them efficiently. Our methodology aims at simplifying the development on parallel architectures through the use of high level programming techniques. As an example, we developed a least-squares solver based on semi-normal equations in mixed precision that cannot be found in current libraries. This solver achieves similar performance as other mixed-precision algorithms. We extend our approach to a new multistage programming model that alleviates the interoperability problems between the CPU and GPU programming models. Our multistage approach is used to automatically generate GPU code for CPU-based element-wise expressions and parallel skeletons while allowing for type-safe program generation. We illustrate that this work can be applied to recent architectures and algorithms. The resulting code has been incorporated into a C++ library called NT2. Finally, we investigate how to apply high level programming techniques to batched computations and tensor contractions. We start by explaining how to design a simple data container using modern C++14 programming techniques. Then, we study the issues around batched computations, memory locality and code vectorization to implement a highly optimized matrix-matrix product for small sizes using SIMD instructions. By combining a high level programming approach and advanced parallel programming techniques, we show that we can outperform state of the art numerical libraries. C++ Programmation générique CUDA Meta programmation GPU Languages dédiés Programmation générative Algèbre linéaire C++ Generic programming CUDA Meta-Programming GPU Domain specific languages Generative programming Linear algebra
369	Akcelerace ultrazvukových simulací pomocí multi-GPU systémů / Acceleration of Ultrasound Simulations on Multi-GPU Systems Stodůlka, Martin January 2021 (has links) The main focus of this project is usage of multi - GPU systems and usage of CUDA unified memory . Its goal is to accelerate computation of 2D and 3D FFT, which is the main part of simulations in k- Wave library .K- Wave is a C++/ Matlab library used for simulations of propagation of ultrasonic waves in 1D , 2D or 3D space . Acceleration of these functions is necessary , because the simulations are computationally intensive .
370	Design and Performance Analysis of Parallel Processing of SRTP Packets / Design and Performance Analysis of Parallel Processing of SRTP Packets Wozniak, Jan January 2013 (has links) Šifrování multimediálních datových přenosů v reálném čase je jednou z úloh telekomunikační infrastruktury pro dosažení nezbytné úrovně zabezpečení. Rychlost provedení šifrovacího algoritmu může hrát klíčovou roli ve velikosti zpoždění jednotlivých paketů a proto je tento úkol zajímavým z hlediska optimalizačních metod. Tato práce se zaměřuje na možnosti paralelizace zpracování SRTP pro účely telefonní ústředny s využitím OpenCL frameworku a následnou analýzu potenciálního zlepšení.

Search results