Global ETD Search

61	DistributedCL: middleware de processamento distribuído em GPU com interface da API OpenCL. / DistributedCL: middleware de processamento distribuído em GPU com interface da API OpenCL. Andre Luiz Rocha Tupinamba 10 July 2013 (has links) Este trabalho apresenta a proposta de um middleware, chamado DistributedCL, que torna transparente o processamento paralelo em GPUs distribuídas. Com o suporte do middleware DistributedCL uma aplicação, preparada para utilizar a API OpenCL, pode executar de forma distribuída, utilizando GPUs remotas, de forma transparente e sem necessidade de alteração ou nova compilação do seu código. A arquitetura proposta para o middleware DistributedCL é modular, com camadas bem definidas e um protótipo foi construído de acordo com a arquitetura, onde foram empregados vários pontos de otimização, incluindo o envio de dados em lotes, comunicação assíncrona via rede e chamada assíncrona da API OpenCL. O protótipo do middleware DistributedCL foi avaliado com o uso de benchmarks disponíveis e também foi desenvolvido o benchmark CLBench, para avaliação de acordo com a quantidade dos dados. O desempenho do protótipo se mostrou bom, superior às propostas semelhantes, tendo alguns resultados próximos do ideal, sendo o tamanho dos dados para transmissão através da rede o maior fator limitante. / This work proposes a middleware, called DistributedCL, which makes parallel processing on distributed GPUs transparent. With DistributedCL middleware support, an OpenCL enabled application can run in a distributed manner, using remote GPUs, transparently and without alteration to the code or recompilation. The proposed architecture for the DistributedCL middleware is modular, with well-defined layers. A prototype was built according to the architecture, into which were introduced multiple optimization features, including batch data transfer, asynchronous network communication and asynchronous OpenCL API invocation. The prototype was evaluated using available benchmarks and a specific benchmark, the CLBench, was developed to facilitate evaluations according to the amount of processed data. The prototype presented good performance, higher compared to similar proposals. The size of data for transmission over the network showed to be the biggest limiting factor. Engenharia Eletrônica OpenCL GPGPU GPU middleware processamento distribuído Electronic Engineering OpenCL GPGPU GPU middleware distributed systems ENGENHARIAS
62	Méthodes numériques pour la résolution accélérée des systèmes linéaires de grandes tailles sur architectures hybrides massivement parallèles / Numerical methods for the accelerated resolution of large scale linear systems on massively parallel hybrid architecture Cheik Ahamed, Abal-Kassim 07 July 2015 (has links) Les progrès en termes de puissance de calcul ont entraîné de nombreuses évolutions dans le domaine de la science et de ses applications. La résolution de systèmes linéaires survient fréquemment dans le calcul scientifique, comme par exemple lors de la résolution d'équations aux dérivées partielles par la méthode des éléments finis. Le temps de résolution découle alors directement des performances des opérations algébriques mises en jeu.Cette thèse a pour but de développer des algorithmes parallèles innovants pour la résolution de systèmes linéaires creux de grandes tailles. Nous étudions et proposons comment calculer efficacement les opérations d'algèbre linéaire sur plateformes de calcul multi-coeur hétérogènes-GPU afin d'optimiser et de rendre robuste la résolution de ces systèmes. Nous proposons de nouvelles techniques d'accélération basées sur la distribution automatique (auto-tuning) des threads sur la grille GPU suivant les caractéristiques du problème et le niveau d'équipement de la carte graphique, ainsi que les ressources disponibles. Les expérimentations numériques effectuées sur un large spectre de matrices issues de divers problèmes scientifiques, ont clairement montré l'intérêt de l'utilisation de la technologie GPU, et sa robustesse comparée aux bibliothèques existantes comme Cusp.L'objectif principal de l'utilisation du GPU est d'accélérer la résolution d'un problème dans un environnement parallèle multi-coeur, c'est-à-dire "Combien de temps faut-il pour résoudre le problème?". Dans cette thèse, nous nous sommes également intéressés à une autre question concernant la consommation énergétique, c'est-à-dire "Quelle quantité d'énergie est consommée par l'application?". Pour répondre à cette seconde question, un protocole expérimental est établi pour mesurer la consommation d'énergie d'un GPU avec précision pour les opérations fondamentales d'algèbre linéaire. Cette méthodologie favorise une "nouvelle vision du calcul haute performance" et apporte des réponses à certaines questions rencontrées dans l'informatique verte ("green computing") lorsque l'on s'intéresse à l'utilisation de processeurs graphiques.Le reste de cette thèse est consacré aux algorithmes itératifs synchrones et asynchrones pour résoudre ces problèmes dans un contexte de calcul hétérogène multi-coeur-GPU. Nous avons mis en application et analysé ces algorithmes à l'aide des méthodes itératives basées sur les techniques de sous-structurations. Dans notre étude, nous présentons les modèles mathématiques et les résultats de convergence des algorithmes synchrones et asynchrones. La démonstration de la convergence asynchrone des méthodes de sous-structurations est présentée. Ensuite, nous analysons ces méthodes dans un contexte hybride multi-coeur-GPU, qui devrait ouvrir la voie vers les méthodes hybrides exaflopiques.Enfin, nous modifions la méthode de Schwarz sans recouvrement pour l'accélérer à l'aide des processeurs graphiques. La mise en oeuvre repose sur l'accélération par les GPUs de la résolution locale des sous-systèmes linéaires associés à chaque sous-domaine. Pour améliorer les performances de la méthode de Schwarz, nous avons utilisé des conditions d'interfaces optimisées obtenues par une technique stochastique basée sur la stratégie CMA-ES (Covariance Matrix Adaptation Evolution Strategy). Les résultats numériques attestent des bonnes performances, de la robustesse et de la précision des algorithmes synchrones et asynchrones pour résoudre de grands systèmes linéaires creux dans un environnement de calcul hétérogène multi-coeur-GPU. / Advances in computational power have led to many developments in science and its applications. Solving linear systems occurs frequently in scientific computing, as in the finite element discretization of partial differential equations. The running time of the overall resolution is a direct result of the performance of the involved algebraic operations.In this dissertation, different ways of efficiently solving large and sparse linear systems are put forward. We present the best way to effectively compute linear algebra operations in an heterogeneous multi-core-GPU environment in order to make solvers such as iterative methods more robust and therefore reduce the computing time of these systems. We propose new techniques to speed algorithms up the auto-tuning of the threading design, according to the problem characteristics and the equipment level in the hardware and available resources. Numerical experiments performed on a set of large-size sparse matrices arising from diverse engineering and scientific problems, have clearly shown the benefit of the use of GPU technology to solve large sparse systems of linear equations, and its robustness and accuracy compared to existing libraries such as Cusp.The main priority of the GPU program is computational time to obtain the solution in a parallel environment, i.e, "How much time is needed to solve the problem?". In this thesis, we also address another question regarding energy issues, i.e., "How much energy is consumed by the application?". To answer this question, an experimental protocol is established to measure the energy consumption of a GPU for fundamental linear algebra operations accurately. This methodology fosters a "new vision of high-performance computing" and answers some of the questions outlined in green computing when using GPUs.The remainder of this thesis is devoted to synchronous and asynchronous iterative algorithms for solving linear systems in the context of a multi-core-GPU system. We have implemented and analyzed these algorithms using iterative methods based on sub-structuring techniques. Mathematical models and convergence results of synchronous and asynchronous algorithms are presented here, as are the convergence results of the asynchronous sub-structuring methods. We then analyze these methods in the context of a hybrid multi-core-GPU, which should pave the way for exascale hybrid methods.Lastly, we modify the non-overlapping Schwarz method to accelerate it, using GPUs. The implementation is based on the acceleration of the local solutions of the linear sub-systems associated with each sub-domain using GPUs. To ensure good performance, optimized conditions obtained by a stochastic technique based on the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) are used. Numerical results illustrate the good performance, robustness and accuracy of synchronous and asynchronous algorithms to solve large sparse linear systems in the context of an heterogeneous multi-core-GPU system. Calcul parallèle GPU OpenCL CUDA Auto-tuning Eco-calcul Parallel algorithm GPU OpenCL CUDA Auto-tuning Green computing
63	Hardware Accelerated Digital Image Stabilization in a Video Stream / Hardware Accelerated Digital Image Stabilization in a Video Stream Pacura, Dávid January 2016 (has links) Cílem této práce je návrh nové techniky pro stabilizaci obrazu za pomoci hardwarové akcelerace prostřednictvím GPGPU. Využití této techniky umožnuje stabilizaci videosekvencí v reálném čase i pro video ve vysokém rozlišení. Toho je zapotřebí pro ulehčení dalšího zpracování v počítačovém vidění nebo v armádních aplikacích. Z důvodu existence vícerých programovacích modelů pro GPGPU je navrhnutý stabilizační algoritmus implementován ve třech nejpoužívanějších z nich. Jejich výkon a výsledky jsou následně porovnány a diskutovány.
64	Seminar Hochleistungsrechnen und Benchmarking: x264-Encoder als Benchmark Naumann, Stefan January 2014 (has links) Bei der modernen Videoencodierung werden viele Berechnungen benötigt. Unter anderem wird das Bild in Makroblöcke zerlegt, Bewegungsvektoren berechnet und Bewegungsvorhersagen getroffen, um Speicherplatz für die komprimierte Datei zu sparen. Der x264-Encoder versucht das auf verschiedene Arten und Weisen zu realisieren, wodurch der eigentliche Encodier-Vorgang langsam wird und auf älteren oder langsameren PCs deutlich länger dauert als andere Verfahren. Außerdem verwendet der x264-Encoder Standards wie SSE, AVX oder OpenCL um Zeit zu sparen, indem mehrere Daten gleichzeitig berechnet werden. Daher eignet sich x264 auch zur Evaluation solcher Standards und der Untersuchung des Geschwindigkeitsgewinns durch die Verwendung von Vektoroperationen oder Grafikbeschleunigung. info:eu-repo/classification/ddc/004 ddc:004 Video; Benchmark x264, Encodierung, OpenCL, VideoLan
65	Cooperative Execution of Opencl Programs on Multiple Heterogeneous Devices Pandit, Prasanna Vasant January 2013 (has links) (PDF) Computing systems have become heterogeneous with the increasing prevalence of multi-core CPUs, Graphics Processing Units (GPU) and other accelerators in them. OpenCL has emerged as an attractive programming framework for heterogeneous systems. However, utilizing mul- tiple devices in OpenCL is a challenge as it requires the programmer to explicitly map data and computation to each device. Utilizing multiple devices simultaneously to speed up execu- tion of a kernel is even more complex, as the relative execution time of the kernel on different devices can vary signiﬁcantly. Also, after each kernel execution, a coherent version of the data needs to be established. This means that, in order to utilize all devices effectively, the programmer has to spend considerable time and effort to distribute work across all devices, keep track of modiﬁed data in these devices and correctly perform a merging step to put the data together. Further, the relative performance of a program may vary across different inputs, which means a statically determined work distribution may not work well. In this work, we present FluidiCL, an OpenCL runtime that takes a program written for a single device and uses multiple heterogeneous devices to execute each kernel. The runtime performs dynamic work distribution and cooperatively executes each kernel on all available devices. Since we consider a setup with devices having discrete address spaces, our solution ensures that execution of OpenCL work-groups on devices is adjusted by taking into account the overheads for data management. The data transfers and data merging needed to ensure coherence are handled transparently without requiring any effort from the programmer. Flu- idiCL also does not require prior training or proﬁling and is completely portable across dif- ferent machines. Because it is dynamic, the runtime is able to adapt to system load. We have developed several optimizations for improving the performance of FluidiCL. We evaluate the runtime across different sets of devices. On a machine with an Intel quad-core processor and an NVidia Fermi GPU, FluidiCL shows a geomean speedup of nearly 64% over the GPU, 88% over the CPU and 14% over the best of the two devices in each benchmark. In all benchmarks, performance of our runtime comes to within 13% of the best of the two devices. FluidiCL shows similar results on a machine with a quad-core CPU and an NVidia Kepler GPU, with up to 26% speedup over the best of the two. We also present results considering an Intel Xeon Phi accelerator and a CPU and ﬁnd that FluidiCL performs up to 45% faster than the best of the two devices. We extend FluidiCL from a CPU–GPU scenario to a three-device setup hav- ing a quad-core CPU, an NVidia Kepler GPU and an Intel Xeon Phi accelerator and ﬁnd that FluidiCL obtains a geomean improvement of 6% in kernel execution time over the best of the three devices considered in each case. Heterogeneous Computers Open Computing Language FluidiCL Fluidic Kernels OpenCL Application Programming Interface Graphics Processing Unit (GPU) Central Processing Unit (CPU) Computer Architecture FluidiCL Runtime Heterogeneous OpenCL Runtime OpenCL Programs CPU–GPU Systems Computer Engineering
66	Trace-based Performance Analysis for Hardware Accelerators / Leistungsanalyse hardwarebeschleunigter Anwendungen mittels Programmspuren Juckeland, Guido 14 February 2013 (has links) (PDF) This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods. / Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat. Leistungsanalyse Hardwarebeschleuniger GPUs CUDA OpenCL Tracing Particle-in-Cell Performance Analysis Hardware accelerators GPUs CUDA OpenCL Tracing Particle-in-Cell ddc:004 rvk:ST 150
67	Trumpųjų bangų sklidimo modelis daugiaprocesorinėje aplinkoje / Development of the model of short wave propagation by using multi-processor environment Mickus, Mykolas 04 November 2013 (has links) Tampriosios bangos (arba akustinės ar bet kokios kitos bangos) sklidimo tyrimai yra svarbūs tokiose srityse kaip seismologija arba neardantis medžiagos testavimas. Tamprioje srityje šis reiškinys aprašomas tampriosios bangos dinamine diferencialine lygtimi. Tačiau šios lygties sprendimas naudojant tokius skaitinius metodus kaip baigtiniai elementai reikalauja sritį padalinti į milijonus elementų. Naujų skaičiavimo technologijų kaip bendros paskirties grafiniai procesoriai (GPU) atsiradimas skaičiavimų laiką leidžia ženkliai sumažinti, tačiau algoritmai turi būti specialiai pritaikomi. Todėl šiame darbe koncentruojamasi į trumpos tampriosios bangos baigtinių elementų modelio sukūrimą ir algoritmų tobulinimą naudojant GPU bei pagrindinį procesorių (CPU). Lygties integravimui buvo pasirinktas centrinių skirtumų metodo (CSM) schema. Ši integravimo schema buvo modifikuota taip, kad būtų galima išskirti tris integravimo algoritmo etapus: išorinės jėgos įvertinimas, elementų deformacijos sąlygotų jėgų įvertinimas bei magų poslinkių, greičių ir jėgų perskaičiavimas. Remiantis strategija pasiūlyta [1] šaltinyje, buvo sukurti lygiagretūs algoritmai 2 ir 3 etapo skaičiavimams atlikti. Toliau antrojo etapo algoritmas buvo optimizuotas 2 kartus. Pirmiausia buvo atsisakyta elementų mazgų indeksų masyvo: tai skaičiavimo laiką sumažino 20%. Po to algoritmas buvo modifikuotas taip, kad elementus būtų galima apdoroti blokais kaip siūloma [12] ir [22] šaltiniuose. Skaičiavimo laiką tai leido... [toliau žr. visą tekstą] / Understanding elastic wave (or acoustic or any other type of wave for that matter) phenomenon is of great importance in areas such as seismology or non destructive testing (NDT). This phenomenon in case of elastic environment is described by dynamic elastic differential equations. However, computational models like finite element method consumes huge amounts of computational power as even for relatively small problems require dividing area of interest into millions of elements. In the advent of general purpose GPU computing new opportunities for speeding up computations as well as challenges for developing high performance algorithms suited for new kinds of processors arise. Therefore this work concentrates on developing a finite element based short elastic wave propagation model on GPU as well as CPU. Central difference explicit wave equation integration scheme has been chosen. It then was slightly modified in order to separate integration algorithm into three phases: external force evaluation, evaluation of forces that occur due to stresses of elements and recalculation of node shifts, speeds and forces. A parallel algorithm has been developed for executing third and seconds phases, based on strategy suggested in [1]. Then the algorithm of the second phase has been optimized 2 times: at first the array of element node indices was eliminated yielding 20% performance boost, then modifications have been made to process elements in blocks by using strategy described at [22]... [to full text] Informatics Baigtinių elementų metodas Trumpoji tamprioji banga Daugiaprocesorinė aplinka OpenCL GPGPU Finite element method Short elastic wave Multiprocessor environment OpenCL GPGPU
68	A Haptic Device Interface for Medical Simulations using OpenCL / Ett haptiskt gränssnitt för medicinska simuleringar med OpenCL Machwirth, Mattias January 2013 (has links) The project evaluates how well a haptic device can be used to interact with a visualization of volumetric data. Since the interface to the haptic device require explicit surface descriptions, triangles had to be constructed from the volumetric data. The algorithm used to extract these triangles is marching cubes. The triangles produced by marching cubes are then transmitted to the haptic device to enable the force feedback. Marching cubes was suitable for parallelization and it was executed using OpenCL. Graphs in the report shows how this parallelization ran almost 70 times faster than the sequential CPU counterpart of the same algorithm. Further development of the project would give medical students the opportunity to practice difficult procedures on a simulation instead of a real patient. This would give a realistic and accurate simulation to practice on. / Projektet går ut på att utvärdera hur väl en haptisk utrustning går att använda för att interagera med en visualisering av volumetrisk data. Eftersom haptikutrustningen krävde explicit beskrivna ytor, krävdes först en triangelgenerering utifrån den volymetriska datan. Algoritmen som används till detta är marching cubes. Trianglarna som producerades med hjälp av marching cubes skickas sedan vidare till den haptiska utrustningen för att kunna få gensvar i form av krafter för att utnyttja sig av känsel och inte bara syn. Eftersom marching cubes lämpas för en parallelisering användes OpenCL för att snabba upp algoritmen. Grafer i projektet visar hur denna algoritm exekveras upp emot 70 gånger snabbare när algoritmen körs som en kernel i OpenCL istället för ekvensiellt på CPUn. Tanken är att när vidareutveckling av projektet har gjorts i god mån, kan detta användas av läkarstuderande där övning av svåra snitt kan ske i en verklighetstrogen simulering innan samma ingrepp utförs på en individ. 3D ultrasound marching cubes OpenCL haptics GPU GPGPU simulation paralell 3D ultraljud marching cubes OpenCL haptik GPU GPGPU simulering parallell Software Engineering Programvaruteknik
69	Approche de conception haut-niveau pour l'accélération matérielle de calcul haute performance en finance / High-level approach for hardware acceleration of high-performance computing in finance Mena morales, Valentin 12 July 2017 (has links) Les applications de calcul haute-performance (HPC) nécessitent des capacités de calcul conséquentes, qui sont généralement atteintes à l'aide de fermes de serveurs au détriment de la consommation énergétique d'une telle solution. L'accélération d'applications sur des plateformes hétérogènes, comme par exemple des FPGA ou des GPU, permet de réduire la consommation énergétique et correspond donc à un compromis architectural plus séduisant. Elle s'accompagne cependant d'un changement de paradigme de programmation et les plateformes hétérogènes sont plus complexes à prendre en main pour des experts logiciels. C'est particulièrement le cas des développeurs de produits financiers en finance quantitative. De plus, les applications financières évoluent continuellement pour s'adapter aux demandes législatives et concurrentielles du domaine, ce qui renforce les contraintes de programmabilité de solutions d'accélérations. Dans ce contexte, l'utilisation de flots haut-niveaux tels que la synthèse haut-niveau (HLS) pour programmer des accélérateurs FPGA n'est pas suffisante. Une approche spécifique au domaine peut fournir une réponse à la demande en performance, sans que la programmabilité d'applications accélérées ne soit compromise.Nous proposons dans cette thèse une approche de conception haut-niveau reposant sur le standard de programmation hétérogène OpenCL. Cette approche repose notamment sur la nouvelle implémentation d'OpenCL pour FPGA introduite récemment par Altera. Quatre contributions principales sont apportées : (1) une étude initiale d'intégration de c'urs de calculs matériels à une librairie logicielle de calcul financier (QuantLib), (2) une exploration d'architectures et de leur performances respectives, ainsi que la conception d'une architecture dédiée pour l'évaluation d'option américaine et l'évaluation de volatilité implicite à partir d'un flot haut-niveau de conception, (3) la caractérisation détaillée d'une plateforme Altera OpenCL, des opérateurs élémentaires, des surcouches de contrôle et des liens de communication qui la compose, (4) une proposition d'un flot de compilation spécifique au domaine financier, reposant sur cette dernière caractérisation, ainsi que sur une description des applications financières considérées, à savoir l'évaluation d'options. / The need for resources in High Performance Computing (HPC) is generally met by scaling up server farms, to the detriment of the energy consumption of such a solution. Accelerating HPC application on heterogeneous platforms, such as FPGAs or GPUs, offers a better architectural compromise as they can reduce the energy consumption of a deployed system. Therefore, a change of programming paradigm is needed to support this heterogeneous acceleration, which trickles down to an increased level of programming complexity tackled by software experts. This is most notably the case for developers in quantitative finance. Applications in this field are constantly evolving and increasing in complexity to stay competitive and comply with legislative changes. This puts even more pressure on the programmability of acceleration solutions. In this context, the use of high-level development and design flows, such as High-Level Synthesis (HLS) for programming FPGAs, is not enough. A domain-specific approach can help to reach performance requirements, without impairing the programmability of accelerated applications.We propose in this thesis a high-level design approach that relies on OpenCL, as a heterogeneous programming standard. More precisely, a recent implementation of OpenCL for Altera FPGA is used. In this context, four main contributions are proposed in this thesis: (1) an initial study of the integration of hardware computing cores to a software library for quantitative finance (QuantLib), (2) an exploration of different architectures and their respective performances, as well as the design of a dedicated architecture for the pricing of American options and their implied volatility, based on a high-level design flow, (3) a detailed characterization of an Altera OpenCL platform, from elemental operators, memory accesses, control overlays, and up to the communication links it is made of, (4) a proposed compilation flow that is specific to the quantitative finance domain, and relying on the aforementioned characterization and on the description of the considered financial applications (option pricing). Conception haut-Niveau OpenCL Fpga Gpu Finance Accélération matérielle Hpc Hls Prototypage High-Level design OpenCL Fpga Gpu Quantitative finance Hardware acceleration Hpc Hls Prototyping 004
70	Simulace tekutin v reálném čase / Real-Time Fluid Simulation Fedorko, Matúš January 2015 (has links) The primary concern of this work is real-time fluid simulation on modern programmable graphics hardware. It starts by introducing fundamental fluid simulation principles with focus on Smoothed particle hydrodynamics technique. The following discussion then provides a brief introduction to OpenCL as well as contemporary GPU hardware and outlines their programming specifics in comparison with CPUs. Finally, the last two chapters of this work, detail the problem analysis and its implementation.

Search results