Global ETD Search

51	Použití OpenCl v AVG na platformě Windows / Using of OpenCl at AVG in Windows Platform Bajcar, Martin January 2012 (has links) The main topic of this thesis is the practical use of OpenCL at AVG company. AVG is looking for ways to decrease hardware requirement of their security product and also to decrease computation time of some algorithms. Using OpenCL is one way to achieve this requirement. Significant part of this thesis deals with optimization strategies for AMD and NVIDIA graphics cards as they are most common cards among users. Practical part of the thesis describes parallelization of two algorithms, their analysis and implementation. After that, the obtained results are presented and cases in which the use of OpenCL is beneficial are identified. As a part of implementation, library containing various utility functions which can aid programmers to implement OpenCL based code was developed.
52	Akcelerace mikroskopické simulace dopravy za použití OpenCL / Acceleration of Microscopic Urban Traffic Simulation Using OpenCL Urminský, Andrej January 2011 (has links) As the number of vehicles on our roads increases, the problems related to this phenomenon emerge more dramatically. These problems include car accidents, congestions and CO2 emissions production, increasing CO2 concentrations in the atmosphere. In order to minimize these impacts and to use the road infrastructure eff ectively, the use of traffic simulators can come in handy. Thanks to these tools, it is possible to evaluate the evolution of a traffic flow with various initial states of the simulation and thus know what to do and how to react in different states of the real-world traffic situations. This thesis deals with acceleration of microscopic urban traffic simulation using OpenCL. Supposing it is necessary to simulate a large network traffic, the need to accelerate the simulation is necessary. For this purpose, it is possible, for example, to use the graphics processing units (GPUs) and the technique of GPGPU for general purpose computations, which is used in this work. The results show that the performance gains of GPUs are significant compared to a parallel implementation on CPU.
53	Akcelerace kryptografie pomocí GPU / Cryptography Acceleration Using GPU Potěšil, Josef January 2011 (has links) The reader will be familiar with selected concepts of cryptography consited in this work. AES algorithm was selected in conjunction with the description of architecture and software for programming graphic cards (CUDA, OpenCL), in order to create its GPU-accelerated version. This thesis tries to map APIs for communication with crypto-coprocessors, which exist in kernels of Linux/BSD operating systems (CryptoAPI, OCF). It examines this support in the cross-platform OpenSSL library. Subsequently, the work discusses the implementation details, achieved results and integration with OpenSSL library. The conclusion suggests how the developed application could be used and briefly suggests its usage directly by the operating system kernel.
54	Contributions to Parallel Simulation of Equation-Based Models on Graphics Processing Units Stavåker, Kristian January 2011 (has links) In this thesis we investigate techniques and methods for parallel simulation of equation-based, object-oriented (EOO) Modelica models on graphics processing units (GPUs). Modelica is being developed through an international effort via the Modelica Association. With Modelica it is possible to build computationally heavy models; simulating such models however might take a considerable amount of time. Therefor techniques of utilizing parallel multi-core architectures for simulation are desirable. The goal in this work is mainly automatic parallelization of equation-based models, that is, it is up to the compiler and not the end-user modeler to make sure that code is generated that can efficiently utilize parallel multi-core architectures. Not only the code generation process has to be altered but the accompanying run-time system has to be modified as well. Adding explicit parallel language constructs to Modelica is also discussed to some extent. GPUs can be used to do general purpose scientific and engineering computing. The theoretical processing power of GPUs has surpassed that of CPUs due to the highly parallel structure of GPUs. GPUs are, however, only good at solving certain problems of data-parallel nature. In this thesis we relate several contributions, by the author and co-workers, to each other. We conclude that the massively parallel GPU architectures are currently only suitable for a limited set of Modelica models. This might change with future GPU generations. CUDA for instance, the main software platform used in the thesis for general purpose computing on graphics processing units (GPGPU), is changing rapidly and more features are being added such as recursion, function pointers, C++ templates, etc.; however the underlying hardware architecture is still optimized for data-parallelism. Modelica GPU CUDA OpenCL Modeling Simulation Computer Systems Datorsystem
55	Benchmarking linear-algebra algorithms on CPU- and FPGA-based platforms / Utvärdering av linjär-algebra-algoritmer på CPU- och FPGA-baserade plattformar Askar Vergara, Omar, Törnblom Bartholf, Karl January 2023 (has links) Moore’s law is the main driving factor behind the rapid evolution of computers that has been observed in the past 50 years. Though the law is soon ending due to heat- and sizing-related issues. One solution to continuing the evolution is utilizing alternative computer hardware, where parallel hardware is especially interesting. The Field Programmable Gate Array (FPGA) is one such piece of hardware. This study compares the runtime of two linear-algebra benchmarks executed on a traditional CPU-based platform and an FPGA-based platform respectively. The benchmarks are called cholesky and durbin respectively. The cholesky benchmark performs Cholesky decomposition and the durbin benchmark computes the solution to a Yule-Walker equation. The CPU implementations of the benchmarks were provided in the C programming language and the FPGA implementations of the benchmarks were written using OpenCL, which is a High-Level-Synthesis framework. The results highlighted a clear advantage for the CPU implementations, which had a shorter runtime than the FPGA implementations in both benchmarks for every test case. This was caused by both benchmarks containing data dependencies, which required them to be executed sequentially. Since the CPU operates at a clock frequency more than ten times higher than the FPGA’s clock frequency, it executed sequential instructions faster than the FPGA. / Moores lag är den främsta orsaken till den snabba datorutveckling som skett de senaste 50 åren. På grund av svårigheter med värme och dimensionering närmar sig dock lagen sin applicerbara gräns. En lösning för att bibehålla utvecklingen är att nyttja alternativ hårdvara och särskilt intressant är parallell hårdvara. En Field Programmable Gate Array (FPGA) är ett exempel på sådan hårdvara. Denna studie jämför körtiden mellan två prestandatest för linjär-algebra-algoritmer som utvärderades på en traditionell CPU-baserad plattform och en FPGA-baserad plattform. Prestandatesterna kallas cholesky respektive durbin. Testet cholesky utför Choleskydekomposition och testet durbin löser en Yule-Walker-ekvation. CPU-implementationerna av testen tillhandahölls i programmeringsspråket C och FPGA-implementationerna av testen skrevs i OpenCL, som är ett ramverk för högnivåsyntes. Resultaten visade en tydlig fördel för CPU-implementationerna, som har en kortare körtid än FPGA-implementationerna för alla testfall i båda prestandatest. Detta orsakades av databeroenden som existerar i båda algoritmerna, vilket påtvingade en sekventiell exekvering. Då CPU:n når en närmare tio gånger högre klockfrekvens än FPGA:n exekverar den sålunda sekventiella instruktioner snabbare än FPGA:n. FPGA OpenCL PolyBench Cholesky Durbin Computer and Information Sciences Data- och informationsvetenskap
56	Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems Daga, Mayank 06 June 2011 (has links) The emergence of scientific applications embedded with multiple modes of parallelism has made heterogeneous computing systems indispensable in high performance computing. The popularity of such systems is evident from the fact that three out of the top five fastest supercomputers in the world employ heterogeneous computing, i.e., they use dissimilar computational units. A closer look at the performance of these supercomputers reveals that they achieve only around 50% of their theoretical peak performance. This suggests that applications that were tuned for erstwhile homogeneous computing may not be efficient for today's heterogeneous computing and hence, novel optimization strategies are required to be exercised. However, optimizing an application for heterogeneous computing systems is extremely challenging, primarily due to the architectural differences in computational units in such systems. This thesis intends to act as a cookbook for optimizing applications on heterogeneous computing systems that employ graphics processing units (GPUs) as the preferred mode of accelerators. We discuss optimization strategies for multicore CPUs as well as for the two popular GPU platforms, i.e., GPUs from AMD and NVIDIA. Optimization strategies for NVIDIA GPUs have been well studied but when applied on AMD GPUs, they fail to measurably improve performance because of the differences in underlying architecture. To the best of our knowledge, this research is the first to propose optimization strategies for AMD GPUs. Even on NVIDIA GPUs, there exists a lesser known but an extremely severe performance pitfall called partition camping, which can affect application performance by up to seven-fold. To facilitate the detection of this phenomenon, we have developed a performance prediction model that analyzes and characterizes the effect of partition camping in GPU applications. We have used a large-scale, molecular modeling application to validate and verify all the optimization strategies. Our results illustrate that if appropriately optimized, AMD and NVIDIA GPUs can provide 371-fold and 328-fold improvement, respectively, over a hand-tuned, SSE-optimized serial implementation. / Master of Science molecular modeling performance evaluation Optimization OpenCL CUDA GPU multicore CPU
57	On the Complexity of Robust Source-to-Source Translation from CUDA to OpenCL Sathre, Paul Daniel 12 June 2013 (has links) The use of hardware accelerators in high-performance computing has grown increasingly prevalent, particularly due to the growth of graphics processing units (GPUs) as general-purpose (GPGPU) accelerators. Much of this growth has been driven by NVIDIA's CUDA ecosystem for developing GPGPU applications on NVIDIA hardware. However, with the increasing diversity of GPUs (including those from AMD, ARM, and Qualcomm), OpenCL has emerged as an open and vendor-agnostic environment for programming GPUs as well as other parallel computing devices such as the CPU (central processing unit), APU (accelerated processing unit), FPGA (field programmable gate array), and DSP (digital signal processor). The above, coupled with the broader array of devices supporting OpenCL and the significant conceptual and syntactic overlap between CUDA and OpenCL, motivated the creation of a CUDA-to-OpenCL source-to-source translator. However, there exist sufficient differences that make the translation non-trivial, providing practical limitations to both manual and automatic translation efforts. In this thesis, the performance, coverage, and reliability of a prototype CUDA-to-OpenCL source translator are addressed via extensive profiling of a large body of sample CUDA applications. An analysis of the sample body of applications is provided, which identifies and characterizes general CUDA source constructs and programming practices that obstruct our translation efforts. This characterization then led to more robust support for the translator, followed by an evaluation that demonstrated the performance of our automatically-translated OpenCL is on par with the original CUDA for a subset of sample applications when executed on the same NVIDIA device. / Master of Science Source Translation Clang CUDA OpenCL GPU GPGPU Compilers
58	Enabling Development of OpenCL Applications on FPGA platforms Shagrithaya, Kavya Subraya 17 September 2012 (has links) FPGAs can potentially deliver tremendous acceleration in high-performance server and embedded computing applications. Whether used to augment a processor or as a stand-alone device, these reconfigurable architectures are being deployed in a large number of implementations owing to the massive amounts of parallelism offered. At the same time, a significant challenge encountered in their wide-spread acceptance is the laborious efforts required in programming these devices. The increased development time, level of experience needed by the developers, lower turns per day and difficulty involved in faster iterations over designs affect the time-to-market for many solutions. High-level synthesis aims towards increasing the productivity of FPGAs and bringing them within the reach software developers and domain experts. OpenCL is a specification introduced for parallel programming purposes across platforms. Applications written in OpenCL consist of two parts - a host program for initialization and management, and kernels that define the compute intensive tasks. In this thesis, a compilation flow to generate customized application-specific hardware descriptions from OpenCL computation kernels is presented. The flow uses Xilinx AutoESL tool to obtain the design specification for compute cores. An architecture provided integrates the cores with memory and host interfaces. The host program in the application is compiled and executed to demonstrate a proof-of-concept implementation towards achieving an end-to-end flow that provides abstraction of hardware at the front-end. / Master of Science Field programmable gate arrays AutoESL OpenCL Convey HPC
59	Unified system of code transformation and execution for heterogeneous multi-core architectures. / Système unifié de transformation de code et d'éxécution pour un passage aux architectures multi-coeurs hétérogènes Li, Pei 17 December 2015 (has links) Architectures hétérogènes sont largement utilisées dans le domaine de calcul haute performance. Cependant, le développement d'applications sur des architectures hétérogènes est indéniablement fastidieuse et sujette à erreur pour un programmeur même expérimenté. Pour passer une application aux architectures multi-cœurs hétérogènes, les développeurs doivent décomposer les données de l'entrée, gérer les échanges de valeur intermédiaire au moment d’exécution et garantir l'équilibre de charge de système. L'objectif de cette thèse est de proposer une solution de programmation parallèle pour les programmeurs novices, qui permet de faciliter le processus de codage et garantir la qualité de code. Nous avons comparé et analysé les défauts de solutions existantes, puis nous proposons un nouvel outil de programmation STEPOCL avec un nouveau langage de domaine spécifique qui est conçu pour simplifier la programmation sur les architectures hétérogènes. Nous avons évalué la performance de STEPOCL sur trois cas d'application classiques : un stencil 2D, une multiplication de matrices et un problème à N corps. Le résultat montre que : (i) avec l'aide de STEPOCL, la performance d'application varie linéairement selon le nombre d'accélérateurs, (ii) la performance de code généré par STEPOCL est comparable à celle de la version manuscrite. (iii) les charges de travail, qui sont trop grandes pour la mémoire d'un seul accélérateur, peuvent être exécutées en utilisant plusieurs accélérateurs. (iv) grâce à STEPOCL, le nombre de lignes de code manuscrite est considérablement réduit. / Heterogeneous architectures have been widely used in the domain of high performance computing. However developing applications on heterogeneous architectures is time consuming and error-prone because going from a single accelerator to multiple ones indeed requires to deal with potentially non-uniform domain decomposition, inter-accelerator data movements, and dynamic load balancing. The aim of this thesis is to propose a solution of parallel programming for novice developers, to ease the complex coding process and guarantee the quality of code. We lighted and analysed the shortcomings of existing solutions and proposed a new programming tool called STEPOCL along with a new domain specific language designed to simplify the development of an application for heterogeneous architectures. We evaluated both the performance and the usefulness of STEPOCL. The result show that: (i) the performance of an application written with STEPOCL scales linearly with the number of accelerators, (ii) the performance of an application written using STEPOCL competes with an handwritten version, (iii) larger workloads run on multiple devices that do not fit in the memory of a single device, (iv) thanks to STEPOCL, the number of lines of code required to write an application for multiple accelerators is roughly divided by ten. Calcul Haute Performance Equilibrage de charge Génération de code OpenCL Architectures hétérogènes Parallélisme High-Performance Computing OpenCL Heterogeneous Architectures Parallelism
60	DistributedCL: middleware de processamento distribuído em GPU com interface da API OpenCL. / DistributedCL: middleware de processamento distribuído em GPU com interface da API OpenCL. Andre Luiz Rocha Tupinamba 10 July 2013 (has links) Este trabalho apresenta a proposta de um middleware, chamado DistributedCL, que torna transparente o processamento paralelo em GPUs distribuídas. Com o suporte do middleware DistributedCL uma aplicação, preparada para utilizar a API OpenCL, pode executar de forma distribuída, utilizando GPUs remotas, de forma transparente e sem necessidade de alteração ou nova compilação do seu código. A arquitetura proposta para o middleware DistributedCL é modular, com camadas bem definidas e um protótipo foi construído de acordo com a arquitetura, onde foram empregados vários pontos de otimização, incluindo o envio de dados em lotes, comunicação assíncrona via rede e chamada assíncrona da API OpenCL. O protótipo do middleware DistributedCL foi avaliado com o uso de benchmarks disponíveis e também foi desenvolvido o benchmark CLBench, para avaliação de acordo com a quantidade dos dados. O desempenho do protótipo se mostrou bom, superior às propostas semelhantes, tendo alguns resultados próximos do ideal, sendo o tamanho dos dados para transmissão através da rede o maior fator limitante. / This work proposes a middleware, called DistributedCL, which makes parallel processing on distributed GPUs transparent. With DistributedCL middleware support, an OpenCL enabled application can run in a distributed manner, using remote GPUs, transparently and without alteration to the code or recompilation. The proposed architecture for the DistributedCL middleware is modular, with well-defined layers. A prototype was built according to the architecture, into which were introduced multiple optimization features, including batch data transfer, asynchronous network communication and asynchronous OpenCL API invocation. The prototype was evaluated using available benchmarks and a specific benchmark, the CLBench, was developed to facilitate evaluations according to the amount of processed data. The prototype presented good performance, higher compared to similar proposals. The size of data for transmission over the network showed to be the biggest limiting factor. Engenharia Eletrônica OpenCL GPGPU GPU middleware processamento distribuído Electronic Engineering OpenCL GPGPU GPU middleware distributed systems ENGENHARIAS

Search results