Spelling suggestions: "subject:"opencl"" "subject:"àopencl""
1 |
GPU-accelleration of image rendering and sorting algorithms with the OpenCL frameworkAnders, Söderholm, Justus, Sörman January 2016 (has links)
Today's computer systems often contains several different processing units aside from the CPU. Among these the GPU is a very common processing unit with an immense compute power that is available in almost all computer systems. How do we make use of this processing power that lies within our machines? One answer is the OpenCL framework that is designed for just this, to open up the possibilities of using all the different types of processing units in a computer system. This thesis will discuss the advantages and disadvantages of using the integrated GPU available in a basic workstation computer for computation of image processing and sorting algorithms. These tasks are computationally intensive and the authors will analyze if an integrated GPU is up to the task of accelerating the processing of these algorithms. The OpenCL framework makes it possible to run one implementation on different processing units, to provide perspective we will benchmark our implementations on both the GPU and the CPU and compare the results. A heterogeneous approach that combines the two above mentioned processing units will also be tested and discussed. The OpenCL framework is analyzed from a development perspective and what advantages and disadvantages it brings to the development process will be presented.
|
2 |
Jit4OpenCL: a compiler from Python to OpenCLXunhao, Li 11 1900 (has links)
Heterogeneous computing platforms that use GPUs and CPUs in tandem for computation have become an important choice to build low-cost high-performance computing platforms. The computing ability of modern GPUs surpasses that of CPUs can offer for certain classes of applications. GPUs can deliver several Tera-Flops in peak performance. However, programmers must adopt a more complicated and more difficult new programming paradigm.
To alleviate the burden of programming for heterogeneous systems, Garg and Amaral developed a Python compiling framework that combines an ahead-of-time compiler called unPython with a just-in-time compiler called jit4GPU. This compilation framework generates code for systems with AMD GPUs. We extend the framework to retarget it to generate OpenCL code, an industry standard that is implemented for most GPUs. Therefore, by generating OpenCL code, this new compiler, called jit4OpenCL, enables the execution of the same program in a wider selection of heterogeneous platforms. To further improve the target-code performance on nVidia GPUs, we developed an array-access analysis tool that helps to exploit the data reusability by utilizing the shared (local) memory space hierarchy in OpenCL.
The thesis presents an experimental performance evaluation indicating that, in comparison with jit4GPU, jit4OpenCL has performance degradation because of the current performance of implementations of OpenCL, and also because of the extra time needed for the additional just-in-time compilation. However, the portable code generated by jit4OpenCL still have performance gains in some applications compared to highly optimized CPU code.
|
3 |
Jit4OpenCL: a compiler from Python to OpenCLXunhao, Li Unknown Date
No description available.
|
4 |
An in-depth performance analysis of irregular workloads on VLIW APUDoerksen, Matthew 26 May 2014 (has links)
Heterogeneous multi-core architectures have a higher performance/power ratio than traditional homogeneous architectures. Due to their heterogeneity, these architectures support diverse applications but developing parallel algorithms on these architectures can be difficult. In implementing algorithms for heterogeneous systems, proprietary languages are often required, limiting portability. Although general purpose graphics processing units (GPUs) have shown great promise in accelerating the performance of throughput computing applications, it is still limited by the memory wall. The memory wall can greatly affect application performance for problems that incorporate amorphous parallelism or irregular workload. Now, AMD's Fusion series of Accelerated Processing Units (APUs) attempts to solve this problem. The APU is a radical change from the traditional systems of a few years ago. This design change enables consumers to have a capable CPU connected to a powerful, compute-capable GPU using a Very Long Instruction Word (VLIW) architecture.
In this thesis, I address the suitability of irregular workload problems on APU architectures. I consider four scientific computing problems of varying characteristics and map them onto the architectural features of the APU. I develop several software optimizations for each problem by making effective use of VLIW static scheduling through techniques such as loop unrolling and vectorization. Using AMD's OpenCL profiler, I analyze the execution of the various optimizations and provide an in-depth performance analysis using metrics such as kernel occupancy, ALUFetchRatio, ALUBusy Percentage and ALUPacking. Finally, I show the effect of register pressure due to vectorization and the limitations associated with the APU architecture for irregular workloads.
|
5 |
Optimización de software de visualización y detección de patrones de drenaje y terrazas fluviales en superficies de terrenoPefaur Pumarino, José Tomás January 2016 (has links)
Ingeniero Civil en Computación / La Geomorfología fluvial corresponde al estudio de los procesos de formación y sedimentación de los ríos, y de su interacción con el entorno, lo que entrega información sobre la "historia de vida" de un terreno. En este contexto, existen dos elementos de estudio interesantes: las redes de drenaje y las terrazas fluviales.
Runnel es un software que tiene por objetivo visualizar y detectar patrones de drenaje y terrazas fluviales sobre terrenos representados por grillas o triangulaciones de éstas. Si bien el funcionamiento de Runnel es correcto, éste tiene un gran problema: el tiempo de ejecución de sus principales algoritmos. A medida que el tamaño del terreno crece, el tiempo de ejecución aumenta considerablemente. Con el fin de mejorar este problema, se decidió paralelizar los algoritmos de detección de redes de drenaje: Peucker, Callaghan, RWFlood y Ángulo Diedro. Al mismo tiempo, se decidió implementar una triangulación simplificada, la cual disminuye el número de triángulos, y en consecuencia, se disminuye el tiempo de ejecución de los algoritmos que usan la triangulación.
Para lograr la paralelización se utilizó OpenCL, herramienta que permite la ejecución de un código paralelo tanto en GPU como en CPU de forma indistinguible. Para la triangulación simplificada se utilizó un algoritmo basado en la eliminación de uno de los vértices de aquellos triángulos que cumplen una condición predeterminada.
Como resultado de la paralelización se obtuvieron mejoras significativas con los algoritmos de Peucker (speed-up mínimo: 1.19 y speed-up máximo: 13.63), Callaghan (speed-up mínimo: 5.87 y speed-up máximo: 19.82) y Ángulo Diedro (speed-up mínimo: 1.43 y speed-up máximo: 23.53). La triangulación simplificada también entregó mejoras en rendimiento, pero con menor impacto que la paralelización (speed-up mínimo: 1.14 y speed-up máximo: 1.94). El único algoritmo que no resultó en mejoras en su tiempo de ejecución (en la mayoría de los casos de prueba) fue el algoritmo RWFlood (speed-up mínimo: 0.16 y speed-up máximo: 2.93).
Junto con el desarrollo de los algoritmos paralelos se adquirió conocimiento sobre las diferencias de rendimiento de una CPU con una GPU. Se tuvo que ahondar en la arquitectura de cada una y reconocer el tipo de problema que cada una puede resolver de manera óptima.
Se propone como trabajo futuro solucionar el uso excesivo de la memoria en la triangulación, analizar del impacto de la triangulación simplificada en los terrenos, solucionar el problema del tiempo de ejecución de RWFlood y paralelizar otros algoritmos.
|
6 |
Comparative study of parallel programming models for multicore computingAli, Akhtar January 2013 (has links)
Shared memory multi-core processor technology has seen a drastic developmentwith faster and increasing number of processors per chip. This newarchitecture challenges computer programmers to write code that scales overthese many cores to exploit full computational power of these machines.Shared-memory parallel programming paradigms such as OpenMP and IntelThreading Building Blocks (TBB) are two recognized models that offerhigher level of abstraction, shields programmers from low level detailsof thread management and scales computation over all available resources.At the same time, need for high performance power-ecient computing iscompelling developers to exploit GPGPU computing due to GPU's massivecomputational power and comparatively faster multi-core growth. Thistrend leads to systems with heterogeneous architectures containing multicoreCPUs and one or more programmable accelerators such as programmableGPUs. There exist dierent programming models to program these architecturesand code written for one architecture is often not portable to anotherarchitecture. OpenCL is a relatively new industry standard framework, de-ned by Khronos group, which addresses the portability issue. It oers aportable interface to exploit the computational power of a heterogeneous setof processors such as CPUs, GPUs, DSP processors and other accelerators. In this work, we evaluate the eectiveness of OpenCL for programmingmulti-core CPUs in a comparative case study with two CPU specic stableframeworks, OpenMP and Intel TBB, for ve benchmark applicationsnamely matrix multiply, LU decomposition, image convolution, Pi value approximationand image histogram generation. The evaluation includes aperformance comparison of the three frameworks and a study of the relativeeects of applying compiler optimizations on performance numbers.OpenCL performance on two vendor-dependent platforms Intel and AMD,is also evaluated. Then the same OpenCL code is ported to a modern GPUand its code correctness and performance portability is investigated. Finally,usability experience of coding using the three multi-core frameworksis presented.
|
7 |
Reusable OpenCL FPGA InfrastructureChin, Stephen Alexander 25 July 2012 (has links)
OpenCL has emerged as a standard programming model for heterogeneous systems. Recent work combining OpenCL and FPGAs has focused on high-level synthesis. Building a complete OpenCL FPGA system requires more than just high-level synthesis. This work introduces a reusable OpenCL infrastructure for FPGAs that complements previous work and specifically targets a key architectural element - the memory interface. An Aggregating Memory Controller that aims to maximize bandwidth to external, large, high-latency, high-bandwidth memories and a template Processing Array with soft-processor and hand-coded hardware elements are designed, simulated, and implemented on an FPGA. Two micro-benchmarks were run on both the soft-processor elements and the hand-coded hardware elements to exercise the Aggregating Memory Controller. The micro-benchmarks were simulated as well as implemented in a hardware prototype. Memory bandwidth results for the system show that the external memory interface can be saturated and the high-latency can be effectively hidden using the Aggregating Memory Controller.
|
8 |
Reusable OpenCL FPGA InfrastructureChin, Stephen Alexander 25 July 2012 (has links)
OpenCL has emerged as a standard programming model for heterogeneous systems. Recent work combining OpenCL and FPGAs has focused on high-level synthesis. Building a complete OpenCL FPGA system requires more than just high-level synthesis. This work introduces a reusable OpenCL infrastructure for FPGAs that complements previous work and specifically targets a key architectural element - the memory interface. An Aggregating Memory Controller that aims to maximize bandwidth to external, large, high-latency, high-bandwidth memories and a template Processing Array with soft-processor and hand-coded hardware elements are designed, simulated, and implemented on an FPGA. Two micro-benchmarks were run on both the soft-processor elements and the hand-coded hardware elements to exercise the Aggregating Memory Controller. The micro-benchmarks were simulated as well as implemented in a hardware prototype. Memory bandwidth results for the system show that the external memory interface can be saturated and the high-latency can be effectively hidden using the Aggregating Memory Controller.
|
9 |
KFusion: obtaining modularity and performance with regards to general purpose GPU computing and co-processorsKiemele, Liam 14 December 2012 (has links)
Concurrency has recently come to the forefront of computing as multi-core processors
become more and more common. General purpose graphics processing unit
computing brings with them new language support for dealing with co-processor environments
such as OpenCL and CUDA. Programming language support for multi-core
architectures introduces a fundamentally new mechanism for modularity--a kernel.
Developers attempting to leverage these mechanism to separate concerns often
incur unanticipated performance penalties. My proposed solution aims to preserve
the benefits of kernel boundaries for modularity, while at the same time eliminate
these inherent costs at compile time and execution.
KFusion is a prototype tool for transforming programs written in OpenCL to make
them more efficient. By leveraging loop fusion and deforestation, it can eliminate the
costs associated with compositions of kernels that share data. Case studies show that
Kfusion can address key memory bandwidth and latency bottlenecks and result in
substantial performance improvements. / Graduate
|
10 |
Překlad OpenCL aplikací pro vestavěné systémy / Compilation of OpenCL Applications for Embedded SystemsŠnobl, Pavel January 2016 (has links)
This master's thesis deals with the support for compilation and execution of programs written using OpenCL framework on embedded systems. OpenCL is a system for programming heterogeneous systems comprising processors, graphic accelerators and other computing devices. But it also finds usage on systems composed of just one computing unit, where it allows to write parallel programs (task and data parallelism) and work with hierarchical system of memories. In this thesis, various available open source OpenCL implementations are compared and one selected is then integrated into LLVM compiler infrastructure. This compiler is generated as a part of toolchain provided by application specific instruction set architecture processor developement environment called Codasip Studio. Designed and implemented are also optimizations for architectures with SIMD instructions and VLIW architectures. The result is tested and demonstrated on a set of testing applications.
|
Page generated in 0.036 seconds