Spelling suggestions: "subject:"openai"" "subject:"open64""
31 |
Hardwarebeschleunigung von Matrixberechnungen auf Basis von GPU VerarbeitungGötze, Johannes 02 July 2019 (has links)
In den heutigen Algorithmen zu Soundlokalisierungsverfahren sind Matrizenberechnungen allgegenwärtig, aus diesem Grund befasst sich diese Arbeit mit der Analyse von Matrixberechnungen und deren möglichen Realisierung auf eingebetteten Systemen. Hierzu werden die gängigen Beschleunigungstechnologien wie Prozessoren, Grafikbeschleunigung und Parallelisierung mit der Hilfe von FPGAs analysiert. Die Ergebnisse zeigen, dass ein Grafikchip in der Lage ist eine solche Matrixvektormultiplikation im Gegensatz zu einer Implementierung auf einem Prozessor zu beschleunigen. Eine Implementierung auf einem FPGA, welche in ihrem Entwicklungsaufwand deutlich über der einer Beschleunigung durch einen Grafikchip liegt, ist hinsichtlich der Laufzeit durch eine GPU nicht zu erreichen.
|
32 |
Comparison of Technologies for General-Purpose Computing on Graphics Processing UnitsSörman, Torbjörn January 2016 (has links)
The computational capacity of graphics cards for general-purpose computinghave progressed fast over the last decade. A major reason is computational heavycomputer games, where standard of performance and high quality graphics constantlyrise. Another reason is better suitable technologies for programming thegraphics cards. Combined, the product is high raw performance devices andmeans to access that performance. This thesis investigates some of the currenttechnologies for general-purpose computing on graphics processing units. Technologiesare primarily compared by means of benchmarking performance andsecondarily by factors concerning programming and implementation. The choiceof technology can have a large impact on performance. The benchmark applicationfound the difference in execution time of the fastest technology, CUDA, comparedto the slowest, OpenCL, to be twice a factor of two. The benchmark applicationalso found out that the older technologies, OpenGL and DirectX, are competitivewith CUDA and OpenCL in terms of resulting raw performance.
|
33 |
Mapping parallel programs to heterogeneous multi-core systemsGrewe, Dominik January 2014 (has links)
Heterogeneous computer systems are ubiquitous in all areas of computing, from mobile to high-performance computing. They promise to deliver increased performance at lower energy cost than purely homogeneous, CPU-based systems. In recent years GPU-based heterogeneous systems have become increasingly popular. They combine a programmable GPU with a multi-core CPU. GPUs have become flexible enough to not only handle graphics workloads but also various kinds of general-purpose algorithms. They are thus used as a coprocessor or accelerator alongside the CPU. Developing applications for GPU-based heterogeneous systems involves several challenges. Firstly, not all algorithms are equally suited for GPU computing. It is thus important to carefully map the tasks of an application to the most suitable processor in a system. Secondly, current frameworks for heterogeneous computing, such as OpenCL, are low-level, requiring a thorough understanding of the hardware by the programmer. This high barrier to entry could be lowered by automatically generating and tuning this code from a high-level and thus more user-friendly programming language. Both challenges are addressed in this thesis. For the task mapping problem a machine learning-based approach is presented in this thesis. It combines static features of the program code with runtime information on input sizes to predict the optimal mapping of OpenCL kernels. This approach is further extended to also take contention on the GPU into account. Both methods are able to outperform competing mapping approaches by a significant margin. Furthermore, this thesis develops a method for targeting GPU-based heterogeneous systems from OpenMP, a directive-based framework for parallel computing. OpenMP programs are translated to OpenCL and optimized for GPU performance. At runtime a predictive model decides whether to execute the original OpenMP code on the CPU or the generated OpenCL code on the GPU. This approach is shown to outperform both a competing approach as well as hand-tuned code.
|
34 |
Accelerating Computational AlgorithmsRisley, Michael 10 December 2013 (has links)
Mathematicians and computational scientists are often limited in their ability to model complex phenomena by the time it takes to run simulations. This thesis will inform interested researchers on how the development of highly parallel computer graphics hardware and the compiler frameworks to exploit it are expanding the range of algorithms that can be explored on affordable commodity hardware. We will discuss the complexities that have prevented researchers from exploiting advanced hardware as well as the obstacles that remain for the non-computer scientist.
|
35 |
An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessorParker, Samuel J. January 2015 (has links)
Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration).
|
36 |
Parallel data-processing on GPGPU / Parallel data-processing on GPGPUVansa, Radim January 2012 (has links)
Modern graphic cards are no longer limited to 3D image rendering. Frameworks such as OpenCL enable developers to harness the power of many-core architectures for general-purpose data-processing. This thesis is focused on elementary primitives often used in database management systems, particularly on sorting and set intersection. We present several approaches to these problems and evalute results of benchmarked implementations. Our conclusion is that both tasks can be successfully solved using graphic cards with significant speedup compared to the traditional applications computing solely on multicore CPU.
|
37 |
Оптимизација CFD симулације на групама вишејезгарних хетерогених архитектура / Optimizacija CFD simulacije na grupama višejezgarnih heterogenih arhitektura / Optimization of CFD simulations on groups of many-core heterogeneous architecturesTekić Jelena 07 October 2019 (has links)
<p>Предмет истраживања тезе је из области паралелног програмирања,<br />имплементација CFD (Computational Fluid Dynamics) методе на више<br />хетерогених вишејезгарних уређаја истовремено. У раду је приказано<br />неколико алгоритама чији је циљ убрзање CFD симулације на персоналним рачунарима. Показано је да описано решење постиже задовољавајуће перформансе и на HPC уређајима (Тесла графичким картицама). Направљена је симулација у микросервис архитектури која је портабилна и флексибилна и додатно олакшава рад на персоналним рачунарима.</p> / <p>Predmet istraživanja teze je iz oblasti paralelnog programiranja,<br />implementacija CFD (Computational Fluid Dynamics) metode na više<br />heterogenih višejezgarnih uređaja istovremeno. U radu je prikazano<br />nekoliko algoritama čiji je cilj ubrzanje CFD simulacije na personalnim računarima. Pokazano je da opisano rešenje postiže zadovoljavajuće performanse i na HPC uređajima (Tesla grafičkim karticama). Napravljena je simulacija u mikroservis arhitekturi koja je portabilna i fleksibilna i dodatno olakšava rad na personalnim računarima.</p> / <p>The case study of this dissertation belongs to the field of parallel programming, the implementation of CFD (Computational Fluid Dynamics) method on several heterogeneous multiple core devices simultaneously. The paper presents several algorithms aimed at accelerating CFD simulation on common computers. Also it has been shown that the described solution achieves satisfactory performance on<br />HPC devices (Tesla graphic cards). Simulation is created in micro-service architecture that is portable and flexible and makes it easy to test CFD<br />simulations on common computers.</p>
|
38 |
Real-Time Systems with Radiation-Hardened Processors : A GPU-based Framework to Explore TradeoffsAlhowaidi, Mohammad January 2012 (has links)
Radiation-hardened processors are designed to be resilient against soft errorsbut such processors are slower than Commercial Off-The-Shelf (COTS)processors as well significantly costlier. In order to mitigate the high costs,software techniques such as task re-executions must be deployed together withadequately hardened processors to provide reliability. This leads to a huge designspace comprising of the hardening level of the processors and the numberof re-executions of each task in the system. Each configuration in this designspace represents a tradeoff between processor load, reliability and costs. The reliability comes at the price of higher costs due to higher levels of hardeningand performance degradation due to hardening or due to re-executions.Thus, the tradeoffs between performance, reliability and costs must be carefullystudied. Pertinent questions that arise in such a design scenario are — (i)how many times a task must be re-executed and (ii) what should be hardeninglevel? — such that the system reliability is satisfied. In order to evaluate such tradeoffs efficiently, in this thesis, we proposenovel framework that harnesses the computational power of Graphics ProcessingUnits (GPUs). Our framework is based on a system failure probabilityanalysis that connects the probability of failure of tasks to the overall systemreliability. Based on characteristics of this probabilistic analysis as well asreal-time deadlines, we derive bounds on the design space to prune infeasiblesolutions. Finally, we illustrate the benefits of our proposed framework withseveral experiments
|
39 |
Lane-Based Front Vehicle Detection and Its AccelerationChen, Jie-Qi 02 January 2013 (has links)
Based on .Net Framework4.0 development platform and Visual C# language, this thesis presents various methods of performing lane detection and preceding vehicle detection/tracking with code optimization and acceleration to reduce the execution time. The thesis consists of two major parts: vehicle detection and tracking. In the part of detection, driving lanes are identified first and then the preceding vehicles between the left lane and right lane are detected using the shadow information beneath vehicles. In vehicle tracking, three-pass search method is used to find the matched vehicles based on the detection results in the previous frames. According to our experiments, the preprocessing (including color-intensity conversion) takes a significant portion of total execution time. We propose different methods to optimize the code and speed up the software execution using pure C # pointers, OPENCV, and OPENCL etc. Experimental results show that the fastest detection/tracking speed can reach more than 30 frames per second (fps) using PC with i7-2600 3.4Ghz CPU. Except for OPENCV with execution rate of 18 fps, the rest of methods have up to 28 fps processing rate of almost the real-time speed. We also add the auxiliary vehicle information, such as preceding vehicle distance and vehicle offset warning.
|
40 |
Steady State Analysis of Nonlinear Circuits using the Harmonic Balance on GPUBandali, Bardia 16 October 2013 (has links)
This thesis describes a new approach to accelerate the simulation of the steady-state response of nonlinear circuits using the Harmonic Balance (HB) technique. The approach presented in this work focuses on direct factorization of the sparse Jacobian matrix of the HB nonlinear equations using a Graphics Processing Unit (GPU) platform. This approach exploits the heterogeneous structure of the Jacobian matrix. The computational core of the proposed approach is based on developing a block-wise version of the KLU factorization algorithm, where scalar arithmetic operations are replaced by block-aware matrix operations. For a large number of harmonics, or excitation tones, or both the Block-KLU (BKLU) approach effectively raises the ratio of floating-point operations to other operations and, therefore, becomes an ideal vehicle for implementation on a GPU-based platform. Motivated by this fact, a GPU-based Hybrid Block KLU framework is developed to implement the BKLU. The proposed approach in this thesis is named Hybrid-BKLU. The Hybrid-BKLU is implemented in two parts, on the host CPU and on the graphic card’s GPU, using the OpenCL heterogeneous parallel programming language. To show the efficiency of the Hybrid-BKLU approach, its performance is compared with BKLU approach performing HB analysis on several test circuits. The Hybrid-BKLU approach yields speedup by up to 89 times over conventional BKLU on CPU.
|
Page generated in 0.0657 seconds