Global ETD Search

181	Akcelerace neuronových sítí s využitím GPU / The GPU Based Acceleration of Neural Networks Šimíček, Ondřej January 2015 (has links) The thesis deals with the acceleration of backpropagation neural networks using graphics chips. To solve this problem it was used the OpenCL technology that allows work with graphics chips from different manufacturers. The main goal was to accelerate the time-consuming learning process and classification process. The acceleration was achieved by training a large amount of neural networks simultaneously. The speed gain was used to find the best settings and topology of neural network for a given task using genetic algorithm.
182	Akcelerace částicových rojů PSO pomocí GPU / Acceleration of Particle Swarm Optimization Using GPUs Krézek, Vladimír January 2012 (has links) This work deals with the PSO technique (Particle Swarm Optimization), which is capable to solve complex problems. This technique can be used for solving complex combinatorial problems (the traveling salesman problem, the tasks of knapsack), design of integrated circuits and antennas, in fields such as biomedicine, robotics, artificial intelligence or finance. Although the PSO algorithm is very efficient, the time required to seek out appropriate solutions for real problems often makes the task intractable. The goal of this work is to accelerate the execution time of this algorithm by the usage of Graphics processors (GPU), which offers higher computing potential while preserving the favorable price and size. The boolean satisfiability problem (SAT) was chosen to verify and benchmark the implementation. As the SAT problem belongs to the class of the NP-complete problems, any reduction of the solution time may broaden the class of tractable problems and bring us new interesting knowledge.
183	Akcelerace částicových rojů PSO pomocí GPU / Particle Swarm Optimization on GPUs Záň, Drahoslav January 2013 (has links) This thesis deals with a population based stochastic optimization technique PSO (Particle Swarm Optimization) and its acceleration. This simple, but very effective technique is designed for solving difficult multidimensional problems in a wide range of applications. The aim of this work is to develop a parallel implementation of this algorithm with an emphasis on acceleration of finding a solution. For this purpose, a graphics card (GPU) providing massive performance was chosen. To evaluate the benefits of the proposed implementation, a CPU and GPU implementation were created for solving a problem derived from the known NP-hard Knapsack problem. The GPU application shows 5 times average and almost 10 times the maximum speedup of computation compared to an optimized CPU application, which it is based on.
184	Detekce obličejů ve videu na GPU / Face Detection in Video on GPU Tesař, Martin January 2012 (has links) This work deals with task of face detection on graphic card. First part is the introduction to face detection methods focusing on detector proposed by Viola and Jones. Further, this work studies the possibilities of mapping detector's key parts on graphic card. Next part describes implementation details of designed application. The end of work include results and comparison with CPU approach. The last chapter summarizes the whole work and proposes future possibilities of development.
185	Fyzikální simulace na GPU / Physics Simulation on GPU Janošík, Ondřej January 2016 (has links) This thesis addresses the issue of rigid body simulation and possibilities of paralellization using GPU. It describes the basics necessary for implementation of basic physics engine for blocks and technologies which can be used for acceleration. In my thesis, I describe approach which allowed me to gradually accellerate physics simulation using OpenCL. Each significant change is described in its own section and includes measurement results with short summary.
186	Konstrukce kD stromu na GPU / Building kD Tree on GPU Bajza, Jakub January 2016 (has links) This term project addresses the construction of kD tree acceleration structures and parallelization of this construction using GPU. At the beginning, there is an introduction of the reader into CUDA platform for parallel programming. There is a decription of generic principles as well as specific features that will be used in this thesis. Following that the reader is put into the issue of acceleration structures for Ray tracing. These structures are described and the kD tree acceleration structure and its variants are portrayed in detail. After that the analysis of chosen kD tree variant is broken down and the problems and issuse of its parallel implementation are adressed. As a part of implementation discription, there is a short descripton of CPU variant and detailed specifications of the CUDA kernels. The testing section brings the results of implementation in form of CPU vs GPU comparison, as well as evaluation of how much the metric set in design was fulfilled. In the end there is a summary of achieved goals and results followed by possible future improvements for the implementation.
187	Synthesizing Software from a ForSyDe Model Targeting GPGPUs Hjort Blindell, Gabriel January 2012 (has links) Today, a plethora of parallel execution platforms are available. One platform in particular is the GPGPU – a massively parallel architecture designed for exploiting data parallelism. However, GPGPUS are notoriously difficult to program due to the way data is accessed and processed, and many interconnected factors affect the performance. This makes it an exceptionally challengingtask to write correct and high-performing applications for GPGPUS. This thesis project aims to address this problem by investigating how ForSyDe models – a software engineering methodology where applications are modeled at a very high level of abstraction – can be synthesized into CUDA C code for execution on NVIDIA CUDA-enabled graphics cards. The report proposes a software synthesis process which discovers one type of potential data parallelism in a model and generates either pure C or CUDA C code. A prototype of the software synthesis component has also been implemented and tested on models derived from two applications – a Mandelbrot generator and an industrial-scale image processor. The synthesized CUDA code produced in the tests was shown to be both correct and efficient, provided there was enough computation complexity in the processes to amortize the overhead cost of using the GPGPU. ForSyDe abstract program models software synthesis gpgpu cuda C Engineering and Technology Teknik och teknologier
188	Compiler-Based Tools to Aid in Data Transfer Optimization and On-Chip Debug of Heterogeneous Compute Systems Ashcraft, Matthew B. 07 July 2020 (has links) First, we present techniques to efficiently schedule data transfers through compiler analyses. Compared to transferring data immediately before and after the kernel executes, our scheduling results in orders of magnitude improvements in execution time, number of data transfers, and number of bytes transferred. Second, we demonstrate techniques to provide on-chip debugging for heterogeneous systems through recording execution on the software in addition to debugging circuitry in the hardware, and provide a temporal correlation between the hardware and software traces through synchronization. This allows us to follow debug data between the hardware and software trace buffers. Due to the added cost of synchronizing the trace buffers, we explore synchronization schemes which can reduce the impact synchronization depending on the code structure. We demonstrate the quantitative impact of these techniques on execution time and hardware and software resources, which are under a 2x increase to execution time in most cases. Third, we demonstrate how source-code debugging techniques for on-chip debugging can be applied to OpenCL FPGA kernels in heterogeneous systems. We developed techniques and a tool-flow that allows users to select variables to record, automatically insert recording instructions into the kernel source code, synthesize the changes directly into the hardware design using commercial HLS tools, retrieve the trace data through kernel arguments, and present it to the user. Overall, quantitative measurements showed our techniques resulted in modest increases to execution time and hardware resources. compilers accelerators GPGPU data transfers HLS high-level Synthesis FPGA Engineering
189	GPU-Assisted Collision Avoidance for Trajectory Optimization : Parallelization of Lookup Table Computations for Robotic Motion Planners Based on Optimal Control Bishnoi, Abhiraj January 2021 (has links) One of the biggest challenges associated with optimization based methods forrobotic motion planning is their extreme sensitivity to a good initial guess,especially in the presence of local minima in the cost function landscape.Additional challenges may also arise due to operational constraints, robotcontrollers sometimes have very little time to plan a trajectory to perform adesired function. To work around these limitations, a common solution is tosplit the motion planner into an offline phase and an online phase. The offlinephase entails computing reference trajectories for varying parameterizationsof the task space in the form of a lookup table. During the online phase,a stripped down version of the optimizer is supplied with a suitable initialguess from the lookup table using the current state estimate of the robot andits surrounding bodies. This method helps in alleviating problems related toboth local minima and operational time constraints, by seeding the optimizerwith a suitable initial guess that allows it to converge to the global minimummuch faster.The problem however, shifts to the computational complexity of computinga lookup table of reference trajectories for a fine enough discreti- zation ofthe input state space. For many robotic scenarios of interest, it is oftenimpractical and sometimes computationally infeasible to compute a look uptable using a serial, single core implementation of the offline phase of a motionplanner. The main contribution of this work is to develop and evaluate amethod for reducing the time spent on computing a lookup table of referencetrajectories during the offline phase of motion planners based on optimalcontrol. We implement a method to offload the computation of collisionavoidance constraints during trajectory optimization on a Graphics ProcessingUnit (GPU), while simultaneously benefiting from a task based approach todistribute lookup table computations for independent subsets of the input statespace across multiple processes on a cluster of machines. We demonstrate theefficacy of the proposed method in a practical setting by implementing andevaluating it within a representative motion planner based on optimal control.We observe that the implemented method is 115x faster than the originalserial version of the planner, using 86 processes on 5 machines with standardserver grade hardware and 5 Graphics Processing Units in total. Additionally,we observe that the implemented method results in solutions identical to theoriginal serial version in 96.6% of cases, lending credibility for its use inrobotic motion planning. / En av de största utmaningarna med optimeringsbaserade metoder för rörelseplaneringinom robotik är deras extrema känslighet för en bra initial gissning,särskilt i närvaro av lokala minima i kostnadsfunktionslandskapet. Ytterligareutmaningar kan också uppstå på grund av operativa begränsningar. Robotkontrollerhar ibland väldigt lite tid att planera en väg för att utföra en önskadfunktion. För att kringgå dessa begränsningar är en vanlig lösning att dela upprörelseplaneraren i en offline-fas och en online-fas. Offlinefasen inkluderarberäkning av referensvägar för olika punkter i ingångstillståndsutrymmet iform av en uppslagstabell. Under online-fasen levereras en avskalad versionav optimeraren med en lämplig initial gissning från uppslagstabellen medden aktuella uppskattningen av roboten och dess omgivande kroppar. Dennametod hjälper till att lindra problem relaterade till både lokala minima ochdriftstidsbegränsningar genom att sådd optimeraren med en lämplig initialgissning som gör att den kan konvergera till det globala minimumet mycketsnabbare.Problemet flyttas emellertid nu till beräkningskomplexiteten för att beräknaen uppslagstabell över referensvägar för ett tillräckligt fint utrymme för ingångstillståndsutrymmet.För många robotscenarier av intresse är det ofta opraktisktoch ibland beräkningsmässigt omöjligt att beräkna en uppslagstabell med hjälpav en seriell, enda kärnimplementering av offline-fasen i en rörelseplanner.Huvudbidraget till detta arbete är att utveckla och utvärdera en metod för attminska tiden som används för att beräkna en uppslagstabell över referensvägarunder offline-fasen för rörelsesplanerare baserat på optimal kontroll. Vi implementeraren metod för att utföra en kollision undvika en grafikbehandlingsenhet(GPU), medan du använder en uppgiftsbaserad metod för att distribuerauppslagningsberäkningar för oberoende delmängder av inmatningsutrymmeöver flera processer i ett kluster av maskiner. Vi demonstrerar effektivitetenav den föreslagna metoden i en praktisk miljö genom att implementeraoch utvärdera den inom en representativ rörelseplanner baserat på optimalkontroll. Vi noterar att den implementerade metoden är 115 gånger snabbareän den ursprungliga serieversionen av schemaläggaren, med 86 processer på 5maskiner med standardhårdvara och totalt 5 GPU: er. Dessutom observerarvi att den implementerade metoden resulterar i lösningar som är identiskamed den ursprungliga serieversionen i mer än 96,6 % av fallen, vilket gertrovärdighet för dess användning i robotrörelse planering. Motion Planning Robotics Trajectory Optimization GPGPU Parallel Programming Computer and Information Sciences Data- och informationsvetenskap
190	Testing and Validation of a Prototype Gpgpu Design for FPGAs Merchant, Murtaza 01 January 2013 (has links) (PDF) Due to their suitability for highly parallel and pipelined computation, field programmable gate arrays (FPGAs) and general-purpose graphics processing units (GPGPUs) have emerged as top contenders for hardware acceleration of high-performance computing applications. FPGAs are highly specialized devices that can be customized to a specific application, whereas GPGPUs are made of a fixed array of multiprocessors with a rigid architectural model. To alleviate this rigidity as well as to combine some other benefits of the two platforms, it is desirable to explore the implementation of a flexible GPGPU (soft GPGPU) using the reconfigurable fabric found in an FPGA. This thesis describes an aggressive effort to test and validate a prototype GPGPU design targeted to a Virtex-6 FPGA. Individual design stages are tested and integrated together using manually-generated RTL testbenches and logic simulation tools. The soft GPGPU design is validated by benchmarking the platform against five standard CUDA benchmarks. The platform is fully CUDA-compatible and supports direct execution of CUDA compiled binaries. Platform scalability is validated by varying the number of processing cores as well as multiprocessors, and evaluating their effects on area and performance. Experimental results show as average speedup of 25x for a 32 core soft GPGPU configuration over a fully optimized MicroBlaze soft microprocessor, accentuating benefits of the thread-based execution model of GPUs and their ability to perform complex control flow operations in hardware. The testing and validation of the designed soft GPGPU system serves as a prerequisite for rapid design exploration of the platform in the future. GPGPU FPGA hardware acceleration CUDA compatible scalable flexible

Search results