• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 138
  • 41
  • 23
  • 16
  • 15
  • 9
  • 8
  • 5
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 303
  • 107
  • 104
  • 104
  • 60
  • 52
  • 50
  • 47
  • 46
  • 39
  • 31
  • 30
  • 30
  • 29
  • 29
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
191

GPU-Assisted Collision Avoidance for Trajectory Optimization : Parallelization of Lookup Table Computations for Robotic Motion Planners Based on Optimal Control

Bishnoi, Abhiraj January 2021 (has links)
One of the biggest challenges associated with optimization based methods forrobotic motion planning is their extreme sensitivity to a good initial guess,especially in the presence of local minima in the cost function landscape.Additional challenges may also arise due to operational constraints, robotcontrollers sometimes have very little time to plan a trajectory to perform adesired function. To work around these limitations, a common solution is tosplit the motion planner into an offline phase and an online phase. The offlinephase entails computing reference trajectories for varying parameterizationsof the task space in the form of a lookup table. During the online phase,a stripped down version of the optimizer is supplied with a suitable initialguess from the lookup table using the current state estimate of the robot andits surrounding bodies. This method helps in alleviating problems related toboth local minima and operational time constraints, by seeding the optimizerwith a suitable initial guess that allows it to converge to the global minimummuch faster.The problem however, shifts to the computational complexity of computinga lookup table of reference trajectories for a fine enough discreti- zation ofthe input state space. For many robotic scenarios of interest, it is oftenimpractical and sometimes computationally infeasible to compute a look uptable using a serial, single core implementation of the offline phase of a motionplanner. The main contribution of this work is to develop and evaluate amethod for reducing the time spent on computing a lookup table of referencetrajectories during the offline phase of motion planners based on optimalcontrol. We implement a method to offload the computation of collisionavoidance constraints during trajectory optimization on a Graphics ProcessingUnit (GPU), while simultaneously benefiting from a task based approach todistribute lookup table computations for independent subsets of the input statespace across multiple processes on a cluster of machines. We demonstrate theefficacy of the proposed method in a practical setting by implementing andevaluating it within a representative motion planner based on optimal control.We observe that the implemented method is 115x faster than the originalserial version of the planner, using 86 processes on 5 machines with standardserver grade hardware and 5 Graphics Processing Units in total. Additionally,we observe that the implemented method results in solutions identical to theoriginal serial version in 96.6% of cases, lending credibility for its use inrobotic motion planning. / En av de största utmaningarna med optimeringsbaserade metoder för rörelseplaneringinom robotik är deras extrema känslighet för en bra initial gissning,särskilt i närvaro av lokala minima i kostnadsfunktionslandskapet. Ytterligareutmaningar kan också uppstå på grund av operativa begränsningar. Robotkontrollerhar ibland väldigt lite tid att planera en väg för att utföra en önskadfunktion. För att kringgå dessa begränsningar är en vanlig lösning att dela upprörelseplaneraren i en offline-fas och en online-fas. Offlinefasen inkluderarberäkning av referensvägar för olika punkter i ingångstillståndsutrymmet iform av en uppslagstabell. Under online-fasen levereras en avskalad versionav optimeraren med en lämplig initial gissning från uppslagstabellen medden aktuella uppskattningen av roboten och dess omgivande kroppar. Dennametod hjälper till att lindra problem relaterade till både lokala minima ochdriftstidsbegränsningar genom att sådd optimeraren med en lämplig initialgissning som gör att den kan konvergera till det globala minimumet mycketsnabbare.Problemet flyttas emellertid nu till beräkningskomplexiteten för att beräknaen uppslagstabell över referensvägar för ett tillräckligt fint utrymme för ingångstillståndsutrymmet.För många robotscenarier av intresse är det ofta opraktisktoch ibland beräkningsmässigt omöjligt att beräkna en uppslagstabell med hjälpav en seriell, enda kärnimplementering av offline-fasen i en rörelseplanner.Huvudbidraget till detta arbete är att utveckla och utvärdera en metod för attminska tiden som används för att beräkna en uppslagstabell över referensvägarunder offline-fasen för rörelsesplanerare baserat på optimal kontroll. Vi implementeraren metod för att utföra en kollision undvika en grafikbehandlingsenhet(GPU), medan du använder en uppgiftsbaserad metod för att distribuerauppslagningsberäkningar för oberoende delmängder av inmatningsutrymmeöver flera processer i ett kluster av maskiner. Vi demonstrerar effektivitetenav den föreslagna metoden i en praktisk miljö genom att implementeraoch utvärdera den inom en representativ rörelseplanner baserat på optimalkontroll. Vi noterar att den implementerade metoden är 115 gånger snabbareän den ursprungliga serieversionen av schemaläggaren, med 86 processer på 5maskiner med standardhårdvara och totalt 5 GPU: er. Dessutom observerarvi att den implementerade metoden resulterar i lösningar som är identiskamed den ursprungliga serieversionen i mer än 96,6 % av fallen, vilket gertrovärdighet för dess användning i robotrörelse planering.
192

Testing and Validation of a Prototype Gpgpu Design for FPGAs

Merchant, Murtaza 01 January 2013 (has links) (PDF)
Due to their suitability for highly parallel and pipelined computation, field programmable gate arrays (FPGAs) and general-purpose graphics processing units (GPGPUs) have emerged as top contenders for hardware acceleration of high-performance computing applications. FPGAs are highly specialized devices that can be customized to a specific application, whereas GPGPUs are made of a fixed array of multiprocessors with a rigid architectural model. To alleviate this rigidity as well as to combine some other benefits of the two platforms, it is desirable to explore the implementation of a flexible GPGPU (soft GPGPU) using the reconfigurable fabric found in an FPGA. This thesis describes an aggressive effort to test and validate a prototype GPGPU design targeted to a Virtex-6 FPGA. Individual design stages are tested and integrated together using manually-generated RTL testbenches and logic simulation tools. The soft GPGPU design is validated by benchmarking the platform against five standard CUDA benchmarks. The platform is fully CUDA-compatible and supports direct execution of CUDA compiled binaries. Platform scalability is validated by varying the number of processing cores as well as multiprocessors, and evaluating their effects on area and performance. Experimental results show as average speedup of 25x for a 32 core soft GPGPU configuration over a fully optimized MicroBlaze soft microprocessor, accentuating benefits of the thread-based execution model of GPUs and their ability to perform complex control flow operations in hardware. The testing and validation of the designed soft GPGPU system serves as a prerequisite for rapid design exploration of the platform in the future.
193

Parallel Mesh Adaptation and Graph Analysis Using Graphics Processing Units

Mcguiness, Timothy P 01 January 2011 (has links) (PDF)
In the field of Computational Fluid Dynamics, several types of mesh adaptation strategies are used to enhance a mesh’s quality, thereby improving simulation speed and accuracy. Mesh smoothing (r-refinement) is a simple and effective technique, where nodes are repositioned to increase or decrease local mesh resolution. Mesh partitioning divides a mesh into sections, for use on distributed-memory parallel machines. As a more abstract form of modeling, graph theory can be used to simulate many real-world problems, and has applications in the fields of computer science, sociology, engineering and transportation, to name a few. One of the more important graph analysis tasks involves moving through the graph to evaluate and calculate nodal connectivity. The basic structures of meshes and graphs are the same, as both rely heavily on connectivity information, representing the relationships between constituent nodes and edges. This research examines the parallelization of these algorithms using commodity graphics hardware; a low-cost tool readily available to the computing community. Not only does this research look at the benefits of the fine-grained parallelism of an individual graphics processor, but the use of Message Passing Interface (MPI) on large-scale GPU-based supercomputers is also studied.
194

Optimizing Harris Corner Detection on GPGPUs Using CUDA

Loundagin, Justin 01 March 2015 (has links) (PDF)
ABSTRACT Optimizing Harris Corner Detection on GPGPUs Using CUDA The objective of this thesis is to optimize the Harris corner detection algorithm implementation on NVIDIA GPGPUs using the CUDA software platform and measure the performance benefit. The Harris corner detection algorithm—developed by C. Harris and M. Stephens—discovers well defined corner points within an image. The corner detection implementation has been proven to be computationally intensive, thus realtime performance is difficult with a sequential software implementation. This thesis decomposes the Harris corner detection algorithm into a set of parallel stages, each of which are implemented and optimized on the CUDA platform. The performance results show that by applying strategic CUDA optimizations to the Harris corner detection implementation, realtime performance is feasible. The optimized CUDA implementation of the Harris corner detection algorithm showed significant speedup over several platforms: standard C, MATLAB, and OpenCV. The optimized CUDA implementation of the Harris corner detection algorithm was then applied to a feature matching computer vision system, which showed significant speedup over the other platforms.
195

CUDA Accelerated 3D Non-rigid Diffeomorphic Registration / CUDA-accelererad icke-rigid diffeomorf registrering i 3D

Qu, An January 2017 (has links)
Advances of magnetic resonance imaging (MRI) techniques enable visualguidance to identify the anatomical target of interest during the image guidedintervention(IGI). Non-rigid image registration is one of the crucial techniques,aligning the target tissue with the MRI preoperative image volumes. As thegrowing demand for the real-time interaction in IGI, time used for intraoperativeregistration is increasingly important. This work implements 3D diffeomorphicdemons algorithm on Nvidia GeForce GTX 1070 GPU in C++ based on CUDA8.0.61 programming environment, using which the average registration time hasaccelerated to 5s. We have also extensively evaluated GPU accelerated 3D diffeomorphicregistration against both CPU implementation and Matlab codes, and theresults show that GPU implementation performs a much better algorithm efficiency.
196

Visual Inspection Of Railroad Tracks

Babenko, Pavel 01 January 2009 (has links)
In this dissertation, we have developed computer vision methods for measurement of rail gauge, and reliable identification and localization of structural defects in railroad tracks. The rail gauge is the distance between the innermost sides of the two parallel steel rails. We have developed two methods for evaluation of rail gauge. These methods were designed for different hardware setups: the first method works with two pairs of unaligned video cameras while the second method works with depth maps generated by paired laser range scanners. We have also developed a method for detection of rail defects such as damaged or missed rail fasteners, tie clips, and bolts, based on correlation and MACH filters. Lastly, to make our algorithms perform in real-time, we have developed the GPU based library for parallel computation of the above algorithms. Rail gauge is the most important measurement for track maintenance, because deviations in gauge indicate where potential defects may exist. We have developed a vision-based method for rail gauge estimation from a pair of industrial laser range scanners. In this approach, we start with building a 3D panorama of the rail out of a stack of input scans. After the panorama is built, we apply FIR circular filtering and Gaussian smoothing to the panorama buffer to suppress the noise component. In the next step we attempt to segment the rail heads in the panorama buffer. We employ the method which detects railroad crossings or forks in the panorama buffer. If they are not present, we find the rail edge using robust line fit. If they are present we use an alternative way: we predict the rail edge positions using Kalman filter. In the next step, common to both fork/crossings conditions, we find the adjusted positions of rail edges using additional clustering in the vicinity of the edge. We approximate rail head surface by the third degree polynomial and then fit two plane surfaces to find the exact position of the rail edge. Lastly, using rail edge information, we calculate the rail gauge and smooth it with 1D Gaussian filter. We have also developed a vision-based method to estimate the rail gauge from a pair of unaligned high shutter speed calibrated cameras. In this approach, the first step is to accurately detect the rail in each of the two non-overlapping synchronous images from the two cameras installed on the data collection cart by building an edge map, and fitting lines into the edge map using the Hough transform, and detecting persistent edge lines using a history buffer. After railroad track parts are detected, we segment rails out to find rail edges and calculate the rail gauge. We have demonstrated how to apply Computer Vision methods (the correlation filters and MACH filters in particular) to find different types of railroad elements with fixed or similar appearance, like railroad clips, bolts, and rail plates, in real-time. Template-based approaches for object detection (correlation filters) directly compare gray scale image data to a predefined model or template. The drawback of the correlation filters has always been that they are neither scale nor rotation invariant, thus many different filters are needed if either scale or rotation change. The application of many filters cannot be done in real-time. We have succeeded to overcome this difficulty by using the parallel computation technology which is widely available in the GPUs of most advanced graphics cards. We have developed a library, MinGPU, which facilitates the use of GPUs for Computer Vision, and have also developed a MinGPU-based library of several Computer Vision methods, which includes, among others, an implementation of correlation filters on the GPU. We have achieved a true positive rate of 0.98 for fastener detection using implementation of MACH filters on GPU. Besides correlation filters, MinGPU include implementations of Lucas-Kanade Optical Flow, image homographies, edge detectors and discrete filters, image pyramids, morphology operations, and some graphics primitives. We have shown that MinGPU implementation of homographies speeds up execution time approximately 600 times versus C implementation and 8000 times versus Matlab implementation. MinGPU is built upon a reusable core and thus is an easily expandable library. With the help of MinGPU, we have succeeded to make our algorithms work in real-time.
197

CUDA Enhanced Filtering In a Pipelined Video Processing Framework

Dworaczyk Wiltshire, Austin Aaron 01 June 2013 (has links) (PDF)
The processing of digital video has long been a significant computational task for modern x86 processors. With every video frame composed of one to three planes, each consisting of a two-dimensional array of pixel data, and a video clip comprising of thousands of such frames, the sheer volume of data is significant. With the introduction of new high definition video formats such as 4K or stereoscopic 3D, the volume of uncompressed frame data is growing ever larger. Modern CPUs offer performance enhancements for processing digital video through SIMD instructions such as SSE2 or AVX. However, even with these instruction sets, CPUs are limited by their inherently sequential design, and can only operate on a handful of bytes in parallel. Even processors with a multitude of cores only execute on an elementary level of parallelism. GPUs provide an alternative, massively parallel architecture. GPUs differ from CPUs by providing thousands of throughput-oriented cores, instead of a maximum of tens of generalized “good enough at everything” x86 cores. The GPU’s throughput-oriented cores are far more adept at handling large arrays of pixel data, as many video filtering operations can be performed independently. This computational independence allows for pixel processing to scale across hun- dreds or even thousands of device cores. This thesis explores the utilization of GPUs for video processing, and evaluates the advantages and caveats of porting the modern video filtering framework, Vapoursynth, over to running entirely on the GPU. Compute heavy GPU-enabled video processing results in up to a 108% speedup over an SSE2-optimized, multithreaded CPU implementation.
198

GPU High-Performance Framework for PIC-Like Simulation Methods Using the Vulkan® Explicit API

Yager, Kolton Jacob 01 March 2021 (has links) (PDF)
Within computational continuum mechanics there exists a large category of simulation methods which operate by tracking Lagrangian particles over an Eulerian background grid. These Lagrangian/Eulerian hybrid methods, descendants of the Particle-In-Cell method (PIC), have proven highly effective at simulating a broad range of materials and mechanics including fluids, solids, granular materials, and plasma. These methods remain an area of active research after several decades, and their applications can be found across scientific, engineering, and entertainment disciplines. This thesis presents a GPU driven PIC-like simulation framework created using the Vulkan® API. Vulkan is a cross-platform and open-standard explicit API for graphics and GPU compute programming. Compared to its predecessors, Vulkan offers lower overhead, support for host parallelism, and finer grain control over both device resources and scheduling. This thesis harnesses those advantages to create a programmable GPU compute pipeline backed by a Vulkan adaptation of the SPgrid data-structure and multi-buffered particle arrays. The CPU host system works asynchronously with the GPU to maximize utilization of both the host and device. The framework is demonstrated to be capable of supporting Particle-in-Cell like simulation methods, making it viable for GPU acceleration of many Lagrangian particle on Eulerian grid hybrid methods. This novel framework is the first of its kind to be created using Vulkan® and to take advantage of GPU sparse memory features for grid sparsity.
199

Hardware Accelerated Particle Filter for Lane Detection and Tracking in OpenCL

Madduri, Nikhil January 2014 (has links)
A road lane detection and tracking algorithm is developed, especially tailored to run on high-performance heterogeneous hardware like GPUs and FPGAs in autonomous road vehicles. The algorithm was initially developed in C/C++ and was ported to OpenCL which supports computation on heterogeneous hardware.A novel road lane detection algorithm is proposed using random sampling of particles modeled as straight lines. Weights are assigned to these particles based on their location in the gradient image. To improve the computation efficiency of the lane detection algorithm, lane tracking is introduced in the form of a Particle Filter. Creation of the particles in lane detection step and prediction, measurement updates in lane tracking step are computed parellelly on GPU/FPGA using OpenCL code, while the rest of the code runs on a host CPU. The software was tested on two GPUs - NVIDIA GeForce GTX 660 Ti &amp; NVIDIA GeForce GTX 285 and an FPGA - Altera Stratix-V, which gave a computational frame rate of up to 104 Hz, 79 Hz and 27 Hz respectively. The code was tested on video streams from five different datasets with different scenarios of varying lighting conditions on the road, strong shadows and the presence of light to moderate traffic and was found to be robust in all the situations for detecting a single lane. / <p>Validerat; 20140128 (global_studentproject_submitter)</p>
200

GPGPU microbenchmarking for irregular application optimization

Winans-Pruitt, Dalton R. 09 August 2022 (has links)
Irregular applications, such as unstructured mesh operations, do not easily map onto the typical GPU programming paradigms endorsed by GPU manufacturers, which mostly focus on maximizing concurrency for latency hiding. In this work, we show how alternative techniques focused on latency amortization can be used to control overall latency while requiring less concurrency. We used a custom-built microbenchmarking framework to test several GPU kernels and show how the GPU behaves under relevant workloads. We demonstrate that coalescing is not required for efficacious performance; an uncoalesced access pattern can achieve high bandwidth - even over 80% of the theoretical global memory bandwidth in certain circumstances. We also make other further observations on specific relevant behaviors of GPUs. We hope that this study opens the door for further investigation into techniques that can exploit latency amortization when latency hiding does not achieve sufficient performance.

Page generated in 0.0262 seconds