1 |
Feature detection in an indoor environment using Hardware Accelerators for time-efficient Monocular SLAMVyas, Shivang 03 August 2015 (has links)
" In the field of Robotics, Monocular Simultaneous Localization and Mapping (Monocular SLAM) has gained immense popularity, as it replaces large and costly sensors such as laser range finders with a single cheap camera. Additionally, the well-developed area of Computer Vision provides robust image processing algorithms which aid in developing feature detection technique for the implementation of Monocular SLAM. Similarly, in the field of digital electronics and embedded systems, hardware acceleration using FPGAs, has become quite popular. Hardware acceleration is based upon the idea of offloading certain iterative algorithms from the processor and implementing them on a dedicated piece of hardware such as an ASIC or FPGA, to speed up performance in terms of timing and to possibly reduce the net power consumption of the system. Good strides have been taken in developing massively pipelined and resource efficient hardware implementations of several image processing algorithms on FPGAs, which achieve fairly decent speed-up of the processing time. In this thesis, we have developed a very simple algorithm for feature detection in an indoor environment by means of a single camera, based on Canny Edge Detection and Hough Transform algorithms using OpenCV library, and proposed its integration with existing feature initialization technique for a complete Monocular SLAM implementation. Following this, we have developed hardware accelerators for Canny Edge Detection & Hough Transform and we have compared the timing performance of implementation in hardware (using FPGAs) with an implementation in software (using C++ and OpenCV). "
|
2 |
Assessing OpenGL for 2D rendering of geospatial dataJacobson, Jared Neil January 2014 (has links)
The purpose of this study was to investigate the suitability of using the OpenGL and OpenCL graphics
application programming interfaces (APIs), to increase the speed at which 2D vector geographic information
could be rendered. The research focused on rendering APIs available to the Windows operating system.
In order to determine the suitability of OpenGL for efficiently rendering geographic data, this dissertation
looked at how software and hardware based rendering performed. The results were then compared to that of
the different rendering APIs. In order to collect the data necessary to achieve this; an in-depth study of
geographic information systems (GIS), geographic coordinate systems, OpenGL and OpenCL was conducted.
A simplistic 2D geographic rendering engine was then constructed using a number of graphic APIs which
included GDI, GDI+, DirectX, OpenGL and the Direct2D API. The purpose of the developed rendering engine
was to provide a tool on which to perform a number of rendering experiments. A large dataset was then
rendered via each of the implementations. The processing times as well as image quality were recorded and
analysed. This research investigated the potential issues such as acquiring data to be rendered for the API as
fast as possible. This was needed to ensure saturation at the API level. Other aspects such as difficulty of
implementation as well as implementation differences were examined.
Additionally, leveraging the OpenCL API in conjunction with the TopoJSON storage format as a means of data
compression was investigated. Compression is beneficial in that to get optimal rendering performance from
OpenGL, the graphic data to be rendered needs to reside in the graphics processing unit (GPU) memory bank.
More data in GPU memory in turn theoretically provides faster rendering times. The aim was to utilise the
extra processing power of the GPU to decode the data and then pass it to the OpenGL API for rendering and
display. This was achievable via OpenGL/OpenCL context sharing.
The results of the research showed that on average, the OpenGL API provided a significant speedup of between
nine and fifteen times that of GDI and GDI+. This means a faster and more performant rendering engine could
be built with OpenGL at its core. Additional experiments show that the OpenGL API performed faster than
GDI and GDI+ even when a dedicated graphics device is not present. A challenge early in the experiments was
related to the supply of data to the graphic API. Disk access is orders of magnitude slower than the rest of the
computer system. As such, in order to saturate the different graphics APIs, data had to be loaded into main
memory.
Using the TopoJSON storage format yielded decent data compression allowing a larger amount of data to be
stored on the GPU. However, in an initial experiment, it took longer to process the TopoJSON file into a flat
structure that could be utilised by OpenGL than to simply use the actual created objects, process them on the
central processing unit (CPU) and then upload them directly to OpenGL. It is left as future work to develop a
more efficient algorithm for converting from TopoJSON format to a format that OpenCL can utilise. / Dissertation (MSc)--University of Pretoria, 2014. / tm2015 / Geography, Geoinformatics and Meteorology / MSc / Unrestricted
|
3 |
FPGA Acceleration of Decision-Based Problems using Heterogeneous ComputingThong, Jason January 2014 (has links)
The Boolean satisfiability (SAT) problem is central to many applications involving the verification and optimization of digital systems. These combinatorial problems are typically solved by using a decision-based approach, however the lengthy compute time of SAT can make it prohibitively impractical for some applications.
We discuss how the underlying physical characteristics of various technologies affect the practicality of SAT solvers. Power dissipation and other physical limitations are increasingly restricting the improvement in performance of conventional software on CPUs. We use heterogeneous computing to maximize the strengths of different underlying technologies as well as different computing architectures.
In this thesis, we present a custom hardware architecture for accelerating the common computation within a SAT solver. Algorithms and data structures must be fundamentally redesigned in order to maximize the strengths of customized computing. Generalizable optimizations are proposed to maximize the throughput, minimize communication latencies, and aggressively compact the memory. We tightly integrate as well as jointly optimize the hardware accelerator and the software host.
Our fully implemented system is significantly faster than pure software on real-life SAT problems. Due to our insights and optimizations, we are able to benchmark SAT in uncharted territory. / Thesis / Doctor of Philosophy (PhD)
|
4 |
Optimized hardware accelerators for data mining applicationsKanan, Awos 19 February 2018 (has links)
Data mining plays an important role in a variety of fields including bioinformatics, multimedia, business intelligence, marketing, and medical diagnosis. Analysis of today’s
huge and complex data involves several data mining algorithms including clustering and
classification. The computational complexity of machine learning and data mining algorithms, that are frequently used in today’s applications such as embedded systems, makes the design of efficient hardware architectures for these algorithms a challenging issue for the development of such systems. The aim of this work is to optimize the performance of hardware acceleration for data mining applications in terms of speed and area. Most of the previous accelerator architectures proposed in the literature have been obtained using ad hoc techniques that do not allow for design space exploration, some did not consider the size (number of samples) and dimensionality (number of features in each sample) of the datasets. To obtain practical architectures that are amenable for hardware implementation, size and dimensionality of input datasets are taken into consideration in this work. For one-dimensional data, algorithm-level optimizations are investigated to design a fast and area-efficient hardware accelerator for clustering one-dimensional datasets using the well-known K-Means clustering algorithm. Experimental results show that the optimizations adopted in the proposed architecture result in faster convergence of the algorithm using less hardware resources while maintaining the quality of clustering results. The computation of similarity distance matrices is one of the computational kernels that are generally required by several machine learning and data mining algorithms to measure the degree of similarity between data samples. For these algorithms, distance calculation is considered a computationally intensive task that accounts for a significant portion of the processing time. A systematic methodology is presented to explore the design space of 2-D and 1-D processor array architectures for similarity distance computation involved in processing datasets of different sizes and dimensions. Six 2-D and six 1-D processor array architectures are developed systematically using linear scheduling and projection operations. The obtained
architectures are classified based on the size and dimensionality of input datasets, analyzed in terms of speed and area, and compared with previous architectures in the literature. Motivated by the necessity to accommodate large-scale and high-dimensional data, nonlinear scheduling and projection operations are finally introduced to design a scalable processor array architecture for the computation of similarity distance matrices. Implementation results of the proposed architecture show improved compromise between area and speed. Moreover, it scales better for large and high-dimensional datasets since the architecture is fully parameterized and only has to deal with one data dimension in each time step. / Graduate / 2019-12-31
|
5 |
Hardware Implementation Techniques for JPEG2000.Dyer, Michael Ian, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW January 2007 (has links)
JPEG2000 is a recently standardized image compression system that provides substantial improvements over the existing JPEG compression scheme. This improvement in performance comes with an associated cost in increased implementation complexity, such that a purely software implementation is inefficient. This work identifies the arithmetic coder as a bottleneck in efficient hardware implementations, and explores various design options to improve arithmetic coder speed and size. The designs produced improve the critical path of the existing arithmetic coder designs, and then extend the coder throughput to 2 or more symbols per clock cycle. Subsequent work examines more system level implementation issues. This work examines the communication between hardware blocks and utilizes certain modes of operation to add flexibility to buffering solutions. It becomes possible to significantly reduce the amount of intermediate buffering between blocks, whilst maintaining a loose synchronization. Full hardware implementations of the standard are necessarily limited in the amount of features that they can offer, in order to constrain complexity and cost. To circumvent this, a hardware / software codesign is produced using the Altera NIOS II softcore processor. By keeping the majority of the standard implemented in software and using hardware to accelerate those time consuming functions, generality of implementation can be retained, whilst implementation speed is improved. In addition to this, there is the opportunity to explore parallelism, by providing multiple identical hardware blocks to code multiple data units simultaneously.
|
6 |
FPGA Acceleration of CNNs Using OpenCLJanuary 2020 (has links)
abstract: Convolutional Neural Network (CNN) has achieved state-of-the-art performance in numerous applications like computer vision, natural language processing, robotics etc. The advancement of High-Performance Computing systems equipped with dedicated hardware accelerators has also paved the way towards the success of compute intensive CNNs. Graphics Processing Units (GPUs), with massive processing capability, have been of general interest for the acceleration of CNNs. Recently, Field Programmable Gate Arrays (FPGAs) have been promising in CNN acceleration since they offer high performance while also being re-configurable to support the evolution of CNNs. This work focuses on a design methodology to accelerate CNNs on FPGA with low inference latency and high-throughput which are crucial for scenarios like self-driving cars, video surveillance etc. It also includes optimizations which reduce the resource utilization by a large margin with a small degradation in performance thus making the design suitable for low-end FPGA devices as well.
FPGA accelerators often suffer due to the limited main memory bandwidth. Also, highly parallel designs with large resource utilization often end up achieving low operating frequency due to poor routing. This work employs data fetch and buffer mechanisms, designed specifically for the memory access pattern of CNNs, that overlap computation with memory access. This work proposes a novel arrangement of the systolic processing element array to achieve high frequency and consume less resources than the existing works. Also, support has been extended to more complicated CNNs to do video processing. On Intel Arria 10 GX1150, the design operates at a frequency as high as 258MHz and performs single inference of VGG-16 and C3D in 23.5ms and 45.6ms respectively. For VGG-16 and C3D the design offers a throughput of 66.1 and 23.98 inferences/s respectively. This design can outperform other FPGA 2D CNN accelerators by up to 9.7 times and 3D CNN accelerators by up to 2.7 times. / Dissertation/Thesis / Masters Thesis Computer Science 2020
|
7 |
Performance and Perceived Realism in Rasterized 3D Sound Propagation for Interactive Virtual EnvironmentsHansson, Karl, Hernvall, Mikael January 2019 (has links)
Background. 3D sound propagation is important for immersion and realism in interactive and dynamic virtual environments. However, this is difficult to model in a physically accurate manner under real-time constraints. Computer graphics techniques are used in acoustics research to increase performance, yet there is little utilization of the especially efficient rasterization techniques, possibly due to concerns of physical accuracy. Fortunately, psychoacoustics have shown that perceived realism does not equate physical accuracy. This indicates that perceptually realistic and high-performance 3D sound propagation may be achievable with rasterization techniques. Objectives. This thesis investigates whether 3D sound propagation can be modelled with high performance and perceived realism using rasterization-based techniques. Methods. A rasterization-based solution for 3D sound propagation is implemented. Its perceived realism is measured using psychoacoustic evaluations. Its performance is analyzed through computation time measurements with varying sound source and triangle count, and theoretical calculations of memory consumption. The performance and perceived realism of the rasterization-based solution is compared with an existing solution. Results. The rasterization-based solution shows both higher performance and perceived realism than the existing solution. Conclusions. 3D sound propagation can be modelled with high performance and perceived realism using rasterization-based techniques. Thus, rasterized 3D sound propagation may provide efficient, low-cost, perceptually realistic 3D audio for areas where immersion and perceptual realism are important, such as video games, serious games, live entertainment events, architectural design, art production and training simulations. / Bakgrund. 3D-ljudpropagering är viktig för inlevelse och realism i interaktiva och dynamiska virtuella miljöer. Dock är detta svårt att modellera på fysiskt träffsäkert sätt med realtidsbegränsningar. Tekniker inom datorgrafik används inom akustikforskning för att öka prestanda, ändock används knappt de synnerligen effektiva rasteriseringsteknikerna, möjligtvis på grund av osäkerhet kring fysisk träffsäkerhet. Lyckligtvis har psykoakustiken visat att uppfattad realism inte är detsamma som fysisk träffsäkerhet. Detta är en indikation på att högpresterande och perceptuellt realistisk 3D-ljudpropagering kan åstadkommas med rasteriseringstekniker. Syfte. Denna avhandling undersöker huruvida 3D-ljudpropagering kan modelleras med hög prestanda och perceptuell realism med rasteriseringstekniker. Metod. En rasteriseringsbaserad lösning för 3D-ljudpropagering implementeras. Dess perceptuella realism mäts genom psykoakustiska utvärderingar. Dess prestanda analyseras genom körtidsmätningar vid varierande antal ljudkällor och trianglar, och teoretiska uträkningar över minnesanvändning. Den perceptuella realismen och prestandan hos den rasteriseringsbaserade lösningen jämförs med en existerande lösning. Resultat. Den rasteriseringsbaserade lösningen påvisar både högre perceptuell realism och prestanda än den existerande lösningen. Slutsatser. 3D-ljudpropagering kan modelleras med hög prestanda och perceptuell realism med rasteriseringsbaserade tekniker. Alltså kan rasteriserad 3D-ljudpropagering bistå med effektivt, billigt och perceptuellt realistiskt 3D-ljud för områden där inlevelse och perceptuell realism är viktiga, såsom videospel, seriösa spel, live underhållningsevents, arkitekturdesign, konstproduktion och träningssimulationer.
|
8 |
Real-time Rendering of Burning Objects in Video GamesAmarasinghe, Dhanyu Eshaka 08 1900 (has links)
In recent years there has been growing interest in limitless realism in computer graphics applications. Among those, my foremost concentration falls into the complex physical simulations and modeling with diverse applications for the gaming industry. Different simulations have been virtually successful by replicating the details of physical process. As a result, some were strong enough to lure the user into believable virtual worlds that could destroy any sense of attendance. In this research, I focus on fire simulations and its deformation process towards various virtual objects. In most game engines model loading takes place at the beginning of the game or when the game is transitioning between levels. Game models are stored in large data structures. Since changing or adjusting a large data structure while the game is proceeding may adversely affect the performance of the game. Therefore, developers may choose to avoid procedural simulations to save resources and avoid interruptions on performance. I introduce a process to implement a real-time model deformation while maintaining performance. It is a challenging task to achieve high quality simulation while utilizing minimum resources to represent multiple events in timely manner. Especially in video games, this overwhelming criterion would be robust enough to sustain the engaging player's willing suspension of disbelief. I have implemented and tested my method on a relatively modest GPU using CUDA. My experiments conclude this method gives a believable visual effect while using small fraction of CPU and GPU resources.
|
9 |
Indexing Large Permutations in HardwareOdom, Jacob Henry 07 June 2019 (has links)
Generating unbiased permutations at run time has traditionally been accomplished through application specific optimized combinational logic and has been limited to very small permutations. For generating unbiased permutations of any larger size, variations of the memory dependent Fisher-Yates algorithm are known to be an optimal solution in software and have been relied on as a hardware solution even to this day. However, in hardware, this thesis proves Fisher-Yates to be a suboptimal solution. This thesis will show variations of Fisher-Yates to be suboptimal by proposing an alternate method that does not rely on memory and outperforms Fisher-Yates based permutation generators, while still able to scale to very large sized permutations. This thesis also proves that this proposed method is unbiased and requires a minimal input. Lastly, this thesis demonstrates a means to scale the proposed method to any sized permutations and also to produce optimal partial permutations. / Master of Science / In computing, some applications need the ability to shuffle or rearrange items based on run time information during their normal operations. A similar task is a partial shuffle where only an information dependent selection of the total items is returned in a shuffled order. Initially, there may be the assumption that these are trivial tasks. However, the applications that rely on this ability are typically related to security which requires repeatable, unbiased operations. These requirements quickly turn seemingly simple tasks to complex. Worse, often they are done incorrectly and only appear to meet these requirements, which has disastrous implications for security. A current and dominating method to shuffle items that meets these requirements was developed over fifty years ago and is based on an even older algorithm refer to as Fisher-Yates, after its original authors. Fisher-Yates based methods shuffle items in memory, which is seen as advantageous in software but only serves as a disadvantage in hardware since memory access is significantly slower than other operations. Additionally, when performing a partial shuffle, Fisher-Yates methods require the same resources as when performing a complete shuffle. This is due to the fact that, with Fisher-Yates methods, each element in a shuffle is dependent on all of the other elements. Alternate methods to meet these requirements are known but are only able to shuffle a very small number of items before they become too slow for practical use. To combat the disadvantages current methods of shuffling possess, this thesis proposes an alternate approach to performing shuffles. This alternate approach meets the previously stated requirements while outperforming current methods. This alternate approach is also able to be extended to shuffling any number of items while maintaining a useable level of performance. Further, unlike current popular shuffling methods, the proposed method has no inter-item dependency and thus offers great advantages over current popular methods with partial shuffles.
|
10 |
Acceleration of a bioinformatics application using high-level synthesis / Accélération d'une application en bioinformatique utilisant une synthèse de haut niveauAbbas, Naeem 22 May 2012 (has links)
Les avancées dans le domaine de la bioinformatique ont ouvert de nouveaux horizons pour la recherche en biologie et en pharmacologie. Les machines comme les algorithmes utilisées aujourd'hui ne sont cependant plus en mesure de répondre à la demande exponentiellement croissante en puissance de calcul. Il existe donc un besoin pour des plate-formes de calculs spécialisées pour ce types de traitement, qui sauraient tirer partie de l'ensemble des technologie de calcul parallèle actuelles (Grilles, multi-coeurs, GPU, FPGA). Dans cette thèse nous étudions comment l'utilisation d'outils de synthèse de haut niveau peut aider à la conception d'accélérateurs matériels spécialisés massivement parallèles. Ces outils permettent de réduire considérablement les temps de conception mais ne sont pas conçus pour produire des architectures matérielles massivement parallèles efficaces. Les travaux de cette thèse se sont attachés à dégager des techniques de parallélisation, ainsi que les moyens d'exprimer efficacement ce parallélisme, pour des outils de type HLS. Nous avons appliqué ces résultats à une application de bioinformatique connue sous le nom de HMMER. Cet algorithme qui pourrait être un bon candidat à une accélération matérielle est très délicat à paralléliser. Nous avons proposé un schéma d'exécution parallèle original, basé sur une réécriture mathématique de l'algorithme, qui a été suivi par une exploration des schéma d'exécution matériels possible sur FPGA. Ce résultat à ensuite donnée lieu à une mise en œuvre sur un accélérateur matériel et a démontré des facteurs d'accélération encourageants. Les travaux démontre également la pertinence des outils de HLS pour la conception d'accélérateur matériel pour le calcul haute performance en Bioinformatique, à la fois pour réduire les temps de conception, mais aussi pour obtenir des architectures plus efficaces et plus facilement reciblables d'un plateforme à une autre. / The revolutionary advancements in the field of bioinformatics have opened new horizons in biological and pharmaceutical research. However, the existing bioinformatics tools are unable to meet the computational demands, due to the recent exponential growth in biological data. So there is a dire need to build future bioinformatics platforms incorporating modern parallel computation techniques. In this work, we investigate FPGA based acceleration of these applications, using High-Level Synthesis. High-Level Synthesis tools enable automatic translation of abstract specifications to the hardware design, considerably reducing the design efforts. However, the generation of an efficient hardware using these tools is often a challenge for the designers. Our research effort encompasses an exploration of the techniques and practices, that can lead to the generation of an efficient design from these high-level synthesis tools. We illustrate our methodology by accelerating a widely used application -- HMMER -- in bioinformatics community. HMMER is well-known for its compute-intensive kernels and data dependencies that lead to a sequential execution. We propose an original parallelization scheme based on rewriting of its mathematical formulation, followed by an in-depth exploration of hardware mapping techniques of these kernels, and finally show on-board acceleration results. Our research work demonstrates designing flexible hardware accelerators for bioinformatics applications, using design methodologies which are more efficient than the traditional ones, and where resulting designs are scalable enough to meet the future requirements.
|
Page generated in 0.1283 seconds