Global ETD Search

1	Hardware accelerator for the JPEG encoder on the xilinx SPARTAN 3 FPGA Zheng, Feng, M.S. in Engineering 21 February 2011 (has links) The report detailing the Hardware Accelerator for the JPEG encoder is organized into three sections. First, it will review the processes of the Joint Photographic Experts Group (JPEG) encoding and decoding standard. Second, it will review three different implementations of the discrete cosine transform in hardware. This is a very computationally intensive element of the JPEG encoding process and the analysis of these designs covers the benefits and costs of the various approaches for the Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC) implementations. Finally, it will discuss this specific hardware accelerator design for a color state transformation for the standard JPEG encoder. An eight by eight matrix of Red, Green, Blue (RGB) values is passed into the FPGA as well as calculated in software. The Y Cr Cb results from that of the hardware accelerator implementation are compared with the software implementation for computational accuracy and the differences in computation time are sampled for a comparison. There is a clear 38% improvement in speed from the hardware accelerator. / text JPEG Hardware accelerator Discrete cosine transform
2	COMPILER FOR A TRACE-BASED DEEP NEURAL NETWORK ACCELERATOR Andre Xian Ming Chang (6789503) 12 October 2021 (has links) Deep Neural Networks (DNNs) are the algorithm of choice for various applications that require modeling large datasets, such as image classification, object detection and natural language processing. DNNs present highly parallel workloads<br>that lead to the need of custom hardware accelerators. Deep Learning (DL) models specialized on different tasks require a programmable custom hardware, and a compiler to efficiently translate various DNNs into an efficient dataflow to be executed on the accelerator. Given a DNN oriented custom instructions set, various compilation phases are needed to generate efficient code and maintain generality to support<br>many models. Different compilation phases need to have different levels of hardware awareness so that it exploits the hardware’s full potential, while abiding with the hardware constraints. The goal of this work is to present a compiler workflow and its hardware aware optimization passes for a custom DNN hardware accelerator. The compiler uses model definition files created from popular frameworks to generate custom instructions. Different levels of hardware aware code optimizations are applied to improve performance and data reuse. The software also exposes an interface to run the accelerator implemented on various FPGA platforms, proving an end-to-end solution. Computer Engineering
3	Scalable Hybrid Neuromorphic Accelerator & Hybrid Neural Networks Nardone, Joshua 01 June 2024 (has links) (PDF) With machine learning workloads currently at very large scales, models are distributed across large compute systems. On distributed systems, the performance of these models are limited by the bandwidth limitations of chip-to-chip communication. To relieve this bottleneck, spiking neural networks (SNNs) can be utilized to reduce inter-chip communication traffic utilizing inherit network sparsity. However, in comparison to traditional artificial neural networks (ANNs), SNNs can have significant degradation in performance with increased network scale and complexity. This research proposes a hybrid neural network accelerator that uses the best of both spiking and non-spiking layers by allocating a majority of resources to nonspiking layers on the interior of the chip while bandwidth-limited areas (e.g., I/O pads, or chip separation boundaries) employ spike-based data traffic. By limiting the overall use of spiking layers within the network, we realize the energy savings of SNNs without the a degradation in accuracy which comes with large spike-based models. We present a scalable chiplet architecture and show how hybrid data is managed with both spike and non-spiking data communication. We also demonstrate how the asynchronous spike-based model is integrated efficiently with the synchronous artificial-based deep learning workloads. We demonstrate that our hybrid architecture offers significant improvements in performance, accuracy, and energy consumption in comparison to SNNs and ANNs. With up to a 1.34× increase in energy efficiency and 1.56× decrease in single inference latency, the versatility of the architecture is demonstrated by its validation across multiple datasets, encompassing both language processing and computer vision tasks. Neuromorphic Neural Networks Network-on-Chip Hardware Accelerator
4	Hardware Accelerator for MIMO Wireless Systems Bhagawat, Pankaj 2011 December 1900 (has links) Ever increasing demand for higher data rates and better Quality of Service (QoS) for a growing number of users requires new transceiver algorithms and architectures to better exploit the available spectrum and to efficiently counter the impairments of the radio channel. Multiple-Input Multiple-Output (MIMO) communication systems employ multiple antennas at both transmitter and at the receiver to meet the requirements of next-generation wireless systems. It is a promising technology to provide increased data rates while not involving an equivalent increase in the spectral requirements. However, practical implementation of MIMO detectors poses a significant challenge and has been consistently identified as the major bottleneck for realizing the full potential that multiple antenna systems promise. Furthermore, in order to make judicious use of the available bandwidth, the baseband units have to dynamically adapt to different modes (modulation schemes, code rates etc) of operations. Flexibility and high throughput requirements often place conflicting demands on the Very Large Scale Integration (VLSI) system designer. The major focus of this dissertation is to present efficient VLSI architectures for configurable MIMO detectors that can serve as accelerators to enable the realization of next generation wireless devices feasible. Both, hard output and soft output detector architectures are considered. MIMO wireless systems configurable hardware hardware accelerator soft detection
5	A Hardware Interpreter for Sparse Matrix LU Factorization Syed, Akber 16 September 2002 (has links) No description available. LU factorization sparse matrices hardware interpreter hardware accelerator loop unrolling
6	FPGA-Roofline: An Insightful Model for FGPA-based Hardware Acceleration in Modern Embedded Systems Pahlavan Yali, Moein 17 January 2015 (has links) The quick growth of embedded systems and their increasing computing power has made them suitable for a wider range of applications. Despite the increasing performance of modern embedded processors, they are outpaced by computational demands of the growing number of modern applications. This trend has led to emergence of hardware accelerators in embedded systems. While the processing power of dedicated hardware modules seems appealing, they require significant effort of development and integration to gain performance benefit. Thus, it is prudent to investigate and estimate the integration overhead and consequently the hardware acceleration benefit before committing to implementation. In this work, we present FPGA-Roofline, a visual model that offers insights to designers and developers to have realistic expectations of their system and that enables them to do their design and analysis in a faster and more efficient fashion. FPGA-Roofline allows simultaneous analysis of communication and computation resources in FPGA-based hardware accelerators. To demonstrate the effectiveness of our model, we have implemented hardware accelerators in FPGA and used our model to analyze and optimize the overall system performance. We show how the same methodology can be applied to the design process of any FPGA-based hardware accelerator to increase productivity and give insights to improve performance and resource utilization by finding the optimal operating point of the system. / Master of Science Embedded Systems Field programmable gate arrays Hardware Accelerator Performance Model
7	Architecture and Programming Model Support for Reconfigurable Accelerators in Multi-Core Embedded Systems / Architecture et modèle de programmation pour accélérateurs reconfigurables dans les systèmes embarqués multi-coeurs Das, Satyajit 04 June 2018 (has links) La complexité des systèmes embarqués et des applications impose des besoins croissants en puissance de calcul et de consommation énergétique. Couplé au rendement en baisse de la technologie, le monde académique et industriel est toujours en quête d'accélérateurs matériels efficaces en énergie. L'inconvénient d'un accélérateur matériel est qu'il est non programmable, le rendant ainsi dédié à une fonction particulière. La multiplication des accélérateurs dédiés dans les systèmes sur puce conduit à une faible efficacité en surface et pose des problèmes de passage à l'échelle et d'interconnexion. Les accélérateurs programmables fournissent le bon compromis efficacité et flexibilité. Les architectures reconfigurables à gros grains (CGRA) sont composées d'éléments de calcul au niveau mot et constituent un choix prometteur d'accélérateurs programmables. Cette thèse propose d'exploiter le potentiel des architectures reconfigurables à gros grains et de pousser le matériel aux limites énergétiques dans un flot de conception complet. Les contributions de cette thèse sont une architecture de type CGRA, appelé IPA pour Integrated Programmable Array, sa mise en œuvre et son intégration dans un système sur puce, avec le flot de compilation associé qui permet d'exploiter les caractéristiques uniques du nouveau composant, notamment sa capacité à supporter du flot de contrôle. L'efficacité de l'approche est éprouvée à travers le déploiement de plusieurs applications de traitement intensif. L'accélérateur proposé est enfin intégré à PULP, a Parallel Ultra-Low-Power Processing-Platform, pour explorer le bénéfice de ce genre de plate-forme hétérogène ultra basse consommation. / Emerging trends in embedded systems and applications need high throughput and low power consumption. Due to the increasing demand for low power computing and diminishing returns from technology scaling, industry and academia are turning with renewed interest toward energy efficient hardware accelerators. The main drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low is they perform one specific function and increasing the number of the accelerators in a system on chip (SoC) causes scalability issues. Programmable accelerators provide flexibility and solve the scalability issues. Coarse-Grained Reconfigurable Array (CGRA) architecture consisting of several processing elements with word level granularity is a promising choice for programmable accelerator. Inspired by the promising characteristics of programmable accelerators, potentials of CGRAs in near threshold computing platforms are studied and an end-to-end CGRA research framework is developed in this thesis. The major contributions of this framework are: CGRA design, implementation, integration in a computing system, and compilation for CGRA. First, the design and implementation of a CGRA named Integrated Programmable Array (IPA) is presented. Next, the problem of mapping applications with control and data flow onto CGRA is formulated. From this formulation, several efficient algorithms are developed using internal resources of a CGRA, with a vision for low power acceleration. The algorithms are integrated into an automated compilation flow. Finally, the IPA accelerator is augmented in PULP - a Parallel Ultra-Low-Power Processing-Platform to explore heterogeneous computing. Accélérateurs matériels Hardware accelerator Low-power 621.395
8	FPGA-Based Co-processor for Singular Value Array Reconciliation Tomography Coyne, Jack W 05 September 2007 (has links) "This thesis describes a co-processor system that has been designed to accelerate computations associated with Singular Value Array Reconciliation Tomography (SART), a method for locating a wide-band RF source which may be positioned within an indoor environment, where RF propagation characteristics make source localization very challenging. The co-processor system is based on field programmable gate array (FPGA) technology, which offers a low-cost alternative to customized integrated circuits, while still providing the high performance, low power, and small size associated with a custom integrated solution. The system has been developed in VHDL, and implemented on a Virtex-4 SX55 FPGA development platform. The system is easy to use, and may be accessed through a C program or MATLAB script. Compared to a Pentium 4 CPU running at 3 GHz, use of the co-processor system provides a speed-up of about 6 times for the current signal matrix size of 128-by-16. Greater speed-ups may be obtained by using multiple devices in parallel. The system is capable of computing the SART metric to an accuracy of about -145 dB with respect to its true value. This level of accuracy, which is shown to be better than that obtained using single precision floating point arithmetic, allows even relatively weak signals to make a meaningful contribution to the final SART solution." hardware accelerator SVD digital signal processing FPGA Wireless communication systems Automatic locator systems Gate array circuits
9	Hardware Acceleration of Deep Convolutional Neural Networks on FPGA January 2018 (has links) abstract: The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the deep learning algorithm inference. However, deploying CNNs on portable and embedded systems is still challenging due to large data volume, intensive computation, varying algorithm structures, and frequent memory accesses. This dissertation proposes a complete design methodology and framework to accelerate the inference process of various CNN algorithms on FPGA hardware with high performance, efficiency and flexibility. As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance. Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance. Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase. / Dissertation/Thesis / Doctoral Dissertation Electrical Engineering 2018 Electrical engineering Computer engineering Artificial intelligence Computer Vision Convolutional Neural Networks FPGA Hardware Accelerator
10	HAALO : A cloud native hardware accelerator abstraction with low overhead Facchetti, Jeremy January 2019 (has links) With the upcoming 5G deployment and the exponentially increasing data transmitted over cellular networks, off the shelf hardware won't provide enough performance to cope with the data being transferred over cellular networks. To tackle that problem, hardware accelerators will be of great support thanks to their better performances and lower energy consumption. However, hardware accelerators are not a silver bullet as their very nature prevents them to be as flexible as CPUs. Hardware accelerators integration into Kubernetes and Docker, respectively the most used tools for orchestration and containerization, is still not as flexible as it would need. In this thesis, we developed a framework that allows for a more flexible integration of these accelerators into a Kubernetes cluster using Docker containers making use of an abstraction layer instead of the classic virtualization process. Our results compare the performance of an execution with and without the framework that was developed during this thesis. We found that the framework's overhead depends on the size of the data being processed by the accelerator but does not go over a very low percentage of the total execution time. This framework provides an abstraction for hardware accelerators and thus provides an easy way to integrate hardware accelerated applications into a heterogeneous cluster or even across different clusters with different hardware accelerators types. This framework also moves the hardware specific parts of an accelerated program from the containers to the infrastructure and enables a new kind of service, OpenCL as a service. cloud 5g hardware accelerator kubernetes docker vnf nfv cloud native virtualization Computer Systems Datorsystem

Search results