1371 |
Ein Branch&Bound-Ansatz zur Verdrahtung von Field Programmable Gate-ArraysMöhrke, Ulrich, Herrmann, Paul, Steffen, M., Spruth, Wilhelm G. 15 July 2019 (has links)
Zur Verdrahtung der meisten FPGA-Architekturen können die aus dem ASIC-Entwurf stammenden Werkzeuge wie z.B. Kanalverdrahter nicht eingesetzt werden. Eine vollautomatische Verdrahtung mit optimalen Signallaufzeiten kann nur dann erreicht werden, wenn bei gegebener Plazierung die Leitungführung den technologischen Gegebenheiten angepaßt wird. Diese unterscheiden sich deutlich von denen in ASICs. Im Rahmen des von der Deutschen Forschungsgemeinschaft (DFG) geförderten Gemeinschafts-Projekts „FPGA Entwurfssystem“, an dem die Universität Leipzig, die Universität Tübingen und die Technischen Universität München beteiligt sind, wurden am Lehrstuhl für Computersysteme (Prof. W.G. Spruth) des Instituts für
Informatik der Universität Leipzig Verfahren zur effizienten und qualitativ hochwertigen Verdrahtung von FPGA-Bausteinen entwickelt. Es wird eine Beschreibung des Verdrahtungsproblems für FPGAs gegeben und ein Lösungsansatz mit Hilfe des Branch&Bound – Verfahrens vorgestellt. Die Ergebnisse in Form von Programmlaufzeiten, Länge des kritischen Pfades und Anzahl der betrachteten Suchknoten in Abhängigkeit von einer Vielzahl von Schaltungsvarianten sind tabellarisch dargestellt und dokumentieren eine deutliche Verkürzung der längsten Pfade gegenüber dem Plazier- und Verdrahtungswerkzeug von Xilinx. Abschließend werden Probleme und weiterführende Arbeiten diskutiert.
|
1372 |
Neuronale Netze als Modell Boolescher FunktionenKohut, Roman 30 May 2007 (has links)
In der vorliegenden Arbeit werden die Darstellungsmöglichkeiten Boolescher Funktionen durch Neuronale Netze untersucht und eine neue Art von Booleschen Neuronalen Netzen (BNN) entwickelt. Das Basiselement Boolescher Neuronaler Netze ist ein neuartiges Boolesches Neuron (BN), das im Gegensatz zum klassischen Neuron direkt mit Booleschen Signalen operiert und dafür ausschließlich Boolesche Operationen benutzt. Für das Training der BNN wurde ein sequentieller Algorithmus erarbeitet, der eine schnelle Konvergenz garantiert und somit eine kurze Trainingzeit benötigt. Dieser Trainingsalgorithmus bildet die Grundlage eines neuen geschaffenen Verfahrens zur Architektursynthese der BNN. Das entwickelte Training stellt darüber hinaus ein spezielles Dekompositionsverfahren Boolescher Funktionen dar. Neuronale Netze können sowohl in Software als auch in Hardware realisiert werden. Der sehr hohe Aufwand der Hardware-Realisierung üblicher Neuronaler Netze wurde durch die Verwendung von BN und BNN wesentlich vereinfacht. Die Anzahl erforderlicher CLBs (configurable logic blocks) zur Realisierung eines Neurons wurde um 2 Größenordnungen verringert. Ein Boolesches Neuron wird direkt auf eine einzige LUT (lookup table) abgebildet. Für diese sehr kompakte Abbildung der BNN in eine FPGA-Struktur wurde der Trainingsalgorithmus des BNN angepasst. Durch die Spezifikation der BNN mit UML-Modellen und die Anwendung der MDA-Technologie zum Hardware/Software-Codesign konnte der Syntheseaufwand für Hardware-Realisierung von BNN signifikant verringert werden.
|
1373 |
Discrete Fractional Clock Generation for Systems-on-FPGAPreußer, Thomas B., Köhler, Steffen 14 November 2012 (has links)
This article describes an inexpensive way of clock generation for FPGA-based circuit cores, which reduces the number of external clock sources and eases synchronization problems. We introduce a modified version of the BRESENHAM line drawing algorithm and use it outside its original application domain for the rational division of clocks. An optimized hardware design for BRESENHAM-based clock division is presented and the quality of its output is evaluated. The optimal initialization conditions in terms of phase shift and jitter are identified and formally proven. Finally, the complexity characteristics of a generic synthesizable VHDL design based on this algorithm are examined and verified by synthesis examples. Special attention is paid to implementation results in conjunction with different FPGA families.
|
1374 |
High-performance communication infrastructure design on FPGA-centric clustersYang, Chen 29 September 2019 (has links)
FPGA-Centric Clusters (FCCs) with the FPGAs directly linked through their Multi-Gigabit Transceivers (MGTs) have a proven advantage over other commodity architectures for communication bound applications. To date, however, communication infrastructure for such clusters has generally only taken one of two simple approaches: nearest-neighbor-only, which is fast but of limited utility, and processor-based, which is general but slow. The overall problem addressed in this dissertation is the architecture, design, and implementation of communication networks for FCCs. These network designs should take advantage of the decades of design experience in networks for High-Performance Computing (HPC) clusters, but should also account for, and take advantage of, unique characteristics of FCCs, in particular, the configurability of the FPGAs themselves.
This dissertation has seven parts. We begin with in-depth implementations of two model applications, Directional Dark Matter (DM) Detection, and Molecular Dynamics (MD). These implementations expose the necessary characteristics of FCC networks from physical through application layers.
The second is the systematic exploration of communication microarchitecture for FCCs, as has been done previously for HPC clusters and for Networks on Chips (NoCs) on both FPGAs and ASICs. One outcome of this part is to find the properties of FCCs that substantially influence the router design space. Another outcome is to create a selection of candidate routers and generalize it so that it is parameterized by routing algorithm, arbitration policy, number of virtual channels (VCs), and other parameters.
The third part is to use the proposed application-aware framework to evaluate the resulting design space with respect to a number of common communication patterns and packet sizes. The results from this part enable two sets of designs. One is the selection of an optimal router for a given resource budget that accounts for all the workloads. The other is to take advantage of FPGA reconfigurability to select the optimal router accounting for both resource budget and a particular workload.
The fourth part is to evaluate the advantages of this approach of adapting the router design to the application. We find that the optimality of the router design varies significantly with workloads. We observe that compared with the router configuration with the best average performance, application-aware router selection can lead to substantial improvement in performance or reduction in resources required.
The fifth part is application-specific optimizations in which we develop several modules and functional units that can provide specific optimizations for certain types of communication workloads depending on the application it going to serve.
The sixth part explores topology emulation, e.g., when a three-dimensional network is used in the computation of an application that is logically two dimensional. We propose a generalized fold-and-cut mechanism that both preserves the locality in logical mapping, while also making use of the extra links provided by our 3D-torus fixture.
The seventh part is a table-based static-scheduled router for applications with a static or persistent communication pattern. The router supports various cases, including unicast, multicast, and reduction. By making routing decisions a priori, we can bring better load-balance to network links and reduce congestion.
|
1375 |
Convolutional Neural Networks on FPGA and GPU on the Edge: A ComparisonPettersson, Linus January 2020 (has links)
When asked to implement a neural network application, the decision concerning what hardware platform to use may not always be easily made. This thesis studies various relevant platforms with regards to performance, power efficiency and usability, with the purpose of providing a basis for such decisions. The hardware platforms which are studied were a GPU, an FPGA and a CPU. The project implements Convolutional Neural Networks (CNN) on the different hardware platforms using several tools and frameworks. The final implementation uses BNN-PYNQ for the implementation on the FPGA and CPU, which provided ready-to-run code and overlays for quantized CNNs and fully connected neural networks. Next, these networks are copied using TensorFlow, and optimized to FP32, FP16 and INT8 precision using TensorRT for use on the GPU. The results indicate that the FPGA outperforms the GPU with a factor of 100 for the CNN networks, and a factor of 1000 on the fully connected networks with regards to inference speed. For power efficiency, the FPGA again outperforms the GPU. The thesis concludes that for a neural network application, an FPGA is preferred if performance is a priority. However, the GPU proved to have a greater ease of use due to the many tools and frameworks available. If easy implementation and high design flexibility is a priority, a GPU is instead recommended.
|
1376 |
Machine Learning for Space Applications on Embedded SystemsDengel, Ric January 2021 (has links)
As space missions continue to increase in complexity, the operational capabilities and amount of gathered data demand ever more advanced systems. Currently, mission capabilities are often constrained by the link bandwidth as well as onboard processing capabilities. A large number of commands and complex ground station systems are required to allow spacecraft operations. Thus, methods to allow more efficient use of the bandwidth, computing capacity and increased autonomous capabilities are of strong research interest. Artificial Intelligence (AI), with its vast areas of application scenarios, allows for these challenges and more to be tackled in the spacecraft design. Particularly, the flexibility of Artificial Neural Networks as Machine Learning technology provides many possibilities. For example, Artificial Neural Networks can be used for object detection and classification tasks. Unfortunately, the execution of current Machine Learning algorithms consumes a large amount of power and memory resources. Additionally, the qualification of such algorithms remains challenging, which limits their possible applications in space systems. Thus, an increase in efficiency in all aspects is required to further enable these technologies for space applications. The optimisation of the algorithm for System on Chip (SoC) platforms allows it to benefit from the best of a generic processor and hardware acceleration. This increased complexity of the processing system shall allow broader and more flexible applications of these technologies with a minimum increase of power consumption. As Commercial off-the-shelf embedded systems are commonly used in NewSpace applications and such SoC are not yet available in a qualified manner, the deployment of Machine Learning algorithms on such devices has been evaluated. For deployment of machine learning on such devices, a ConvolutionalNeural Network model was optimised on a workstation. Then, the neural network is deployed with Xilinx’s Vitis AI onto a SoC which includes a powerful generic processor as well as the hardware programming capabilities of an Field ProgrammableGate Array (FPGA). This result was evaluated based on relevant performance and efficiency parameters and a summary is given in this thesis. Additionally, a tool utilising a different approach was developed. With a high-level synthesis tool the hardware description language of an accelerated linear algebra optimised network is created and directly deployed into FPGA logic. The implementation of this tool was started, and the proof of concept is presented. Furthermore, existing challenges with the auto-generated code are outlined and future steps to automate and improve the entire workflow are presented. As both workflows are very different and thus aim for different usage scenarios, both workflows are outlined and the benefits and disadvantages of both are outlined.
|
1377 |
Fault-Tolerant Nostrum NoC on FPGA for theForSyDe/NoC System Generator Tool SuiteGkalea, Salvator January 2014 (has links)
Moore’s law is the observation that over the years, the transistor density will increase,allowing billions of transistors to be integrated on a single chip. Over the lasttwo decades, Moore’s law has enabled the implementation of complex systems on asingle chip(SoCs). The challenge of the System-on-Chip(SoC) era was the demandof an efficient communication mechanism between the growing number of processingcores on the chip. The outcome established an new interconnection scheme (amongothers, like crossbars, rings, buses) based on the telecommunication networks andthe Network- on-Chip(NoC) appeared on the scene.The NoC has been developed not only to support systems embedded into asingle processor, but also to support a set of processors embedded on a singlechip.Therefore, the Multi-Processors System on Chip(MPSoC) has arisen, whichincorporate processing elements, memories and I/O with a fixed interconnection infrastructurein a complete integrated system. In such systems, the NoC constitutesthe backbone of the communication architecture that targets future SoC composedby hundred of processing elements. Besides that, together with the deep sub-microntechnology progress, some drawbacks have arisen. The communication efficiencyand the reliability of the systems rely on the proper functionality of NoC for onchipdata communication. A NoC must deal with the susceptibility of transistors tofailure that indicates the demand for a fault tolerant communication infrastructure.A mechanism that can deal with the existence of different classes of faults(transient,intermittent and permanent [11]) which can occur in the communication network.In this thesis, different algorithms are investigated that implement fault toleranttechniques for permanent faults in the NoC. The outcome would be to deliver a faulttolerantmechanism for the NoC System Generator Tool [29] which is a researchin Network-on-Chip carried out at the Royal Institute of Technology. It will beexplicitly described the fault tolerant algorithm that is implemented in the switchin order to achieve packet rerouting around the faulty communication links.
|
1378 |
High-Speed Image Classification for Resource-Limited Systems Using Binary ValuesSimons, Taylor Scott 16 June 2021 (has links)
Image classification is a memory- and compute-intensive task. It is difficult to implement high-speed image classification algorithms on resource-limited systems like FPGAs and embedded computers. Most image classification algorithms require many fixed- and/or floating-point operations and values. In this work, we explore the use of binary values to reduce the memory and compute requirements of image classification algorithms. Our objective was to implement these algorithms on resource-limited systems while maintaining comparable accuracy and high speeds. By implementing high-speed image classification algorithms on resource-limited systems like embedded computers, FPGAs, and ASICs, automated visual inspection can be performed on small low-powered systems. Industries like manufacturing, medicine, and agriculture can benefit from compact, high-speed, low-power visual inspection systems. Tasks like defect detection in manufactured products and quality sorting of harvested produce can be performed cheaper and more quickly. In this work, we present ECO Jet Features, an algorithm adapted to use binary values for visual inspection. The ECO Jet Features algorithm ran 3.7x faster than the original ECO Features algorithm on embedded computers. It also allowed the algorithm to be implemented on an FPGA, achieving 78x speedup over full-sized desktop systems, using a fraction of the power and space. We reviewed Binarized Neural Nets (BNNs), neural networks that use binary values for weights and activations. These networks are particularly well suited for FPGA implementation and we compared and contrasted various FPGA implementations found throughout the literature. Finally, we combined the deep learning methods used in BNNs with the efficiency of Jet Features to make Neural Jet Features. Neural Jet Features are binarized convolutional layers that are learned through deep learning and learn classic computer vision kernels like the Gaussian and Sobel kernels. These kernels are efficiently computed as a group and their outputs can be reused when forming output channels. They performed just as well as BNN convolutions on visual inspection tasks and are more stable when trained on small models.
|
1379 |
FPGA-Accelerated Digital Signal Processing for UAV Traffic Control RadarMoody, Kacen Paul 07 April 2021 (has links)
As an extension of previous work done by Luke Newmeyer in his master's thesis \cite{newmeyer2018efficient}, this report presents an improved signal processing chain for efficient, real-time processing of radar data for small-scale UAV traffic control systems. The HDL design described is for a 16-channel, 2-dimensional phased array feed processing chain and includes mean subtraction, windowing, FIR filtering, decimation, spectral estimation via FFT, cross-correlation, and averaging, as well as a significant amount of control and configuration logic. The design runs near the the max allowable memory bus frequency at 300MHz, and using AXI DMA engines can achieve throughput of 38.3 Gb/s (~0.25% below theoretical 38.4 Gb/s), transferring 2MB of correlation data in about 440us. This allows for a pulse repetition frequency of nearly 2kHz, in contrast to 454Hz from the previous design. The design targets the Avnet UltraZed-EV MPSoC board, which boots custom PetaLinux images. API code and post-processing algorithms run in this environment to interface with the FPGA control registers and further process frames of data. Primary configuration options include variable sample rate, window coefficients, FIR filter coefficients, chirp length, pulse repetition interval, decimation factor, number of averaged frames, error monitoring, three DMA sampling points, and DMA ring buffer transfers. The result is a dynamic, high-speed, small-scale design which can process 16 parallel channels of data in real time for 3-dimensional detection of local UAV traffic at a range of 1000m.
|
1380 |
Statistical Method for Extracting Radiation-Induced Multi-Cell Upsets and Anomalies in SRAM-Based FPGAsPerez Celis, Juan Andres 23 November 2021 (has links)
FPGAs are susceptible to radiation-induced effects that change the data in the configuration memory. These effects can cause the malfunction of the system. Triple modular redundancy has extensively been used to improve the circuit's cross-section. However, TMR has shown to be particularly susceptible to radiation effects that affect more than one memory cell such as Multiple Cell Upsets (MCU) or micro-Single Event Functional Interrupts (micro-SEFI). This work describes a statistical technique to extract Multi-Cell Upset (MCU) and micro-SEFI events from raw radiation upset data. The technique uses Poisson statistics to identify patterns in the data. The most common patterns are selected using Poisson statistics. The selected patterns are used to reconstruct MCU events. The results show the distribution of MCU, micro-SEFis, and single-bit upsets for several radiation tests. Additionally, the results show the MCU distribution based on the number of bits affected by the event. This work details the process of reconstructing MCU data and also the process to use these data during a fault injection campaign. The results show that by using MCU fault injection it is possible to replicate failures seen in the radiation test and even induce more failures than seen in the radiation test. This shows the importance of extracting MCUs from radiation data and use them to evaluate TMR-protected designs.
|
Page generated in 0.0314 seconds