Global ETD Search

11	Dynamically Reconfigurable Systolic Array Accelerators: A Case Study with Extended Kalman Filter and Discrete Wavelet Transform Algorithms Barnes, Robert C 01 May 2009 (has links) Field programmable grid arrays (FPGA) are increasingly being adopted as the primary on-board computing system for autonomous deep space vehicles. There is a need to support several complex applications for navigation and image processing in a rapidly responsive on-board FPGA-based computer. This requires exploring and combining several design concepts such as systolic arrays, hardware-software partitioning, and partial dynamic reconfiguration. A microprocessor/co-processor design that can accelerate two single precision oating-point algorithms, extended Kalman lter and a discrete wavelet transform, is presented. This research makes three key contributions. (i) A polymorphic systolic array framework comprising of recofigurable partial region-based sockets to accelerate algorithms amenable to being mapped onto linear systolic arrays. When implemented on a low end Xilinx Virtex4 SX35 FPGA the design provides a speedup of at least 4.18x and 6.61x over a state of the art microprocessor used in spacecraft systems for the extended Kalman lter and discrete wavelet transform algorithms, respectively. (ii) Switchboxes to enable communication between static and partial reconfigurable regions and a simple protocol to enable schedule changes when a socket's contents are dynamically reconfigured to alter the concurrency of the participating systolic arrays. (iii) A hybrid partial dynamic reconfiguration method that combines Xilinx early access partial reconfiguration, on-chip bitstream decompression, and bitstream relocation to enable fast scaling of systolic arrays on the PolySAF. This technique provided a 2.7x improvement in reconfiguration time compared to an o-chip partial reconfiguration technique that used a Flash card on the FPGA board, and a 44% improvement in BRAM usage compared to not using compression. Case Study Extended Kalman Filter Discrete Wavelet Transform Algorithms Computer Engineering
12	Fully efficient pipelined VLSI arrays for solving toeplitz matrices Lee, Louis Wai-Fung 11 October 1991 (has links) Fully efficient systolic arrays for the solution of Toeplitz matrices using Schur algorithm [1] have been obtained. By applying clustering mapping method [2], the complexity of the algorithm is 0(n) and it requires n/2 processing elements as opposed to n processing elements developed elsewhere [1]. The motivation of this thesis is to obtain efficient pipeline arrays by using the synthesis procedure to implement Toeplitz matrix solution. Furthermore, we will examine pipeline structures for the Toeplitz system factorization and back-substitution by obtaining clustering and Multi-Rate Array structures. These methods reduce the number of processing elements and enhance the computational speed. Comparison and advantage of these methods to other method will be presented. / Graduation date: 1992 Toeplitz matrices Systolic array circuits Array processors
13	FPGA-Based Implementation of QR Decomposition January 2014 (has links) abstract: This thesis report aims at introducing the background of QR decomposition and its application. QR decomposition using Givens rotations is a efficient method to prevent directly matrix inverse in solving least square minimization problem, which is a typical approach for weight calculation in adaptive beamforming. Furthermore, this thesis introduces Givens rotations algorithm and two general VLSI (very large scale integrated circuit) architectures namely triangular systolic array and linear systolic array for numerically QR decomposition. To fulfill the goal, a 4 input channels triangular systolic array with 16 bits fixed-point format and a 5 input channels linear systolic array are implemented on FPGA (Field programmable gate array). The final result shows that the estimated clock frequencies of 65 MHz and 135 MHz on post-place and route static timing report could be achieved using Xilinx Virtex 6 xc6vlx240t chip. Meanwhile, this report proposes a new method to test the dynamic range of QR-D. The dynamic range of the both architectures can be achieved around 110dB. / Dissertation/Thesis / M.S. Electrical Engineering 2014 Electrical engineering Beamforming FPGA Matrix Triangularization QR Decomposition Systolic Array VLSI
14	Uma plataforma híbrida baseada em FPGA para a aceleração de um algoritmo de alinhamento de sequências biológicas FIGUEIRÔA, Luiz Henrique Alves 17 August 2015 (has links) Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-04-05T14:47:50Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertação_Figueiroa(versao_final).pdf: 2779464 bytes, checksum: bec03362367d058faa9ed8c36d09b5f8 (MD5) / Made available in DSpace on 2016-04-05T14:47:50Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Dissertação_Figueiroa(versao_final).pdf: 2779464 bytes, checksum: bec03362367d058faa9ed8c36d09b5f8 (MD5) Previous issue date: 2015-08-17 / A partir da revelação da estrutura em dupla-hélice do DNA, em 1953, foi aberto o caminho para a compreensão dos mecanismos que codificam as instruções de construção e desenvolvimento das células dos seres vivos. A nova geração de sequenciadores (NGS) têm produzido gigantescos volumes de dados nos Bancos de Dados biológicos cujas informações podem demandar uma intensa atividade computacional em sua compilação. Entretanto, o desempenho das ferramentas empregadas na Biologia Computacional não tem evoluído na mesma taxa de crescimento desses bancos, podendo impor restrições aos avanços neste campo de pesquisa. Uma das principais técnicas usadas é o alinhamento de sequências que, a partir da identificação de similaridades, possibilitam a análise de regiões conservadas em sequências homólogas, servem como ponto de partida no estudo de estruturas secundárias de proteínas e de construção de àrvores filogenéticas, entre outros. Como os algoritmos exatos de alinhamento possuem complexidade quadrática no tempo e no espaço, o custo computacional poderá ser elevado demandando estratégias de aceleração. Neste contexto, a Computação de Alto Desempenho (HPC), estruturada em Supercomputadores e Clusters, tem sido, empregada. No entanto, o investimento inicial e os requisitos de manutenção, espaço físico, refrigeração, além do consumo de energia, podem representar custos significativos. As arquiteturas paralelas híbridas baseadas na ação conjunta de PCs e dispositivos aceleradores como chips VLSI, GPGPUs e FPGAs, surgiram como alternativas mais acessíveis, apresentando resultados promissores. O projeto descrito nesta dissertação tem por objetivo a aceleração do algoritmo de alinhamento-ótimo global, conhecido como Needleman-Wunsch, a partir de uma plataforma híbrida baseada em um PC (host) e um FPGA. A aceleração ocorre a partir da exploração das possibilidades de paralelismo oferecidas pelo algoritmo e sua implementação em hardware. A arquitetura desenvolvida é baseada num Array Sistólico Linear apresentando elevado desempenho e boa escalabilidade. / From the revelation of the structure in double-helix of Deoxyribonucleic Acid (DNA) by James D. Watson and Francis H. C. Crick, in 1953, it opened the way for the understanding of the mechanismis that encoding the building instructions and development of cells of living beings. The DNA sequencing is one of the first steps in this process. The new generation of sequencers (NGS) have produced massive amounts of data on biological databases whose information may require intense computational activity in your compilation. However, the performance of the tools employed in Computational Biology has not evolved at the same rate of growth of these banks, may impose restrictions on advances in this research field. One of the primary techniques used is the sequence alignment that from the identification of similarities, enable the analysis of conserved regions of homologous sequences, serve as the starting point in the study of protein secondary structures and the construction of phylogenetic trees, among others. As the exact alignment algorithms have quadratic complexity in time and space, the computational cost can be high demanding acceleration strategies. In this context, the High Performance Computing (HPC), structured in supercomputers and clusters, has been employed. However, the initial investment and maintenance requirements, floor space, cooling, in addition to energy consumption, may represent significant costs. The hybrid parallel architectures based on joint action of PCs and devices accelerators as VLSI chips, GPGPUs and FPGAs, have emerged as more affordable alternatives, with promising results. The project described in this dissertation aims at accelerating the global optimal-alignment algorithm, known as Needleman-Wunsch, from a hybrid platform based on a PC, that acts as host, and an FPGA. The acceleration occurs through exploration of the parallelism opportunities offered by the algorithm and implemented in hardware. In this, an architecture based on a Linear Systolic Array offers high performance and high scalability. DNA HPC FPGA GPGPU Array Sistólico DNA HPC FPGA GPGPU Systolic Array
15	Techniques for algorithm design on the instruction systolic array Schmidt, Bertil January 1999 (has links) Instruction systolic arrays (ISAs) provide a programmable high performance hardware for specific computationally intensive applications. Typically, such an array is connected to a sequential host, thus operating like a coprocessor which solves only the computationally intensive tasks within a global application. The ISA model is a mesh connected processor grid, which combines the advantages of special purpose systolic arrays with the flexible programmability of general purpose machines. The subject of this thesis is the analysis, design, and implementation of several special purpose algorithms and subroutines on the ISA that take advantage of the special features of the systolic information flow. The ability of ISAs to perform parallel prefix computations in an extremely efficient way is exploited as a key-operation to derive efficiency as well as local operations within each processor. Therefore, given sequential algorithms has to be decomposed in simple building blocks of parallel prefix computations and parallel local operations. To modify sequential algorithms for a parallelisation several techniques are introduced in this thesis, e. g. swapping of loops in the sequential algorithm, shearing of data, and appropriate mapping of input data onto the processor array It is demonstrated how these techniques can be exploited to derive efficient ISA algorithms for several computationally intensive applications. These include cryptographic applications (e. g. arithmetic operations on long operands, RSA encryption, RSA key generation) and image processing applications (e. g. convolution, Wavelet Transform, morphological operators, median filter, Fourier Transform, Hough Transform, Morphological Hough Transform, and tomographic image reconstruction). Their implementation on Systola 1024 - the first commercial parallel computer with the ISA architecture - shows that the concept of the ISA is very suitable for these applications and results in significant run time savings. The results of this thesis emphases the suitability of the ISA concept as an accelerator for computationally intensive applications in the areas of cryptography and image processing. This might lead research towards further high-speed low cost systems based on ISA hardware. 621.39
16	Hardware Acceleration of Video analytics on FPGA using OpenCL January 2019 (has links) abstract: With the exponential growth in video content over the period of the last few years, analysis of videos is becoming more crucial for many applications such as self-driving cars, healthcare, and traffic management. Most of these video analysis application uses deep learning algorithms such as convolution neural networks (CNN) because of their high accuracy in object detection. Thus enhancing the performance of CNN models become crucial for video analysis. CNN models are computationally-expensive operations and often require high-end graphics processing units (GPUs) for acceleration. However, for real-time applications in an energy-thermal constrained environment such as traffic management, GPUs are less preferred because of their high power consumption, limited energy efficiency. They are challenging to fit in a small place. To enable real-time video analytics in emerging large scale Internet of things (IoT) applications, the computation must happen at the network edge (near the cameras) in a distributed fashion. Thus, edge computing must be adopted. Recent studies have shown that field-programmable gate arrays (FPGAs) are highly suitable for edge computing due to their architecture adaptiveness, high computational throughput for streaming processing, and high energy efficiency. This thesis presents a generic OpenCL-defined CNN accelerator architecture optimized for FPGA-based real-time video analytics on edge. The proposed CNN OpenCL kernel adopts a highly pipelined and parallelized 1-D systolic array architecture, which explores both spatial and temporal parallelism for energy efficiency CNN acceleration on FPGAs. The large fan-in and fan-out of computational units to the memory interface are identified as the limiting factor in existing designs that causes scalability issues, and solutions are proposed to resolve the issue with compiler automation. The proposed CNN kernel is highly scalable and parameterized by three architecture parameters, namely pe_num, reuse_fac, and vec_fac, which can be adapted to achieve 100% utilization of the coarse-grained computation resources (e.g., DSP blocks) for a given FPGA. The proposed CNN kernel is generic and can be used to accelerate a wide range of CNN models without recompiling the FPGA kernel hardware. The performance of Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet has been measured by the proposed CNN kernel on Intel Arria 10 GX1150 FPGA. The measurement result shows that the proposed CNN kernel, when mapped with 100% utilization of computation resources, can achieve a latency of 11ms, 84ms, 1614.9ms, and 990.34ms for Alexnet, Resnet-50, Retinanet, and Light-weight Retinanet respectively when the input feature maps and weights are represented using 32-bit floating-point data type. / Dissertation/Thesis / Masters Thesis Electrical Engineering 2019 Computer engineering Convolution Neural Network Deep Learning Field-Programmable Gate Array Hardware acceleration Systolic Array
17	Advanced techniques for improving radar performance Shoukry, Mohammed Adel 03 December 2019 (has links) Wideband beamforming have been widely used in modern radar systems. One of the powerful wideband beamforming techniques that is capable of achieving a high selectivity over a wide bandwidth is the nested array (NA) beamformer. Such a beamformer consists of nested antenna arrays, 2-D spatio-temporal filters, and multirate filterbanks. Speed of operation is bounded by the speed of the hardware implementation. This dissertation presents the use of a systematic methodology for design space exploration of the NA beamformer basic building blocks. The efficient systolic array design in terms of the highest possible clock speed of each block was selected for hardware implementation. The proposed systolic array designs and the conventional designs were implemented in FPGA hardware to verify their functionality and compare their erformance. The implementations results confirm that the proposed systolic array implementations are faster and requires less hardware resources than the published designs. The overall beamformer FPGA implementation is constructed based on the analysis of efficient systolic arrays designs of the beamformer building blocks. The implemented overall structure is then validated to ensure its proper operation. Further, the implementation performance is evaluated in terms of accuracy and error analysis in comparison to the MATLAB simulations. The new methodology is based on the systematic methodology to close the gap between the modern wideband radar I/O rates and the silicon operating speed. This new metodology is applied to the interpolator block as an example. The proposed methodology is simulated and tested using MATLAB object oriented programming (OOP) to ensure the proper operation. / Graduate / 2020-11-17 Wideband beamforming nested array (NA) systolic array MATLAB object oriented programming (OOP)
18	Linear Algebra for Array Signal Processing on a Massively Parallel Dataflow Architecture Savaş, Süleyman January 2009 (has links) This thesis provides the deliberations about the implementation of Gentleman-Kung systolic array for QR decomposition using Givens Rotations within the context of radar signal processing. The systolic array of Givens Rotations is implemented and analysed using a massively parallel processor array (MPPA), Ambric Am2045. The tools that are dedicated to the MPPA are tested in terms of engineering efficiency. aDesigner, which is built on eclipse environment, is used for programming, simulating and performance analysing. aDesigner has been produced for Ambric chip family. 2 parallel matrix multiplications have been implemented to get familiar with the architecture and tools. Moreover different sized systolic arrays are implemented and compared with each other. For programming, ajava and astruct languages are provided. However floating point numbers are not supported by the provided languages. Thus fixed point arithmetic is used in systolic array implementation of Givens Rotations. Stable and precise numerical results are obtained as outputs of the algorithms. However the analysis results are not reliable because of the performance analysis tools. Am2000 family Ambric register aDesigner radar antenna arrays beamforming QRD Gentleman-Kung systolic array Givens Rotations MPPA massively parallel processor array fixed point
19	Linear Algebra for Array Signal Processing on a Massively Parallel Dataflow Architecture Savaş, Süleyman January 2008 (has links) <p>This thesis provides the deliberations about the implementation of Gentleman-Kung systolic array for QR decomposition using Givens Rotations within the context of radar signal </p><p>processing. The systolic array of Givens Rotations is implemented and analysed using a massively parallel processor array (MPPA), Ambric Am2045. The tools that are dedicated to the MPPA are tested in terms of engineering efficiency. aDesigner, which is built on eclipse environment, is used for programming, simulating and performance analysing. aDesigner has been produced for Ambric chip family. 2 parallel matrix multiplications have been implemented </p><p>to get familiar with the architecture and tools. Moreover different sized systolic arrays are implemented and compared with each other. For programming, ajava and astruct languages are provided. However floating point numbers are not supported by the provided languages. </p><p>Thus fixed point arithmetic is used in systolic array implementation of Givens Rotations. Stable and precise numerical results are obtained as outputs of the algorithms. However the analysis </p><p>results are not reliable because of the performance analysis tools.</p>
20	Linear Algebra for Array Signal Processing on a Massively Parallel Dataflow Architecture Savaş, Süleyman January 2009 (has links) <p>This thesis provides the deliberations about the implementation of Gentleman-Kung systolic array for QR decomposition using Givens Rotations within the context of radar signal processing. The systolic array of Givens Rotations is implemented and analysed using a massively parallel processor array (MPPA), Ambric Am2045. The tools that are dedicated to the MPPA are tested in terms of engineering efficiency. aDesigner, which is built on eclipse environment, is used for programming, simulating and performance analysing. aDesigner has been produced for Ambric chip family. 2 parallel matrix multiplications have been implemented to get familiar with the architecture and tools. Moreover different sized systolic arrays are implemented and compared with each other. For programming, ajava and astruct languages are provided. However floating point numbers are not supported by the provided languages. Thus fixed point arithmetic is used in systolic array implementation of Givens Rotations. Stable </p><p>and precise numerical results are obtained as outputs of the algorithms. However the analysis results are not reliable because of the performance analysis tools.</p>

Search results