Global ETD Search

1071	Enabling Hardware/Software Co-design in High-level Synthesis Choi, Jongsok 21 November 2012 (has links) A hardware implementation can bring orders of magnitude improvements in performance and energy consumption over a software implementation. Hardware design, however, can be extremely difficult. High-level synthesis, the process of compiling software to hardware, promises to make hardware design easier. However, compiling an entire software program to hardware can be inefficient. This thesis proposes hardware/software co-design, where computationally intensive functions are accelerated by hardware, while remaining program segments execute in software. The work in this thesis builds a framework where user-designated software functions are automatically compiled to hardware accelerators, which can execute serially or in parallel to work in tandem with a processor. To support multiple parallel accelerators, new multi-ported cache designs are presented. These caches provide low-latency high-bandwidth data to further improve the performance of accelerators. An extensive range of cache architectures are explored, and results show that certain cache architectures significantly outperform others in a processor/accelerator system. FPGA high-level synthesis hardware/software co-design LegUp 0544 0984
1072	Efficient Molecular Dynamics Simulation on Reconfigurable Models with MultiGrid Method Cho, Eunjung 22 April 2008 (has links) In the field of biology, MD simulations are continuously used to investigate biological studies. A Molecular Dynamics (MD) system is defined by the position and momentum of particles and their interactions. The dynamics of a system can be evaluated by an N-body problem and the simulation is continued until the energy reaches equilibrium. Thus, solving the dynamics numerically and evaluating the interaction is computationally expensive even for a small number of particles in the system. We are focusing on long-ranged interactions, since the calculation time is O(N^2) for an N particle system. In this dissertation, we are proposing two research directions for the MD simulation. First, we design a new variation of Multigrid (MG) algorithm called Multi-level charge assignment (MCA) that requires O(N) time for accurate and efficient calculation of the electrostatic forces. We apply MCA and back interpolation based on the structure of molecules to enhance the accuracy of the simulation. Our second research utilizes reconfigurable models to achieve fast calculation time. We have been working on exploiting two reconfigurable models. We design FPGA-based MD simulator implementing MCA method for Xilinx Virtex-IV. It performs about 10 to 100 times faster than software implementation depending on the simulation accuracy desired. We also design fast and scalable Reconfigurable mesh (R-Mesh) algorithms for MD simulations. This work demonstrates that the large scale biological studies can be simulated in close to real time. The R-Mesh algorithms we design highlight the feasibility of these models to evaluate potentials with faster calculation times. Specifically, we develop R-Mesh algorithms for both Direct method and Multigrid method. The Direct method evaluates exact potentials and forces, but requires O(N^2) calculation time for evaluating electrostatic forces on a general purpose processor. The MG method adopts an interpolation technique to reduce calculation time to O(N) for a given accuracy. However, our R-Mesh algorithms require only O(N) or O(logN) time complexity for the Direct method on N linear R-Mesh and N¡¿N R-Mesh, respectively and O(r)+O(logM) time complexity for the Multigrid method on an X¡¿Y¡¿Z R-Mesh. r is N/M and M = X¡¿Y¡¿Z is the number of finest grid points. FPGA Reconfigurable Mesh Algorithm Reconfigurable Model Multigrid Molecular Dynamics Simulation Computer Sciences
1073	An equalization technique for high rate OFDM systems Yuan, Naihua 05 December 2003 In a typical orthogonal frequency division multiplexing (OFDM) broadband wireless communication system, a guard interval using cyclic prefix is inserted to avoid the inter-symbol interference and the inter-carrier interference. This guard interval is required to be at least equal to, or longer than the maximum channel delay spread. This method is very simple, but it reduces the transmission efficiency. This efficiency is very low in the communication systems, which inhibit a long channel delay spread with a small number of sub-carriers such as the IEEE 802.11a wireless LAN (WLAN). To increase the transmission efficiency, it is usual that a time domain equalizer (TEQ) is included in an OFDM system to shorten the effective channel impulse response within the guard interval. There are many TEQ algorithms developed for the low rate OFDM applications such as asymmetrical digital subscriber line (ADSL). The drawback of these algorithms is a high computational load. Most of the popular TEQ algorithms are not suitable for the IEEE 802.11a system, a high data rate wireless LAN based on the OFDM technique. In this thesis, a TEQ algorithm based on the minimum mean square error criterion is investigated for the high rate IEEE 802.11a system. This algorithm has a comparatively reduced computational complexity for practical use in the high data rate OFDM systems. In forming the model to design the TEQ, a reduced convolution matrix is exploited to lower the computational complexity. Mathematical analysis and simulation results are provided to show the validity and the advantages of the algorithm. In particular, it is shown that a high performance gain at a data rate of 54Mbps can be obtained with a moderate order of TEQ finite impulse response (FIR) filter. The algorithm is implemented in a field programmable gate array (FPGA). The characteristics and regularities between the elements in matrices are further exploited to reduce the hardware complexity in the matrix multiplication implementation. The optimum TEQ coefficients can be found in less than 4µs for the 7th order of the TEQ FIR filter. This time is the interval of an OFDM symbol in the IEEE 802.11a system. To compensate for the effective channel impulse response, a function block of 64-point radix-4 pipeline fast Fourier transform is implemented in FPGA to perform zero forcing equalization in frequency domain. The offsets between the hardware implementations and the mathematical calculations are provided and analyzed. The system performance loss introduced by the hardware implementation is also tested. Hardware implementation output and simulation results verify that the chips function properly and satisfy the requirements of the system running at a data rate of 54 Mbps. Time domain equalization OFDM LDLT and LU decompositions IEEE 802.11a FPGA Pipeline FFT
1074	An equalization technique for high rate OFDM systems Yuan, Naihua 05 December 2003 (has links) In a typical orthogonal frequency division multiplexing (OFDM) broadband wireless communication system, a guard interval using cyclic prefix is inserted to avoid the inter-symbol interference and the inter-carrier interference. This guard interval is required to be at least equal to, or longer than the maximum channel delay spread. This method is very simple, but it reduces the transmission efficiency. This efficiency is very low in the communication systems, which inhibit a long channel delay spread with a small number of sub-carriers such as the IEEE 802.11a wireless LAN (WLAN). To increase the transmission efficiency, it is usual that a time domain equalizer (TEQ) is included in an OFDM system to shorten the effective channel impulse response within the guard interval. There are many TEQ algorithms developed for the low rate OFDM applications such as asymmetrical digital subscriber line (ADSL). The drawback of these algorithms is a high computational load. Most of the popular TEQ algorithms are not suitable for the IEEE 802.11a system, a high data rate wireless LAN based on the OFDM technique. In this thesis, a TEQ algorithm based on the minimum mean square error criterion is investigated for the high rate IEEE 802.11a system. This algorithm has a comparatively reduced computational complexity for practical use in the high data rate OFDM systems. In forming the model to design the TEQ, a reduced convolution matrix is exploited to lower the computational complexity. Mathematical analysis and simulation results are provided to show the validity and the advantages of the algorithm. In particular, it is shown that a high performance gain at a data rate of 54Mbps can be obtained with a moderate order of TEQ finite impulse response (FIR) filter. The algorithm is implemented in a field programmable gate array (FPGA). The characteristics and regularities between the elements in matrices are further exploited to reduce the hardware complexity in the matrix multiplication implementation. The optimum TEQ coefficients can be found in less than 4µs for the 7th order of the TEQ FIR filter. This time is the interval of an OFDM symbol in the IEEE 802.11a system. To compensate for the effective channel impulse response, a function block of 64-point radix-4 pipeline fast Fourier transform is implemented in FPGA to perform zero forcing equalization in frequency domain. The offsets between the hardware implementations and the mathematical calculations are provided and analyzed. The system performance loss introduced by the hardware implementation is also tested. Hardware implementation output and simulation results verify that the chips function properly and satisfy the requirements of the system running at a data rate of 54 Mbps. Time domain equalization OFDM LDLT and LU decompositions IEEE 802.11a FPGA Pipeline FFT
1075	Overlay Architectures for FPGA-Based Software Packet Processing Martin, Labrecque 16 June 2011 (has links) Packet processing is the enabling technology of networked information systems such as the Internet and is usually performed with fixed-function custom-made ASIC chips. As communication protocols evolve rapidly, there is increasing interest in adapting features of the processing over time and, since software is the preferred way of expressing complex computation, we are interested in finding a platform to execute packet processing software with the best possible throughput. Because FPGAs are widely used in network equipment and they can implement processors, we are motivated to investigate executing software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric are currently geared towards performing embedded sequential tasks and, in contrast, network processing is most often inherently parallel between packet flows, if not between each individual packet. Our goal is to allow multiple threads of execution in an FPGA to reach a higher aggregate throughput than commercially available shared-memory soft multi-processors via improvements to the underlying soft processor architecture. We study a number of processor pipeline organizations to identify which ones can scale to a larger number of execution threads and find that tuning multithreaded pipelines can provide compact cores with high throughput. We then perform a design space exploration of multicore soft systems, compare single-threaded and multithreaded designs to identify scalability limits and develop processor architectures allowing threads to execute with as little architectural stalls as possible: in particular with instruction replay and static hazard detection mechanisms. To further reduce the wait times, we allow threads to speculatively execute by leveraging transactional memory. Our multithreaded multiprocessor along with our compilation and simulation framework makes the FPGA easy to use for an average programmer who can write an application as a single thread of computation with coarse-grained synchronization around shared data structures. Comparing with multithreaded processors using lock-based synchronization, we measure up to 57\% additional throughput with the use of transactional-memory-based synchronization. Given our applications, gigabit interfaces and 125 MHz system clock rate, our results suggest that soft processors can process packets in software at high throughput and low latency, while capitalizing on the FPGAs already available in network equipment. computer architecture soft processors FPGA packet processing network processor multithreaded transactional memory 0544
1076	Overlay Architectures for FPGA-Based Software Packet Processing Martin, Labrecque 16 June 2011 (has links) Packet processing is the enabling technology of networked information systems such as the Internet and is usually performed with fixed-function custom-made ASIC chips. As communication protocols evolve rapidly, there is increasing interest in adapting features of the processing over time and, since software is the preferred way of expressing complex computation, we are interested in finding a platform to execute packet processing software with the best possible throughput. Because FPGAs are widely used in network equipment and they can implement processors, we are motivated to investigate executing software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric are currently geared towards performing embedded sequential tasks and, in contrast, network processing is most often inherently parallel between packet flows, if not between each individual packet. Our goal is to allow multiple threads of execution in an FPGA to reach a higher aggregate throughput than commercially available shared-memory soft multi-processors via improvements to the underlying soft processor architecture. We study a number of processor pipeline organizations to identify which ones can scale to a larger number of execution threads and find that tuning multithreaded pipelines can provide compact cores with high throughput. We then perform a design space exploration of multicore soft systems, compare single-threaded and multithreaded designs to identify scalability limits and develop processor architectures allowing threads to execute with as little architectural stalls as possible: in particular with instruction replay and static hazard detection mechanisms. To further reduce the wait times, we allow threads to speculatively execute by leveraging transactional memory. Our multithreaded multiprocessor along with our compilation and simulation framework makes the FPGA easy to use for an average programmer who can write an application as a single thread of computation with coarse-grained synchronization around shared data structures. Comparing with multithreaded processors using lock-based synchronization, we measure up to 57\% additional throughput with the use of transactional-memory-based synchronization. Given our applications, gigabit interfaces and 125 MHz system clock rate, our results suggest that soft processors can process packets in software at high throughput and low latency, while capitalizing on the FPGAs already available in network equipment. computer architecture soft processors FPGA packet processing network processor multithreaded transactional memory 0544
1077	Aportació als mètodes de seguiment tridimensional d'objectes d'alta velocitat d'operació mitjançant l'estereovisió Aranda, Joan 16 October 1997 (has links) No description available. FPGA image processor feature extraction singular points stereo vision real-time visual tracking computer vision 004
1078	Hardware accelerators for embedded fingerprint-based personal recognition systems Fons Lluís, Mariano 29 May 2012 (has links) Abstract The development of automatic biometrics-based personal recognition systems is a reality in the current technological age. Not only those operations demanding stringent security levels but also many daily use consumer applications request the existence of computational platforms in charge of recognizing the identity of one individual based on the analysis of his/her physiological and/or behavioural characteristics. The state of the art points out two main open problems in the implementation of such applications: on the one hand, the needed reliability improvement in terms of recognition accuracy, overall security and real-time performances; and on the other hand, the cost reduction of those physical platforms in charge of the processing. This work aims at finding the proper system architecture able to address those limitations of current personal recognition applications. Embedded system solutions based on hardware-software co-design techniques and programmable (and run-time reconfigurable) logic devices under FPGAs or SOPCs is proven to be an efficient alternative to those existing multiprocessor systems based on HPCs, GPUs or PC platforms in the development of that kind of high-performance applications at low cost / El desenvolupament de sistemes automàtics de reconeixement personal basats en tècniques biomètriques esdevé una realitat en l’era tecnològica actual. No només aquelles operacions que exigeixen un elevat nivell de seguretat sinó també moltes aplicacions quotidianes demanen l’existència de plataformes computacionals encarregades de reconèixer la identitat d’un individu a partir de l’anàlisi de les seves característiques fisiològiques i/o comportamentals. L’estat de l’art de la tècnica identifica dues limitacions importants en la implementació d’aquest tipus d’aplicacions: per una banda, és necessària la millora de la fiabilitat d’aquests sistemes en termes de precisió en el procés de reconeixement personal, seguretat i execució en temps real; i per altra banda, és necessari reduir notablement el cost dels sistemes electrònics encarregats del processat biomètric. Aquest treball té per objectiu la cerca de l’arquitectura adequada a nivell de sistema que permeti fer front a les limitacions de les aplicacions de reconeixement personal actuals. Es demostra que la proposta de sistemes empotrats basats en tècniques de codisseny hardware-software i dispositius lògics programables (i reconfigurables en temps d’execució) sobre FPGAs o SOPCs resulta ser una alternativa eficient en front d’aquells sistemes multiprocessadors existents basats en HPCs, GPUs o plataformes PC per al desenvolupament d’aquests tipus d’aplicacions que requereixen un alt nivell de prestacions a baix cost. / El desarrollo de sistemas automáticos de reconocimiento personal basados en técnicas biométricas se ha convertido en una realidad en la era tecnológica actual. No tan solo aquellas operaciones que requieren un alto nivel de seguridad sino también muchas otras aplicaciones cotidianas exigen la existencia de plataformas computacionales encargadas de verificar la identidad de un individuo a partir del análisis de sus características fisiológicas y/o comportamentales. El estado del arte de la técnica identifica dos limitaciones importantes en la implementación de este tipo de aplicaciones: por un lado, es necesario mejorar la fiabilidad que presentan estos sistemas en términos de precisión en el proceso de reconocimiento personal, seguridad y ejecución en tiempo real; y por otro lado, es necesario reducir notablemente el coste de los sistemas electrónicos encargados de dicho procesado biométrico. Este trabajo tiene por objetivo la búsqueda de aquella arquitectura adecuada a nivel de sistema que permita hacer frente a las limitaciones de los sistemas de reconocimiento personal actuales. Se demuestra que la propuesta basada en sistemas embebidos implementados mediante técnicas de codiseño hardware-software y dispositivos lógicos programables (y reconfigurables en tiempo de ejecución) sobre FPGAs o SOPCs resulta ser una alternativa eficiente frente a aquellos sistemas multiprocesador actuales basados en HPCs, GPUs o plataformas PC en el ámbito del desarrollo de aplicaciones que demandan un alto nivel de prestaciones a bajo coste Biometrics Embedded System hardware-software codesign Programmable Logic System-on-Chip FPGA 00 62 621.3
1079	High Performance Elliptic Curve Cryptographic Co-processor Lutz, Jonathan January 2003 (has links) In FIPS 186-2, NIST recommends several finite fields to be used in the elliptic curve digital signature algorithm (ECDSA). Of the ten recommended finite fields, five are binary extension fields with degrees ranging from 163 to 571. The fundamental building block of the ECDSA, like any ECC based protocol, is elliptic curve scalar multiplication. This operation is also the most computationally intensive. In many situations it may be desirable to accelerate the elliptic curve scalar multiplication with specialized hardware. In this thesis a high performance elliptic curve processor is developed which is optimized for the NIST binary fields. The architecture is built from the bottom up starting with the field arithmetic units. The architecture uses a field multiplier capable of performing a field multiplication over the extension field with degree 163 in 0. 060 microseconds. Architectures for squaring and inversion are also presented. The co-processor uses Lopez and Dahab's projective coordinate system and is optimized specifically for Koblitz curves. A prototype of the processor has been implemented for the binary extension field with degree 163 on a Xilinx XCV2000E FPGA. The prototype runs at 66 MHz and performs an elliptic curve scalar multiplication in 0. 233 msec on a generic curve and 0. 075 msec on a Koblitz curve. Electrical & Computer Engineering elliptic curve co-processor cryptography koblitz curves FPGA hardware
1080	Parallel Multiplier Designs for the Galois/Counter Mode of Operation Patel, Pujan January 2008 (has links) The Galois/Counter Mode of Operation (GCM), recently standardized by NIST, simultaneously authenticates and encrypts data at speeds not previously possible for both software and hardware implementations. In GCM, data integrity is achieved by chaining Galois field multiplication operations while a symmetric key block cipher such as the Advanced Encryption Standard (AES), is used to meet goals of confidentiality. Area optimization in a number of proposed high throughput GCM designs have been approached through implementing efficient composite Sboxes for AES. Not as much work has been done in reducing area requirements of the Galois multiplication operation in the GCM which consists of up to 30% of the overall area using a bruteforce approach. Current pipelined implementations of GCM also have large key change latencies which potentially reduce the average throughput expected under traditional internet traffic conditions. This thesis aims to address these issues by presenting area efficient parallel multiplier designs for the GCM and provide an approach for achieving low latency key changes. The widely known Karatsuba parallel multiplier (KA) and the recently proposed Fan-Hasan multiplier (FH) were designed for the GCM and implemented on ASIC and FPGA architectures. This is the first time these multipliers have been compared with a practical implementation, and the FH multiplier showed note worthy improvements over the KA multiplier in terms of delay with similar area requirements. Using the composite Sbox, ASIC designs of GCM implemented with subquadratic multipliers are shown to have an area savings of up to 18%, without affecting the throughput, against designs using the brute force Mastrovito multiplier. For low delay LUT Sbox designs in GCM, although the subquadratic multipliers are a part of the critical path, implementations with the FH multiplier showed the highest efficiency in terms of area resources and throughput over all other designs. FPGA results similarly showed a significant reduction in the number of slices using subquadratic multipliers, and the highest throughput to date for FPGA implementations of GCM was also achieved. The proposed reduced latency key change design, which supports all key types of AES, showed a 20% improvement in average throughput over other GCM designs that do not use the same techniques. The GCM implementations provided in this thesis provide some of the most area efficient, yet high throughput designs to date. GCM AES VLSI FPGA Karatsuba Multiplier Fan-Hasan Multiplier Electrical and Computer Engineering

Search results