1 |
Fault tolerance of evolvable hardware through diversificationHollingworth, Gordon S. January 2001 (has links)
No description available.
|
2 |
Hardware Architecture for Semantic ComparisonMohan, Suneil 2012 May 1900 (has links)
Semantic Routed Networks provide a superior infrastructure for complex search engines. In a Semantic Routed Network (SRN), the routers are the critical component and they perform semantic comparison as their key computation. As the amount of information available on the Internet grows, the speed and efficiency with which information can be retrieved to the user becomes important. Most current search engines scale to meet the growing demand by deploying large data centers with general purpose computers that consume many megawatts of power. Reducing the power consumption of these data centers while providing better performance, will help reduce the costs of operation significantly.
Performing operations in parallel is a key optimization step for better performance on general purpose CPUs. Current techniques for parallelization include architectures that are multi-core and have multiple thread handling capabilities. These coarse grained approaches have considerable resource management overhead and provide only sub-linear speedup.
This dissertation proposes techniques towards a highly parallel, power efficient architecture that performs semantic comparisons as its core activity. Hardware-centric parallel algorithms have been developed to populate the required data structures followed by computation of semantic similarity. The performance of the proposed design is further enhanced using a pipelined architecture. The proposed algorithms were also implemented on two contemporary platforms such as the Nvidia CUDA and an FPGA for performance comparison. In order to validate the designs, a semantic benchmark was also been created. It has been shown that a dedicated semantic comparator delivers significantly better performance compared to other platforms.
Results show that the proposed hardware semantic comparison architecture delivers a speedup performance of up to 10^5 while reducing power consumption by 80% compared to traditional computing platforms. Future research directions including better power optimization, architecting the complete semantic router and using the semantic benchmark for SRN research are also discussed.
|
3 |
Constraint Programming Techniques for Generating Efficient Hardware Architectures For Field Programmable Gate ArraysShah, Atul Kumar 01 May 2010 (has links)
This thesis presents an approach for modeling and generating efficient hardware architectures using constraint programming techniques, targeting field programmable gate arrays (FPGAs). The focus of this thesis is the derivation of optimal or near-optimal schedules for streaming applications from data flow graphs (DFGs). The resulting schedules are then used to facilitate the architecture generation process. Most streaming applications, like digital singal processing (DSP) algorithms, are repetitive in nature: the same computation is performed on different data items. This repetitive nature of streaming applications can be used to expose additional parallelism available across different iterations, by creating multiple instances of the same computation. The replication of the single computation, when applied to high level synthesis (HLS), improves the performance of the design but requires additional area. The amount of additional area required for a replicated graph can be reduced through the use of pipelined functional units and the addition of some extra clock cycles beyond the critical path of the DFG. This thesis discusses the use of a constraint programming (CP)-based scheduler to generate optimal schedules based on designer-provided replication level and critical path relaxation. The scheduler is an integrated part of the design tool, called CHARGER, which analyzes the resulting schedules to allocate memory for storing intermediate data, creates the infrastructure necessary to efficiently execute the application, and finally generates a synthesizable Verilog/VHDL code for the controller. The performance of the architectures derived using the CP-based scheduler is compared with the architectures generated using a force directed scheduling (FDS)-based scheduler for algorithms selected from embedded/multimedia applications. The results show that our CP-based scheduler outperforms the FDS-based scheduler, both in terms of area and efficiency of the generated architectures. The results show average area saving of 39% and average performance improvement of 41%.
|
4 |
Shared Memory Abstractions for Heterogeneous Multicore ProcessorsSchneider, Scott 21 January 2011 (has links)
We are now seeing diminishing returns from classic single-core processor designs, yet the number of transistors available for a processor is still increasing. Processor architects are therefore experimenting with a variety of multicore processor designs. Heterogeneous multicore processors with Explicitly Managed Memory (EMM) hierarchies are one such experimental design which has the potential for high performance, but at the cost of great programmer effort. EMM processors have cores that are divorced from the normal memory hierarchy, thus the onus is on the programmer to manage locality and parallelism. This dissertation presents the Cellgen source-to-source compiler which moves some of this complexity back into the compiler. Cellgen offers a directive-based programming model with semantics similar to OpenMP for the Cell Broadband Engine, a general-purpose processor with EMM. The compiler implicitly handles locality and parallelism, schedules memory transfers for data parallel regions of code, and provides performance predictions which can be leveraged to make scheduling decisions. We compare this approach to using a software cache, to a different programming model which is task based with explicit data transfers, and to programming the Cell directly using the native SDK. We also present a case study which uses the Cellgen compiler in a comparison across multiple kinds of multicore architectures: heterogeneous, homogeneous and radically data-parallel graphics processors. / Ph. D.
|
5 |
Flexible encoder and decoder designs for low-density parity-check codesKopparthi, Sunitha January 1900 (has links)
Doctor of Philosophy / Department of Electrical and Computer Engineering / Don M. Gruenbacher / Future technologies such as cognitive radio require flexible and reliable hardware architectures that can be easily configured and adapted to varying coding parameters. The objective of this work is to develop a flexible hardware encoder and decoder for low-density parity-check (LDPC) codes. The design methodologies used for the implementation of a LDPC encoder and decoder are flexible in terms of parity-check matrix, code rate and code length. All these designs are implemented on a programmable chip and tested.
Encoder implementations of LDPC codes are optimized for area due to their high complexity. Such designs usually have relatively low data rate. Two new encoder designs are developed that achieve much higher data rates of up to 844 Mbps while requiring more area for implementation. Using structured LDPC codes decreases the encoding complexity and provides design flexibility. The architecture for an encoder is presented that adheres to the structured LDPC codes defined in the IEEE 802.16e standard.
A single encoder design is also developed that accommodates different code lengths and code rates and does not require re-synthesis of the design in order to change the encoding parameters. The flexible encoder design for structured LDPC codes is also implemented on a custom chip. The maximum coded data rate of the structured encoder is up to 844 Mbps and for a given code rate its value is independent of the code length.
An LDPC decoder is designed and its design methodology is generic. It is applicable to both structured and any randomly generated LDPC codes. The coded data rate of the decoder increases with the increase in the code length. The number of decoding iterations used for the decoding process plays an important role in determining the decoder performance and latency. This design validates the estimated codeword after every iteration and stops the decoding process when the correct codeword is estimated which saves power consumption. For a given parity-check matrix and signal-to-noise ratio, a procedure to find an optimum value of the maximum number of decoding iterations is presented that considers the affects of power, delay, and error performance.
|
6 |
DESCRIPTION AND ANALYSIS OF A FLEXIBLE HARDWARE ARCHITECTURE FOR EVENT-DRIVEN DISTRIBUTED SENSOR NETWORK NODESDavis, Jesse, Kyker, Ron, Berry, Nina 10 1900 (has links)
International Telemetering Conference Proceedings / October 20-23, 2003 / Riviera Hotel and Convention Center, Las Vegas, Nevada / A particular engineering aspect of distributed sensor networks that has not received
adequate attention is the system level hardware architecture of the individual nodes of the
network. A novel hardware architecture based on an idea of task specific modular
computing is proposed to provide for both the high flexibility and low power
consumption required for distributed sensing solutions. The power consumption of the
architecture is mathematically analyzed against a traditional approach, and guidelines are
developed for application scenarios that would benefit from using this new design.
|
7 |
L'architecture totalitaire. Un monographie du Centre civique de BucarestRacolta, Radu Petru 30 June 2010 (has links) (PDF)
Le centre civique de Bucarest est le projet étudié d'une façon approfondie et il devient progressivement, avec l'avancement de cette thèse, l'élément de référence qui nous permet de faire des parallèles et des comparaisons avec d'autres projets construits sous un régime totalitaire. La confrontation directe entre des réponses architecturales différentes a le mérite de mettre en exergue des traits communs de l'acte d'édifier et ses conséquences dans l'atmosphère urbaine, en un mot, d'identifier la production architecturale totalitaire. Elle permet aussi de souligner le parcours intellectuel que les dictateurs empruntent pour arriver à imaginer et matérialiser le monde qui est le leur. L'architecture est une expression incontournable, une dimension inéluctable pour la compréhension de l'esprit totalitaire.
|
8 |
Hardware architectures for morphological filters with large structuring elements / Architectures matérielles pour filtres morphologiques avec des grandes éléments structurantsBartovsky, Jan 14 November 2012 (has links)
Cette thèse se concentre sur la mise en œuvre d'implantations matérielles dédiées des filtres morphologiques fondamentaux, basés sur des itérations d'érosions/dilatations. L'objectif principal de cette thèse est de proposer une mise en oeuvre efficace et programmable de ces opérateurs en utilisant des algorithmes en flot de données et considérant les besoins applicatifs globaux. Dans la première partie, nous étudions les algorithmes existants pour les opérateurs morphologiques fondamentaux et leur réalisation sur des différentes plates-formes informatiques. Nous nous intéressons plus particulièrement à un algorithme basé sur la file d'attente pour la mise en œuvre de la dilatation car il permet de réaliser l'accès séquentiel aux données avec une latence minimale, ce qui est très favorable pour le matériel dédié. Nous proposons ensuite un autre algorithme réalisant l'ouverture morphologique, sous angle arbitraire, basé sur le même principe d'une file d'attente, permettant d'obtenir directement des mesures de granulométrie. La deuxième partie présente la mise en oeuvre matérielle des algorithmes efficaces au moyen d'unités de traitement à flot de données. Nous commençons par l'unité de dilatation 1-D, puis grâce à la séparabilité de la dilatation nous construisons des unités 2-D rectangulaire et polygonale. L'unité de traitement pour l'ouverture orientée et spectre modèle est ainsi décrit. Nous présentons également une méthode de parallélisation de calcul en dupliquant des unités de traitement. Toutes les unités de traitement proposés sont évalués expérimentalement par la réalisation des prototypes à base de circuits programmables (FPGA), et les résultats en termes d'occupation de surface et de vitesse de traitement sont discutées. Dans la troisième partie, les unités de calcul proposées sont utilisées dans deux applications différentes, illustrant ainsi leur capacité de répondre exigences des applications embarquées a basse consommation. Les principales contributions de cette thèse sont : i) proposition d'un nouvel algorithme d'ouverture sous angle arbitraire, ii) réalisation des architectures matérielles dédiées et programmables d'opérateurs morphologiques fondamentaux à l'élément structurant large et sous angle arbitraire ; iii) amélioration de la performance obtenue grâce à l'exploitation de plusieurs niveaux de parallélisme. Les résultats suggèrent que les performances de temps réel précédemment irréalisable de ces opérateurs traditionnellement coûteux peuvent être atteints même pour des longues concaténations d'opérateurs ou des images à haute résolution / This thesis is focused on implementation of fundamental morphological filters in the dedicated hardware. The main objective of this thesis is to provide a programmable and efficient implementation of basic morphological operators using efficient dataflow algorithms considering the entire application point of view. In the first part, we study existing algorithms for fundamental morphological operators and their implementation on different computational platforms. We are especially interested in algorithms using the queue memory because their implementation provides the sequential data access and minimal latency, the properties very beneficial for the dedicated hardware. Then we propose another queue-based arbitrary-oriented opening algorithm that allows for direct granulometric measures. Performance benchmarks of these two algorithms are discussed, too. The second part presents hardware implementation of the efficient algorithms by means of stream processing units. We begin with 1-D dilation unit, then thanks to the separability of dilation we build up 2-D rectangular and polygonal dilation units. The processing unit for arbitrary-oriented opening and pattern spectrum is described as well. We also introduce a method of parallel computation using a few copies of processing units in parallel, thereby speeding up the computation. All proposed processing units are experimentally assessed in hardware by means of FPGA prototypes, and the performance and FPGA occupation results are discussed. In the third part, the proposed units are employed in two diverse applications illustrating thus their capability of addressing performance-demanding, low-power embedded applications. The main contributions of this thesis are: 1) new algorithm for arbitrary oriented opening and pattern spectrum, 2) programmable hardware implementation of fundamental morphological operators with large structuring elements and arbitrary orientation, 3) performance increase obtained through multi-level parallelism. Results suggest that the previously unachievable, real-time performance of these traditionally costly operators can be attained even for long concatenations and high-resolution images
|
9 |
Analysis and acceleration of high quality isosurface contouring / Análise e aceleração da extração de isosuperfícies com alta qualidadeSchmitz, Leonardo Augusto January 2009 (has links)
Este trabalho apresenta uma análise dos principais algoritmos de poligonização de isosuperfícies na GPU. O resultado desta análise mostra tanto como a GPU pode ser modificada para oferecer suporte a este tipo de algoritmo quanto como os algoritmos podem ser modificados para se adaptar as características das GPUs atuais. As técnicas usadas em versões de GPU do Marching Cubes são extendidas e uma poligonização com menos artefatos é gerada. São propostas versões paralelas do Dual Contouring e do Macet, algoritmos que melhoram a aproximação e a forma das malhas de triângulos, respectivamente. Ambas técnicas extraem isosuperfícies a partir de grandes volumes de dados em menos de um segundo, superando versões de CPU em até duas ordens de grandeza. As contribuições desse trabalho incluem uma versão orientada a tabelas do Dual Contouring (DC) para grids estruturados. A tabela é utilizada na especificação da topologia dos quadriláteros, que ajuda a implementação e a eficiência de cache em cenários paralelos. A tabela é adequada para a expansão de streams na GPU em ambos geometry shader e Histogram Pyramids. Além disso, nossa versão de aproximação de características das isosuperfícies é mais simples que a Decomposição de Valores Singulares e também que a Decomposição QR. O posicionamento dos vértices não requer uma diagonalização de matrizes. Ao invés disso, usa-se uma simples interpolação trilinear. Afim de avaliar a eficiência das técnicas apresentadas neste trabalho, comparamos nossas técnicas com versões do Marching Cubes na GPU do estado da arte. Também incluímos uma análise detalhada da arquitetura de GPU para a extração de isosuperfícies, usando ferramentas de avaliação de desempenho da indústria. Essa análise apresenta os gargalos das placas gráficas na extração de isosuperfícies e ajuda na avaliação de possíveis soluções para as GPUs das próximas gerações.
|
10 |
A VLSI Architecture for Rijndael, the Advanced Encryption StandardKosaraju, Naga M 13 November 2003 (has links)
The increasing application of cryptographic algorithms to ensure secure communications across virtual networks has led to an ever-growing demand for high performance hardware implementations of the encryption/decryption methods. The inevitable inclusion of the cryptographic algorithms in network communications has led to the development of several encryption standards, one of the prominent ones among which, is the Rijndael, the Advanced Encryption Standard. Rijndael was chosen as the Advanced Encryption Standard (AES) by the National Institute of Standard and Technology (NIST), in October 2000, as a replacement for the Data Encryption Standard (DES). This thesis presents the architecture for the VLSI implementation of the Rijndael, the Advanced Encryption Standard algorithm.
Rijndael is an iterated, symmetric block cipher with a variable key length and block length. The block length is fixed at 128 bits by the AES standard [4]. The key length can be designed for 128,192 or 256 bits. The VLSI implementation, presented in this thesis, is based on a feed-back logic and allows a key length specification of 128-bits. The present architecture is implemented in the Electronic Code Book(ECB) mode of operation. The proposed architecture is further optimized for area through resource-sharing between the encryption and decryption modules. The architecture includes a Key-Scheduler module for the forward-key and reverse-key scheduling during encryption and decryption respectively. The subkeys, required for each round of the Rijndael algorithm, are generated in real-time by the Key-Scheduler module by expanding the initial secret key.
The proposed architecture is designed using the Custom-Design Layout methodology with the Cadence Virtuoso tools and tested using the Avanti Hspice and the Nanosim CAD tools. Successful implementation of the algorithm using iterativearchitecture resulted in a throughput of 232 Mbits/sec on a 0.35[mu] CMOS technology. Using 0.35[mu] CMOS technology, implementation of the algorithm using pipelining architecture resulted in a throughput of 1.83 Gbits/sec. The performance of this implementation is compared with similar architectures reported in the literature.
|
Page generated in 0.0953 seconds