101 |
Arquitetura de NoC programável baseada em múltiplos clusters de cores para suporte a padrões de comunicação coletiva / Programmable multi-cluster noc architecture to support collective communication patternsFreitas, Henrique Cota de January 2009 (has links)
As próximas gerações de processadores many-core exigem que novas abordagens no projeto de arquitetura de processadores sejam propostas. Neste novo contexto, as redes de comunicação intra-chip são importantes para garantir o desempenho dos programas. Soluções tradicionais de interconexão possuem limites físicos que comprometem a escalabilidade e o desempenho no processamento de aplicações paralelas de diversos tipos. A alternativa apontada pelo estado da arte é a Network-on-Chip (NoC) composta por roteadores e outros elementos de rede capazes de prover comunicação escalável e de alto desempenho. No entanto, as cargas de trabalho geram padrões de comunicação diferentes que podem influenciar no desempenho da rede. Existem pesquisas que abordam metodologias de projeto dedicado de NoCs em função de domínios de aplicações específicos. Apesar de uma NoC dedicada possuir um alto desempenho, cargas de trabalho paralelas geram padrões de comunicação coletiva que mudam dinamicamente. Com o objetivo de aumentar a flexibilidade de redes-em-chip, trabalhos correlatos utilizam conceitos de computação reconfigurável para aumentar a capacidade da arquitetura da NoC se adaptar em função de padrões de comunicação. Alguns trabalhos focam na programação de FPGAs e outros em ASICs polimórficos. O objetivo desta tese é propor uma arquitetura de Network-on-Chip que suporte múltiplos clusters de núcleos de processamento através de roteadores programáveis e de topologias reconfiguráveis. Cada roteador é composto por uma chave crossbar reconfigurável capaz de implementar topologias dinamicamente através do uso de um segundo nível de reconfiguração. Os roteadores possuem processadores de rede que aumentam a flexibilidade e a capacidade da NoC se adaptar ao padrão de comunicação através de programas que monitoram e gerenciam a rede. Portanto, a contribuição da tese é a Arquitetura de NoC Programável Baseada em Múltiplos Clusters de Cores. Os resultados baseados em modelos analíticos e de simulação, e cargas de trabalho artificiais e naturais, mostram que a arquitetura da NoC possui um alto desempenho e vazão de pacotes, proporcionados pela adaptação de topologias e redução da influência da rede na comunicação. A ocupação em FPGA mostra que os roteadores programáveis possuem tamanho similares a NoCs com arquiteturas tradicionais para gerenciamento de mesma quantidade de núcleos. A menor utilização de buffers de entrada resulta em uma melhor eficiência no consumo de potência e energia. Portanto, através dos modelos de projeto e avaliação foi possível verificar através dos resultados que a arquitetura da MCNoC é uma alternativa para suportar padrões de comunicações coletivas. / For the next generation of many-core processors, new design methodologies must be proposed. In this context, on-chip interconnections are important to assure the program performance. Traditional approaches of interconnections have physical constraints that reduce the scalability and performance to process parallel applications. The state-of-theart points out to the Network-on-Chip (NoC), which consists of routers and other network devices capable of increasing the communication scalability and performance. However, workloads produce different types of communication patterns, which can influence the network performance. There are research works that explore applicationspecific NoC design to response the demand on specific workloads. Although a dedicated NoC has a high performance, parallel workloads have different collective communication patterns. In order to increase the flexibility of NoCs, related works use concepts of reconfigurable computing to add architecture adaptability to support dynamic communication patterns. Some works focus on FPGA-based reconfiguration and others on polymorphic ASICs. The goal of this thesis is to propose an alternative Programmable Multi-Cluster NoC architecture. Each router consists of a reconfigurable crossbar switch capable of implementing dynamic topologies through a second reconfiguration level. The routers have network processors that increase the flexibility and the NoC adaptability through management programs in order to support different workloads. Therefore, the contribution of this thesis is the following: A Programmable Multi-Cluster NoC (MCNoC) architecture. Based on analytical and simulation models, and artificial and natural workloads, results show the high performance and throughput for the proposed NoC architecture, due to the adaptable topologies and low network latency impact. Results based on FPGA shows a similar component utilization considering the proposed programmable NoC relative to conventional NoC architectures for the same number of processing cores. The low utilization of input buffers improves the efficiency of power and energy consumption. Therefore, through design and evaluation models, the NoC proposal was verified and the results point out the MCNoC as an alternative architecture to support collective communication patterns.
|
102 |
Proposta e desenvolvimento de um algoritmo de associatividade reconfigurável em memórias cache. / Proposal and development of a reconfigurable associativity algorithm in cache memories.Roberto Borges Kerr Junior 25 June 2008 (has links)
A evolução constante dos processadores está aumentando cada vez o overhead dos acessos à memória. Tentando evitar este problema, os desenvolvedores de processadores utilizam diversas técnicas, entre elas, o emprego de memórias cache na hierarquia de memórias dos computadores. As memórias cache, por outro lado, não conseguem suprir totalmente as suas necessidades, sendo interessante alguma técnica que tornasse possível aproveitar melhor a memória cache. Para resolver este problema, autores propõem a utilização de técnicas de computação reconfigurável. Este trabalho analisa um trabalho na área de reconfiguração na associatividade de memórias cache, e propõe melhorias nele para uma melhor utilização de seus recursos, apresentando resultados práticos de simulações realizadas com diversas organizações de cache. / With the constant evolution of processors architecture, its getting even bigger the overhead generated with memory access. Trying to avoid this problem, some processors developers are using several techniques to improve the performance, as the use of cache memories. By the otherside, cache memories cannot supply all their needs, thats why its important some new technique that could use better the cache memory. Working on this problem, some authors are using reconfigurable computing to improve the cache memorys performance. This work analyses the reconfiguration of the cache memory associativity algorithm, and propose some improvements on this algorithm to better use its resources, showing some practical results from simulations with several cache organizations.
|
103 |
Projeto de um processador open source em Bluespec baseado no processador soft-core Nios II da Altera / Design of an open source processor in Bluespec based on Altera Nios II soft-core processorErinaldo da Silva Pereira 09 June 2014 (has links)
Este trabalho apresenta o desenvolvimento de um processador open source baseado no processador Nios II da Altera. O processador desenvolvido permite a customização de instruções, a inclusão de componentes que possibilitem um estudo detalhado da memória cache, tal como um monitor de cache, definir o tamanho da cache, dentre outras características. Além disso, o processador é baseado na arquitetura do Nios II e implementa 90% do ISA do Nios II, o mesmo está integrado aos ambientes Qsys e SOPC Builder da ferramenta Quartus II da Altera, sendo possível utilizar todo o conjunto de IP (Propriedade Intelectual) e ferramentas disponíveis pela Altera. Assim, este trabalho tem como propósito colaborar com o desenvolvimento de arquiteturas de hardware com uma unidade de processamento configurável e customizável facilmente pelo usuário, uma vez que o seu código fonte em Bluespec SystemVerilog está aberto a todos os usuários, diferente do Nios II da Altera, que tem o código encriptado, inviabilizando fornecer qualquer mudança no processador a nível RTL (Register Transfer Level ). Para o desenvolvimento do processador foi utilizada a Linguagem de Descrição de Hardware Bluespec SystemVerilog, pelo fato de ser uma ESL (Electronic System Level ) que acelera o processo de desenvolvimento de hardware / This work presents the development of an open source based Nios II processor from Altera. The developed processor allows custom instructions, use of components that allows a detailed study of the cache memory, among other features. In addition, the processor is based on the Nios II architecture, which can be integrated into the Qsys and SOPC Builder of the Altera Quartus II environment tool as well as use the entire set of IP (Intellectual Property) and tools available from Altera. This work contributes to the development of hardware architectures with a processing unit configurable and easily customizable by the user, since its source code in Bluespec SystemVerilog is open to all users, other than the Nios II from Altera which has encrypted code, making it impossible to do any changes in the processor at RTL (Register Transfer level) level. For the development of the processor hardware the description language Bluespec SystemVerilog was used, which is an ESL (Electronic System Level) that speeds up the development of the hardware
|
104 |
Otimização de código fonte C para o processador embarcado Nios II / Optimizing C source-code for the Nios II embedded processorRafael de Vasconcellos Peron 20 December 2007 (has links)
Este projeto apresenta uma metodologia aplicada à análise da viabilidade de se otimizar código fonte C para o processador embarcado Nios II. Esta metodologia utiliza ferramentas de análise de código que traçam o perfil da aplicação, identificando suas partes críticas em relação ao tempo de execução, as quais são o gprof e o performance counter. Para otimizar o código para o processador Nios II, são utilizadas tanto instruções customizadas quanto uma ferramenta automática de aceleração de código, o compilador C2H. Como casos de estudo, foram escolhidos três algoritmos devido à sua importância no campo da robótica móvel, sendo eles o gaxpy, o EKF e o SIFT. A partir da aplicação da metodologia para se otimizar cada um dos casos, foi comparada a eficiência tanto das ferramentas de análise de código, quanto das ferramentas de otimização, bem como a validade da metodologia proposta / This project presents a methodology applied to analyze the viability of C source code optimization for the Nios II embedded processor. This methodology utilizes the gprof and performance counter source code analysis tools to profile the source code of an application, and identify its critical time consuming parts. The optimization of C source code for the Nios II processor was performed using custom instructions and an automatic source code acceleration tool, the C2H compiler. Three algorithms were chosen as study cases, based on their importance to mobile robotics. Those were the gaxpy, EKF and SIFT algorithms. After applying the presented methodology to optimize each study case, efficiency comparisons were made between the source code analysis tools, as well between the optimization tools, in order to validate the presented methodology
|
105 |
ChipCflow - em hardware dinamicamente reconfigurável / ChipCflow - in dynamically reconfigurable hardwareVitor Fiorotto Astolfi 04 December 2009 (has links)
Nos últimos anos, houve um grande avanço na computação reconfigurável, em particular em hardware que emprega Field-Programmable Gate Arrays. Porém, esse aumento de capacidade e desempenho aumentou a distância entre a capacidade de projeto e a disponibilidade de tecnologia para o desenvolvimento do projeto. As linguagens de programação imperativas de alto nível, como C, são mais apropriadas para o desenvolvimento de aplicativos complexos que as linguagens de descrição de hardware. Por isso, surgiram diversas ferramentas para o desenvolvimento de hardware a partir de código em C. A ferramenta ChipCflow, da qual faz parte este projeto, é uma delas. A execução dos programas por meio dessa ferramenta será completamente baseada em seu fluxo de dados, seguindo o modelo dinâmico encontrado nas arquiteturas de computadores a fluxo de dados, aproveitando ao máximo o paralelismo considerado natural desse modelo e as características do hardware parcialmente reconfigurável. Neste projeto em particular, o objetivo é a prova de conceito (proof of concept) para a criação de instâncias, em forma de operadores, de um algoritmo ChipCflow em hardware parcialmente reconfigurável, tendo como base a plataforma Virtex da Xilinx / In recent years, reconfigurable computing has become increasingly more advanced, especially in hardware that uses Field-Programmable Gate Arrays. However, the increase of performance in FPGAs accumulated the gap between design capacity and technology for the development of the design. Imperative high-level programming languages such as C are more appropriate for the development of complex algorithms than hardware description languages (HDL). For this reason, many ANSI C-like programming tools for the development of hardware came to existence. The ChipCflow project, of which this project is part, is one of these tools. The execution of algorithms through this tool will be completely directed by data flow, according to the dynamic model found on Dataflow Architectures, taking advantage of its natural high levels of parallelism and the characteristics of the partially reconfigurable hardware. In this project, the objective is a proof of concept for the creation of instances, in the form of operators, of a ChipCflow algorithm on a partially reconfigurable hardware, taking as reference the Xilinx Virtex boards
|
106 |
A Self-Configurable Architecture on an Irregular Reconfigurable FabricAmarnath, Avinash 01 January 2011 (has links)
Reconfigurable computing architectures combine the flexibility of software with the performance of custom hardware. Such architectures are of particular interest at the nanoscale level. We argue that a bottom-up self-assembled fabric of nodes will be easier and cheaper to manufacture, however, one has to make compromises with regards to the device regularity, homogeneity, and reliability. The goal of this thesis is to evaluate the performance and cost of a self-configurable computing architecture composed of simple reconfigurable nodes for unstructured and unknown fabrics. We built a software and hardware framework for this purpose. The framework enables creating an irregular network of compute nodes where each node can be configured as a simple 2-input, 4-bit logic gate. The compute nodes are organized hierarchically by sending a packet through a top anchor node that recruits compute nodes with a chemically-inspired algorithm. The nodes are then self-configured by means of a gate-level netlist describing any digital logic circuit. A topology-agnostic optimization algorithm inspired by simulated annealing is then initiated to self-optimize the circuit for latency. Latency comparisons between non-optimized, brute-force optimized and our optimization algorithm are made. We further implement the architecture in VHDL and evaluate hardware cost, area, and energy consumption. The simple on-chip topology-agnostic optimization algorithm we propose results in a significant (up to 50\%) performance improvement compared to the non-optimized circuits. Our findings are of particular interest for emerging nano and molecular-scale circuits.
|
107 |
Reconfigurable Technologies for Next Generation Internet and Cluster ComputingUnnikrishnan, Deepak C. 01 September 2013 (has links)
Modern web applications are marked by distinct networking and computing characteristics. As applications evolve, they continue to operate over a large monolithic framework of networking and computing equipment built from general-purpose microprocessors and Application Specific Integrated Circuits (ASICs) that offers few architectural choices. This dissertation presents techniques to diversify the next-generation Internet infrastructure by integrating Field-programmable Gate Arrays (FPGAs), a class of reconfigurable integrated circuits, with general-purpose microprocessor-based techniques. Specifically, our solutions are demonstrated in the context of two applications - network virtualization and distributed cluster computing.
Network virtualization enables the physical network infrastructure to be shared among several logical networks to run diverse protocols and differentiated services. The design of a good network virtualization platform is challenging because the physical networking substrate must scale to support several isolated virtual networks with high packet forwarding rates and offer sufficient flexibility to customize networking features. The first major contribution of this dissertation is a novel high performance heterogeneous network virtualization system that integrates FPGAs and general-purpose CPUs. Salient features of this architecture include the ability to scale the number of virtual networks in an FPGA using existing software-based network virtualization techniques, the ability to map virtual networks to a combination of hardware and software resources on demand, and the ability to use off-chip memory resources to scale virtual router features. Partial-reconfiguration has been exploited to dynamically customize virtual networking parameters. An open software framework to describe virtual networking features using a hardware-agnostic language has been developed. Evaluation of our system using a NetFPGA card demonstrates one to two orders of improved throughput over state-of-the-art network virtualization techniques.
The demand for greater computing capacity grows as web applications scale. In state-of-the-art systems, an application is scaled by parallelizing the computation on a pool of commodity hardware machines using distributed computing frameworks.
Although this technique is useful, it is inefficient because the sequential nature of execution in general-purpose processors does not suit all workloads equally well. Iterative algorithms form a pervasive class of web and data mining algorithms that are poorly executed on general purpose processors due to the presence of strict synchronization barriers in distributed cluster frameworks. This dissertation presents Maestro, a heterogeneous distributed computing framework that demonstrates how FPGAs can break down such synchronization barriers using asynchronous accumulative updates. These updates allow for the accumulation of intermediate results for numerous data points without the need for iteration-based barriers. The benefits of a heterogeneous cluster are illustrated by executing a general-class of iterative algorithms on a cluster of commodity CPUs and FPGAs. Computation is dynamically prioritized to accelerate algorithm convergence. We implement a general-class of three iterative algorithms on a cluster of four FPGAs. A speedup of 7× is achieved over an implementation of asynchronous accumulative updates on a general-purpose CPU. The system offers 154× speedup versus a standard Hadoop-based CPU-workstation cluster. Improved performance is achieved by clusters of FPGAs.
|
108 |
Reconfigurable Computing For Video CodingHuang, Jian 01 January 2010 (has links)
Video coding is widely used in our daily life. Due to its high computational complexity, hardware implementation is usually preferred. In this research, we investigate both ASIC hardware design approach and reconfigurable hardware design approach for video coding applications. First, we present a unified architecture that can perform Discrete Cosine Transform (DCT), Inverse Discrete Cosine Transform (IDCT), DCT domain motion estimation and compensation (DCT-ME/MC). Our proposed architecture is a Wavefront Array-based Processor with a highly modular structure consisting of 8*8 Processing Elements (PEs). By utilizing statistical properties and arithmetic operations, it can be used as a high performance hardware accelerator for video transcoding applications. We show how different core algorithms can be mapped onto the same hardware fabric and can be executed through the pre-defined PEs. In addition to the simplified design process of the proposed architecture and savings of the hardware resources, we also demonstrate that high throughput rate can be achieved for IDCT and DCT-MC by fully utilizing the sparseness property of DCT coefficient matrix. Compared to fixed hardware architecture using ASIC design approach, reconfigurable hardware design approach has higher flexibility, lower cost, and faster time-to-market. We propose a self-reconfigurable platform which can reconfigure the architecture of DCT computations during run-time using dynamic partial reconfiguration. The scalable architecture for DCT computations can compute different number of DCT coefficients in the zig-zag scan order to adapt to different requirements, such as power consumption, hardware resource, and performance. We propose a configuration manager which is implemented in the embedded processor in order to adaptively control the reconfiguration of scalable DCT architecture during run-time. In addition, we use LZSS algorithm for compression of the partial bitstreams and on-chip BlockRAM as a cache to reduce latency overhead for loading the partial bitstreams from the off-chip memory for run-time reconfiguration. A hardware module is designed for parallel reconfiguration of the partial bitstreams. The experimental results show that our approach can reduce the external memory accesses by 69% and can achieve 400 MBytes/s reconfiguration rate. Detailed trade-offs of power, throughput, and quality are investigated, and used as a criterion for self-reconfiguration. Prediction algorithm of zero quantized DCT (ZQDCT) to control the run-time reconfiguration of the proposed scalable architecture has been used, and 12 different modes of DCT computations including zonal coding, multi-block processing, and parallel-sequential stage modes are supported to reduce power consumptions, required hardware resources, and computation time with a small quality degradation. Detailed trade-offs of power, throughput, and quality are investigated, and used as a criterion for self-reconfiguration to meet the requirements set by the users.
|
109 |
A Design Methodology for Creating Programmable Logic-based Real-time Image Processing HardwareDrayer, Thomas Hudson 24 January 1997 (has links)
A new design methodology that produces hardware solutions for performing real-time image processing is presented here. This design methodology provides significant advantages over traditional hardware design approaches by translating real-time image processing tasks into the gate-level resources of programmable logic-based hardware architectures.
The use of programmable logic allows high-performance solutions to be realized with very efficient utilization of available logic and interconnection resources. These implementations provide comparable performance at a lower cost than other available programmable logic-based hardware architectures. This new design methodology is based on two components: a programmable logic-based destination hardware architecture and a suite of development system software. The destination hardware architecture is a Custom Computing Machine (CCM) that contains multiple Field Programmable Gate Array (FPGA) chips. FPGA chips provide gate-level programmability for the hardware architecture.
Sophisticated software development tools, called the TRAVERSE development system software, are created to overcome the significant amount of time and expertise required to manually utilize this gate-level programmability. The new hardware architecture and development system software combine to establish a unique design methodology.
There are several distinct contributions provided by this dissertation. The new flexible MORRPH hardware architecture provides a more efficient solution for creating real-time image processing computing machines than current commercial hardware architectures. The TRAVERSE development system software is the first integrated development system specifically for creating real-time image processing designs with multiple FPGA-based CCMs. New standards and design conventions are defined specifically for creating solutions to low-level image processing tasks, using the MORRPH architecture for verification. The circuit partitioning and global routing programs of the TRAVERSE development system software enable automated translation of image processing designs into the resources of multiple FPGA chips in the hardware architecture. In a broad sense, the individual contributions of this dissertation combine to create a new design methodology that will change the way hardware solutions are created for real-time image processing in the future. / Ph. D.
|
110 |
Towards the development of a reliable reconfigurable real-time operating system on FPGAsHong, Chuan January 2013 (has links)
In the last two decades, Field Programmable Gate Arrays (FPGAs) have been rapidly developed from simple “glue-logic” to a powerful platform capable of implementing a System on Chip (SoC). Modern FPGAs achieve not only the high performance compared with General Purpose Processors (GPPs), thanks to hardware parallelism and dedication, but also better programming flexibility, in comparison to Application Specific Integrated Circuits (ASICs). Moreover, the hardware programming flexibility of FPGAs is further harnessed for both performance and manipulability, which makes Dynamic Partial Reconfiguration (DPR) possible. DPR allows a part or parts of a circuit to be reconfigured at run-time, without interrupting the rest of the chip’s operation. As a result, hardware resources can be more efficiently exploited since the chip resources can be reused by swapping in or out hardware tasks to or from the chip in a time-multiplexed fashion. In addition, DPR improves fault tolerance against transient errors and permanent damage, such as Single Event Upsets (SEUs) can be mitigated by reconfiguring the FPGA to avoid error accumulation. Furthermore, power and heat can be reduced by removing finished or idle tasks from the chip. For all these reasons above, DPR has significantly promoted Reconfigurable Computing (RC) and has become a very hot topic. However, since hardware integration is increasing at an exponential rate, and applications are becoming more complex with the growth of user demands, highlevel application design and low-level hardware implementation are increasingly separated and layered. As a consequence, users can obtain little advantage from DPR without the support of system-level middleware. To bridge the gap between the high-level application and the low-level hardware implementation, this thesis presents the important contributions towards a Reliable, Reconfigurable and Real-Time Operating System (R3TOS), which facilitates the user exploitation of DPR from the application level, by managing the complex hardware in the background. In R3TOS, hardware tasks behave just like software tasks, which can be created, scheduled, and mapped to different computing resources on the fly. The novel contributions of this work are: 1) a novel implementation of an efficient task scheduler and allocator; 2) implementation of a novel real-time scheduling algorithm (FAEDF) and two efficacious allocating algorithms (EAC and EVC), which schedule tasks in real-time and circumvent emerging faults while maintaining more compact empty areas. 3) Design and implementation of a faulttolerant microprocessor by harnessing the existing FPGA resources, such as Error Correction Code (ECC) and configuration primitives. 4) A novel symmetric multiprocessing (SMP)-based architectures that supports shared memory programing interface. 5) Two demonstrations of the integrated system, including a) the K-Nearest Neighbour classifier, which is a non-parametric classification algorithm widely used in various fields of data mining; and b) pairwise sequence alignment, namely the Smith Waterman algorithm, used for identifying similarities between two biological sequences. R3TOS gives considerably higher flexibility to support scalable multi-user, multitasking applications, whereby resources can be dynamically managed in respect of user requirements and hardware availability. Benefiting from this, not only the hardware resources can be more efficiently used, but also the system performance can be significantly increased. Results show that the scheduling and allocating efficiencies have been improved up to 2x, and the overall system performance is further improved by ~2.5x. Future work includes the development of Network on Chip (NoC), which is expected to further increase the communication throughput; as well as the standardization and automation of our system design, which will be carried out in line with the enablement of other high-level synthesis tools, to allow application developers to benefit from the system in a more efficient manner.
|
Page generated in 0.0834 seconds