71 |
Performance and Energy Efficient Building Blocks for Network-on-Chip ArchitecturesVangal, Sriram R. January 2006 (has links)
<p>The ever shrinking size of the MOS transistors brings the promise of scalable Network-on-Chip (NoC) architectures containing hundreds of processing elements with on-chip communication, all integrated into a single die. Such a computational fabric will provide high levels of performance in an energy efficient manner. To mitigate emerging wire-delay problem and to address the need for substantial interconnect bandwidth, packet switched routers are fast replacing shared buses and dedicated wires as the interconnect fabric of choice. With on-chip communication consuming a significant portion of the chip power and area budgets, there is a compelling need for compact, low power routers. While applications dictate the choice of the compute core, the advent of multimedia applications, such as 3D graphics and signal processing, places stronger demands for self-contained, low-latency floating-point processors with increased throughput. Therefore, this work focuses on two key building blocks critical to the success of NoC design: high performance, area and energy efficient router and floating-point processor architectures.</p><p>This thesis first presents a six-port four-lane 57 GB/s non-blocking router core based on wormhole switching. The router features double-pumped crossbar channels and destinationaware channel drivers that dynamically configure based on the current packet destination. This enables 45% reduction in crossbar channel area, 23% overall router area, up to 3.8X reduction in peak channel power, and 7.2% improvement in average channel power, with no performance penalty over a published design. In a 150nm six-metal CMOS process, the 12.2mm2 router contains 1.9 million transistors and operates at 1GHz at 1.2V. We next present a new pipelined single-precision floating-point multiply accumulator core (FPMAC) featuring a single-cycle accumulate loop using base 32 and internal carry-save arithmetic, with delayed addition techniques. Combined algorithmic, logic and circuit techniques enable multiply-accumulates at speeds exceeding 3GHz, with single-cycle throughput. Unlike existing FPMAC architectures, the design eliminates scheduling restrictions between consecutive FPMAC instructions. The optimizations allow removal of the costly normalization step from the critical accumulate loop and conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In addition, an improved leading zero anticipator (LZA) and overflow detection logic applicable to carry-save format is presented. In a 90nm seven-metal dual-VT CMOS process, the 2mm2 custom design contains 230K transistors. The fully functional first silicon achieves 6.2 GFLOPS of performance while dissipating 1.2W at 3.1GHz, 1.3V supply.</p><p>It is clear that realization of successful NoC designs require well balanced decisions at all levels: architecture, logic, circuit and physical design. Our results from key building blocks demonstrate the feasibility of pushing the performance limits of compute cores and communication routers, while keeping active and leakage power, and area under control.</p> / Report code: LiU-TEK-LIC-2006:36.
|
72 |
Performance and Energy Efficient Network-on-Chip ArchitecturesVangal, Sriram January 2007 (has links)
The scaling of MOS transistors into the nanometer regime opens the possibility for creating large Network-on-Chip (NoC) architectures containing hundreds of integrated processing elements with on-chip communication. NoC architectures, with structured on-chip networks are emerging as a scalable and modular solution to global communications within large systems-on-chip. NoCs mitigate the emerging wire-delay problem and addresses the need for substantial interconnect bandwidth by replacing today’s shared buses with packet-switched router networks. With on-chip communication consuming a significant portion of the chip power and area budgets, there is a compelling need for compact, low power routers. While applications dictate the choice of the compute core, the advent of multimedia applications, such as three-dimensional (3D) graphics and signal processing, places stronger demands for self-contained, low-latency floating-point processors with increased throughput. This work demonstrates that a computational fabric built using optimized building blocks can provide high levels of performance in an energy efficient manner. The thesis details an integrated 80- Tile NoC architecture implemented in a 65-nm process technology. The prototype is designed to deliver over 1.0TFLOPS of performance while dissipating less than 100W. This thesis first presents a six-port four-lane 57 GB/s non-blocking router core based on wormhole switching. The router features double-pumped crossbar channels and destinationaware channel drivers that dynamically configure based on the current packet destination. This enables 45% reduction in crossbar channel area, 23% overall router area, up to 3.8X reduction in peak channel power, and 7.2% improvement in average channel power. In a 150-nm sixmetal CMOS process, the 12.2 mm2 router contains 1.9-million transistors and operates at 1 GHz at 1.2 V supply. We next describe a new pipelined single-precision floating-point multiply accumulator core (FPMAC) featuring a single-cycle accumulation loop using base 32 and internal carry-save arithmetic, with delayed addition techniques. A combination of algorithmic, logic and circuit techniques enable multiply-accumulate operations at speeds exceeding 3GHz, with singlecycle throughput. This approach reduces the latency of dependent FPMAC instructions and enables a sustained multiply-add result (2FLOPS) every cycle. The optimizations allow removal of the costly normalization step from the critical accumulation loop and conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In a 90-nm seven-metal dual-VT CMOS process, the 2 mm2 custom design contains 230-K transistors. Silicon achieves 6.2-GFLOPS of performance while dissipating 1.2 W at 3.1 GHz, 1.3 V supply. We finally present the industry's first single-chip programmable teraFLOPS processor. The NoC architecture contains 80 tiles arranged as an 8×10 2D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined singleprecision FPMAC units which feature a single-cycle accumulation loop for high throughput. The five-port router combines 100 GB/s of raw bandwidth with low fall-through latency under 1ns. The on-chip 2D mesh network provides a bisection bandwidth of 2 Tera-bits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100-M transistors. The fully functional first silicon achieves over 1.0TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07-V supply. It is clear that realization of successful NoC designs require well balanced decisions at all levels: architecture, logic, circuit and physical design. Our results demonstrate that the NoC architecture successfully delivers on its promise of greater integration, high performance, good scalability and high energy efficiency.
|
73 |
Architecture Support and Scalability Analysis of Memory Consistency Models in Network-on-Chip based SystemsNaeem, Abdul January 2013 (has links)
The shared memory systems should support parallelization at the computation (multi-core), communication (Network-on-Chip, NoC) and memory architecture levels to exploit the potential performance benefits. These parallel systems supporting shared memory abstraction both in the general purpose and application specific domains are confronting the critical issue of memory consistency. The memory consistency issue arises due to the unconstrained memory operations which leads to the unexpected behavior of shared memory systems. The memory consistency models enforce ordering constraints on the memory operations for the expected behavior of the shared memory systems. The intuitive Sequential Consistency (SC) model enforces strict ordering constraints on the memory operations and does not take advantage of the system optimizations both in the hardware and software. Alternatively, the relaxed memory consistency models relax the ordering constraints on the memory operations and exploit these optimizations to enhance the system performance at the reasonable cost. The purpose of this thesis is twofold. First, the novel architecture supports are provided for the different memory consistency models like: SC, Total Store Ordering (TSO), Partial Store Ordering (PSO), Weak Consistency (WC), Release Consistency (RC) and Protected Release Consistency (PRC) in the NoC-based multi-core (McNoC) systems. The PRC model is proposed as an extension of the RC model which provides additional reordering and relaxation in the memory operations. Second, the scalability analysis of these memory consistency models is performed in the McNoC systems. The architecture supports for these different memory consistency models are provided in the McNoC platforms. Each configurable McNoC platform uses a packet-switched 2-D mesh NoC with deflection routing policy, distributed shared memory (DSM), distributed locks and customized processor interface. The memory consistency models/protocols are implemented in the customized processor interfaces which are developed to integrate the processors with the rest of the system. The realization schemes for the memory consistency models are based on a transaction counter and an an an address ddress ddress ddress ddress ddress ddress stack tacktack-based based based based based based novel approaches.approaches.approaches.approaches. approaches.approaches.approaches.approaches.approaches.approaches. The transaction counter is used in each node of the network to keep track of the outstanding memory operations issued by a processor in the system. The address stack is used in each node of the network to keep track of the addresses of the outstanding memory operations issued by a processor in the system. These hardware structures are used in the processor interface to enforce the required global orders under these different memory consistency models. The realization scheme of the PRC model in addition also uses acquire counter for further classification of the data operations as unprotected and protected operations. The scalability analysis of these different memory consistency models is performed on the basis of different workloads which are developed and mapped on the various sized networks. The scalability study is conducted in the McNoC systems with 1 to 64-cores with various applications using different problem sizes and traffic patterns. The performance metrics like execution time, performance, speedup, overhead and efficiency are evaluated as a function of the network size. The experiments are conducted both with the synthetic and application workloads. The experimental results under different application workloads show that the average execution time under the relaxed memory consistency models decreases relative to the SC model. The specific numbers are highly sensitive to the application and depend on how well it matches to the architectures. This study shows the performance improvement under the relaxed memory consistency models over the SC model that is dependent on the computation-to-communication ratio, traffic patterns, data-to-synchronization ratio and the problem size. The performance improvement of the PRC and RC models over the SC model tends to be higher than 50% as observed in the experiments, when the system is further scaled up. / <p>QC 20130204</p>
|
74 |
Network-on-chip architectures for scalability and service guaranteesGrot, Boris 13 July 2012 (has links)
Rapidly increasing transistor densities have led to the emergence of richly-integrated substrates in the form of chip multiprocessors and systems-on-a-chip. These devices integrate a variety of discrete resources, such as processing cores and cache memories, on a single die with the degree of integration growing in accordance with Moore's law. In this dissertation, we address challenges of scalability and quality-of-service (QOS) in network architectures of highly-integrated chips. The proposed techniques address the principal sources of inefficiency in networks-on-chip (NOCs) in the form of performance, area, and energy overheads. We also present a comprehensive network architecture capable of interconnecting over a thousand discrete resources with high efficiency and strong guarantees.
We first show that mesh networks, commonly employed in existing chips, fall significantly short of achieving their performance potential due to transient congestion effects that diminish network performance. Adaptive routing has the potential to improve performance through better load distribution. However, we find that existing approaches are myopic in that they only consider local congestion indicators and fail to take global network state into account. Our approach, called Regional Congestion Awareness (RCA), improves network visibility in adaptive routers via a light-weight mechanism for propagating and integrating congestion information. By leveraging both local and non-local congestion indicators, RCA improves network load balance and boosts throughput. Under a set of parallel workloads running on a 49-node substrate, RCA reduces on-chip network latency by 16%, on average, compared to a locally-adaptive router.
Next, we target NOC latency and energy efficiency through a novel point-to-multipoint topology. Ring and mesh networks, favored in existing on-chip interconnects, often require packets to go through a number of intermediate routers between source and destination nodes, resulting in significant latency and energy overheads. Topologies that improve connectivity, such as fat tree and flattened butterfly, eliminate much of the router overhead, but require non-minimal channel lengths or large channel count, reducing energy-efficiency and/or performance as a result. We propose a new topology, called Multidrop Express Channels (MECS), that augments minimally-routed express channels with multi-drop capability. The resulting richly-connected NOC enjoys a low hop count with favorable delay and energy characteristics, while improving wire utilization over prior proposals.
Applications such as virtualized servers-on-a-chip and real-time systems require chip-level quality-of-service (QOS) support to provide fairness, service differentiation, and guarantees. Existing network QOS approaches suffer from considerable performance and area overheads that limit their usefulness in a resource-limited on-die network. In this dissertation, we propose a new QOS scheme called Preemptive Virtual Clock (PVC). PVC uses a preemptive approach to provide hard guarantees and strong performance isolation while dramatically reducing queuing requirements that burden prior proposals.
Finally, we introduce a comprehensive network architecture that overcomes the bottlenecks of earlier designs with respect to area, energy, and QOS in future highly-integrated chips. The proposed NOC uses a topology-centric QOS approach that restricts the extent of hardware QOS support to a fraction of the network without compromising guarantees. In doing so, network area and energy efficiency are significantly improved. Further improvements are derived through a novel flow-control mechanism, along with switch- and link-level optimizations. In concert, these techniques yield a network capable of interconnecting over a thousand terminals on a die while consuming 47% less area and 26% less power than a state-of-the-art QOS-enabled NOC.
The mechanisms proposed in this dissertation are synergistic and enable efficient, high-performance interconnects for future chips integrating hundreds or thousands of on-die resources. They address deficiencies in routing, topologies, and flow control of existing architectures with respect to area, energy, and performance scalability. They also serve as a building block for cost-effective advanced services, such as QOS guarantees at the die level. / text
|
75 |
Vienlusčių tinklo architektūrų tyrimas / Research in Network on Chip architecturesČepaitis, Modestas 16 August 2007 (has links)
“International Technology Roadmap for Semiconductors” (ITRS) teigia, kad baigiantis dešimtmečiui bus pagamintas 50 – 100 nm lustas susidedantis iš maždaug 4 bilijonų tranzistorių operuojančių 10Ghz. Taigi tokių parametrų sistemas pagaminti yra nelengva, tranzistorių skaičiaus didėjimas taip pat didina lusto dydį, lusto sud��tingumą bei sunkina lusto integruojamumą. Be to vienlustės sistemos naudoja bendras magistrales bendrauti su kitais lustiniais resursais. Šios magistralės yra nedalomos, todėl ateityje toks komunikavimo metodas taps vis didesne problema gaminant lustus susidedančius iš bilijonų tranzistorių. Šiems tikslams spręsti ir buvo pasiūlyta vienlusčių tinklo paradigma. Vienlusčių tinklų paradigma siūlo galimus komunikavimo infrastruktūrų sprendimus, susiduriant su vis sudėtingesnėmis sistemomis, bei padeda trumpinanti lusto pagaminimo laiko periodą. Šiame darbe naudojama atkartojimo technologijos, kuriant bendrini komponentą vienlusčių tinklams. Metaspecifikacija, sukurta naudojant preprocesorinį atkartojimo metodą. / According to the International Technology Roadmap for Semiconductors (ITRS), before the end of this decade we will be entering the era of a billion transistors on a single chip. It is being stated that soon we will have a chip of 50- 100 nm comprising around 4 billion transistors operating at a frequency of 10 Ghz. However, it has been observed that as the system grows, so does the complexity of integrating various components on a chip. The major threat toward the achievement of a billion transistor chip is poor scalability of current interconnect structure of today’s SoC. In order to cope with growing interconnect infrastructure, the “Network on chip (NoC)” concept was introduced. NoCs present a possible communication infrastructure solution to deal with increased design complexity and shrinking time-to-market. It is clear that NoCs can potentially become the preferred interconnection approach for SoCs being developed in a near future. This paper discusses the impact of the reuse of NoC components methods, for parameterize and test these systems. There is preprocessing type reusing being used, to create a router metaspecification of router for the 2d torus interconnect network.
|
76 |
Adaptive NoC for reconfigurable SoCPratomo, Istas 08 November 2013 (has links) (PDF)
Chips will be designed with billions of transistors and heterogeneous components integrated to provide full functionality of a current application for embedded system. These applications also require highly parallel and flexible communicating architecture through a regular interconnection network. The emerging solution that can fulfill this requirement is Network-on-Chips (NoCs). Designing an ideal NoC with high throughput, low latency, minimum using resources, minimum power consumption and small area size are very time consuming. Each application required different levels of QoS such as minimum level throughput delay and jitter. In this thesis, firstly, we proposed an evaluation of the impact of design parameters on performance of NoC. We evaluate the impact of NoC design parameters on the performances of an adaptive NoCs. The objective is to evaluate how big the impact of upgrading the value on performances. The result shows the accuracy of choosing and adjusting the network parameters can avoid performance degradation. It can be considered as the control mechanism in an adaptive NoC to avoid the degradation of QoS NoC. The use of deep sub-micron technology in embedded system and its variability process cause Single Event Upsets (SEU) and ''aging'' the circuit. SEU and aging of circuit is the major problem that cause the failure on transmitting the packet in a NoC. Implementing fault-tolerant routing techniques in NoC switching instead of adding virtual channel is the best solution to avoid the fault in NoC. Communication performance of a NoC is depends heavily on the routing algorithm. An adaptive routing algorithm such as fault-tolerant has been proposed for deadlock avoidance and load balancing. This thesis proposed a novel adaptive fault-tolerant routing algorithm for 2D mesh called Gradient and for 3D mesh called Diagonal. Both algorithms consider sequences of alternative paths for packets when the main path fails. The proposed algorithm tolerates faults in worst condition traffic in NoCs. The number of hops, the number of alternative paths, latency and throughput in faulty network are determined and compared with other 2D mesh routing algorithms. Finally, we implemented Gradient routing algorithm into FPGA. All these work were validated and characterized through simulation and implemented into FPGA. The results provide the comparison performance between proposed method with existing related method using some scenarios.
|
77 |
Représentation dynamique de la liste des copies pour le passage à l'échelle des protocoles de cohérence de cache / Dynamic sharing set for scalable cache coherence protocolsDumas, Julie 13 December 2017 (has links)
Le problème du passage à l’échelle des protocoles de cohérence de cache qui se pose pour les machines parallèles se pose maintenant aussi sur puce, suite à l’émergence des architectures manycores. Il existe fondamentalement deux classes de protocoles : ceux basés sur l’espionnage et ceux utilisant un répertoire. Les protocoles basés sur l’espionnage, qui doivent transmettre à tous les caches les informations de cohérence, engendrent un nombre important de messages dont peu sont effectivement utiles. En revanche, les protocoles avec répertoires visent à n’envoyer des messages qu’aux caches qui en ont besoin. L’implémentation la plus évidente utilise un champ de bits complet dont la taille dépend uniquement du nombre de cœurs. Ce champ de bits représente la liste des copies. Pour passer à l’échelle, un protocole de cohérence doit émettre un nombre raisonnable de messages de cohérence et limiter le matériel utilisé pour la cohérence et en particulier pour la liste des copies. Afin d’évaluer et de comparer les différents protocoles et leurs représentations de la liste des copies, nous proposons tout d’abord une méthode de simulation basée sur l’injection de traces dans un modèle de cache à haut niveau. Cette méthode permet d’effectuer rapidement l’exploration architecturale des protocoles de cohérence de cache. Dans un second temps, nous proposons une nouvelle représentation dynamique de la liste des copies pour le passage à l’échelle des protocoles de cohérence de cache. Pour une architecture à 64 cœurs, 93% des lignes de cache sont partagées par au maximum 8 cœurs, sachant par ailleurs que le système d’exploitation chercher à placer les tâches communicantes proches les unes des autres. Notre représentation dynamique de la liste des copies tire parti de ces deux observations en utilisant un champ de bits pour un sous-ensemble des copies et une liste chaînée. Le champ de bits correspond à un rectangle à l’intérieur duquel la représentation de la liste des copies est exacte. La position et la forme de ce rectangle évoluent au cours de la durée de vie des applications. Plusieurs algorithmes pour le placement du rectangle cohérent sont proposés et évalués. Pour finir, nous effectuons une comparaison avec les représentations de la liste des copies de l’état de l’art. / Cache coherence protocol scalability problem for parallel architecture is also a problem for on chip architecture, following the emergence of manycores architectures. There are two protocol classes : snooping and directory-based.Protocols based on snooping, which send coherence information to all caches, generate a lot of messages whose few are useful.On the other hand, directory-based protocols send messages only to caches which need them. The most obvious implementation uses a full bit vector whose size depends only on the number of cores. This bit vector represents the sharing set. To scale, a coherence protocol must produce a reasonable number of messages and limit hardware ressources used by the coherence and in particular for the sharing set.To evaluate and compare protocols and their sharing set, we first propose a method based on trace injection in a high-level cache model. This method enables a very fast architectural exploration of cache coherence protocols.We also propose a new dynamic sharing set for cache coherence protocols, which is scalable. With 64 cores, 93% of cache blocks are shared by up to 8 cores.Futhermore, knowing that the operating system looks to place communicating tasks close to each other. Our dynamic sharing set takes advantage from these two observations by using a bit vector for a subset of copies and a linked list. The bit vector corresponds to a rectangle which stores the exact sharing set. The position and shape of this rectangle evolve over application's lifetime. Several algorithms for coherent rectangle placement are proposed and evaluated. Finally, we make a comparison with sharing sets from the state of the art.
|
78 |
Memória transacional em hardware para sistemas embarcados multiprocessados conectados por redes-em-chip / Hardware transactional memory for noc-based multi-core embedded systemsKunz, Leonardo January 2010 (has links)
A Memória Transacional (TM) surgiu nos últimos anos como uma nova solução para sincronização em sistemas multiprocessados de memória compartilhada, permitindo explorar melhor o paralelismo das aplicações ao evitar limitações inerentes ao mecanismo de locks. Neste modelo, o programador define regiões de código que devem executar de forma atômica. O sistema tenta executá-las de forma concorrente, e, em caso de conflito nos acessos à memória, toma as medidas necessárias para preservar a atomicidade e isolamento das transações, na maioria das vezes abortando e reexecutando uma das transações. Um dos modelos mais aceitos de memória transacional em hardware é o LogTM, implementado neste trabalho em um MPSoC embarcado que utiliza uma NoC para interconexão. Os experimentos fazem uma comparação desta implementação com locks, levando-se em consideração performance e energia do sistema. Além disso, este trabalho mostra que o tempo que uma transação espera para reiniciar sua execução após ter abortado (chamado de backoff delay on abort) tem impactos significativos na performance e energia. Uma análise deste impacto é feita utilizando-se de três políticas de backoff. Um mecanismo baseado em um handshake entre transações, chamado Abort handshake, é proposto como solução para o problema. Os resultados dos experimentos são dependentes da aplicação e configuração do sistema e indicam ganhos da TM na maioria dos casos em relação ao mecanismo de locks. Houve redução de até 30% no tempo de execução e de até 32% na energia de aplicações de baixa demanda de sincronização. Em um segundo momento, é feita uma análise do backoff delay on abort na performance e energia de aplicações utilizando três políticas de backoff em comparação com o mecanismo Abort handshake. Os resultados mostram que o mecanismo proposto apresenta redução de até 20% no tempo de execução e de até 53% na energia comparado à melhor política de backoff dentre as analisadas. Para aplicações com alta demanda de sincronização, a TM mostra redução no tempo de execução de até 63% e redução de energia de até 71% em comparação com o mecanismo de locks. / Transactional Memory (TM) has emerged in the last years as a new solution for synchronization on shared memory multiprocessor systems, allowing a better exploration of the parallelism of the applications by avoiding inherent limitations of the lock mechanism. In this model, the programmer defines regions of code, called transactions, to execute atomically. The system tries to execute transactions concurrently, but in case of conflict on memory accesses, it takes the appropriate measures to preserve the atomicity and isolation, usually aborting and re-executing one of the transactions. One of the most accepted hardware transactional memory model is LogTM, implemented in this work in an embedded MPSoC that uses an NoC as interconnection mechanism. The experiments compare this implementation with locks, considering performance and energy. Furthermore, this work shows that the time a transaction waits to restart after abort (called backoff delay on abort) has significant impact on performance and energy. An analysis of this impact is done using three backoff policies. A novel mechanism based on handshake of transactions, called Abort handshake, is proposed as a solution to this issue. The results of the experiments depends on application and system configuration and show TM benefits in most cases in comparison to the locks mechanism, reaching reduction on the execution time up to 30% and reduction on the energy consumption up to 32% on low contention workloads. After that, an analysis of the backoff delay on abort on the performance and energy is presented, comparing to the Abort handshake mechanism. The proposed mechanism shows reduction of up to 20% on the execution time and up to 53% on the energy, when compared to the best backoff policy. For applications with a high degree of synchronization, TM shows reduction on the execution time up to 63% and energy savings up to 71% compared to locks.
|
79 |
Estudo sobre o impacto da hierarquia de memória em MPSoCs baseados em NoCSilva, Gustavo Girão Barreto da January 2009 (has links)
Ao longo dos últimos anos, os sistemas embarcados vêm se tornando cada vez mais complexos tanto em termos de hardware quanto de software. Ultimamente têm-se adotado como solução o uso de MPSoCs (sistemas multiprocessados integrados em chip) para uma maior eficiência energética e computacional nestes sistemas. Com o uso de diversos elementos de processamento, redes-em-chip (NoC - networks-on-chip) aparecem como soluções de melhor desempenho do que barramentos. Nestes ambientes cujo desempenho depende da eficiência do modelo de comunicação, a hierarquia de memória se torna um elemento chave. Baseando-se neste cenário, este trabalho realiza uma investigação sobre o impacto da hierarquia de memória em MPSoCs baseados em NoC. Dentro deste escopo foi desenvolvida uma nova organização de memória fisicamente centralizada com diferentes espaços de endereçamentos denominada nDMA. Este trabalho também apresenta uma comparação entre a nova organização e outras três organizações bastante difundidas tais como memória distribuída, memória compartilhada e memória compartilhada distribuída. Estas duas ultimas adotam um modelo de coerência de cache baseado em diretório completamente desenvolvido em hardware. Os modelos de memória foram implementados na plataforma virtual SIMPLE (SIMPLE Multiprocessor Platform Environment). Resultados experimentais mostram uma forte dependência com relação à carga de comunicação gerada pelas aplicações. O modelo de memória distribuída apresenta melhores resultados conforme a carga de comunicação das aplicações é baixa. Por outro lado, o novo modelo de memória fisicamente compartilhado com diferentes espaços de endereçamento apresenta melhores resultados conforme a carga de comunicação das aplicações é alta. Também foram realizados experimentos objetivando analisar o desempenho dos modelos de memória em situações de alta latência de comunicação na rede. Resultados mostram melhores resultados do modelo de memória distribuída quando a carga de comunicação das aplicações é alta e, caso contrário, o modelo nDMA apresenta melhores resultados. Por fim, foram analisados os desempenhos dos modelos de memória durante o processo de migração de tarefas. Neste caso, os modelos de memória compartilhada e compartilhada distribuída apresentaram melhores resultados devido ao fato de que não se faz necessária o envio dos dados da aplicação nestes modelos e também devido ao menor tamanho de código se comparado com os outros modelos. / In the past few the years, embedded systems have become even more complex both on terms of hardware and software. Lately, the use of MPSoCs (Multi-Processor Systems-on-Chip) has been adopted on these systems for a better energetic and computational efficiency. Due to the use of several processing elements, Networks-on-Chip arise as better performance solutions than buses. Considering this scenario, this work performs an investigation on the impact of memory hierarchy in NoC-based MPSoCs. In this context, a new physically centralized and shared memory organization with different address spaces named nDMA was developed. This work also presents a comparison between the new memory organization and three different well-known memory hierarchy models such as distributed memory and shared and distributed shared memories that make use of a fully hardware cache coherence solution. The memory models were implemented in the SIMPLE (SIMPLE Multiprocessor Platform Environment) virtual platform. Experimental results shows a strong dependency on the application communication workload. The distributed memory model presents better results as the application communication workload is low. On the other hand, the new memory model (physically shared with different address spaces) presents better results as the application communication workload is high. There were also experiments aiming at observing the performance of the memory models in situations where the communication latency on the network is high. Results show better results of the distributed memory model when the application communication workload is high, and the nDMA model presents better results otherwise. Finally, the performance of the memory models during a task migration process were evaluated. In this case, the shared memory and distributed shared memory models presented better results due to the fact that in this case the data memory does not need to be transferred from one point to another and also due to the low size of the memory code in these cases if compared to other memory models.
|
80 |
Desenvolvimento e avaliação de redes-em-chip hierárquicas e reconfiguráveis para MPSoCs / Development and evaluation of hierarchical and reconfigurable networks-on-chip for MPSoCsReinbrecht, Cezar Rodolfo Wedig January 2012 (has links)
Com o advento dos processos submicrônicos, a capacidade de integração de transistores numa mesma pastilha de silício atingiu níveis que possibilitaram a construção dos sistemas com múltiplos processadores num chip (MPSoCs, do inglês MultiProcessor System-on-Chip). Essa possibilidade de integração permite inserir dezenas de Elementos de Processamento (EPs) nos circuitos integrados atuais, e já se projeta centenas de EPs para os sistemas da próxima década (ITRS, 2011). Nesse cenário, um dos principais desafios se refere ao serviço de interconexão dos EPs, que deve apresentar um desempenho de comunicação necessário para as aplicações em execução sem comprometer as limitações de consumo de área e energia do circuito. Nos primeiros sistemas multiprocessados, com poucos nodos, arquiteturas baseadas em barramento foram suficientes para cumprir esses requisitos. Porém, o número de elementos nos sistemas recentes aumentou rapidamente, tornando as redes-em-chip a solução mais apropriada, por aliar escalabilidade e reuso na mesma estrutura. Contudo, diante da previsão de que essa tendência de aumento se manterá retorna a discussão se as redes-em-chip atuais continuarão adequadas para os futuros sistemas. De fato, o custo das redes-em-chip convencionais pode se tornar proibitivo para as escalas dos circuitos em um futuro próximo. Novas propostas têm sido apresentadas na literatura científica onde se podem destacar duas principais estratégias de projeto às redes de interconexão: reconfiguração arquitetural e organização hierárquica da topologia. A reconfiguração arquitetural permite obter uma grande eficiência, independente do tipo de aplicação em execução, pois uma das alternativas é projetar o circuito para que ele se auto adapte conforme os requisitos de desempenho para cada aplicação. Por outro lado, arquiteturas organizadas em topologias hierárquicas são desenvolvidas para uma estrutura computacional definida em tempo de projeto, sendo mais eficazes para uma classe de aplicações. O presente trabalho explora a sinergia da combinação das potencialidades das duas soluções e propõe uma nova estrutura que oferece melhor desempenho para uma classe maior de aplicações apropriada para os futuros sistemas. Como resultado foi implementada uma arquitetura adaptativa chamada MINoC (Multiple Interconnections Networks-on-Chip), uma arquitetura organizada em hierarquia, chamada HiCIT (Hierarchical Crossbar-based Interconnection Topology) e uma simbiose de ambas culminando na arquitetura hierárquica adaptativa HASIN (Hierarchical Adaptive Switching Interconnection Network). São apresentados resultados que mostram a eficiência desses conceitos validando a proposta hierárquica adaptativa. / With the advent of submicron processes, the number of transistors integrated on a single chip has reached levels that allowed the design of Multiprocessor Systems-on-Chip (MPSoCs). This capability allows the integration of several processing elements (PEs) in integrated circuits designed nowadays. In the next decade it is expected that hundreds of PEs will be integrated on a single chip. In this scenario, a key challenge is the interconnection network between PEs, which must provide the communication service required to run applications without compromising the limitations of area and energy consumption. In the first multiprocessor systems, with few nodes, bus-based approaches have been sufficient to meet these requirements. However, current systems increased quickly the number of elements, making the Networks-on-Chip (NoCs) the most appropriate solution, because it handles scalability and reusability in the same structure. Nevertheless, ITRS roadmap predicts that this increase will continue (ITRS, 2011), which resumes the discussion if present NoC architectures will be the most adequate for future systems, since its costs could be prohibitive. Therefore, new proposals have been presented in the literature with two main design strategies: architectural reconfiguration and hierarchical organization of the topology. With the architectural reconfiguration it is possible to obtain an application independent high efficiency structure, because the circuit is designed to adapt itself to satisfy performance requirements. On the other hand, architectural organizations in hierarchical topologies are defined at design time to have the most appropriate features for a class of applications, being very effective. The current work identified the synergy of both approaches and proposes a new symbiotic structure suitable for a broader class of applications. As a result, it was implemented an adaptive architecture called MINoC (Multiple Interconexions Networks-on-chip), an architecture organized in hierarchy called HiCIT (Hierarchical Crossbar-based Interconnection Topology) and a mix of both ending up with the hierarchical adaptive architecture HASIN (Hierarchical Interconnection Network Adaptive Switching). Results show the efficiency of these concepts validating the proposed hierarchical adaptive architecture.
|
Page generated in 0.0575 seconds