Spelling suggestions: "subject:"bnetwork ono chips"" "subject:"bnetwork ono ships""
1 |
Multipath Router Architectures to Reduce Latency in Network-on-ChipsDeshpande, Hrishikesh 2012 May 1900 (has links)
The low latency is a prime concern for large Network-on-Chips (NoCs) typically used in chip-multiprocessors (CMPs) and multiprocessor system-on-chips (MPSoCs). A significant component of overall latency is the serialization delay for applications which have long packets such as typical video stream traffic. To address the serialization latency, we propose to exploit the inherent path diversity available in a typical 2-D Mesh with our two novel router architectures, Dual-path router and Dandelion router. We observe that, in a 2-D mesh, for any source-destination pair, there are two minimal paths along the edges of the bounding box. We call it XY Dimension Order Routing (DOR) and YX DOR. There are also two non-minimal paths which are non-coinciding and out of the bounding box created by XY and YX DOR paths. Dual-path Router implements two injection and two ejection ports for parallel packet injection through two minimal paths. Packets are split into two halves and injected simultaneously into the network. Dandelion router implements four injection and ejection ports for parallel packet injection. Packets are split into smaller sub-packets and are injected simultaneously in all possible directions which typically include two minimal paths and two non-minimal paths. When all the sub-packets reach the destination, they are eventually recombined. We find that our technique significantly increases the throughput and reduces the serialization latency and hence overall latency of long packets. We explore the impact of Dual-path and Dandelion on various packet lengths in order to prove the advantage of our routers over the baseline. We further implement different deadlock free disjoint path models for Dandelion and develop a switching mechanism between Dual-path and Dandelion based on the traffic congestion.
|
2 |
A Pure STT-MRAM Design for High-bandwidth Low-power On-chip InterconnectsKansal, Rohan 16 December 2013 (has links)
Network-on-Chip (NoC) is a de facto inter-core communication infrastructure for future Chip Multiprocessors (CMPs). NoC should be designed to provide both low latency and high bandwidth considering limited on-chip power and area budgets. The use of a high density and low leakage memory, Spin-Torque Transfer Magnetic RAM (STT-MRAM), in NoC routers has been proposed as it increases network throughput by providing more buffer capacities with the same die footprint. However, the inevitable use of SRAM to hide the long write latencies of STT-MRAM sacrifices buffer area and also wastes significant leakage and dynamic power in migrating flits between the disparate memories. In this thesis, the first NoC router designs that use only STT-MRAM is proposed. This allows for a much larger buffer space with the least power consumptions. To overcome the multi-cycle writes, a multi-banked STT-MRAM buffer is employed, which is a logically divided virtual channel where every incoming flit is seamlessly pipelined to each bank alternately every clock cycle simple latches inside the router links. Our STT-MRAM has aggressively reduced retention time, resulting in a significant reduction in latency and power overheads of write operations. We observe flit losses in our STT-MRAM buffer, and propose cost-efficient dynamic buffer refresh schemes to minimize unnecessary refreshes with minimum hardware overheads. Simulation results show that our STT-MRAM NoC router enhances the throughput by 21.6% and achieves 61% savings in dynamic power and 18% savings in total router power, respectively compared to a conventional SRAM based NoC router of same area.
|
3 |
High Performance Shared Memory Networking in Future Many-core Architectures UsingOptical InterconnectsNeel, Brian 11 June 2014 (has links)
No description available.
|
4 |
High-Performance Crossbar Designs for Network-on-Chips (NoCs)Zhang, Yixuan 23 September 2010 (has links)
No description available.
|
5 |
Real-Time Communication over Wormhole-Switched On-Chip NetworksLiu, Meng January 2017 (has links)
In a modern industrial system, the requirement on computational capacity has increased dramatically, in order to support a higher number of functionalities, to process a larger amount of data or to make faster and safer run-time decisions. Instead of using a traditional single-core processor where threads can only be executed sequentially, multi-core and many-core processors are gaining more and more attentions nowadays. In a multi-core processor, software programs can be executed in parallel, which can thus boost the computational performance. Many-core processors are specialized multi-core processors with a larger number of cores which are designed to achieve a higher degree of parallel processing. An on-chip communication bus is a central intersection used for data-exchange between cores, memory and I/O in most multi-core processors. As the number of cores increases, more contention can occur on the communication bus which raises a bottleneck of the overall performance. Therefore, in order to reduce contention incurred on the communication bus, a many-core processor typically employs a Network-on-Chip (NoC) to achieve data-exchange. Real-time embedded systems have been widely utilized for decades. In addition to the correctness of functionalities, timeliness is also an important factor in such systems. Violation of specific timing requirements can result in performance degradation or even fatal problems. While executing real-time applications on many-core processors, the timeliness of a NoC, as a communication subsystem, is essential as well. Unfortunately, many real-time system designs over-provision resources to guarantee the fulfillment of timing requirements, which can lead to significant resource waste. For example, analysis of a NoC design yields that the network is already saturated (i.e. accepting more traffic can incur requirement violation), however, in reality the network actually has the capacity to admit more traffic. In this thesis, we target such resource wasting problems related to design and analysis of NoCs that are used in real-time systems. We propose a number of solutions to improve the schedulability of real-time traffic over wormhole-switched NoCs in order to further improve the resource utilization of the whole system. The solutions focus mainly on two aspects: (1) providing more accurate and efficient time analyses; (2) proposing more cost-effective scheduling methods.
|
6 |
Aging-Aware Routing Algorithms for Network-on-ChipsBhardwaj, Kshitij 01 August 2012 (has links)
Network-on-Chip (NoC) architectures have emerged as a better replacement of the traditional bus-based communication in the many-core era. However, continuous technology scaling has made aging mechanisms, such as Negative Bias Temperature Instability (NBTI) and electromigration, primary concerns in NoC design. In this work, a novel system-level aging model is proposed to model the effects of aging in NoCs, caused due to (a) asymmetric communication patterns between the network nodes, and (b) runtime traffic variations due to routing policies. This work observes a critical need of a holistic aging analysis, which when combined with power-performance optimization, poses a multi-objective design challenge. To solve this problem, two different aging-aware routing algorithms are proposed: (a) congestion-oblivious Mixed Integer Linear Programming (MILP)-based routing algorithm, and (b) congestion-aware adaptive routing algorithm and router micro-architecture. After extensive experimental evaluations, proposed routing algorithms reduce aging-induced power-performance overheads while also improving the system robustness.
|
7 |
Understanding Security Threats of Emerging Computing Architectures and Mitigating Performance Bottlenecks of On-Chip Interconnects in Manycore NTC SystemRajamanikkam, Chidhambaranathan 01 May 2019 (has links)
Emerging computing architectures such as, neuromorphic computing and third party intellectual property (3PIP) cores, have attracted significant attention in the recent past. Neuromorphic Computing introduces an unorthodox non-von neumann architecture that mimics the abstract behavior of neuron activity of the human brain. They can execute more complex applications, such as image processing, object recognition, more efficiently in terms of performance and energy than the traditional microprocessors. However, focus on the hardware security aspects of the neuromorphic computing at its nascent stage. 3PIP core, on the other hand, have covertly inserted malicious functional behavior that can inflict range of harms at the system/application levels. This dissertation examines the impact of various threat models that emerges from neuromorphic architectures and 3PIP cores.
Near-Threshold Computing (NTC) serves as an energy-efficient paradigm by aggressively operating all computing resources with a supply voltage closer to its threshold voltage at the cost of performance. Therefore, STC system is scaled to many-core NTC system to reclaim the lost performance. However, the interconnect performance in many-core NTC system pose significant bottleneck that hinders the performance of many-core NTC system. This dissertation analyzes the interconnect performance, and further, propose a novel technique to boost the interconnect performance of many-core NTC system.
|
8 |
Analysis of high performance interconnect in SoC with distributed switches and multiple issue bus protocolsNarayanasetty, Bhargavi 26 July 2011 (has links)
In a System on a Chip (SoC), interconnect is the factor limiting Performance,
Power, Area and Schedule (PPAS). Distributed crossbar switches also called as
Switching Central Resources (SCR) are often used to implement high performance
interconnect in a SoC – Network on a Chip (NoC). Multiple issue bus protocols like AXI
(from ARM), VBUSM (from TI) are used in paths critical to the performance of the
whole chip. Experimental analysis of effects on PPAS by architectural modifications to
the SCRs is carried out, using synthesis tools and Texas Instruments (TI) in house power
estimation tools. The effects of scaling of SCR sizes are discussed in this report. These
results provide a quick means of estimation for architectural changes in the early design
phase. Apart from SCR design, the other major domain, which is a concern, is deadlocks.
Deadlocks are situations where the network resources are suspended waiting for each
other. In this report various kinds of deadlocks are classified and their respective mitigations in such networks are provided. These analyses are necessary to qualify
distributed SCR interconnect, which uses multiple issue protocols, across all scenarios of
transactions. The entire analysis in this report is carried out using a flagship product of
Texas Instruments. This ASIC SoC is a complex wireless base station developed in 2010-
2011, having 20 major cores. Since the parameters of crossbar switches with multiple
issue bus protocols are commonly used in SoCs across the semiconductor industry, this
reports provides us a strong basis for architectural/design selection and validation of all
such high performance device interconnects.
This report can be used as a seed for the development of an interface tool for
architects. For a given architecture, the tool suggests architectural modifications, and
reports deadlock situations. This new tool will aid architects to close design problems and
bring provide a competitive specification very early in the design cycle. A working
algorithm for the tool development is included in this report. / text
|
9 |
Design of the SiLago GNOC / Design av SiLago GNOCTang, Weiyao January 2022 (has links)
Synchoros VLSI design style can be an alternative choice to fit the increasing complexity of embedded multi-processor architectures. SiLago Block is part of the synchoros blocks, which can effectively reduce the cost of logic and physical synthesis as it is hardened and highly centralized details from each layer of metal. Global NoCs play an essential part in system-level design and there is necessary to benchmark the SiLago global NoC against other existing NoC libraries. In this degree project, the structure of the NoC is established based on the SiLago models, including the wires and the switches. The whole structure has nine times nine grids and sixteen switches are placed inside symmetrically. The connection between two adjacent switches is built up by wires. The routing algorithm inside the switches can support the most common routing situations by destinations, routing states, and routing history. Except the routing algorithm, this essay provides some deadlock situations and also conclude some ways to solve them. The scripts developed from the NoC generator can be used to do the logical and physical synthesis for the SiLago models. The results from the synthesis can be explored to compare against other methods about the hability to estimate cost metrics from a high level of abstraction and the quality of results. The concept of partition is introduced to accomplish physical synthesis, and through this, the design can be more approach to the core idea of synchoros VLSI design. / Synchoros VLSI designstil kan vara ett alternativt val för att passa den ökande komplexiteten hos inbäddade flerprocessorarkitekturer. SiLago Block är en del av synchoros-blocken, som effektivt kan minska kostnaderna för logik och fysisk syntes eftersom det är härdat och mycket centraliserade detaljer från varje lager av metall. Globala NoC spelar en viktig roll i design på systemnivå och det är nödvändigt att jämföra SiLago globala NoC mot andra befintliga NoC-bibliotek. I detta examensarbete fastställs strukturen för NoC baserat på SiLago-modellerna, inklusive ledningarna och switcharna. Hela strukturen har nio gånger nio rutnät och sexton brytare är placerade inuti symmetriskt. Förbindelsen mellan två intilliggande brytare byggs upp av ledningar. Routingalgoritmen inuti switcharna kan stödja de vanligaste routingsituationerna efter destinationer, routingtillstånd och routinghistorik. Förutom routingalgoritmen ger den här uppsatsen några dödlägessituationer och kommer också fram till några sätt att lösa dem. Skripten som utvecklats från NoC-generatorn kan användas för att göra den logiska och fysiska syntesen för SiLago-modellerna. Resultaten från syntesen kan utforskas för att jämföras med andra metoder om förmågan att uppskatta kostnadsmått från en hög abstraktionsnivå och kvaliteten på resultaten. Begreppet partition introduceras för att åstadkomma fysisk syntes, och genom detta kan designen vara mer förhållningssätt till kärnidén med synchoros VLSI-design.
|
10 |
NoC Design & Optimization of Multicore Media ProcessorsBasavaraj, T January 2013 (has links) (PDF)
Network on Chips[1][2][3][4] are critical elements of modern System on Chip(SoC) as well as Chip Multiprocessor(CMP)designs. Network on Chips (NoCs) help manage high complexity of designing large chips by decoupling computation from communication. SoCs and CMPs have a multiplicity of communicating entities like programmable processing elements, hardware acceleration engines, memory blocks as well as off-chip interfaces. With power having become a serious design constraint[5], there is a great need for designing NoC which meets the target communication requirements, while minimizing power using all the tricks available at the architecture, microarchitecture and circuit levels of the de-sign. This thesis presents a holistic, QoS based, power optimal design solution of a NoC inside a CMP taking into account link microarchitecture and processor tile configurations.
Guaranteeing QoS by NoCs involves guaranteeing bandwidth and throughput for connections and deterministic latencies in communication paths. Label Switching based Network-on-Chip(LS-NoC) uses a centralized LS-NoC Management framework that engineers traffic into QoS guaranteed routes. LS-NoC uses label switching, enables band-width reservation, allows physical link sharing and leverages advantages of both packet and circuit switching techniques. A flow identification algorithm takes into account band-width available in individual links to establish QoS guaranteed routes. LS-NoC caters to the requirements of streaming applications where communication channels are fixed over the lifetime of the application. The proposed NoC framework inherently supports heterogeneous and ad-hoc SoC designs.
A multicast, broadcast capable label switched router for the LS-NoC has been de-signed, verified, synthesized, placed and routed and timing analyzed. A 5 port, 256 bit data bus, 4 bit label router occupies 0.431 mm2 in 130nm and delivers peak band-width of80Gbits/s per link at312.5MHz. LS Router is estimated to consume 43.08 mW. Bandwidth and latency guarantees of LS-NoC have been demonstrated on streaming applications like Hiper LAN/2 and Object Recognition Processor, Constant Bit Rate traffic patterns and video decoder traffic representing Variable Bit Rate traffic. LS-NoC was found to have a competitive figure of merit with state-of-the-art NoCs providing QoS. We envision the use of LS-NoC in general purpose CMPs where applications demand deterministic latencies and hard bandwidth requirements.
Design variables for interconnect exploration include wire width, wire spacing, repeater size and spacing, degree of pipelining, supply, threshold voltage, activity and coupling factors. An optimal link configuration in terms of number of pipeline stages for a given length of link and desired operating frequency is arrived at. Optimal configurations of all links in the NoC are identified and a power-performance optimal NoC is presented. We presents a latency, power and performance trade-off study of NoCs using link microarchitecture exploration. The design and implementation of a framework for such a design space exploration study is also presented. We present the trade-off study on NoCs by varying microarchitectural(e.g. pipelining) and circuit level(e.g. frequency and voltage) parameters.
A System-C based NoC exploration framework is used to explore impacts of various architectural and microarchitectural level parameters of NoC elements on power and performance of the NoC. The framework enables the designer to choose from a variety of architectural options like topology, routing policy, etc., as well as allows experimentation with various microarchitectural options for the individual links like length, wire width, pitch, pipelining, supply voltage and frequency. The framework also supports a flexible traffic generation and communication model. Latency, power and throughput results using this framework to study a 4x4 CMP are presented. The framework is used to study NoC designs of a CMP using different classes of parallel computing benchmarks[6].
One of the key findings is that the average latency of a link can be reduced by increasing pipeline depth to a certain extent, as it enables link operation at higher link frequencies.
Abstract
There exists an optimum degree of pipelining which minimizes the energy-delay product of the link. In a 2D Torus when the longest link is pipelined by 4 stages at which point least latency(1.56 times minimum) is achieved and power(40% of max) and throughput (64%of max) are nominal. Using frequency scaling experiments, power variations of up to40%,26.6% and24% can be seen in 2D Torus, Reduced 2D Torus and Tree based NoC between various pipeline configurations to achieve same frequency at constant voltages. Also in some cases, we find that switching to a higher pipelining configuration can actually help reduce power as the links can be designed with smaller repeaters. We also find that the overall performance of the ICNs is determined by the lengths of the links needed to support the communication patterns. Thus the mesh seems to perform the best amongst the three topologies(Mesh, Torus and Folded Torus) considered in case studies.
The effects of communication overheads on performance, power and energy of a multiprocessor chip using L1,L2 cache sizes as primary exploration parameters using accurate interconnect, processor, on-chip and off-chip memory modelling are presented. On-chip and off-chip communication times have significant impact on execution time and the energy efficiency of CMPs. Large cache simply larger tile area that result in longer inter-tile communication link lengths and latencies, thus adversely impacting communication time. Smaller caches potentially have higher number of misses and frequent of off-tile communication. Energy efficient tile design is a configuration exploration and trade-off study using different cache sizes and tile areas to identify a power-performance optimal configuration for the CMP.
Trade-offs are explored using a detailed, cycle accurate, multicore simulation frame-work which includes superscalar processor cores, cache coherent memory hierarchies, on-chip point-to-point communication networks and detailed interconnect model including pipelining and latency. Sapphire, a detailed multiprocessor execution environment integrating SESC, Ruby and DRAM Sim was used to run applications from the Splash2 benchmark(64KpointFFT).Link latencies are estimated for a16 core CMP simulation on Sapphire. Each tile has a single processor, L1 and L2 caches and a router. Different sizesofL1 andL2lead to different tile clock speeds, tile miss rates and tile area and hence interconnect latency.
Simulations across various L1, L2 sizes indicate that the tile configuration that maximizes energy efficiency is related to minimizing communication time. Experiments also indicate different optimal tile configurations for performance, energy and energy efficiency. Clustered interconnection network, communication aware cache bank mapping and thread mapping to physical cores are also explored as potential energy saving solutions. Results indicate that ignoring link latencies can lead to large errors in estimates of program completion times, of up to 17%. Performance optimal configurations are achieved at lower L1 caches and at moderateL2 cache sizes due to higher operating frequencies and smaller link lengths and comparatively lesser communication. Using minimal L1 cache size to operate at the highest frequency may not always be the performance-power optimal choice. Larger L1 sizes, despite a drop in frequency, offer a energy advantage due to lesser communication due to misses.
Clustered tile placement experiments for FFT show considerable performance per watt improvement (1.2%). Remapping most accessed L2 banks by a process in the same core or neighbouring cores after communication traffic analysis offers power and performance advantages. Remapped processes and banks in clustered tile placement show a performance per watt improvement of5.25% and energy reductionof2.53%. This suggests that processors could execute a program in multiple modes, for example, minimum energy, maximum performance.
|
Page generated in 0.0347 seconds