21 |
A Link-Level Communication Analysis for Real-Time NoCsGholamian, Sina January 2012 (has links)
This thesis presents a link-level latency analysis for real-time network-on-chip interconnects that use priority-based wormhole switching. This analysis incorporates both direct and indirect
interferences from other traffic flows, and it leverages pipelining and parallel transmission of data across the links. The resulting link-level analysis provides a tighter worst-case upper-bound than existing techniques, which we verify with our analysis and simulation experiments. Our
experiments show that on average, link-level analysis reduces the worst-case latency by 28.8%, and improves the number of flows that are schedulable by 13.2% when compared to previous work.
|
22 |
A verilog-hdl implementation of virtual channels in a network-on-chip routerPark, Sungho 15 May 2009 (has links)
As the feature size is continuously decreasing and integration density is increasing,
interconnections have become a dominating factor in determining the overall
quality of a chip. Due to the limited scalability of system bus, it cannot meet the
requirement of current System-on-Chip (SoC) implementations where only a limited
number of functional units can be supported. Long global wires also cause many
design problems, such as routing congestion, noise coupling, and difficult timing closure.
Network-on-Chip (NoC) architectures have been proposed to be an alternative
to solve the above problems by using a packet-based communication network. The
processing elements (PEs) communicate with each other by exchanging messages over
the network and these messages go through buffers in each router. Buffers are one of
the major resource used by the routers in virtual channel flow control.
In this thesis, we analyze two kinds of buffer allocation approaches, static and
dynamic buffer allocations. These approaches aim to increase throughput and minimize
latency by means of virtual channel flow control. In statically allocated buffer
architecture, size and organization are design time decisions and thus, do not perform
optimally for all traffic conditions. In addition, statically allocated virtual channel
consumes a waste of area and significant leakage power. However, dynamic buffer allocation
scheme claims that buffer utilization can be increased using dynamic virtual
channels. Dynamic virtual channel regulator (ViChaR), have been proposed to use
centralized buffer architecture which dynamically allocates virtual channels and buffer slots in real-time depending on traffic conditions. This ViChaR’s dynamic buffer management
scheme increases buffer utilization, but it also increases design complexity. In
this research, we reexamine performance, power consumption, and area of ViChaR’s
buffer architecture through implementation. We implement a generic router and a
ViChaR architecture using Verilog-HDL. These RTL codes are verified by dynamic
simulation, and synthesized by Design Compiler to get area and power consumption.
In addition, we get latency through Static Timing Analysis. The results show that a
ViChaR’s dynamic buffer management scheme increases the latency and power consumption
significantly even though it could increase buffer utilization. Therefore, we
need a novel design to achieve high buffer utilization without a loss.
|
23 |
HW/SW Codesign and Design, Evaluation of Software Framework for AcENoCs : An FPGA-Accelerated NoC Emulation PlatformPai, Vinayak 2010 December 1900 (has links)
Majority of the modern day compute intensive applications are heterogeneous
in nature. To support their ever increasing computational requirements, present
day System-on-Chip (SoC) architectures have adapted multicore style of modeling,
thereby incorporating multiple, heterogeneous processing cores on a single chip. The
emerging Network-On-Chip (NoC) interconnect paradigm provides a scalable and
power-efficient solution for communication among multiple cores, serving as a powerful
replacement for traditional bus based architectures. A fast, robust and
exible
emulation platform is the key to successful realization and validation of such architectures
within a very short span of time.
This research focuses on various aspects of Hardware/Software (HW/SW) codesign
for AcENoCs (Accelerated Emulation Platform for NoCs), a Field Programmable
Gate Array (FPGA) accelerated, con gurable, cycle accurate platform for emulation
and validation of NoC architectures. This work also details the design, implementation
and evaluation of AcENoCs' software framework along with the various design
optimizations carried out and tradeoffs considered in AcENoCs' HW/SW codesign
for achieving an optimum balance between emulated network dimensions and emulation
performance. AcENoCs emulation platform is realized on a Xilinx Virtex-5
FPGA. AcENoCs' hardware framework consists of the NoC built using configurable
hardware library components, while the software framework consists of Traffic Generators
(TGs) and their associated source queues, Traffic Receptors (TRs) along with statistics analysis module and dynamically controlled emulation clock generator. The
software framework is implemented using on-chip Xilinx MicroBlaze processor. This
report also describes the interaction between various HW/SW events in an emulation
cycle and assesses AcENoCs' performance speedup and tradeoffs over existing FPGA
emulators and software simulators.
FPGA synthesis results showed that networks with dimensions upto 5x5 could be
accommodated inside the device. Varying synthetic traffic workloads, generated by
TGs, were used to evaluate the network. Real application based traces were also run
on AcENoCs platform to evaluate the performance improvement achieved in comparison
to software simulators. For improving the emulator performance, software
profiling was carried out to identify and optimize the software components consuming
highest number of processor cycles in an emulation cycle. Emulation testcases
were run and latency values recorded for varying traffic patterns in order to evaluate
AcENoCs platform. Experimental results showed emulation speedups in order
of 10000-12000X over HDL (Hardware Description Language) simulators and 14-47X
over software simulators, without sacri cing cycle accuracy.
|
24 |
CoNoC: Fast Full Chip Topology Generation for Application-Specific Network on ChipChen, Shu-yu 08 January 2010 (has links)
We propose a synthesis methodology for Network-on-Chips (NoC) or NoC-based multiprocessor systems-on-chip (MPSoCs) for application-specific or irregular topology generation.We first propose simultaneously synthesize both for processor and communication architectures in order to estimate area and routing more accurately during floorplanning stage, which is different with traditional router and link insertion after floorplanning.
Our NoC topology generation is simultaneously optimized for fast, low power and wirelength. Compared with the state of art, our results outperforms averagely 445.45 X in CPU time, 33.20 % in power consumption, and 96.86 % in wirelength at cost of NoC Size of more 2.26 % because our method considering router shape; the number of routers of more 20.63 % because our method only allows router port limit of 5; the number of links of more 3.93 % because our method allows different link lengths.
Also our method is scalable and experiments of 2 X, 4 X, 8 X and 16 X outperform averagely 355,089.11 X in CPU time, 1.21 X in the number hops, 78.33 % in power consumption. Our experimental results show our synthesis method is effective, efficiently and scalable.
|
25 |
A Link-Level Communication Analysis for Real-Time NoCsGholamian, Sina January 2012 (has links)
This thesis presents a link-level latency analysis for real-time network-on-chip interconnects that use priority-based wormhole switching. This analysis incorporates both direct and indirect
interferences from other traffic flows, and it leverages pipelining and parallel transmission of data across the links. The resulting link-level analysis provides a tighter worst-case upper-bound than existing techniques, which we verify with our analysis and simulation experiments. Our
experiments show that on average, link-level analysis reduces the worst-case latency by 28.8%, and improves the number of flows that are schedulable by 13.2% when compared to previous work.
|
26 |
Architectural Support for High-Performance, Power-Efficient and Secure Multiprocessor SystemsAn, Baik Song 2012 August 1900 (has links)
High performance systems have been widely adopted in many fields and the demand for better performance is constantly increasing. And the need of powerful yet flexible systems is also increasing to meet varying application requirements from diverse domains. Also, power efficiency in high performance computing has been one of the major issues to be resolved. The power density of core components becomes significantly higher, and the fraction of power supply in total management cost is dominant. Providing dependability is also a main concern in large-scale systems since more hardware resources can be abused by attackers. Therefore, designing high-performance, power-efficient and secure systems is crucial to provide adequate performance as well as reliability to users.
Adhering to using traditional design methodologies for large-scale computing systems has a limit to meet the demand under restricted resource budgets. Interconnecting a large number of uniprocessor chips to build parallel processing systems is not an efficient solution in terms of performance and power. Chip multiprocessor (CMP) integrates multiple processing cores and caches on a chip and is thought of as a good alternative to previous design trends.
In this dissertation, we deal with various design issues of high performance multiprocessor systems based on CMP to achieve both performance and power efficiency while maintaining security. First, we propose a fast and secure off-chip interconnects through minimizing network overheads and providing an efficient security mechanism. Second, we propose architectural support for fast and efficient memory protection in CMP systems, making the best use of the characteristics in CMP environments and multi-threaded workloads. Third, we propose a new router design for network-on-chip (NoC) based on a new memory technique. We introduce hybrid input buffers that use both SRAM and STT-MRAM for better performance as well as power efficiency.
Simulation results show that the proposed schemes improve the performance of off-chip networks through reducing the message size by 54% on average. Also, the schemes diminish the overheads of bounds checking operations, thus enhancing the overall performance by 11% on average. Adopting hybrid buffers in NoC routers contributes to increasing the network throughput up to 21%.
|
27 |
Design and Analysis of Location Cache in a Network-on-Chip Based Multiprocessor SystemRamakrishnan, Divya 20 April 2009 (has links)
No description available.
|
28 |
Scalable Hybrid Neuromorphic Accelerator & Hybrid Neural NetworksNardone, Joshua 01 June 2024 (has links) (PDF)
With machine learning workloads currently at very large scales, models are distributed across large compute systems. On distributed systems, the performance of these models are limited by the bandwidth limitations of chip-to-chip communication. To relieve this bottleneck, spiking neural networks (SNNs) can be utilized to reduce inter-chip communication traffic utilizing inherit network sparsity. However, in comparison to traditional artificial neural networks (ANNs), SNNs can have significant degradation in performance with increased network scale and complexity.
This research proposes a hybrid neural network accelerator that uses the best of both spiking and non-spiking layers by allocating a majority of resources to nonspiking layers on the interior of the chip while bandwidth-limited areas (e.g., I/O pads, or chip separation boundaries) employ spike-based data traffic. By limiting the overall use of spiking layers within the network, we realize the energy savings of SNNs without the a degradation in accuracy which comes with large spike-based models.
We present a scalable chiplet architecture and show how hybrid data is managed with both spike and non-spiking data communication. We also demonstrate how the asynchronous spike-based model is integrated efficiently with the synchronous artificial-based deep learning workloads. We demonstrate that our hybrid architecture offers significant improvements in performance, accuracy, and energy consumption in comparison to SNNs and ANNs. With up to a 1.34× increase in energy efficiency and 1.56× decrease in single inference latency, the versatility of the architecture is demonstrated by its validation across multiple datasets, encompassing both language processing and computer vision tasks.
|
29 |
Algoritmo de prefetching de dados temporizado para sistemas multiprocessadores baseados em NOCSILVEIRA, Maria Cireno Ribeiro 09 March 2015 (has links)
Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-03-15T13:58:26Z
No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
UFPE-MEI 2015-078 - Maria Cireno Ribeiro Silveira.pdf: 4578273 bytes, checksum: 1c434494e0c03cb02156a37ebfd1c7da (MD5) / Made available in DSpace on 2016-03-15T13:58:26Z (GMT). No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
UFPE-MEI 2015-078 - Maria Cireno Ribeiro Silveira.pdf: 4578273 bytes, checksum: 1c434494e0c03cb02156a37ebfd1c7da (MD5)
Previous issue date: 2015-03-09 / O prefetching é uma técnica considerada e ciente para mitigar um problema já conhecido
em sistemas computacionais: a diferença entre o desempenho do processador e do acesso
à memória. O objetivo do prefetching é aproximar o dado do processador retirando-o da
memória e carregando na cache local. Uma vez que o dado seja requisitado pelo processador,
ele já estará disponível na cache, reduzindo a taxa de perdas e a penalidade do
sistema. Para sistemas multiprocessadores baseados em NoCs a e ciência do prefetching
é ainda mais crítica em relação ao desempenho, uma vez que o tempo de acesso ao dado
varia dependendo da distância entre processador e memória e do tráfego da rede.
Este trabalho propõe um algoritmo de prefetching de dados temporizado, que tem
como objetivo minimizar a penalidade dos núcleos através uma solução de prefetching
baseada em predição de tempo para sistemas multiprocessadores baseados em NoC. O
algoritmo utiliza um processo pró-ativo iniciado pelo servidor para realizar requisições
de prefetching baseado no histórico de perdas de cache e informações da NoC. Nos experimentos
realizados para 16 núcleos, o algoritmo proposto reduziu a penalidade dos
processadores em 53,6% em comparação com o prefetching baseado em eventos (faltas na
cache), sendo a maior redução de 29% da penalidade. / The prefetching technique is an e ective approach to mitigate a well-known problem in
multi-core processors: the gap between computing and data access performance. The
goal of prefetching is to approximate data to the CPU by retrieving the data from the
memory and loading it in the cache. When the data is requested by the CPU, it is already
available in the cache, reducing the miss rate and penalty. In multiprocessor NoC-based
systems the prefetching e ciency is even more critical to system performance, since the
access time depends of the distance between the requesting processor and the memory
and also of the network tra c.
This work proposes a temporized data prefetching algorithm that aims to minimize
the penalty of the cores through one prefetching solution based on time prediction for
multiprocessor NoC-based systems. The algorithm utilizes a proactive process initiated by
the server to request prefetching data based on cache miss history and NoC's information.
In the experiments for 16 cores, the proposed algorithm has successfully reduced the
processors penalty in 53,6% compared to the event-based prefetching and the best case
was a penalty reduction of 29%.
|
30 |
Estratégia para redução de congestionamento em sistemas multiprocessadores baseados em NOCKAMEI, Camila Ascendina Nunes 07 August 2015 (has links)
Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-07-01T13:03:48Z
No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
dissertacao_Camila_Ascendina_Nunes_Kamei.pdf: 2427056 bytes, checksum: 9c4bd5bb499271557f86edce757edec2 (MD5) / Made available in DSpace on 2016-07-01T13:03:48Z (GMT). No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
dissertacao_Camila_Ascendina_Nunes_Kamei.pdf: 2427056 bytes, checksum: 9c4bd5bb499271557f86edce757edec2 (MD5)
Previous issue date: 2015-08-07 / CNPq / Duas questões são críticas em sistemas com paralelismo de memória em rede NoC baseados
em MPSoC, a ordem de entrega da mensagem e o congestionamento da rede. Os
congestionamentos são frequentes em NoC quando as demandas de pacotes excedem a
capacidade dos recursos da rede e a ordem das mensagens precisam ser mantidas para que
a informação de coerência de cache tenha signi cado para as memórias. Assim, métodos
de controle de congestionamento são necessários para estes sistemas e devem lidar com o
congestionamento da rede, enquanto mantém a ordem das transações.
Este trabalho propõe uma técnica de roteamento baseada no algoritmo de roteamento
Odd-Even associado ao conceito de congestionamento local e global da rede para a escolha
do melhor caminho de encaminhamento dos pacotes de comunicação. Desta forma se
objetiva a redução dos gargalos de comunicação da rede para os sistemas NoC baseado
em MPSoC. Nos experimentos realizados para 16 núcleos, a técnica proposta alcançou a
redução de 13,35% da energia consumida, 25% de redução de latência de envio de pacotes
em comparação o algoritmo XY e 23% de redução de latência de envio de pacotes em
comparação o algoritmo Odd-Even sem modi cação. / Two issues are critical in systems with memory parallelism network NoC-based MPSoC,
the delivery order of messages and network congestion. The congestions are frequent in
NoC when the packages demands exceed the capacity of the network resources and the
order of the messages need to be maintained so that the cache coherency information is
meaningful to the memories. Thus, congestion control methods are needed to deal with
network congestion while they keep the order of the transactions.
This paper proposes the use of the routing algorithm Odd-Even associated with the concept
of local and global network congestion to choose the best routing path of communication
packages. In this way it aims to reduce the network communication bottlenecks
for NoC systems based on MPSoC. In experiments conducted for 16 cores, the proposed
technique has achieved the reduction of 13.35 % of energy consumption, 25% of latency
compared with the XY algorithm and 23% of latency compared with the Odd-Even algorithm
without the modi cation.
|
Page generated in 0.0387 seconds