Global ETD Search

31	NoC Design & Optimization of Multicore Media Processors Basavaraj, T January 2013 (has links) (PDF) Network on Chips[1][2][3][4] are critical elements of modern System on Chip(SoC) as well as Chip Multiprocessor(CMP)designs. Network on Chips (NoCs) help manage high complexity of designing large chips by decoupling computation from communication. SoCs and CMPs have a multiplicity of communicating entities like programmable processing elements, hardware acceleration engines, memory blocks as well as off-chip interfaces. With power having become a serious design constraint[5], there is a great need for designing NoC which meets the target communication requirements, while minimizing power using all the tricks available at the architecture, microarchitecture and circuit levels of the de-sign. This thesis presents a holistic, QoS based, power optimal design solution of a NoC inside a CMP taking into account link microarchitecture and processor tile configurations. Guaranteeing QoS by NoCs involves guaranteeing bandwidth and throughput for connections and deterministic latencies in communication paths. Label Switching based Network-on-Chip(LS-NoC) uses a centralized LS-NoC Management framework that engineers traffic into QoS guaranteed routes. LS-NoC uses label switching, enables band-width reservation, allows physical link sharing and leverages advantages of both packet and circuit switching techniques. A flow identification algorithm takes into account band-width available in individual links to establish QoS guaranteed routes. LS-NoC caters to the requirements of streaming applications where communication channels are fixed over the lifetime of the application. The proposed NoC framework inherently supports heterogeneous and ad-hoc SoC designs. A multicast, broadcast capable label switched router for the LS-NoC has been de-signed, verified, synthesized, placed and routed and timing analyzed. A 5 port, 256 bit data bus, 4 bit label router occupies 0.431 mm2 in 130nm and delivers peak band-width of80Gbits/s per link at312.5MHz. LS Router is estimated to consume 43.08 mW. Bandwidth and latency guarantees of LS-NoC have been demonstrated on streaming applications like Hiper LAN/2 and Object Recognition Processor, Constant Bit Rate traffic patterns and video decoder traﬃc representing Variable Bit Rate traffic. LS-NoC was found to have a competitive figure of merit with state-of-the-art NoCs providing QoS. We envision the use of LS-NoC in general purpose CMPs where applications demand deterministic latencies and hard bandwidth requirements. Design variables for interconnect exploration include wire width, wire spacing, repeater size and spacing, degree of pipelining, supply, threshold voltage, activity and coupling factors. An optimal link configuration in terms of number of pipeline stages for a given length of link and desired operating frequency is arrived at. Optimal configurations of all links in the NoC are identified and a power-performance optimal NoC is presented. We presents a latency, power and performance trade-off study of NoCs using link microarchitecture exploration. The design and implementation of a framework for such a design space exploration study is also presented. We present the trade-oﬀ study on NoCs by varying microarchitectural(e.g. pipelining) and circuit level(e.g. frequency and voltage) parameters. A System-C based NoC exploration framework is used to explore impacts of various architectural and microarchitectural level parameters of NoC elements on power and performance of the NoC. The framework enables the designer to choose from a variety of architectural options like topology, routing policy, etc., as well as allows experimentation with various microarchitectural options for the individual links like length, wire width, pitch, pipelining, supply voltage and frequency. The framework also supports a flexible traffic generation and communication model. Latency, power and throughput results using this framework to study a 4x4 CMP are presented. The framework is used to study NoC designs of a CMP using different classes of parallel computing benchmarks[6]. One of the key findings is that the average latency of a link can be reduced by increasing pipeline depth to a certain extent, as it enables link operation at higher link frequencies. Abstract There exists an optimum degree of pipelining which minimizes the energy-delay product of the link. In a 2D Torus when the longest link is pipelined by 4 stages at which point least latency(1.56 times minimum) is achieved and power(40% of max) and throughput (64%of max) are nominal. Using frequency scaling experiments, power variations of up to40%,26.6% and24% can be seen in 2D Torus, Reduced 2D Torus and Tree based NoC between various pipeline configurations to achieve same frequency at constant voltages. Also in some cases, we find that switching to a higher pipelining configuration can actually help reduce power as the links can be designed with smaller repeaters. We also find that the overall performance of the ICNs is determined by the lengths of the links needed to support the communication patterns. Thus the mesh seems to perform the best amongst the three topologies(Mesh, Torus and Folded Torus) considered in case studies. The effects of communication overheads on performance, power and energy of a multiprocessor chip using L1,L2 cache sizes as primary exploration parameters using accurate interconnect, processor, on-chip and off-chip memory modelling are presented. On-chip and off-chip communication times have significant impact on execution time and the energy efficiency of CMPs. Large cache simply larger tile area that result in longer inter-tile communication link lengths and latencies, thus adversely impacting communication time. Smaller caches potentially have higher number of misses and frequent of off-tile communication. Energy efficient tile design is a configuration exploration and trade-off study using different cache sizes and tile areas to identify a power-performance optimal configuration for the CMP. Trade-offs are explored using a detailed, cycle accurate, multicore simulation frame-work which includes superscalar processor cores, cache coherent memory hierarchies, on-chip point-to-point communication networks and detailed interconnect model including pipelining and latency. Sapphire, a detailed multiprocessor execution environment integrating SESC, Ruby and DRAM Sim was used to run applications from the Splash2 benchmark(64KpointFFT).Link latencies are estimated for a16 core CMP simulation on Sapphire. Each tile has a single processor, L1 and L2 caches and a router. Different sizesofL1 andL2lead to different tile clock speeds, tile miss rates and tile area and hence interconnect latency. Simulations across various L1, L2 sizes indicate that the tile configuration that maximizes energy efficiency is related to minimizing communication time. Experiments also indicate different optimal tile configurations for performance, energy and energy efficiency. Clustered interconnection network, communication aware cache bank mapping and thread mapping to physical cores are also explored as potential energy saving solutions. Results indicate that ignoring link latencies can lead to large errors in estimates of program completion times, of up to 17%. Performance optimal configurations are achieved at lower L1 caches and at moderateL2 cache sizes due to higher operating frequencies and smaller link lengths and comparatively lesser communication. Using minimal L1 cache size to operate at the highest frequency may not always be the performance-power optimal choice. Larger L1 sizes, despite a drop in frequency, offer a energy advantage due to lesser communication due to misses. Clustered tile placement experiments for FFT show considerable performance per watt improvement (1.2%). Remapping most accessed L2 banks by a process in the same core or neighbouring cores after communication traffic analysis offers power and performance advantages. Remapped processes and banks in clustered tile placement show a performance per watt improvement of5.25% and energy reductionof2.53%. This suggests that processors could execute a program in multiple modes, for example, minimum energy, maximum performance. Network-On-Chip (NOC) Quality of Service (QOS) Label Switched Network On Chip (LS-NoC) Chip Multiprocessor Designs Multicore Media Processors System on Chip (SoC) Ford-Fulkerson’s MaxFlow Algorithm Flow Algorithm Network-on-Chips On-Chip Interconnection Networks Microarchitecture Exploration Computer Science
32	EVALUATION OF SOURCE ROUTING FOR MESH TOPOLOGY NETWORK ON CHIP PLATFORMS MUBEEN, SAAD January 2009 (has links) Network on Chip is a scalable and flexible communication infrastructure for the design of core based System on Chip. Communication performance of a NoC depends heavily on the routing algorithm. Deterministic and adaptive distributed routing algorithms have been advocated in all the current NoC architectural proposals. In this thesis we make a case for the use of source routing for NoCs, especially for regular topologies like mesh. The advantages of source routing include in-order packet delivery; faster and simpler router design; and possibility of mixing non-minimal paths in a mainly minimal routing. We propose a method to compute paths for various communications in such a way that traffic congestion is avoided while ensuring deadlock free routing. We also propose an efficient scheme to encode the paths. We developed a tool in Matlab that computes paths for source routing for both general and application specific communications. Depending upon the type of traffic, this tool computes paths for source routing by selecting best routing algorithm out of many routing algorithms. The tool uses a constructive path improvement algorithm to compute paths that give more uniform link load distribution. It also generates different types of traffics. We also developed a simulator capable of simulating source routing for mesh topology NoC. The experiments and simulations which we performed were successful and the results show that the advantages of source routing especially lower packet latency more than compensate its disadvantages. The results also demonstrate that source routing can be a good routing candidate for practical core based SoCs design using network on chip communication infrastructure. Network on Chip (NoC) System on Chip (SoC) Core Based Design On Chip Communication Distributed Routing Source Routing Routing Algorithms Performance Analysis Packet Switched Network Annan elektroteknik och elektronik Annan elektroteknik och elektronik Computer Engineering Datorteknik
33	Power Issues in SoCs : Power Aware DFT Architecture and Power Estimation Tudu, Jaynarayan Thakurdas January 2016 (has links) (PDF) Test power, data volume, and test time have been long-standing problems for sequential scan based testing of system-on-chip (SoC) design. The modern SoCs fabricated at lower technology nodes are complex in nature, the transistor count is as large as billions of gate for some of the microprocessors. The design complexity is further projected to increase in the coming years in accordance with Moore's law. The larger gate count and integration of multiple functionalities are the causes for higher test power dissipation, test time and data volume. The dynamic power dissipation during scan testing, i.e. during scan shift, launch and response capture, are major concerns for reliable as well as cost effective testing. Excessive average power dissipation leads to a thermal problem which causes burn-out of the chip during testing. Peak power on other hand causes test failure due to power induced additional delay. The test failure has direct impact on yield. The test power problem in modern 3D stacked based IC is even a more serious issue. Estimating the worst case functional power dissipation is yet another great challenge. The worst case functional power estimation is necessary because it gives an upper bound on the functional power dissipation which can further be used to determine the safe power zone for the test. Several solutions in the past have been proposed to address these issues. In this thesis we have three major contributions: 1) Sequential scan chain reordering, and 2) JScan-an alternative Joint-scan DFT architecture to address primarily the test power issues along with test time and data volume, and 3) an integer linear programming methodology to address the power estimation problem. In order to reduce test power during shift, we have proposed a graph theoretic formulation for scan chain reordering and for optimum scan shift operation. For each formulation a set of algorithms is proposed. The experimental results on ISCAS-89 benchmark circuit show a reduction of around 25% and 15% in peak power and scan shift time respectively. In order to have a holistic DFT architecture which could solve test power, test time, and data volume problems, a new DFT architecture called Joint-scan (JScan) have been developed. In JScan we have integrated the serial and random access scan architectures in a systematic way by which the JScan could harness the respective advantages from each of the architectures. The serial scan architecture from test power, test time, and data volume problems. However, the serial scan is simple in terms of its functionality and is cost effective in terms of DFT circuitry. Whereas, the random ac-cess scan architecture is opposite to this; it is power efficient and it takes lesser time and data volume compared to serial scan. However, the random access scan occupies larger DFT area and introduces routing congestion. Therefore, we have proposed a methodology to realize the JScan architecture as an efficient alternative for standard serial and random access scan. Further, the JScan architecture is optimized and it resulted into a 2-Mode 2M-Jscan Joint-scan architecture. The proposed architectures are experimentally verified on larger benchmark circuits and compared with existing state of the art DFT architectures. The results show a reduction of 50% to 80% in test power and 30% to 50% in test time and data volume. The proposed architectures are also evaluated for routing area minimization and we obtained a saving of around 7% to 15% of chip area. Estimating the worst case functional power being a challenging problem, we have proposed a binary integer linear programming (BILP) based methodology. Two different formulations have been proposed considering the different delay models namely zero-delay and unit-delay. The proposed methodology generates a pair or input vectors which could toggle the circuit to dissipate worst power. The BILP problems are solved using CPLEX solver for ISCAS-85 combinational benchmark circuits. For some of the circuits, the proposed methodology provided the worst possible power dissipation i.e. 80 to 100% toggling in nets. Aware Scan Architecture Random Access Scan Power Aware Test BILP Programming VLSI Testing Power Estimation System-On-Chip (SOC) Sequential Scan Chain Ordering JScan Integer Linear Programming Design-for-Test (DFT) Architecture Congestion Aware Scan Architecture Computer Science
34	Efficient Minimum Cycle Mean Algorithms And Their Applications Supriyo Maji (9158723) 23 July 2020 (has links) <p>Minimum cycle mean (MCM) is an important concept in directed graphs. From clock period optimization, timing analysis to layout optimization, minimum cycle mean algorithms have found widespread use in VLSI system design optimization. With transistor size scaling to 10nm and below, complexities and size of the systems have grown rapidly over the last decade. Scalability of the algorithms both in terms of their runtime and memory usage is therefore important. </p> <p><br></p> <p>Among the few classical MCM algorithms, the algorithm by Young, Tarjan, and Orlin (YTO), has been particularly popular. When implemented with a binary heap, the YTO algorithm has the best runtime performance although it has higher asymptotic time complexity than Karp's algorithm. However, as an efficient implementation of YTO relies on data redundancy, its memory usage is higher and could be a prohibitive factor in large size problems. On the other hand, a typical implementation of Karp's algorithm can also be memory hungry. An early termination technique from Hartmann and Orlin (HO) can be directly applied to Karp's algorithm to improve its runtime performance and memory usage. Although not as efficient as YTO in runtime, HO algorithm has much less memory usage than YTO. We propose several improvements to HO algorithm. The proposed algorithm has comparable runtime performance to YTO for circuit graphs and dense random graphs while being better than HO algorithm in memory usage. </p> <p><br></p> <p>Minimum balancing of a directed graph is an application of the minimum cycle mean algorithm. Minimum balance algorithms have been used to optimally distribute slack for mitigating process variation induced timing violation issues in clock network. In a conventional minimum balance algorithm, the principal subroutine is that of finding MCM in a graph. In particular, the minimum balance algorithm iteratively finds the minimum cycle mean and the corresponding minimum-mean cycle, and uses the mean and cycle to update the graph by changing edge weights and reducing the graph size. The iterations terminate when the updated graph is a single node. Studies have shown that the bottleneck of the iterative process is the graph update operation as previous approaches involved updating the entire graph. We propose an improvement to the minimum balance algorithm by performing fewer changes to the edge weights in each iteration, resulting in better efficiency.</p> <p><br></p> <p>We also apply the minimum cycle mean algorithm in latency insensitive system design. Timing violations can occur in high performance communication links in system-on-chips (SoCs) in the late stages of the physical design process. To address the issues, latency insensitive systems (LISs) employ pipelining in the communication channels through insertion of the relay stations. Although the functionality of a LIS is robust with respect to the communication latencies, such insertion can degrade system throughput performance. Earlier studies have shown that the proper sizing of buffer queues after relay station insertion could eliminate such performance loss. However, solving the problem of maximum performance buffer queue sizing requires use of mixed integer linear programming (MILP) of which runtime is not scalable. We formulate the problem as a parameterized graph optimization problem where for every communication channel there is a parameterized edge with buffer counts as the edge weight. We then use minimum cycle mean algorithm to determine from which edges buffers can be removed safely without creating negative cycles. This is done iteratively in the similar style as the minimum balance algorithm. Experimental results suggest that the proposed approach is scalable. Moreover, quality of the solution is observed to be as good as that of the MILP based approach.</p><p><br></p> Computer Engineering Circuits and Systems Microelectronics and Integrated Circuits Minimum cycle mean Minimum balancing Graph algorithms Clock network design Slack distribution Latency insensitive system Buffer queue sizing Mixed integer linear programming Integer minimum balancing Hartmann-Orlin's MCM algorithm Young-Tarjan-Orlin's MCM algorithm Memory usage Runtime performance Graph re-weighting Cycle collapsing System-on-Chip (SoC) Timing violation
35	Deep Learning Model Deployment for Spaceborne Reconfigurable Hardware : A flexible acceleration approach Ferre Martin, Javier January 2023 (has links) Space debris and space situational awareness (SSA) have become growing concerns for national security and the sustainability of space operations, where timely detection and tracking of space objects is critical in preventing collision events. Traditional computer-vision algorithms have been used extensively to solve detection and tracking problems in flight, but recently deep learning approaches have seen widespread adoption in non-space related applications for their high accuracy. The performanceper-watt and flexibility of reconfigurable Field-Programmable Gate Arrays (FPGAs) make them a good candidate for deep learning model deployment in space, supporting in-flight updates and maintenance. However, the FPGA design costs of custom accelerators for complex algorithms remains high. The research focus of the thesis relies on novel high-level synthesis (HLS) workflows that allow the developer to raise the level of abstraction and lower design costs for deep learning accelerators, particularly for space-representative applications. To this end, four different hardware accelerators of convolutional neural network models for spacebased debris detection are implemented (ResNet, SqueezeNet, DenseNet, TinyCNN), using the open-source HLS tool NNgen. The obtained hardware accelerators are deployed to a reconfigurable module of the Zynq Ultrascale+ MPSoC programmable logic, and compared in terms of inference performance, resource utilization and latency. The tests on the target hardware show a detection accuracy over 95% for ResNet, DenseNet and SqueezeNet, and a localization intersection-over-union over 0.5 for the deep models, and over 0.7 for TinyCNN, for space debris objects at a range between 1km and 100km for a diameter of 1cm, or between 100km and 1000km for a diameter of 10cm. The obtained speed-ups with respect to software-only implementations lay between 3x and 32x for the different hardware accelerators. / Rymdskrot och rymdsituationstänksamhet (SSA) har blivit växande oro för nationell säkerhet och hållbarheten för rymdoperationer, där snabb upptäckt och spårning av rymdobjekt är avgörande för att förhindra kollisioner. Traditionella datorseendealgoritmer har använts omfattande för att lösa problem med upptäckt och spårning i flygning, men på senare tid har djupinlärningsmetoder fått stor användning inom icke rymdrelaterade applikationer på grund av sin höga noggrannhet. Prestandaper-watt och flexibiliteten hos omkonfigurerbara Field-Programmable Gate Arrays (FPGAs) gör dem till en bra kandidat för distribution av djupinlärningsmodeller i rymden, med stöd för uppdateringar och underhåll under flygning. Men FPGAdesignkostnaderna för anpassade acceleratorer för komplexa algoritmer är fortfarande höga. Forskningsfokus för avhandlingen ligger på nya högnivåsyntes (HLS) arbetsflöden som gör det möjligt för utvecklaren att höja abstraktionsnivån och sänka designkostnaderna för acceleratorer för djupinlärning, särskilt för tillämpningar i rymden. För detta har fyra olika hårdvaruacceleratorer för modeller av konvolutionsnätverk för upptäckt av rymdbaserat skrot implementerats (ResNet, SqueezeNet, DenseNet, TinyCNN), med hjälp av öppen källkod HLS-verktyget NNgen. De erhållna hårdvaruacceleratorerna distribueras till en omkonfigurerbar modul av Zynq Ultrascale+ MPSoC-programmerbar logik och jämförs med avseende på inferensprestanda, resursutnyttjande och latens. Testerna på målhardwaren visar en upptäktnoggrannhet på över 95% för ResNet, DenseNet och SqueezeNet, och en lokaliserings-intersektion-över-union på över 0,5 för de djupa modellerna och över 0,7 för TinyCNN för rymdskrotobjekt på en avstånd mellan 1 km och 100 km för en diameter på 1 cm eller mellan 100 km och 1000 km för en diameter på 10 cm. De erhållna hastighetsökningarna i förhållande till endast programvara ligger mellan 3x och 32x för de olika hårdvaruacceleratorerna. Space Situational Awareness Deep Learning Convolutional Neural Networks FieldProgrammable Gate Arrays System-On-Chip Computer Vision Dynamic Partial Reconfiguration High-Level Synthesis Rymdsituationstänksamhet Djupinlärning Konvolutionsnätverk System-On-Chip (SoC) Datorseende Dynamisk partiell omkonfigurering Högnivåsyntes. Elektroteknik och elektronik

Page generated in 0.0341 seconds