• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 4
  • 3
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 15
  • 15
  • 4
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Architectural Support for Efficient Communication in Future Microprocessors

Jin, Yu Ho 16 January 2010 (has links)
Traditionally, the microprocessor design has focused on the computational aspects of the problem at hand. However, as the number of components on a single chip continues to increase, the design of communication architecture has become a crucial and dominating factor in defining performance models of the overall system. On-chip networks, also known as Networks-on-Chip (NoC), emerged recently as a promising architecture to coordinate chip-wide communication. Although there are numerous interconnection network studies in an inter-chip environment, an intra-chip network design poses a number of substantial challenges to this well-established interconnection network field. This research investigates designs and applications of on-chip interconnection network in next-generation microprocessors for optimizing performance, power consumption, and area cost. First, we present domain-specific NoC designs targeted to large-scale and wire-delay dominated L2 cache systems. The domain-specifically designed interconnect shows 38% performance improvement and uses only 12% of the mesh-based interconnect. Then, we present a methodology of communication characterization in parallel programs and application of characterization results to long-channel reconfiguration. Reconfigured long channels suited to communication patterns enhance the latency of the mesh network by 16% and 14% in 16-core and 64-core systems, respectively. Finally, we discuss an adaptive data compression technique that builds a network-wide frequent value pattern map and reduces the packet size. In two examined multi-core systems, cache traffic has 69% compressibility and shows high value sharing among flows. Compression-enabled NoC improves the latency by up to 63% and saves energy consumption by up to 12%.
2

The Rearrangeability of Banyan-type Networks

Huang, Yi-Ming 21 July 2005 (has links)
In the thesis, we study the rearrangeability of the Banyan-type network with crosstalk constraint. Let $x$, $p$ and $c$ be nonnegative integers with $0leq x,cleq n$ and $n,pgeq 1$. $B_{n}(x,p,c)$ is the Banyan-type network with, $2^{n+1}$ inputs, $2^{n+1}$ outputs, $x$ extra-stages, and each connection containing at most $c$ crosstalk switch elements. We give the necessary and sufficient conditions for rearrangeable Banyan-type networks $B_{n}(x,p,c)$.
3

Automorphisms generating disjoint Hamilton cycles in star graphs

Derakhshan, Parisa January 2015 (has links)
In the first part of the thesis we define an automorphism φn for each star graph Stn of degree n-1, which yields permutations of labels for the edges of Stn taken from the set of integers {1,..., [n/2c]}. By decomposing these permutations into permutation cycles, we are able to identify edge-disjoint Hamilton cycles that are automorphic images of a known two-labelled Hamilton cycle H1 2(n) in Stn. The search for edge-disjoint Hamilton cycles in star graphs is important for the design of interconnection network topologies in computer science. All our results improve on the known bounds for numbers of any kind of edge-disjoint Hamilton cycles in star graphs.
4

Network-on-chip architectures for scalability and service guarantees

Grot, Boris 13 July 2012 (has links)
Rapidly increasing transistor densities have led to the emergence of richly-integrated substrates in the form of chip multiprocessors and systems-on-a-chip. These devices integrate a variety of discrete resources, such as processing cores and cache memories, on a single die with the degree of integration growing in accordance with Moore's law. In this dissertation, we address challenges of scalability and quality-of-service (QOS) in network architectures of highly-integrated chips. The proposed techniques address the principal sources of inefficiency in networks-on-chip (NOCs) in the form of performance, area, and energy overheads. We also present a comprehensive network architecture capable of interconnecting over a thousand discrete resources with high efficiency and strong guarantees. We first show that mesh networks, commonly employed in existing chips, fall significantly short of achieving their performance potential due to transient congestion effects that diminish network performance. Adaptive routing has the potential to improve performance through better load distribution. However, we find that existing approaches are myopic in that they only consider local congestion indicators and fail to take global network state into account. Our approach, called Regional Congestion Awareness (RCA), improves network visibility in adaptive routers via a light-weight mechanism for propagating and integrating congestion information. By leveraging both local and non-local congestion indicators, RCA improves network load balance and boosts throughput. Under a set of parallel workloads running on a 49-node substrate, RCA reduces on-chip network latency by 16%, on average, compared to a locally-adaptive router. Next, we target NOC latency and energy efficiency through a novel point-to-multipoint topology. Ring and mesh networks, favored in existing on-chip interconnects, often require packets to go through a number of intermediate routers between source and destination nodes, resulting in significant latency and energy overheads. Topologies that improve connectivity, such as fat tree and flattened butterfly, eliminate much of the router overhead, but require non-minimal channel lengths or large channel count, reducing energy-efficiency and/or performance as a result. We propose a new topology, called Multidrop Express Channels (MECS), that augments minimally-routed express channels with multi-drop capability. The resulting richly-connected NOC enjoys a low hop count with favorable delay and energy characteristics, while improving wire utilization over prior proposals. Applications such as virtualized servers-on-a-chip and real-time systems require chip-level quality-of-service (QOS) support to provide fairness, service differentiation, and guarantees. Existing network QOS approaches suffer from considerable performance and area overheads that limit their usefulness in a resource-limited on-die network. In this dissertation, we propose a new QOS scheme called Preemptive Virtual Clock (PVC). PVC uses a preemptive approach to provide hard guarantees and strong performance isolation while dramatically reducing queuing requirements that burden prior proposals. Finally, we introduce a comprehensive network architecture that overcomes the bottlenecks of earlier designs with respect to area, energy, and QOS in future highly-integrated chips. The proposed NOC uses a topology-centric QOS approach that restricts the extent of hardware QOS support to a fraction of the network without compromising guarantees. In doing so, network area and energy efficiency are significantly improved. Further improvements are derived through a novel flow-control mechanism, along with switch- and link-level optimizations. In concert, these techniques yield a network capable of interconnecting over a thousand terminals on a die while consuming 47% less area and 26% less power than a state-of-the-art QOS-enabled NOC. The mechanisms proposed in this dissertation are synergistic and enable efficient, high-performance interconnects for future chips integrating hundreds or thousands of on-die resources. They address deficiencies in routing, topologies, and flow control of existing architectures with respect to area, energy, and performance scalability. They also serve as a building block for cost-effective advanced services, such as QOS guarantees at the die level. / text
5

FPGA interconnection networks with capacitive boosting in strong and weak inversion

Eslami, Fatemeh 22 August 2012 (has links)
Designers of Field-Programmable Gate Arrays (FPGAs) are always striving to improve the speed of their designs. The propagation delay of FPGA interconnection networks is a major challenge and continues to grow with newer technologies. FPGAs interconnection networks are implemented using NMOS pass transistor based multiplexers followed by buffers. The threshold voltage drop across an NMOS device degrades the high logic value, and results in unbalanced rising and falling edges, static power consumption due to the crowbar currents, and reduced noise margins. In this work, circuit design techniques to construct interconnection circuit with capacitive boosting are proposed. By using capacitive boosting in FPGAs interconnection networks, the signal transitions are accelerated and the crowbar currents of downstream buffers are reduced. In addition, buffers can be non-skewed or slightly skewed to improve noise immunity of the interconnection network. Results indicate that by using the presented circuit design technique, the propagation delay can be reduced by at least 10% versus prior art at the expense of a slight increase in silicon area. In addition, in a bid to reduce power consumption in reconfigurable arrays, operation in weak inversion region has been suggested. Current programmable interconnections cannot be directly used in this region due to a very poor propagation delay and sensitivity to Process-Voltage-Temperature (PVT) variations. This work also focuses on designing a common structure for FPGAs interconnection networks that can operate in both strong and weak inversion. We propose to use capacitive boosting together with a new circuit design technique, called Twins transmission gates in implementing FPGA interconnect multiplexers. We also propose to use capacitive boosting in designing buffers. This way, the operation region of the interconnection circuitry is shifted away from weak inversion toward strong inversion resulting in improved speed and enhanced tolerance to PVT variations. Simulation results indicate using capacitive boosting to implement the interconnection network can have a significant influence on delay and tolerance to variations. The interconnection network with capacitive boosting is at least 34% faster than prior art in weak inversion. / Graduate
6

On designing coarse grain reconfigurable arrays to operate in weak inversion

Ross, Dian Marie 17 December 2012 (has links)
Field Programmable Gate Arrays (FPGAs) support the reconfigurable computing paradigm by providing an integrated circuit hardware platform that facilitates software like reconfigurability. The addition of an embedded microprocessor and peripherals to traditional FPGA Combinational Logic Blocks (CLBs) interleaved with interconnections has effectively resulted in a programmable system on-chip. FPGAs are used to support flexible implementations of Application Specific Integrated Circuit (ASIC) functions. Because FPGAs are reconfigurable, they often are used in place of ASICs during the cicuit design process. FPGAs are also used when only a small number of ICs are required: ASICs necessitate large manufacturing runs to be economically viable; for smaller runs the use of FPGAs is an economic alternative. Application domains of interest, such as intelligent guidance systems, medical devices, and sensors, often require low power, inexpensive calculation of trance- dental functions. COordinate Rotation DIgital Computer (CORDIC) is an iterative algorithm used to emmulate hardware expensive multipliers, such as Multiply/ACculmulate (MAC) units, with only shift and add operations. However, because CORDIC is a sequential algorithm, characterized as having the latency of a serial multiplier, techniques that speed up computational performance have many applications.To this end, three implementations of standard CORDIC, (i) unrolled hardwired, (ii) unrolled programmable, and (iii) rolled programmable, were implemented on four Xilinx FPGA families: Virtex-4, -5, and -6, and Spartan-6. Although hardwired unrolled was found to have the greatest speed at the expense of no runtime flexibility, and rolled programmable was found to have the greatest flexibility and lowest silicon area consumption at the expense of the longest propagation delay, improvements to CORDIC implementations were still sought. Three parallelized CORDIC techniques, P-CORDIC, Flat-CORDIC, and Para-CORDIC, were implemented on the same four FPGA families. P-CORDIC and Flat-CORDIC, were shown to have the lowest latency under various conditions; Para-CORDIC was found to perform well in deeply pipelined, high throughput circuits. Design rules for when to use standard versus precomputation CORDIC techniques are presented. To address the low power requirements of many applications of interest, the Unfolded Multiplexor-LRB (UMUX-LRB), patent held by Sima, et al, was analyzed in weak inversion across four transistor technology nodes (180nm, 130nm, 90nm, and 65nm). Previous was also expanded from strong inversion across 180nm, 130nm, and 90nm technology nodes to also include 65nm. The UMUX-LRB interconnection network is based upon the Xilinx commercial interconnection network. Therefore, this network (MUX-LRB), and another static circuit technique, CMOS-Transmission Gates (CMOS-TG), were profiled across all four technology nodes to provide a baseline of comparision. This analysis found the UMUX-LRB to have the smallest and most balanced rising and falling edge propagation delay, in addition to having the greatest reliability for temperature and process variation. / Graduate
7

Energy Demand Response for High-Performance Computing Systems

Ahmed, Kishwar 22 March 2018 (has links)
The growing computational demand of scientific applications has greatly motivated the development of large-scale high-performance computing (HPC) systems in the past decade. To accommodate the increasing demand of applications, HPC systems have been going through dramatic architectural changes (e.g., introduction of many-core and multi-core systems, rapid growth of complex interconnection network for efficient communication between thousands of nodes), as well as significant increase in size (e.g., modern supercomputers consist of hundreds of thousands of nodes). With such changes in architecture and size, the energy consumption by these systems has increased significantly. With the advent of exascale supercomputers in the next few years, power consumption of the HPC systems will surely increase; some systems may even consume hundreds of megawatts of electricity. Demand response programs are designed to help the energy service providers to stabilize the power system by reducing the energy consumption of participating systems during the time periods of high demand power usage or temporary shortage in power supply. This dissertation focuses on developing energy-efficient demand-response models and algorithms to enable HPC system's demand response participation. In the first part, we present interconnection network models for performance prediction of large-scale HPC applications. They are based on interconnected topologies widely used in HPC systems: dragonfly, torus, and fat-tree. Our interconnect models are fully integrated with an implementation of message-passing interface (MPI) that can mimic most of its functions with packet-level accuracy. Extensive experiments show that our integrated models provide good accuracy for predicting the network behavior, while at the same time allowing for good parallel scaling performance. In the second part, we present an energy-efficient demand-response model to reduce HPC systems' energy consumption during demand response periods. We propose HPC job scheduling and resource provisioning schemes to enable HPC system's emergency demand response participation. In the final part, we propose an economic demand-response model to allow both HPC operator and HPC users to jointly reduce HPC system's energy cost. Our proposed model allows the participation of HPC systems in economic demand-response programs through a contract-based rewarding scheme that can incentivize HPC users to participate in demand response.
8

An Interconnection Network for a Cache Coherent System on FPGAs

Mirian, Vincent 12 January 2011 (has links)
Field-Programmable Gate Arrays (FPGAs) systems now comprise many processing elements that are processors running software and hardware engines used to accelerate specific functions. To make the programming of such a system simpler, it is easiest to think of a shared-memory environment, much like in current multi-core processor systems. This thesis introduces a novel, shared-memory, cache-coherent infrastructure for heterogeneous systems implemented on FPGAs that can then form the basis of a shared-memory programming model for heterogeneous systems. With simulation results, it is shown that the cache-coherent infrastructure outperforms the infrastructure of Woods [1] with a speedup of 1.10. The thesis explores the various configurations of the cache interconnection network and the benefit of the cache-to-cache cache line data transfer with its impact on main memory access. Finally, the thesis shows the cache-coherent infrastructure has very little overhead when using its cache coherence implementation.
9

An Interconnection Network for a Cache Coherent System on FPGAs

Mirian, Vincent 12 January 2011 (has links)
Field-Programmable Gate Arrays (FPGAs) systems now comprise many processing elements that are processors running software and hardware engines used to accelerate specific functions. To make the programming of such a system simpler, it is easiest to think of a shared-memory environment, much like in current multi-core processor systems. This thesis introduces a novel, shared-memory, cache-coherent infrastructure for heterogeneous systems implemented on FPGAs that can then form the basis of a shared-memory programming model for heterogeneous systems. With simulation results, it is shown that the cache-coherent infrastructure outperforms the infrastructure of Woods [1] with a speedup of 1.10. The thesis explores the various configurations of the cache interconnection network and the benefit of the cache-to-cache cache line data transfer with its impact on main memory access. Finally, the thesis shows the cache-coherent infrastructure has very little overhead when using its cache coherence implementation.
10

Analyse et optimisation des performances électriques des réseaux d'interconnexions et des composants passifs dans les empilements 3D de circuits intégrés / Analysis and optimization of electrical performance of interconnections networks and passives components used in 3D integrated circuits

Roullard, Julie 15 December 2011 (has links)
Ces travaux de doctorat portent sur la caractérisation, la modélisation et l'optimisation des performances électriques des réseaux d'interconnexions dans les empilements 3D de circuits intégrés. Dans un premier temps des outils de caractérisation ont été développés pour les briques élémentaires d'interconnexions spécifiques à l'intégration 3D : les interconnexions de redistribution (RDL), les interconnexions enfouies dans le BEOL, les vias traversant le silicium (TSV) et les piliers de cuivre (Cu-Pillar). Des modèles électriques équivalents sont proposés et validés sur une très large bande de fréquence (MHz-GHz) par modélisation électromagnétique. Une analyse des performances électriques des chaînes complètes d'interconnexions des empilements 3D de puces est ensuite effectuée. Les empilements « Face to Face », « Face to Back » et par « Interposer » sont comparés en vue d'établir leurs performances respectives en terme de rapidité de transmission. Une étude est aussi réalisée sur les inductances 2D intégrées dans le BEOL et dont les performances électriques sont fortement impactées par le report des substrats de silicium. La dernière partie est consacrée à l'établissement de stratégies d'optimisation des performances des circuits 3D en vue de maximiser leur fréquence de fonctionnement, minimiser les retards de propagation et assurer l'intégrité des signaux (digramme de l'œil). Des réponses sont données aux concepteurs de circuits 3D quant aux meilleurs choix d'orientation des puces, de routage et de densité d'intégration. Ces résultats sont valorisés sur une application concrète de circuits 3D « mémoire sur processeur » (Wide I/O) pour lesquels les spécifications requises sur les débits (Gbp/s) restent un véritable challenge. / This PhD work deals with characterization and electrical modeling of interconnection networks for 3D stacking of advanced integrated circuits. First, characterization tools have been developed for basic interconnect element specific of the 3D integration : ReDistribution Layer (RDL) interconnect, Back End Of Lines (BEOL) interconnect, Through Silicon Via (TSV) and Copper Pillar. Equivalent models are proposed and then validated on a broad band frequency (MHz-GHz) by electromagnetic modeling. An analysis of global electrical performances of interconnections networks is investigated for 3D wafer stacking. Face to Face, Face to Back and Interposer stacking are compared in order to establish their performances in term of data rate transmission. A study is also carried on 2D inductances integrated in the BEOL to find out which electrical performances are strongly impacted by the stacking of silicon substrate. The last part is dedicated to the optimization strategies of the 3D circuits performances in order to maximize their frequency bandwidth, to minimize the propagation delays and to insure the signal integrity (eye diagram). Answers are given to the 3D circuits designers for determining the best choices of chips orientation, routing and integration density. These results are valued on a concrete application of 3D circuits “memory on processor” (Wide I/O) where obtaining the required specifications on data rate (Gbyps) remain a real challenge.

Page generated in 0.1045 seconds