Spelling suggestions: "subject:"chip multiprocessors"" "subject:"chip ultiprocessors""
1 |
High Performance Interconnect System Design for Future Chip MultiprocessorsWang, Lei 03 October 2013 (has links)
Chip Multi-Processor (CMP) architectures have become mainstream for designing processors. With a large number of cores, Network-On-Chip (NOC) provides a scalable communication method for CMP architectures. NOC must be carefully designed to meet constraints of power and area, and provide ultra low latencies and high throughput. In this research, we explore different techniques to design high performance NOC.
First, existing NOCs mostly use Dimension Order Routing (DOR) to determine the route taken by a packet in unicast traffic. However, with the development of diverse applications in CMPs, one-to-many (multicast) and one-to-all (broadcast) traffic are becoming more common. Current unicast routing cannot support multi-cast and broadcast traffic efficiently. We propose Recursive Partitioning Multicast (RPM) routing and a detailed multicast wormhole router design for NOCs. RPM allows routers to select intermediate replication nodes based on the global distribution of destination nodes. This provides more path diversities, thus achieves more bandwidth-efficiency and finally improves the performance of the whole network.
Second, as feature size is shrinking, wires are becoming abundant resources available in NOC. Since NOC can benefit from high wire density due to no limits on the number of pins and faster signaling rates, it is very critical in the NOC router design to find a way that fully utilizes the wire resources to provide high performance. We propose an Adaptive Physical Channel Regulator (APCR) for NOC routers to exploit huge wiring resources. The flit size in an APCR router is less than the physical channel width (phit size) to provide finer granularity flow control. An APCR router allows flits from different packets or flows to share the same physical channel in a single cycle. The three regulation schemes (Monopolizing, Fair-sharing and Channel-stealing) intelligently allocate the output channel resources considering not only the availability of physical channels but the occupancy of input buffers. In an APCR router, each Virtual Channel can forward a dynamic number of flits every cycle depending on the run-time network status.
Third, nanophotonics has been proposed to design low latency and high band- width NOC for future CMPs. Recent nanophotonic NOC designs adopt the token- based arbitration coupled with credit-based flow control, which leads to low band- width utilization. We propose two handshake schemes for nanophotonic interconnects in CMPs, Global Handshake (GHS) and Distributed Handshake (DHS), which get rid of the traditional credit-based flow control, reduce the average token waiting time, and finally improve the network throughput. Furthermore, we enhance the basic handshake schemes with setaside buffer and circulation techniques to overcome the Head-Of-Line (HOL) blocking.
|
2 |
Accelerating Communication in On-Chip Interconnection NetworksAhn, Minseon 2012 May 1900 (has links)
Due to the ever-shrinking feature size in CMOS process technology, it is expected that future chip multiprocessors (CMPs) will have hundreds or thousands of processing cores. To support a massively large number of cores, packet-switched on-chip interconnection networks have become a de facto communication paradigm in CMPs. However, the on-chip networks have several drawbacks, such as limited on-chip resources, increasing communication latency, and insufficient communication bandwidth.
In this dissertation, several schemes are proposed to accelerate communication in on-chip interconnection networks within area and cost budgets to overcome the problems. First, an early transition scheme for fully adaptive routing algorithms is proposed to improve network throughput. Within a limited number of resources, previously proposed fully adaptive routing algorithms have low utilization in escape channels. To increase utilization of escape channels, it transfers packets earlier before the normal channels are full. Second, a pseudo-circuit scheme is proposed to reduce network latency using communication temporal locality. Reducing per-hop router delay becomes more important for communication latency reduction in larger on-chip interconnection networks. To improve communication latency, the previous arbitration information is reused to bypass switch arbitration. For further acceleration, we also propose two aggressive schemes, pseudo-circuit speculation and buffer bypassing. Third, two handshake schemes are proposed to improve network throughput for nanophotonic interconnects. Nanophotonic interconnects have been proposed to replace metal wires with optical links in on-chip interconnection networks for low latency and power consumptions as well as high bandwidth. To minimize the average token waiting time of the nanophotonic interconnects, the traditional credit-based flow control is removed. Thus, the handshake schemes increase link utilization and enhance network throughput.
|
3 |
Hardware/software deadlock avoidance for multiprocessor multiresource system-on-a-chipLee, Jaehwan. January 2004 (has links) (PDF)
Thesis (Ph. D.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2005. / Panagiotis Manolios, Committee Member ; Douglas M. Blough, Committee Member ; Vincent John Mooney III, Committee Chair ; William D. Hunt, Committee Member ; Sung Kyu Lim, Committee Member. Vita. Includes bibliographical references.
|
4 |
Hardware/software deadlock avoidance for multiprocessor multiresource system-on-a-chip /Lee, Jaehwan. January 2004 (has links) (PDF)
Thesis (Ph. D.)--Georgia Institute of Technology, 2004. / Vita. Department of Electrical and Computer Engineering, Georgia Institute of Technology. Includes bibliographical references (p. 142-146).
|
5 |
Fair and high performance shared memory resource managementEbrahimi, Eiman 31 January 2012 (has links)
Chip multiprocessors (CMPs) commonly share a large portion of memory
system resources among different cores. Since memory requests from
different threads executing on different cores significantly interfere
with one another in these shared resources, the design of the shared
memory subsystem is crucial for achieving high performance and
fairness.
Inter-thread memory system interference has different implications
based on the type of workload running on a CMP. In multi-programmed
workloads, different applications can experience significantly
different slowdowns. If left uncontrolled, large disparities in
slowdowns result in low system performance and make system software's
priority-based thread scheduling policies ineffective. In a single
multi-threaded application, memory system interference between threads
of the same application can slow each thread down significantly. Most
importantly, the critical path of execution can also be
significantly slowed down, resulting in increased application
execution time.
This dissertation proposes three mechanisms that address different
shortcomings of current shared resource management techniques targeted
at multi-programmed workloads, and one mechanism which speeds up a
single multi-threaded application by managing main-memory related
interference between its different threads.
With multi-programmed workloads, the key idea is that both demand- and
prefetch-caused inter-application interference should be taken into
account in shared resource management techniques across the entire
shared memory system. Our evaluations demonstrate that doing so
significantly improves both system performance and fairness compared
to the state-of-the-art. When executing a single multi-threaded
application on a CMP, the key idea is to take into account the
inter-dependence of threads in memory scheduling decisions. Our
evaluation shows that doing so significantly reduces the execution
time of the multi-threaded application compared to using
state-of-the-art memory schedulers designed for multi-programmed
workloads.
This dissertation concludes that the performance and fairness of CMPs
can be significantly improved by better management of inter-thread
interference in the shared memory resources, both for multi-programmed
workloads and multi-threaded applications. / text
|
6 |
Data Processing Techniques on Modern Hardware ArchitecturesTsirogiannis, Dimitrios 31 August 2011 (has links)
The last decade has been characterized by radical changes in the computing landscape. We have witnessed the advent of multi-core processors, flash-based storage systems and the proliferation of scale out architectures, such as map-reduce-based systems and massively parallel databases. Although data management systems have embraced modern hardware technologies to some extent, they have not realized
their full potential.
The goal of this thesis is two-fold. Primarily, it demonstrates the staggering potential for performance improvement offered by modern hardware architectures and, then, proposes how data management
systems must alter in order to realize this potential. Additionally, this thesis demonstrates that utilizing modern hardware architectures is important both for performance and energy-efficiency. Towards this goal, we propose query processing and indexing techniques for chip multiprocessors and we analyze the trade-offs of executing complex database queries on modern processor technologies. Subsequently, we propose query processing methods tailored to flash-based storage systems. Finally, we analyze the power consumption of database systems and we reveal opportunities for improving their
energy efficiency.
|
7 |
Data Processing Techniques on Modern Hardware ArchitecturesTsirogiannis, Dimitrios 31 August 2011 (has links)
The last decade has been characterized by radical changes in the computing landscape. We have witnessed the advent of multi-core processors, flash-based storage systems and the proliferation of scale out architectures, such as map-reduce-based systems and massively parallel databases. Although data management systems have embraced modern hardware technologies to some extent, they have not realized
their full potential.
The goal of this thesis is two-fold. Primarily, it demonstrates the staggering potential for performance improvement offered by modern hardware architectures and, then, proposes how data management
systems must alter in order to realize this potential. Additionally, this thesis demonstrates that utilizing modern hardware architectures is important both for performance and energy-efficiency. Towards this goal, we propose query processing and indexing techniques for chip multiprocessors and we analyze the trade-offs of executing complex database queries on modern processor technologies. Subsequently, we propose query processing methods tailored to flash-based storage systems. Finally, we analyze the power consumption of database systems and we reveal opportunities for improving their
energy efficiency.
|
8 |
Technology Impacts of CMOS Scaling on Microprocessor Core Design for Hard-Fault Tolerance in Single-Core Applications and Optimized Throughput in Throughput-Oriented Chip MultiprocessorsBower, Fred January 2010 (has links)
<p>The continued march of technological progress, epitomized by Moore’s Law provides the microarchitect with increasing numbers of transistors to employ as we continue to shrink feature geometries. Physical limitations impose new constraints upon designers in the areas of overall power and localized power density. Techniques to scale threshold and supply voltages to lower values in order to reduce power consumption of the part have also run into physical limitations, exacerbating power and cooling problems in deep sub-micron CMOS process generations. Smaller device geometries are also subject to increased sensitivity to common failure modes as well as manufacturing process variability.</p>
<p>In the face of these added challenges, we observe a shift in the focus of the industry, away from building ever–larger single–core chips, whose focus is on reducing single–threaded latency toward a design approach that employs multiple cores on a single chip to improve throughput. While the early multicore era utilized the existing single–core designs of the previous generation in small numbers, subsequent generations have introduced cores tailored to multicore use. These cores seek to achieve power-efficient throughput and have led to a new emphasis on throughput-oriented computing, particularly for Internet workloads, where the end-to-end computational task is dominated by long–latency network operations. The ubiquity of these workloads makes a compelling argument for throughput–oriented designs, but does not free the microarchitect fully from latency demands of common workloads in enterprise and desktop application spaces.</p>
<p>We believe that a continued need for both throughput–oriented and latency–sensitive processors will exist in coming generations of technology. We further opine that making effective use of the additional transistors that will be available may require different techniques for latency–sensitive designs than for throughput–oriented ones, since we may trade latency or throughput for the desired attribute of a core in each of the respective paradigms.</p>
<p>We make three major contributions with this thesis. Our first contribution is a fine–grained fault diagnosis and deconfiguration technique for array structures, such as the ROB, within the microprocessor core. We present and evaluate two variants of this technique. The first variant uses an existing fault detection and correction technique whose scope is the processor core execution pipeline to ensure correct processor operation. The second variant integrates fault detection and correction into the array structure itself to provide a self–contained, fine–grained, fault detection, diagnosis, and repair technique.</p>
<p>In our second contribution, we develop a lightweight, fine–grained fault diagnosis mechanism for the processor core. In this work, we leverage the first contribution's methods to provide deconfiguration of faulty array elements. We additionally extend the scope of that work to include all pipeline circuitry from instruction issue to retirement.</p>
<p>In our third and final contribution, we focus on throughput–oriented core data cache design. In this work, we study the demands of the throughput–oriented core running a representative workload and then propose and evaluate an alternative data cache implementation that more closely matches the demands of the core. We then show that a better–matched cache design can be exploited to provide improved throughput under a fixed power budget.</p>
<p>Our results show that typical latency–sensitive cores have sufficient redundancy to make finegrained hard–fault tolerance an affordable alternative for hardening complex designs. Our designs suffer little or no performance loss when no faults are present and retain nearly the same performance characteristics in the presence of small numbers of hard faults in protected structures. In our study of the latency–sensitive core, we have shown that SRAM–based designs have low latencies that end up providing less benefit to a throughput–oriented core and workload than a better–fitted data cache composed of DRAM. The move from a high–power, fast technology to a lower–power, slower technology allows us to increase L1 data cache capacity, which is a net benefit for the throughput–oriented core.</p> / Dissertation
|
9 |
Study of the Hyperscalar Multi-core ArchitectureChou, Yu-Liang 07 September 2011 (has links)
Current trends in processor design have migrated toward chip multiprocessors (CMPs). CMPs are designed to exploit both instruction-level parallelism (ILP) within processors and thread-level parallelism (TLP) within and across processors. However, the conventional design of current CMPs is forced to make a choice between high single-thread performance and high peak throughput. This inability to adjust to varying levels of ILP and TLP results in processor inefficiency. To cope with the dilemma of designing CMPs confronted by the processor designers, this dissertation proposed the hyperscalar concept for current multi-core designs. The hyperscalar concept enables the multi-core architectures to dynamically group many scalar in-order cores as a superscalar processor to accelerate a sequential thread. The reconfigure feature of hyperscalar architecture contributes to the high flexibility in adapting different types of applications, providing high single-thread performance when thread level parallelism (TLP) is low and high throughput when TLP is high.
Based on the hyperscalar concept, this dissertation first proposed a hyperscalar dual-core architecture. It can play three different roles (a 2-issue statically scheduled superscalar processor, a homogeneous dual-core processor, or a standalone single-core processor). An Instruction-dependency Analyzer (IA) that connects two scalar in-order cores is designed to handle the role switching. The design of IA makes it possible for the two cores to work together like a 2-issue statically scheduled superscalar processor. The IA dispatches instructions with data dependencies to the same core so that the data dependencies can be resolved by existing forwarding paths in the core. Simulation results show that when the proposed architecture works in a statically scheduled superscalar manner, it achieves a 30.3% higher instructions per cycle (IPC) than the traditional five-stage pipelined core based on 35 benchmarks from the MiBench suite. The increases in area and power for extending a homogeneous dual-core processor to a hyperscalar dual-core processor are only 1.8% and 1.75%, respectively, using 90nm CMOS technology.
On top of that, this dissertation further extended the hyperscalar dual-core architecture to hyperscalar multi-core architecture capable of flexibly providing high throughput for uniform parallel application as well as high performance for more general workloads. It can dynamically unite many scalar cores as a larger OOO superscalar processor to accelerate a thread. To accomplish this, the Virtual Shared Register File (VSRF) concept was proposed to help the instructions of a thread in different cores can logically face a uniform set of register file. Simulation results show that the 2, 4, 8, 16, and 32-core-united configurations of the hyperscalar multi-core architecture archive 95%, 84%, 82%, 85%, and 90% of the performance of the monolithic 2, 4,8, 16, and 32-issue OOO superscalar processors based the SPEC2000 benchmarks.
Finally, this dissertation proposed a new technology, called multi-streaming SIMD, applicable for hyperscalar architecture to efficiently exploit data-level parallelism (DLP). The multi-streaming SIMD technology enables current multimedia extensions to simultaneously manipulate multiple data streams. Simulation results show that when a multi-streaming SIMD computing engine has four 4-register multimedia operation storage units, it provides a factor of 3.3x to 5.5x performance enhancement for traditional MMX extensions on twelve multimedia kernels. After exploring the above research topics discussed in this dissertation, a promising architecture for future multi-core designs was realized.
|
10 |
Memory-subsystem resource management for the many-core eraKaseridis, Dimitrios 11 July 2012 (has links)
As semiconductor technology continues to scale lower in the nanometer era, the communication between processor and main memory has been particularly challenged. The well-studied frequency, memory and power ``walls'' have redirect architects towards utilizing Chip Multiprocessors (CMP) as an attractive architecture for leveraging technology scaling. In order to achieve high efficiency and throughput, CMPs rely heavily on sharing resources among multiple cores, especially in the case of the memory hierarchy. Unfortunately, such sharing introduces resource contention and interference between the multiple executing threads.
The ever-increasing access latency difference between processor and memory, the gradually increasing memory bandwidth demands to main memory, and the decreasing cache capacity size available to each core due to multiple core integration, has made the need for an efficient memory subsystem resource management more critical than ever before.
This dissertation focuses on managing the sharing of the Last-level Cache (LLC) capacity and the main memory bandwidth, as the two most important resources that significantly affect system performance and energy consumption. The presented schemes include efficient solutions to all of the three basic requirements for implementing a resource management schemes, that is: a) profiling mechanisms to capture applications' resource requirements, b) microarchitecture mechanisms to enforce a resource allocation scheme, and c) resource allocations algorithms/policies to manage the available memory resources throughput the whole memory hierarchy of a CMP system.
To achieve these targets the dissertation first describes a set of low overhead, non-invasive profiling mechanisms that are able to project applications’ memory resource requirements and memory sharing behavior. Two memory resource partitioning schemes are presented. The first one, the Bank-aware dynamic partitioning scheme provides a low overhead solution for partitioning cache resources of large CMP architectures that are based on a Dynamic Non-Uniform Cache Architecture (DNUCA) last-level cache design, consistent with the current industry trends. In addition, the second scheme, the Bandwidth-aware dynamic scheme presents a system-wide optimization of memory-subsystem resource allocation and job scheduling for large, multi-chip CMP systems. The scheme is seeking for optimizations both within and outside single CMP chips, aiming at overall system throughput and efficiency improvements.
As cache partitioning schemes with isolated partitions impose a set of restrictions in the use of the last-level cache, which can severely affect the performance of large CMP designs, this dissertation presents a Quasi-partitioning scheme that breaks such restrictions while providing most of the benefits of cache partitioning schemes. The presented solution is able to efficiently scale to a significant larger number of cores than what previously described schemes that are based on isolated partition can achieve.
Finally, as the memory controller is one of the fundamental components of the memory-subsystem, a well-designed memory-subsystem resource management needs to carefully utilize the memory controller resources and coordinate its functionality with the operation of the main memory and the last-level cache. To improve execution fairness and system throughput, this dissertation presents a criticality-based, memory controller requests priority scheme. The scheme ranks demand read and prefetch operations based on their latency sensitivity, while it coordinates its operation with the DRAM page-mode policy and the memory data prefetcher. / text
|
Page generated in 0.0603 seconds