Global ETD Search

161	Study of the Hyperscalar Multi-core Architecture Chou, Yu-Liang 07 September 2011 (has links) Current trends in processor design have migrated toward chip multiprocessors (CMPs). CMPs are designed to exploit both instruction-level parallelism (ILP) within processors and thread-level parallelism (TLP) within and across processors. However, the conventional design of current CMPs is forced to make a choice between high single-thread performance and high peak throughput. This inability to adjust to varying levels of ILP and TLP results in processor inefficiency. To cope with the dilemma of designing CMPs confronted by the processor designers, this dissertation proposed the hyperscalar concept for current multi-core designs. The hyperscalar concept enables the multi-core architectures to dynamically group many scalar in-order cores as a superscalar processor to accelerate a sequential thread. The reconfigure feature of hyperscalar architecture contributes to the high flexibility in adapting different types of applications, providing high single-thread performance when thread level parallelism (TLP) is low and high throughput when TLP is high. Based on the hyperscalar concept, this dissertation first proposed a hyperscalar dual-core architecture. It can play three different roles (a 2-issue statically scheduled superscalar processor, a homogeneous dual-core processor, or a standalone single-core processor). An Instruction-dependency Analyzer (IA) that connects two scalar in-order cores is designed to handle the role switching. The design of IA makes it possible for the two cores to work together like a 2-issue statically scheduled superscalar processor. The IA dispatches instructions with data dependencies to the same core so that the data dependencies can be resolved by existing forwarding paths in the core. Simulation results show that when the proposed architecture works in a statically scheduled superscalar manner, it achieves a 30.3% higher instructions per cycle (IPC) than the traditional five-stage pipelined core based on 35 benchmarks from the MiBench suite. The increases in area and power for extending a homogeneous dual-core processor to a hyperscalar dual-core processor are only 1.8% and 1.75%, respectively, using 90nm CMOS technology. On top of that, this dissertation further extended the hyperscalar dual-core architecture to hyperscalar multi-core architecture capable of flexibly providing high throughput for uniform parallel application as well as high performance for more general workloads. It can dynamically unite many scalar cores as a larger OOO superscalar processor to accelerate a thread. To accomplish this, the Virtual Shared Register File (VSRF) concept was proposed to help the instructions of a thread in different cores can logically face a uniform set of register file. Simulation results show that the 2, 4, 8, 16, and 32-core-united configurations of the hyperscalar multi-core architecture archive 95%, 84%, 82%, 85%, and 90% of the performance of the monolithic 2, 4,8, 16, and 32-issue OOO superscalar processors based the SPEC2000 benchmarks. Finally, this dissertation proposed a new technology, called multi-streaming SIMD, applicable for hyperscalar architecture to efficiently exploit data-level parallelism (DLP). The multi-streaming SIMD technology enables current multimedia extensions to simultaneously manipulate multiple data streams. Simulation results show that when a multi-streaming SIMD computing engine has four 4-register multimedia operation storage units, it provides a factor of 3.3x to 5.5x performance enhancement for traditional MMX extensions on twelve multimedia kernels. After exploring the above research topics discussed in this dissertation, a promising architecture for future multi-core designs was realized. SIMD chip multiprocessors superscalar dynamic multi-core reconfigurable hardware multimedia processing hyperscalar
162	Designing heterogeneous many-core processors to provide high performance under limited chip power budget Woo, Dong Hyuk 04 October 2010 (has links) This thesis describes the efficient design of a future many-core processor that can provide higher performance under the limited chip power budget. To achieve such a goal, this thesis first develops an analytical framework within which computer architects can estimate achievable performance improvement of different many-core architectures given the same power budget. From this study, this thesis found that a future many-core processor needs (1) energy-efficient parallel cores and (2) a high-performance sequential core. Based on these observations, this thesis proposes an energy-efficient broad-purpose acceleration layer that can be snapped on top of a conventional general-purpose processor. In addition to such an energy-efficient parallel cores, this thesis also proposes different architectural techniques to further boost the performance of sequential computation while those parallel cores are idle. In particular, this thesis develops low-cost architectural techniques to enhance the memory performance of a host core by utilizing those idle parallel cores. This idea is evaluated in two different system architectures: one with the aforementioned acceleration layer and the other with an emerging integrated CPU and GPU chip. Heterogeneous many-core architecture Heterogeneous computing Multiprocessors Microprocessors High performance processors
163	On the design of architecture-aware algorithms for emerging applications Kang, Seunghwa 30 January 2011 (has links) This dissertation maps various kernels and applications to a spectrum of programming models and architectures and also presents architecture-aware algorithms for different systems. The kernels and applications discussed in this dissertation have widely varying computational characteristics. For example, we consider both dense numerical computations and sparse graph algorithms. This dissertation also covers emerging applications from image processing, complex network analysis, and computational biology. We map these problems to diverse multicore processors and manycore accelerators. We also use new programming models (such as Transactional Memory, MapReduce, and Intel TBB) to address the performance and productivity challenges in the problems. Our experiences highlight the importance of mapping applications to appropriate programming models and architectures. We also find several limitations of current system software and architectures and directions to improve those. The discussion focuses on system software and architectural support for nested irregular parallelism, Transactional Memory, and hybrid data transfer mechanisms. We believe that the complexity of parallel programming can be significantly reduced via collaborative efforts among researchers and practitioners from different domains. This dissertation participates in the efforts by providing benchmarks and suggestions to improve system software and architectures. MapReduce Nested parallelism Parallel algorithm Algorithm engineering Performance tuning GPU Transactional memory Algorithms Parallel algorithms Multiprocessors
164	Micro-scheduling and its interaction with cache partitioning Choudhary, Dhruv 05 July 2011 (has links) The thesis explores the sources of energy inefficiency in asymmetric multi- core architectures where energy efficiency is measured by the energy-delay squared product. The insights gathered from this study drive the development of optimized thread scheduling and coordinated cache management strategies in an important class of asymmetric shared memory architectures. The proposed techniques are founded on well known mathematical optimization techniques yet are lightweight enough to be implemented in practical systems. Cache partitioning Computer architecture Thread scheduling Cache memory Multiprocessors High performance computing
165	On algorithm design and programming model for multi-threaded computing He, Zhengyu 27 March 2012 (has links) The objective of this work is to investigate the algorithm design and the programming model of multi-threaded computing. Designing multi-threaded algorithms is very challenging - when multiple threads need to communicate or coordinate with each other, efficient synchronization support is needed. However, synchronizations are known to be expensive on the emerging multi-/many-core processors, especially when the number of threads increases. To fully unleash the power of such processors, carefully investigations are needed in both algorithm design and programming models for multi-threaded systems. In this dissertation, we first present an asynchronous multi-threaded algorithm for the maximum network flow problem. This algorithm is based on the classical push-relabel algorithm and completely removes the use of locks and barriers from its original parallel version. While this algorithmic method shows effectiveness, it is challenging to generalize the success to other multi-threaded problem. We next focus on improving the transactional memory, a promising mechanism to construct multi-threaded programs. A queuing-theory-based model is developed to analyze the performance of different transactional memory systems. Based on the results of the model, we emphasize on the contention management mechanism of transactional memory systems. A profiling-based adaptive contention management scheme is finally proposed to cope with the problem that none of the static contention management schemes can keep good performance on all platforms for all types of workload. From this research, we show that it is necessary and worthwhile to explore both the algorithm design aspect and the programming model aspect for multi-thread computing. Programming model Parallel computing Transactional memory Maximum flow Computer science Computer algorithms Algorithms Multiprocessors
166	Architecture, Performance and Applications of a Hierarchial Network of Hypercubes Kumar, Mohan J 02 1900 (has links) This thesis, presents a multiprocessor topology, the hierarchical network of hyper-cubes, which has a low diameter, low degree of connectivity and yet exhibits hypercube like versatile characteristics. The hierarchical network of hyper-cubes consists of k-cubes interconnected in two or more hierarchical levels. The network has a hierarchical, expansive, recursive structure with a constant pre-defined building block. The basic building block of the hierarchical network of hyper-cubes comprises of a k-cube of processor elements and a network controller. The hierarchical network of hyper-cubes retains the positive features of the k-cube at different levels of hierarchy and has been found to perform better than the binary hypercube in executing a variety of application problems. The ASCEND/DESCEND class of algorithms can be executed in O(log2 N) parallel steps (N is the number of data elements) on a hierarchical network of hypercubes with N processor elements. A description of the topology of the hierarchical network of hypercubes is presented and its architectural potential in terms of fault-tolerant message routing, executing a class of highly parallel algorithms, and in simulating artificial neural networks is analyzed. Further, the proposed topology is found to be very efficient in executing multinode broadcast and total exchange algorithms. We subsequently, propose an improvisation of the network to counter faults, and explore implementation of artificial neural networks to demonstrate efficient implementation of application problems on the network. The fault-tolerant capabilities of the hierarchical network of hypercubes with two network controllers per k-cube of processor elements are comparable to those of the hypercube and the folded hypercube. We also discuss various issues related to the suitability of multiprocessor architectures for simulating neural networks. Performance analysis of ring, hypercube, mesh and hierarchical network of hypercubes for simulating artificial neural networks is presented. Our studies reveal that the performance of the hierarchical network of hypercubes is better than those of ring, mesh, hypernet and hypercube topologies in implementing artificial neural networks. Design and implementation aspects of hierarchical network of hypercubes based on two schemes, viz., dual-ported RAM communication, and transputers are also presented. Results of simulation studies for robotic applications using neural network paradigms on the transputer-based hierarchical network of hypercubes reveal that the proposed network can produce fast response times of the order of hundred microseconds. Computer and Information Science Computer network architectures Computer architectures Multiprocessors Parallel processing Hierarchical Network of Hypercubes
167	Memory-subsystem resource management for the many-core era Kaseridis, Dimitrios 11 July 2012 (has links) As semiconductor technology continues to scale lower in the nanometer era, the communication between processor and main memory has been particularly challenged. The well-studied frequency, memory and power ``walls'' have redirect architects towards utilizing Chip Multiprocessors (CMP) as an attractive architecture for leveraging technology scaling. In order to achieve high efficiency and throughput, CMPs rely heavily on sharing resources among multiple cores, especially in the case of the memory hierarchy. Unfortunately, such sharing introduces resource contention and interference between the multiple executing threads. The ever-increasing access latency difference between processor and memory, the gradually increasing memory bandwidth demands to main memory, and the decreasing cache capacity size available to each core due to multiple core integration, has made the need for an efficient memory subsystem resource management more critical than ever before. This dissertation focuses on managing the sharing of the Last-level Cache (LLC) capacity and the main memory bandwidth, as the two most important resources that significantly affect system performance and energy consumption. The presented schemes include efficient solutions to all of the three basic requirements for implementing a resource management schemes, that is: a) profiling mechanisms to capture applications' resource requirements, b) microarchitecture mechanisms to enforce a resource allocation scheme, and c) resource allocations algorithms/policies to manage the available memory resources throughput the whole memory hierarchy of a CMP system. To achieve these targets the dissertation first describes a set of low overhead, non-invasive profiling mechanisms that are able to project applications’ memory resource requirements and memory sharing behavior. Two memory resource partitioning schemes are presented. The first one, the Bank-aware dynamic partitioning scheme provides a low overhead solution for partitioning cache resources of large CMP architectures that are based on a Dynamic Non-Uniform Cache Architecture (DNUCA) last-level cache design, consistent with the current industry trends. In addition, the second scheme, the Bandwidth-aware dynamic scheme presents a system-wide optimization of memory-subsystem resource allocation and job scheduling for large, multi-chip CMP systems. The scheme is seeking for optimizations both within and outside single CMP chips, aiming at overall system throughput and efficiency improvements. As cache partitioning schemes with isolated partitions impose a set of restrictions in the use of the last-level cache, which can severely affect the performance of large CMP designs, this dissertation presents a Quasi-partitioning scheme that breaks such restrictions while providing most of the benefits of cache partitioning schemes. The presented solution is able to efficiently scale to a significant larger number of cores than what previously described schemes that are based on isolated partition can achieve. Finally, as the memory controller is one of the fundamental components of the memory-subsystem, a well-designed memory-subsystem resource management needs to carefully utilize the memory controller resources and coordinate its functionality with the operation of the main memory and the last-level cache. To improve execution fairness and system throughput, this dissertation presents a criticality-based, memory controller requests priority scheme. The scheme ranks demand read and prefetch operations based on their latency sensitivity, while it coordinates its operation with the DRAM page-mode policy and the memory data prefetcher. / text Chip-multiprocessors Many-core Cache Memory Resource-management Processor Computer architecture Memory controllers
168	Compiling for a multithreaded dataflow architecture : algorithms, tools, and experience Li, Feng 20 May 2014 (has links) (PDF) Across the wide range of multiprocessor architectures, all seem to share one common problem: they are hard to program. It is a general belief that parallelism is a software problem, and that perhaps we need more sophisticated compilation techniques to partition the application into concurrent threads. Many experts also make the point that the underlining architecture plays an equally important architecture before one may expect significant progress in the programmability of multiprocessors. Our approach favors a convergence of these viewpoints. The convergence of dataflow and von Neumann architecture promises latency tolerance, the exploitation of a high degree of parallelism, and light thread switching cost. Multithreaded dataflow architectures require a high degree of parallelism to tolerate latency. On the other hand, it is error-prone for programmers to partition the program into large number of fine grain threads. To reconcile these facts, we aim to advance the state of the art in automatic thread partitioning, in combination with programming language support for coarse-grain, functionally deterministic concurrency. This thesis presents a general thread partitioning algorithm for transforming sequential code into a parallel data-flow program targeting a multithreaded dataflow architecture. Our algorithm operates on the program dependence graph and on the static single assignment form, extracting task, pipeline, and data parallelism from arbitrary control flow, and coarsening its granularity using a generalized form of typed fusion. We design a new intermediate representation to ease code generation for an explicit token match dataflow execution model. We also implement a GCC-based prototype. We also evaluate coarse-grain dataflow extensions of OpenMP in the context of a large-scale 1024-core, simulated multithreaded dataflow architecture. These extension and simulated architecture allow the exploration of innovative memory models for dataflow computing. We evaluate these tools and models on realistic applications. Dataflow Multiprocessors
169	Energy and performance improvement relying on trivial instructions and speculative snooping in high-performance processors Atoofian, Ehsan 12 April 2010 (has links) This thesis introduces energy and performance optimization techniques for high-performance processors. Our optimization techniques target both single processors and chip multiprocessors (CMPs). In single processors, we exploit trivial instructions to improve energy and performance. Trivial instructions are those instructions whose output can be determined without performing the actual computations. We show that bypassing such unnecessary computations reduces energy while improve performance. Performance improvement achieved by skipping executing trivial instructions depends on how early such instructions are identified. We use value prediction to detect trivial instructions with high accuracy and as soon as possible. Consequently, we improve performance over a processor that bypasses trivial instructions without using speculation. In CMPs, we introduce two techniques to improve energy of interconnect and caches Conventional snoopy based chip multiprocessors take an aggressive approach broadcasting snoop requests to all nodes. In addition, each node checks all received requests. This approach reduces the latency of cache to cache transfer misses at the expense of increasing energy. We exploit this design inefficiency and introduce two optimization techniques in CMPs. First, and at the requester end, we introduce speculative selective request (SSR) to reduce energy consumption in the binary tree interconnect. In SSR, we send the request only to the node more likely to have the missing data. We reduce energy as we limit accesses only to the interconnect components between the requestor and the supplier node. Second, and at the receiving end, we propose speculative tag lookup (STL) to reduce energy consumption in data caches. We filter those accesses more likely to miss in the L1, cache. Using shared memory applications, we show that SSR and STL improve energy of interconnect and caches significantly with negligible performance loss and hardware overhead. High-performance processors Multiprocessors
170	Multiprocessor scheduling in the presence of link contention delays Macey, Benjamin January 2004 (has links) [Truncated abstract] Parallel computing is recognised today as an important tool in the solution of a wide variety of computationally intensive problems, problems which were previously considered intractable. While it offers the promise of vastly increased performance, parallel computing introduces additional complexities which are not encountered with sequential processing. One of these is the scheduling problem, in which the individual tasks comprising a parallel program are scheduled onto the processors comprising the parallel architecture. The objective is to minimise execution time while still preserving the precedence relations between the tasks. Scheduling is of vital importance since a poor task schedule can undo any potential gains from the parallelism present in the application. Inappropriate scheduling can result in the hardware being used inefficiently, or worse, the program could run slower in parallel than on a single processor. The scheduling problem is one of the more difficult problems facing the parallel programmer. In fact, it is NP-complete in the general case. As a result, a large number of heuristic methods with sub-optimal performance but polynomial, rather than exponential, time complexity have been proposed. In order to simplify their algorithms, researchers have restricted the problem: by making assumptions concerning the parallel architecture or imposing limitations on the task graph representing the parallel program. The evolution of the task scheduling problem has involved the gradual relaxation of these restrictions. A major change occurred when the assumption of zero inter-processor communication costs was removed. This was driven by the increasing popularity of distributed-memory message-passing multiprocessors. Multiprocessors Multiprocessor scheduling List scheduling Parallel computing Interprocessor communication Contention delay

Search results