Over the past two decades, microprocessor manufacturers have typically relied on wider issue widths and deeper pipelines to obtain performance improvements for single threaded applications. However, in the recent years, with power dissipation and wire delays becoming primary design constraints, this approach can no longer be effectively used to yield performance improvements. Thus process designers and vendors are universally moving towards multi-core designs. Examples for these are the commodity general purpose multi-core processors, the CellBE accelerator from IBM and the Graphics Processing Units from NVIDIA and ATI. Although these many and multi-core architectures can provide enormous performance benefits, it is difficult to program for them due to the complexity of writing explicitly parallel code. The ubiquity of computationally intensive media processing applications makes it imperative to consider new programming frameworks and languages that can express parallelism in an easy, portable manner. The StreamIt programming language has been proposed to efficiently exploit parallelism at various levels on general purpose multi-core architectures and stream processors and allow media processing and DSP application to be developed in an easy and portable fashion. The StreamIt model allows programmers to specify a program as a set of filters connected by FIFO communication channels. The graphs thus specified by the StreamIt programs describe task, data and pipeline parallelism which can be potentially exploited on modern Graphics Processing Units (GPUs), which have emerged as powerful, commodity stream processors, which support abundant parallelism in hardware. The first part of this thesis deals with the challenges in mapping StreamIt programs to GPUs and proposes an efficient technique to software pipeline the execution of stream Programs on GPUs. We formulate this problem—both scheduling and assignment of filters to processors—as an efficient Integer Linear Program(ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. We have evaluated our approach on a platform equipped with an NVIDIA GeForce 8800 GTS 512 GPU and our approach yields a (geometric) mean speedup of 5.02X, with a maximum speedup of 36.83X across a set of StreamIt benchmarks, with the speedup measured relative to an optimized single threaded CPU execution. While the approach of software pipelining the execution of stream programs on GPUs is efficient and performs well, it does not utilize the CPU cores to perform useful computation. Further, it does not support programs with stateful filters, which are essentially filters that are not data parallel owing to a dependence between each successive firing that is carried through the implicit state of the filter. The second part of the thesis aims at addressing these issues and describes a novel method to orchestrate the execution of a StreamIt program on the multiple cores of a system and GPUs in a synergistic manner. The proposed approach identifies, using profiling, the relative benefits of executing a task on the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers, the limited DMA bandwidth available and the required buffer layout transformations associated with the partitioning, as an integrated Integer Linear Program(ILP) which can then be solved by an ILP solver. Since solving an ILP is NP-Hard in the general case and may thus require a large amount of time, we also propose an efficient heuristic algorithm for the work partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solutions to the ILP formulation on an average across the benchmark suite, while requiring 2–3 orders of magnitude less time than the ILP approach. The partitioned tasks are then software pipelined to execute on the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with eight CPU cores, out of which four were used, and a GeForce 8800 GTS512 GPU show a(geometric) mean speed up of 6.84X with a maximum of 51.96X over a single threaded CPU execution across a set of StreamIt benchmarks.
Nagarakatte, Santosh G
Instruction scheduling and Software pipelining are important compilation techniques which reorder instructions in a program to exploit instruction level parallelism. They are essential for enhancing instruction level parallelism in architectures such as very Long Instruction Word and tiled processors. This thesis addresses two important problems in the context of these instruction reordering techniques. The first problem is for general purpose applications and architectures, while the second is for media and graphics applications for tiled and multi-core architectures. The first problem deals with software pipelining which is an instruction scheduling technique that overlaps instructions from multiple iterations. Software pipelining increases the register pressure and hence it may be required to introduce spill instructions. In this thesis, we model the problem of register allocation with optimal spill code generation and scheduling in software pipelined loops as a 0-1 integer linear program. By minimizing the amount of spill code produced, the formulation ensures that the initiation interval (II) between successive iterations of the loop is not increased unnecessarily. Experimental results show that our formulation performs better than the existing heuristics by preventing an increase in the II and also generating less spill code on average among loops extracted from Perfect Club and SPEC benchmarks. The second major contribution of the thesis deals with the code size aware scheduling of stream programs. Large scale synchronous dataflow graphs (SDF’s) and StreamIt have emerged as powerful programming models for high performance streaming applications. In these models, a program is represented as a dataflow graph where each node represents an autonomous filter and the edges represent the channels through which the nodes communicate. In constructing static schedules for programs in these models, it is important to optimize the execution time buffer requirements of the data channel and the space required to store the encoded schedule. Earlier approaches have either given priority to one of the requirements or proposed ad-hoc methods for generating schedules with good trade-offs. In this thesis, we propose a genetic algorithm framework based on non-dominated sorting for generating serial schedules which have good trade-off between code size and buffer requirement. We extend the framework to generate software pipelined schedules for tiled architectures. From our experiments, we observe that the genetic algorithm framework generates schedules with good trade-off and performs better than the earlier approaches.
(has links) (PDF)
High power dissipation and on-chip temperature limit performance and affect reliability in modern microprocessors. For servers and data centers, they determine the cooling cost, whereas for handheld and mobile systems, they limit the continuous usage of these systems. For mobile systems, energy consumption affects the battery life. It can not be ignored for desktop and server systems as well, as the contribution of energy continues to go up in organizations’ budgets, influencing strategic decisions, and its implications on the environment are getting appreciated. Intelligent trade-offs involving these quantities are critical to meet the performance demands of many modern applications. Dynamic Voltage and Frequency Scaling (DVFS) offers a huge potential for designing trade-offs involving energy, power, temperature and performance of computing systems. In our work, we propose and evaluate DVFS schemes that aim at minimizing energy consumption while meeting a performance constraint, for both sequential and parallel applications. We propose a Petri net based program performance model, parameterized by application properties, microarchitectural settings and system resource configuration, and use this model to find energy efficient DVFS settings. We first propose a DVFS scheme using this model for sequential programs running on single core multiple clock domain (MCD) processors, and evaluate this on a MCD processor simulator. We then extend this scheme for data parallel (Single Program Multiple Data style) applications, and then generalize it for stream applications as well, and evaluate these two schemes on a full system CMP simulator. Our experimental evaluation shows that the proposed schemes achieve significant energy savings for a small performance degradation.
Page generated in 0.1058 seconds