Global ETD Search

231	Efficient Compilation Of Stream Programs Onto Multi-cores With Accelerators Udupa, Abhishek 07 1900 (has links) Over the past two decades, microprocessor manufacturers have typically relied on wider issue widths and deeper pipelines to obtain performance improvements for single threaded applications. However, in the recent years, with power dissipation and wire delays becoming primary design constraints, this approach can no longer be effectively used to yield performance improvements. Thus process designers and vendors are universally moving towards multi-core designs. Examples for these are the commodity general purpose multi-core processors, the CellBE accelerator from IBM and the Graphics Processing Units from NVIDIA and ATI. Although these many and multi-core architectures can provide enormous performance benefits, it is difficult to program for them due to the complexity of writing explicitly parallel code. The ubiquity of computationally intensive media processing applications makes it imperative to consider new programming frameworks and languages that can express parallelism in an easy, portable manner. The StreamIt programming language has been proposed to efficiently exploit parallelism at various levels on general purpose multi-core architectures and stream processors and allow media processing and DSP application to be developed in an easy and portable fashion. The StreamIt model allows programmers to specify a program as a set of filters connected by FIFO communication channels. The graphs thus specified by the StreamIt programs describe task, data and pipeline parallelism which can be potentially exploited on modern Graphics Processing Units (GPUs), which have emerged as powerful, commodity stream processors, which support abundant parallelism in hardware. The first part of this thesis deals with the challenges in mapping StreamIt programs to GPUs and proposes an efficient technique to software pipeline the execution of stream Programs on GPUs. We formulate this problem—both scheduling and assignment of filters to processors—as an efficient Integer Linear Program(ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. We have evaluated our approach on a platform equipped with an NVIDIA GeForce 8800 GTS 512 GPU and our approach yields a (geometric) mean speedup of 5.02X, with a maximum speedup of 36.83X across a set of StreamIt benchmarks, with the speedup measured relative to an optimized single threaded CPU execution. While the approach of software pipelining the execution of stream programs on GPUs is efficient and performs well, it does not utilize the CPU cores to perform useful computation. Further, it does not support programs with stateful filters, which are essentially filters that are not data parallel owing to a dependence between each successive firing that is carried through the implicit state of the filter. The second part of the thesis aims at addressing these issues and describes a novel method to orchestrate the execution of a StreamIt program on the multiple cores of a system and GPUs in a synergistic manner. The proposed approach identifies, using profiling, the relative benefits of executing a task on the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers, the limited DMA bandwidth available and the required buffer layout transformations associated with the partitioning, as an integrated Integer Linear Program(ILP) which can then be solved by an ILP solver. Since solving an ILP is NP-Hard in the general case and may thus require a large amount of time, we also propose an efficient heuristic algorithm for the work partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solutions to the ILP formulation on an average across the benchmark suite, while requiring 2–3 orders of magnitude less time than the ILP approach. The partitioned tasks are then software pipelined to execute on the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with eight CPU cores, out of which four were used, and a GeForce 8800 GTS512 GPU show a(geometric) mean speed up of 6.84X with a maximum of 51.96X over a single threaded CPU execution across a set of StreamIt benchmarks. Compilers Stream Programs Partitioning Algorithm Stream Programs - Execution Stream Programs - Compilation Graphics Processing Units (GPUs) Streaming Multiprocessors (SMs) Accelerators Computer Science
232	Reducing Communication Through Buffers on a SIMD Architecture Choi, Jee W. 13 May 2004 (has links) Advances in wireless technology and the growing popularity of multimedia applications have brought about a need for energy efficient and cost effective portable supercomputers capable of delivering performance beyond the capabilities of current microprocessors and DSP chips. The SIMPil architecture currently being developed at Georgia Institute of Technology is a promising candidate for this task. In order to develop applications for SIMPil, a high level language and an optimizing compiler for the language are essential. However, with the recent trend of interconnect latency becoming a major bottleneck on computer systems, optimizations focusing on reducing latency are becoming more important, especially with SIMPil, as it is highly scalable. The compiler tracks the path of data through the network and buffers data in each processor to eliminate redundant communication. With a buffer size of 5, the compiler was able to eliminate 96 percent of the redundant communication for a 9x9 convolution and 8x8 DCT algorithms. With 5x5 convolution, only 89 percent elimination was observed. In terms of performance, 106 percent speedup was observed with 9x9 convolution at buffer size of 5 while 5x5 convolution and 8x8 DCT which have a much lower number of communication showed only 101 percent speedup. Communication opimization using buffers Compiler optimization Parallel Architecture Compilers SIMD Communication Networks Optimization Buffer DCT Convolution Image SEP Multimedia SIMPil
233	Compiler Optimizations for Multithreaded Multicore Network Processors Zhuang, Xiaotong 07 July 2006 (has links) Network processors are new types of multithreaded multicore processors geared towards achieving both fast processing speed and flexibility of programming. The architecture of network processors considers many special properties for packet processing, including multiple threads, multiple processor cores on the same chip, special functional units, simplified ISA and simplified pipeline, etc. The architectural peculiarities of network processors raise new challenges for compiler design and optimization. Due to very high clocking speeds, the CPU memory gap on such processors is huge, making registers extremely precious. Moreover, the register file is split into two banks, and for any ALU instruction, the two source operands must come from different banks. We present and compare three different approaches to do register allocation and bank assignment. We also address the problem of sharing registers across threads in order to maximize the utilization of hardware resources. The context switches on the IXP network processor only happen when long latency operations are encountered. As a result, context switches are highly frequent. Therefore, the designer of the IXP network processor decided to make context switches extremely lightweight, i.e. only the program counter(PC) is stored together with the context. Since registers are not saved and restored during context switches, it becomes difficult to share registers across threads. For a conventional processor, each thread can assume that it can use the entire register file, because registers are always part of the context. However, with lightweight context switch, each thread must take a separate piece of the register file, making register usage inefficient. Programs executing on network processors typically have runtime constraints. Scheduling of multiple programs sharing a CPU must be orchestrated by the OS and the hardware using certain sharing policies. Real time applications demand a real time aware OS kernel to meet their specified deadlines. However, due to stringent performance requirements on network processors, neither OS nor hardware mechanisms is typically feasible. In this work, we demonstrate that a compiler approach could achieve some of the OS scheduling and real time scheduling functionalities without introducing a hefty overhead. Runtime constraints Register allocation Compiler optimization Multitheading Multicore Network processors Registers (Computers) Compilers (Computer programs) High performance processors Packet switching (Data transmission)
234	Power-Aware Compilation Techniques For Embedded Systems Shyam, K 07 1900 (has links) The demand for devices like Personal Digital Assistants (PDA’s), Laptops, Smart Mobile Phones, are at an all time high. As the demand for these devices increases, so is the push to provide sophisticated functionalities in these devices. However energy consumption has become a major constraint in providing increased functionality for these devices. A majority of the applications meant for these devices are rich with multimedia content. In this thesis, we propose two approaches for compiler directed energy reduction, one targeting the memory subsystem and another the processor. The ﬁrst technique is a compiler directed optimization technique that reduces the energy consumption of the memory subsystem, for an oﬀ-chip partitioned memory archi- tecture, having multiple memory banks, and various low-power operating modes for each of these banks. We propose an eﬃcient layout of the data segment to reduce the number of simultaneously active memory banks, so that the other memory banks that are inactive can be put to low power modes to reduce the energy. We model this problem as a graph partitioning problem, and use well known heuristics to solve the same. We also propose a simple Integer Linear Programming (ILP) formulation for the above problem. Perfor- mance results indicate that our approach achieves an energy reduction of 20% compared to the base scheme, and a reduction of 8%-10% over a previously suggested method. Also, our results are well within the optimal results obtained by using ILP method. The second approach proposed in this thesis reduces the dynamic energy consumed by the processor using dynamic voltage and frequency scaling technique. Earlier works on dynamic voltage scaling focused mainly on performing voltage scaling when the CPU is waiting for memory subsystem or concentrated chieﬂy on loop nests and/or subroutine calls having suﬃcient number of dynamic instructions. We concentrate on coarser pro- gram regions and for the ﬁrst time uses program phase behavior for performing dynamic voltage scaling. We relate the Dynamic Voltage Scaling Problem to the Multiple Choice Knapsack Problem, and use well known heuristics to solve it eﬃciently. Also, we develop a simple Integer Linear Programming (ILP) problem formulation for this problem. Experi-mental evaluation on a set of media applications reveal that our heuristic method obtains 35-40% reduction in energy consumption on an average, with a negligible performance degradation. Further the energy consumed by our heuristic solution is within 1% the optimal solution obtained by the ILP approach. Electric Power Control Compilers Embedded Systems Computer Memory Architecture Voltages Dynamic Voltage Scaling Memory Architectures Integer Linear Programming (ILP) Computer Science
235	Spill Code Minimization And Buffer And Code Size Aware Instruction Scheduling Techniques Nagarakatte, Santosh G 08 1900 (has links) Instruction scheduling and Software pipelining are important compilation techniques which reorder instructions in a program to exploit instruction level parallelism. They are essential for enhancing instruction level parallelism in architectures such as very Long Instruction Word and tiled processors. This thesis addresses two important problems in the context of these instruction reordering techniques. The first problem is for general purpose applications and architectures, while the second is for media and graphics applications for tiled and multi-core architectures. The first problem deals with software pipelining which is an instruction scheduling technique that overlaps instructions from multiple iterations. Software pipelining increases the register pressure and hence it may be required to introduce spill instructions. In this thesis, we model the problem of register allocation with optimal spill code generation and scheduling in software pipelined loops as a 0-1 integer linear program. By minimizing the amount of spill code produced, the formulation ensures that the initiation interval (II) between successive iterations of the loop is not increased unnecessarily. Experimental results show that our formulation performs better than the existing heuristics by preventing an increase in the II and also generating less spill code on average among loops extracted from Perfect Club and SPEC benchmarks. The second major contribution of the thesis deals with the code size aware scheduling of stream programs. Large scale synchronous dataflow graphs (SDF’s) and StreamIt have emerged as powerful programming models for high performance streaming applications. In these models, a program is represented as a dataflow graph where each node represents an autonomous filter and the edges represent the channels through which the nodes communicate. In constructing static schedules for programs in these models, it is important to optimize the execution time buffer requirements of the data channel and the space required to store the encoded schedule. Earlier approaches have either given priority to one of the requirements or proposed ad-hoc methods for generating schedules with good trade-offs. In this thesis, we propose a genetic algorithm framework based on non-dominated sorting for generating serial schedules which have good trade-off between code size and buffer requirement. We extend the framework to generate software pipelined schedules for tiled architectures. From our experiments, we observe that the genetic algorithm framework generates schedules with good trade-off and performs better than the earlier approaches. Compilers Parallel Processing Instruction Scheduling Software Pipelining Spill Code Scheduling Stream Programs - Scheduling Genetic Algorithms Integer Linear Programming Software Pipelined Loops Register Allocation Computer Science
236	Sub-Polyhedral Compilation using (Unit-)Two-Variables-Per-Inequality Polyhedra Upadrasta, Ramakrishna 13 March 2013 (has links) (PDF) The goal of this thesis is to design algorithms that run with better complexity when compiling or parallelizing loop programs. The framework within which our algorithms operate is the polyhedral model of compilation which has been successful in the design and implementation of complex loop nest optimizers and parallelizing compilers. The algorithmic complexity and scalability limitations of the above framework remain one important weakness. We address it by introducing sub-polyhedral compilation by using (Unit-)Two-Variable-Per-Inequality or (U)TVPI Polyhedra, namely polyhedrawith restricted constraints of the type ax_{i}+bx_{j}\le c (\pm x_{i}\pm x_{j}\le c). A major focus of our sub-polyhedral compilation is the introduction of sub-polyhedral scheduling, where we propose a technique for scheduling using (U)TVPI polyhedra. As part of this, we introduce algorithms that can be used to construct under-aproximations of the systems of constraints resulting from affine scheduling problems. This technique relies on simple polynomial time algorithms to under approximate a general polyhedron into (U)TVPI polyhedra. The above under-approximation algorithms are generic enough that they can be used for many kinds of loop parallelization scheduling problems, reducing each of their complexities to asymptotically polynomial time. We also introduce sub-polyhedral code-generation where we propose algorithms to use the improved complexities of (U)TVPI sub-polyhedra in polyhedral code generation. In this problem, we show that the exponentialities associated with the widely used polyhedral code generators could be reduced to polynomial time using the improved complexities of (U)TVPI sub-polyhedra. The above presented sub-polyhedral scheduling techniques are evaluated in an experimental framework. For this, we modify the state-of-the-art PLuTo compiler which can parallelize for multi-core architectures using permutation and tiling transformations. We show that using our scheduling technique, the above under-approximations yield polyhedra that are non-empty for 10 out of 16 benchmarks from the Polybench (2.0) kernels. Solving the under-approximated system leads to asymptotic gains in complexity, and shows practically significant improvements when compared to a traditional LP solver. We also verify that code generated by our sub-polyhedral parallelization prototype matches the performance of PLuTo-optimized code when the under-approximation preserves feasibility. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Affine scheduling Approximation algorithms Compiler optimizations Compilers Loop transformations Optimization Parallelism Asymptotic complexity Code generation
237	Evaluating Impulse C and multiple parallelism partitions for a low-cost reconfigurable computing system Li Shen, Carmen C. Duren, Russell Walker. January 2008 (has links) Thesis (M.S.E.C.E.)--Baylor University, 2008. / Includes bibliographical references (p. 77-79).
238	Component-based language implementation with object-oriented syntax and aspect-oriented semantics Wu, Xiaoqing. January 2007 (has links) (PDF) Thesis (Ph. D.)--University of Alabama at Birmingham, 2007. / Additional advisors: Jeff Gray, Marjan Mernik, Alan Sprague, Murat Tanik. Description based on contents viewed June 25, 2007; title from title screen. Includes bibliographical references (p. 132-138).
239	Improving scalability of exploratory model checking Boulgakov, Alexandre January 2016 (has links) As software and hardware systems grow more complex and we begin to rely more on their correctness and reliability, it becomes exceedingly important to formally verify certain properties of these systems. If done naïvely, verifying a system can easily require exponentially more work than running it, in order to account for all possible executions. However, there are often symmetries or other properties of a system that can be exploited to reduce the amount of necessary work. In this thesis, we present a number of approaches that do this in the context of the CSP model checker FDR. CSP is named for Communicating Sequential Processes, or parallel combinations of state machines with synchronised communications. In the FDR model, the component processes are typically converted to explicit state machines while their parallel combination is evaluated lazily during model checking. Our contributions are motivated by this model but applicable to other models as well. We first address the scalability of the component machines by proposing a lazy compiler for a subset of CSP<sub>M</sub> selected to model parameterised state machines. This is a typical case where the state space explosion can make model checking impractical, since the size of the state space is exponential in the number and size of the parameters. A lazy approach to evaluating these systems allows only the reachable subset of the state space to be explored. As an example, in studying security protocols, it is common to model an intruder parameterised by knowledge of each of a list of facts; even a relatively small 100 facts results in an intractable 2<sup>100</sup> states, but the rest of the system can ensure that only a small number of these states are reachable. Next, we address the scalability of the overall combination by presenting novel algorithms for bisimulation reduction with respect to strong bisimulation, divergence- respecting delay bisimulation, and divergence-respecting weak bisimulation. Since a parallel composition is related to the Cartesian product of its components, performing a relatively time-consuming bisimulation reduction on the components can reduce its size significantly; an efficient bisimulation algorithm is therefore very desirable. This thesis is motivated by practical implementations, and we discuss an implementation of each of the proposed algorithms in FDR. We thoroughly evaluate their performance and demonstrate their effectiveness.
240	Profile guided hybrid compilation / Compilation hybride guidée pour profilage Nunes Sampaio, Diogo 14 December 2016 (has links) L'auteur n'a pas fourni de résumé en français / The end of chip frequency scaling capacity, due heat dissipation limitations, made manufacturers search for an alternative to sustain the processing capacity growth. The chosen solution was to increase the hardware parallelism, by packing multiple independent processors in a single chip, in a Multiple-Instruction Multiple-Data (MIMD) fashion, each with special instructions to operate over a vector of data, in a Single-Instruction Multiple-Data (SIMD) manner. Such paradigm change, brought to software developer the convoluted task of producing efficient and scalable applications. Programming languages and associated tools evolved to aid such task for new developed applications. But automated optimizations capable of coping with such a new complex hardware, from legacy, single threaded applications, is still lacking.To apply code transformations, either developers or compilers, require to assert that, by doing so, they are not changing the expected comportment of the application producing unexpected results. But syntactically poor codes, such as use of pointer parameters with multiple possible indirections, complex loop structures, or incomplete codes, make very hard to extract application behavior solely from the source code in what is called a static analyses. To cope with the lack of information extracted from the source code, many tools and research has been done in, how to use dynamic analyses, that does application profiling based on run-time information, to fill the missing information. The combination of static and dynamic information to characterize an application are called hybrid analyses. This works advocates for the use of hybrid analyses to be able to optimizations on loops, regions where most of computations are done. It proposes a framework capable of statically applying some complex loop transformations, that previously would be considered unsafe, by assuring their safe use during run-time with a lightweight test.The proposed framework uses application execution profiling to help the static loop optimizer to: 1) identify and classify program hot-spots, so as to focus only on regions vital for the execution time; 2) guide the optimizer in understanding the overall loop behavior, so as to reduce the valid loop transformations search space; 3) using instruction's memory access functions, it statically builds a lightweight run-time test that determine, based on the program parameters values, if a given optimization is safe to be used or not. It's applicability is shown by performing complex loop transformations into a variety of loops, obtained from applications of different fields, and demonstrating that the run-time overhead is insignificant compared to the loop execution time or gained performance, in the vast majority of cases. Optimisation de boucles Compilation hybride Profilage Accès mémoire non-Linéaire Élimination quantificateurs Compilateur Compilers Hybrid compilation Loop optimization Non-affine memory access functions Profiling Quantifier elimination 004

Search results