Global ETD Search

171	Directive-based General-purpose GPU Programming Han, Tian Yi David 19 January 2010 (has links) Graphics Processing Units (GPUs) have become a competitive accelerator for non-graphics applications, mainly driven by the improvements in GPU programmability. Although the Compute Unified Device Architecture (CUDA) is a simple C-like interface for programming NVIDIA GPUs, porting applications to CUDA remains a challenge to average programmers. In particular, CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host and GPU memories, and of manually optimizing the utilization of the GPU memory. We have designed hiCUDA, a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious tasks in a simpler manner, and directly to the sequential code. We have also prototyped a compiler that translates a hiCUDA program to a CUDA program and can handle real-world applications. Experiments using seven standard CUDA benchmarks show that the simplicity hiCUDA provides comes at no expense to performance. GPGPU CUDA data-parallelism programming language directive-based language compiler 0984
172	Directive-based General-purpose GPU Programming Han, Tian Yi David 19 January 2010 (has links) Graphics Processing Units (GPUs) have become a competitive accelerator for non-graphics applications, mainly driven by the improvements in GPU programmability. Although the Compute Unified Device Architecture (CUDA) is a simple C-like interface for programming NVIDIA GPUs, porting applications to CUDA remains a challenge to average programmers. In particular, CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host and GPU memories, and of manually optimizing the utilization of the GPU memory. We have designed hiCUDA, a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious tasks in a simpler manner, and directly to the sequential code. We have also prototyped a compiler that translates a hiCUDA program to a CUDA program and can handle real-world applications. Experiments using seven standard CUDA benchmarks show that the simplicity hiCUDA provides comes at no expense to performance. GPGPU CUDA data-parallelism programming language directive-based language compiler 0984
173	Automatic Task Formation Techniques for the Multi-level Computing Architecture Stewart, Kirk 30 July 2008 (has links) The Multi-Level Computing Architecture (MLCA) is a multiprocessor system-on-chip architecture designed for multimedia applications. It provides a programming model that simplifies the process of writing parallel applications by eliminating the need for explicit synchronization. However, developers must still invest effort to design applications that fully exploit the MLCA’s multiprocessing capabilities. We present a set of compiler techniques to streamline the process of developing applications for the MLCA. We present an algorithm to automatically partition a sequential application into tasks that can be executed in parallel. We also present code generation algorithms to translate annotated, sequential C code to the MLCA’s programming model. We provide an experimental evaluation of these techniques, performed with a prototype compiler based upon the open-source ORC compiler and integrated with the MLCA Optimizing Compiler. This evaluation shows that the performance of automatically generated code compares favourably to that of manually written code. compiler optimization task-level parallelism multimedia automatic parallelization software programming 0984
174	Automatic Parallelization for Graphics Processing Units in JikesRVM Leung, Alan Chun Wai January 2008 (has links) Accelerated graphics cards, or Graphics Processing Units (GPUs), have become ubiquitous in recent years. On the right kinds of problems, GPUs greatly surpass CPUs in terms of raw performance. However, GPUs are currently used only for a narrow class of special-purpose applications; the raw processing power available in a typical desktop PC is unused most of the time. The goal of this work is to present an extension to JikesRVM that automatically executes suitable code on the GPU instead of the CPU. Both static and dynamic features are used to decide whether it is feasible and beneficial to off-load a piece of code on the GPU. Feasible code is discovered by an implementation of data dependence analysis. A cost model that balances the speedup available from the GPU against the cost of transferring input and output data between main memory and GPU memory has been deployed to determine if a feasible parallelization is indeed beneficial. The cost model is parameterized so that it can be applied to different hardware combinations. We also present ways to overcome several obstacles to parallelization inherent in the design of the Java bytecode language: unstructured control flow, the lack of multi-dimensional arrays, the precise exception semantics, and the proliferation of indirect references. Compiler GPU Automatic Parallelization Just In Time Virtual Machine JikesRVM Java Optimization Computer Science
175	Suitability of Java for Solving Large Sparse Positive Definite Systems of Equations Using Direct Methods Armstrong, Shea January 2004 (has links) The purpose of the thesis is to determine whether Java, a programming language that evolved out of a research project by Sun Microsystems in 1990, is suitable for solving large sparse linear systems using direct methods. That is, can performance comparable to the language traditionally used for sparse matrix computation, Fortran, be achieved by a Java implementation. Performance evaluation criteria include execution speed and memory requirements. A secondary criterion is ease of development. Many attractive features, unique to the Java programming language, make it desirable for use in sparse matrix computation and provide the motivation for the thesis. The 'write once, run anywhere' proposition, coupled with nearly-ubiquitous Java support, alleviates the need to re-write programs in the event of hardware change. Features such as garbage collection (automatic recycling of memory) and array-index bounds checking make Java programs more robust than those written in Fortran. Java has garnered a poor reputation as a high-performance computing platform, largely attributable to poor performance relative to Fortran in its early years. It is now a consensus among researchers that the Java language itself is not the problem, but rather its implementation. As such, improving compiler technology for numerical codes is critical to achieving high performance in numerical Java applications. Preliminary work involved converting SPARSPAK, a collection of Fortran 90 subroutines for solving large sparse systems of linear equations and least squares problems developed by Dr. Alan George, into Java (J-SPARSPAK). It is well known that the majority of the solution process is spent in the numeric factorization phase. Initial benchmarks showed Java performing, on average, 3. 6 times slower than Fortran for this critical phase. We detail how we improved Java performance to within a factor of two of Fortran. Computer Science Java High-performance Java Numerical Java Computing Java compiler Numerical Java Libraries
176	Automatic Parallelization for Graphics Processing Units in JikesRVM Leung, Alan Chun Wai January 2008 (has links) Accelerated graphics cards, or Graphics Processing Units (GPUs), have become ubiquitous in recent years. On the right kinds of problems, GPUs greatly surpass CPUs in terms of raw performance. However, GPUs are currently used only for a narrow class of special-purpose applications; the raw processing power available in a typical desktop PC is unused most of the time. The goal of this work is to present an extension to JikesRVM that automatically executes suitable code on the GPU instead of the CPU. Both static and dynamic features are used to decide whether it is feasible and beneficial to off-load a piece of code on the GPU. Feasible code is discovered by an implementation of data dependence analysis. A cost model that balances the speedup available from the GPU against the cost of transferring input and output data between main memory and GPU memory has been deployed to determine if a feasible parallelization is indeed beneficial. The cost model is parameterized so that it can be applied to different hardware combinations. We also present ways to overcome several obstacles to parallelization inherent in the design of the Java bytecode language: unstructured control flow, the lack of multi-dimensional arrays, the precise exception semantics, and the proliferation of indirect references. Compiler GPU Automatic Parallelization Just In Time Virtual Machine JikesRVM Java Optimization Computer Science
177	Development of a Multi-Port Memory Generator and Its Application in the Design of Register Files Wang, Chen-Yu 06 September 2011 (has links) Memory unit is one of the fundamental hardware components in system-on-chip (SoC) design, and takes a significant portion of total area cost. Although commercial memory compilers exist, they usually contains memory unit with single-port or dual ports. However, many SoC designs require memory units that support simultaneous multiple reads and writes. They cannot be efficiently generated using the existing memory compilers in the standard cell library. In this thesis, we develop a memory generator that can automatically produce the circuits of multi-port SRAM and all the necessary models required in the standard cell-based design flow. Compared to the design based on dual-port SRAM from memory compilers which usually consists of duplicated copies of SRAM units for supporting multiple write at the same, the proposed design has smaller area cost. Furthermore, we employ various low-power design concepts, including power-gating and adaptive body-bias, to reduce the dynamic and static power of the generated SRAM circuits. Experimental results show that the proposed multi-port SRAM generator can be used to synthesize low-power and low-area register file circuits that support multiple reads and writes at the same time. multi-port SRAM memory generator memory compiler power-gating body-bias
178	A High Performance Register Allocator for Vector Architectures with a Unified Register-Set Su, Yu-Dan 29 June 2012 (has links) This thesis describes a compiler optimization targeted for machines with unified, vector-based register sets. This optimization combines register allocation and instruction scheduling. It examines places where the code performs computations on scalar variables. The goal is to identify instances where the same operation is performed. For example, a program might calculate ¡§base+offset¡¨ and then calculate ¡§i+j¡¨. Even though these computations are unrelated, yet they use the same operator; if ¡§base¡¨ and ¡§i¡¨ are packed into one vector register, while ¡§offset¡¨ and ¡§j¡¨ are packed into another, then these two computations can be performed simultaneously through the vectors¡¦ parallel addition operation. This would reduce the execution time of the compiled code. Although other researchers have considered similar packing methods, their work has been limited by the hardware that they were studying. Such hardware usually imposed high costs for moving data between scalar and vector register banks. This present thesis, however, considers a novel hardware architecture that imposes no such costs. As a consequence, we are able to obtain significant speedups. The architecture that we consider is a Graphics Processing Unit (GPU) for embedded systems that is under development at this university. This GPU has a single register set for integers, float, and vectors. instruction scheduling register allocator compiler optimization unified register set vector architecture novel Graphics Processing Unit
179	Design and remote control of a Gantry mechanism for the SCARA robot Surinder Pal, 15 May 2009 (has links) Remote experimentation and control have led researchers to develop new technologies as well as implement existing techniques. The multidisciplinary nature of research in electromechanical systems has led to the synergy of mechanical engineering, electrical engineering and computer science. This work describes the design of a model of a Gantry Mechanism, which maneuvers a web-cam. The user controls virtually the position of end-effecter of the Gantry Mechanism using a Graphical User Interface. The GUI is accessed over the Internet. In order to reduce the unbalanced vibrations of the Gantry Mechanism, we investigate the development of an algorithm of input shaping. A model of the Gantry Mechanism is built, and it is controlled over the Internet to view experimentation of the SCARA Robot. The system performance is studied by comparing the inputs such as distances and angles with outputs, and methods to improve the performance are suggested. Gantry PIC microcontroller Encoder CGI Visual Basic HTML CCS-C Compiler MPLAB
180	SAGE: An Automatic Analyzing and Parallelizing System to Improve Performance and Reduce Energy on a New High-Performance SoC Architecture¡XProcessor-in-Memory Chu, Slo-Li 04 October 2002 (has links) Continuous improvements in semiconductor fabrication density are enabling new classes of System-on-a-Chip (SoC) architectures that combine extensive processing logic/processing with high-density memory. Such architectures are generally called Processor-in-Memory or Intelligent Memory and can support high-performance computing by reducing the performance gap between the processor and the memory. This architecture combines various processors in a single system. These processors are characterized by their computational and memory-access capabilities in performance and energy consumption. Two main problems addressed here are how to improve the performance and reduce the energy consumption of applications running on Processor-in-Memory architectures. Accordingly, a novel strategy must be developed to identify the capabilities of the different processors and dispatch the most appropriate jobs to them to exploit them fully. Accordingly, this study proposes a novel automatic source-to-source parallelizing system, called SAGE, to exploit the advantages of Processor-in-Memory architectures. Unlike conventional iteration-based parallelizing systems, SAGE adopts statement-based analytical approaches. The strategy of the SAGE system, which decomposes the original program into blocks and produces a feasible execution schedule for the host and memory processors, is also investigated. Hence, several techniques including statement splitting, weight evaluation, performance scheduling and energy reduction scheduling are designed and integrated into the SAGE system to automatically transform Fortran source programs to improve the performance of the program or reduce energy consumed by the program executed on Processor-in-Memory architecture. This thesis provides detailed techniques and discusses the experimental results of real benchmarks which are transformed by SAGE system and targeted on the Processor-in-Memory architecture. SoC Processor-in-Memory architecture statement-based automatic parallelizing compiler energy reduction

Search results