Spelling suggestions: "subject:"compiler"" "subject:"compiled""
121 |
Techniques for Managing Irregular Control Flow on GPUsJad Hbeika (5929730) 25 June 2020 (has links)
<p>GPGPU is a highly multithreaded throughput architecture that can deliver high speed-up for regular applications while remaining energy efficient. In recent years, there has been much focus on tuning irregular applications and/or the GPU architecture to achieve similar benefits for irregular applications as well as efforts to extract data parallelism from task parallel applications. In this work we tackle both problems.</p><p>The first part of this work tackles the problem of Control divergence in GPUs. GPGPUs’ SIMT execution model is ineffective for workloads with irregular control-flow because GPGPUs serialize the execution of divergent paths which lead to thread-level parallelism (TLP) loss. Previous works focused on creating new warps based on the control path threads follow, or created different warps for the different paths, or ran multiple narrower warps in parallel. While all previous solutions showed speedup for irregular workloads, they imposed some performance loss on regular workloads. In this work we propose a more fine-grained approach to exploit <i>intra-warp</i>convergence: rather than threads executing the same code path, <i>opcode-convergent threads</i>execute the same instruction, but with potentially different operands. Based on this new definition we find that divergent control blocks within a warp exhibit substantial opcode convergence. We build a compiler that analyzes divergent blocks and identifies the common streams of opcodes. We modify the GPU architecture so that these common instructions are executed as convergent instructions. Using software simulation, we achieve a 17% speedup over baseline GPGPU for irregular workloads and do not incur any performance loss on regular workloads.</p><p>In the second part we suggest techniques for extracting data parallelism from irregular, task parallel applications in order to take advantage of the massive parallelism provided by the GPU. Our technique involves dividing each task into multiple sub-tasks each performing less work and touching a smaller memory footprint. Our framework performs a locality-aware scheduling that works on minimizing the memory footprint of each warp (a set of threads performing in lock-step). We evaluate our framework with 3 task-parallel benchmarks and show that we can achieve significant speedups over optimized GPU code.</p>
|
122 |
The Design and Implementation of the Tako Language and CompilerVasudeo, Jyotindra 14 December 2006 (has links)
Aliasing complicates both formal and informal reasoning and is a particular problem in object-oriented languages, where variables denote references to objects rather than object values. Researchers have proposed various approaches to the aliasing problem in object-oriented languages, but all use reference semantics to reason about programs. This thesis describes the design and implementation of Tako—a Java-like language that facilitates value semantics by incorporating alias-avoidance. The thesis describes a non-trivial application developed in the Tako language and discusses some of the object-oriented programming paradigm shifts involved in translating that application from Java to Tako. It introduces a proof rule for procedure calls that uses value semantics and accounts for both repeated arguments and subtyping. / Master of Science
|
123 |
Automatic Scheduling of Compute Kernels Across Heterogeneous ArchitecturesLyerly, Robert Frantz 24 June 2014 (has links)
The world of high-performance computing has shifted from increasing single-core performance to extracting performance from heterogeneous multi- and many-core processors due to the power, memory and instruction-level parallelism walls. All trends point towards increased processor heterogeneity as a means for increasing application performance, from smartphones to servers. These various architectures are designed for different types of applications — traditional "big" CPUs (like the Intel Xeon) are optimized for low latency while other architectures (such as the NVidia Tesla K20x) are optimized for high-throughput. These architectures have different tradeoffs and different performance profiles, meaning fantastic performance gains for the right types of applications. However applications that are ill-suited for a given architecture may experience significant slowdown; therefore, it is imperative that applications are scheduled onto the correct processor.
In order to perform this scheduling, applications must be analyzed to determine their execution characteristics. Traditionally this application-to-hardware mapping was determined statically by the programmer. However, this requires intimate knowledge of the application and underlying architecture, and precludes load-balancing by the system. We demonstrate and empirically evaluate a system for automatically scheduling compute kernels by extracting program characteristics and applying machine learning techniques. We develop a machine learning process that is system-agnostic, and works for a variety of contexts (e.g. embedded, desktop/workstation, server). Finally, we perform scheduling in a workload-aware and workload-adaptive manner for these compute kernels. / Master of Science
|
124 |
A Compiler Framework to Support and Exploit Heterogeneous Overlapping-ISA Multiprocessor PlatformsJelesnianski, Christopher Stanisław 15 December 2015 (has links)
As the demand for ever increasingly powerful machines continues, new architectures are sought to be the next route of breaking past the brick wall that currently stagnates the performance growth of modern multi-core CPUs. Due to physical limitations, scaling single-core performance any further is no longer possible, giving rise to modern multi-cores. However, the brick wall is now limiting the scaling of general-purpose multi-cores. Heterogeneous-core CPUs have the potential to continue scaling by reducing power consumption through exploitation of specialized and simple cores within the same chip.
Heterogeneous-core CPUs join fundamentally different processors each which their own peculiar features, i.e., fast execution time, improved power efficiency, etc; enabling the building of versatile computing systems. To make heterogeneous platforms permeate the computer market, the next hurdle to overcome is the ability to provide a familiar programming model and environment such that developers do not have to focus on platform details. Nevertheless, heterogeneous platforms integrate processors with diverse characteristics and potentially a different Instruction Set Architecture (ISA), which exacerbate the complexity of the software. A brave few have begun to tread down the heterogeneous-ISA path, hoping to prove that this avenue will yield the next generation of super computers. However, many unforeseen obstacles have yet to be discovered.
With this new challenge comes the clear need for efficient, developer-friendly, adaptable system software to support the efforts of making heterogeneous-ISA the golden standard for future high-performance and general-purpose computing. To foster rapid development of this technology, it is imperative to put the proper tools into the hands of developers, such as application and architecture profiling engines, in order to realize the best heterogeneous-ISA platform possible with available technology. In addition, it would be in the best interest to create tools to be as "timeless" as possible to expose fundamental concepts industry could benefit from and adopt in future designs.
We demonstrate the feasibility of a compiler framework and runtime for an existing heterogeneous-ISA operating system (Popcorn Linux) for automatically scheduling compute blocks within an application on a given heterogeneous-ISA high-performance platform (in our case a platform built with Intel Xeon - Xeon Phi). With the introduced Profiler, Partitioner, and Runtime support, we prove to be able to automatically exploit the heterogeneity in an overlapping-ISA platform, being faster than native execution and other parallelism programming models.
Empirically evaluating our compiler framework, we show that application execution on Popcorn Linux can be up to 52% faster than the most performant native execution for Xeon or Xeon Phi. Using our compiler framework relieves the developer from manual scheduling and porting of applications, requiring only a single profiling run per application. / Master of Science
|
125 |
On the Complexity of Robust Source-to-Source Translation from CUDA to OpenCLSathre, Paul Daniel 12 June 2013 (has links)
The use of hardware accelerators in high-performance computing has grown increasingly prevalent, particularly due to the growth of graphics processing units (GPUs) as general-purpose (GPGPU) accelerators. Much of this growth has been driven by NVIDIA's CUDA ecosystem for developing GPGPU applications on NVIDIA hardware. However, with the increasing diversity of GPUs (including those from AMD, ARM, and Qualcomm), OpenCL has emerged as an open and vendor-agnostic environment for programming GPUs as well as other parallel computing devices such as the CPU (central processing unit), APU (accelerated processing unit), FPGA (field programmable gate array), and DSP (digital signal processor).
The above, coupled with the broader array of devices supporting OpenCL and the significant conceptual and syntactic overlap between CUDA and OpenCL, motivated the creation of a CUDA-to-OpenCL source-to-source translator. However, there exist sufficient differences that make the translation non-trivial, providing practical limitations to both manual and automatic translation efforts. In this thesis, the performance, coverage, and reliability of a prototype CUDA-to-OpenCL source translator are addressed via extensive profiling of a large body of sample CUDA applications. An analysis of the sample body of applications is provided, which identifies and characterizes general CUDA source constructs and programming practices that obstruct our translation efforts. This characterization then led to more robust support for the translator, followed by an evaluation that demonstrated the performance of our automatically-translated OpenCL is on par with the original CUDA for a subset of sample applications when executed on the same NVIDIA device. / Master of Science
|
126 |
EXTRACT: Extensible Transformation and Compiler TechnologyCalnan, III, Paul W. 29 April 2003 (has links)
Code transformation is widely used in programming. Most developers are familiar with using a preprocessor to perform syntactic transformations (symbol substitution and macro expansion). However, it is often necessary to perform more complex transformations using semantic information contained in the source code. In this thesis, we developed EXTRACT; a general-purpose code transformation language. Using EXTRACT, it is possible to specify, in a modular and extensible manner, a variety of transformations on Java code such as insertion, removal, and restructuring. In support of this, we also developed JPath, a path language for identifying portions of Java source code. Combined, these two technologies make it possible to identify source code that is to be transformed and then specify how that code is to be transformed. We evaluate our technology using three case studies: a type name qualifier which transforms Java class names into fully-qualified class names; a contract checker which enforces pre- and post-conditions across behavioral subtypes; and a code obfuscator which mangles the names of a class's methods and fields such that they cannot be understood by a human, without breaking the semantic content of the class.
|
127 |
Compiling ACE for Distributed-Memory MachinesSong, Jun 05 November 1992 (has links)
Distributed-memory machines offer a very high level of performance, flexibility and scalability. But the memory organization of this kind of machine determines that processes on different processors must communicate explicitly by sending and receiving messages. As a result, the programmer faces the enormously difficult task of detailed planning of algorithm-irrelevant, low-level communication issues. This level of programming resembles writing assembly programs for a sequential machine. ACE is a message-passing language with abstract communication statements. It was defined by Dr. Jingke Li at Portland State University. The communication in ACE is still explicit, but it is abstracted to a higher level. The abstraction can help balance the needs of ease of programming and high performance. This thesis discusses how those high-level communication abstractions can be transformed into low-level communication routines. It presents the design and implementation of a compiler that transforms an ACE program into a C program with low-level communication routines. The compiler is implemented for the Intel iPSC/2 hypercube multiprocessor machine. Compared to their low-level counterparts, ACE programs are easier to write and are more understandable. Compared to their high level counterparts, more efficient code can be generated since the communication information is expressed explicitly in ACE and the compiler itself is much less complex. ACE also enables the users to fine tune some critical communication segments. Some well known parallel algorithms written in ACE are compiled by the compiler as examples, and experimental results of their performance are included.
|
128 |
Validation DSL for client-server applicationsFedorenko, Vitalii M. 10 1900 (has links)
<p>Given the nature of client-server applications, most use some freeform interface, like web forms, to collect user input. The main difficulty with this approach is that all parameters obtained in this fashion need to be validated and normalized to protect the application from invalid entries. This is the problem addressed here: how to take client input and preprocess it before passing the data to a back-end, which concentrates on business logic. The method of implementation is a rule engine that uses Groovy internal domain-specific language (DSL) for specifying input requirements. We will justify why the DSL is a good fit for a validation rule engine, describe existing techniques used in this area and comprehensively address the related issues of accidental complexity, security, and user experience.</p> / Master of Science (MSc)
|
129 |
Design Space Exploration and Optimization of Embedded Memory SystemsRabbah, Rodric Michel 11 July 2006 (has links)
Recent years have witnessed the emergence of microprocessors that are
embedded within a plethora of devices used in everyday life. Embedded
architectures are customized through a meticulous and time consuming
design process to satisfy stringent constraints with respect to
performance, area, power, and cost. In embedded systems, the cost of
the memory hierarchy limits its ability to play as central a
role. This is due to stringent constraints that fundamentally limit
the physical size and complexity of the memory system. Ultimately,
application developers and system engineers are charged with the heavy
burden of reducing the memory requirements of an application.
This thesis offers the intriguing possibility that compilers can play
a significant role in the automatic design space exploration and
optimization of embedded memory systems. This insight is founded upon
a new analytical model and novel compiler optimizations that are
specifically designed to increase the synergy between the processor
and the memory system. The analytical models serve to characterize
intrinsic program properties, quantify the impact of compiler
optimizations on the memory systems, and provide deep insight into the
trade-offs that affect memory system design.
|
130 |
A TRANSLATION OF OCAML GADTS INTO COQPedro da Costa Abreu Junior (18422613) 23 April 2024 (has links)
<p dir="ltr">Proof assistants based on dependent types are powerful tools for building certified software. In order to verify programs written in a different language, however, a representation of those programs in the proof assistant is required. When that language is sufficiently similar to that of the proof assistant, one solution is to use a <i>shallow embedding</i> to directly encode source programs as programs in the proof assistant. One challenge with this approach is ensuring that any semantic gaps between the two languages are accounted for. In this thesis, we present <i>GSet</i>, a mixed embedding that bridges the gap between OCaml GADTs and inductive datatypes in Coq. This embedding retains the rich typing information of GADTs while also allowing pattern matching with impossible branches to be translated without additional axioms. We formalize this with GADTml, a minimal calculus that captures GADTs in OCaml, and gCIC, an impredicative variant of the Calculus of Inductive Constructions. Furthermore, we present the translation algorithm between GADTml and gCIC, together with a proof of the soundness of this translation. We have integrated this technique into coq-of-ocaml, a tool for automatically translating OCaml programs into Coq. Finally, we demonstrate the feasibility of our approach by using our enhanced version of coq-of-ocaml, to translate a portion of the Tezos code base into Coq.</p>
|
Page generated in 0.0781 seconds