181 |
Run-time optimization of adaptive irregular applicationsYu, Hao 15 November 2004 (has links)
Compared to traditional compile-time optimization, run-time optimization could offer significant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identified a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Specifically, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm specified by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems.
|
182 |
Design and Implementation of Java Virtual MachineMandal, Abhijit 06 1900 (has links)
Interpretation of Java bytecode results in slow execution of program.First version of Java Virtual Machine(JVM) implementation was relied
on interpretation techniques. On the other hand performance can be improved by translating the Java bytecode into machine code by a Just-In-Time(JIT) compiler and this technique is being integrated into most JVM implementations.
Java is an automatic garbage collected language, freeing the programmer from the explicit memory management. Garbage collection "pause" time can
be reduced by using a generational garbage collection.
This thesis describes an implementation of a JVM. The specific contributions made in this thesis include: development of a Just-In-Time(JIT) compiler using DAG construction technique, a bytecode interpreter, a generational
garbage collector. Our implementation can execute Java bytecode either by an interpreter or the bytecode can be translated into machine code
using the JIT compiler and the translated code is directly executed by the processor. We have implemented the Java Native Interface (JNI) to enable using C and assembly language programs with Java.
|
183 |
Porting the GCC-Backend to a VLIW-Architecture / Portierung des GCC-Backends auf eine VLIW-ArchitekturParthey, Jan 26 July 2004 (has links) (PDF)
This diploma thesis discusses the implementation of a GCC target for the Texas Instruments TMS320C6000 DSP platform. To this end, it makes use of mechanisms offered by GCC for porting to new target architectures. GCC internals such as the handling of conditional jumps and the layout of stack frames are investigated and applied to the new architecture. / Diese Diplomarbeit behandelt die Implementierung eines GCC-Targets für die DSP-Plattform TMS320C6000 von Texas Instruments. Dazu werden Mechanismen genutzt, die GCC für die Portierung auf neue Zielplattformen anbietet. GCC-Interna, wie die Behandlung bedingter Sprünge und das Layout von Stack-Frames, werden untersucht und auf die neue Architektur angewendet.
|
184 |
Optimizing the GCC Suite for a VLIW Architecture / Optimierung der GCC Suite für eine VLIW ArchitekturSträtling, Adrian 16 December 2004 (has links) (PDF)
This diploma thesis discusses the applicability of GCC optimization algorithms for the TI TMS320C6x processor family. Conditional and Parallel Execution is used to speed up the resulting code. It describes the optimization framework of the GCC version 4.0 and the implementation details. / Diese Diplomarbeit behandelt die Anwendbarkeit der verschiedenen GCC Optimierungsalgorithmen für die TI TMS320C6x Prozessorfamilie. Bedingte und parallele Ausführbarkeit werden zur Beschleunigung eingesetzt. Sie beschreibt den Rahmen in dem die Optimierungen in Version 4.0 des GCC stattfinden und Details zur Implementierung.
|
185 |
Characterization and optimization of JavaScript programs for mobile systemsSrikanth, Aditya 09 October 2013 (has links)
JavaScript has permeated into every aspect of the web experience in today's world, making it highly crucial to process it as quickly as possible. With the proliferation of HTML5 and its associated mobile web applications, the world is slowly but surely moving into an age where majority of the webpages will involve complex computations and manipulations within the JavaScript engine. Recent techniques like Just-in-Time (JIT) compilation have become commonplace in popular browsers like Chrome and Firefox, and there is an ongoing effort to further optimize them in the context of mobile systems.
In order to fully take advantage of JavaScript-heavy webpages, it is important to first characterize the interaction of these webpages (both existing pages and modern HTML5 pages) with the different components of the JavaScript engine, viz. the interpreter, the method JIT, the optimizing compiler and the garbage collector. In this thesis, the aforementioned characterization work was leveraged to identify the limits of JavaScript optimizations. Subsequently, a particular optimization, i.e. Register Allocation heuristics was explored in detail on different types of JavaScript programs. This was primarily because the majority of the time (an average of 52.81%) spent in the optimizing compiler is for the register allocation stage alone. By varying the heuristics for register assignment, interval priority and spill selection, a clear idea is obtained about how it impacts certain types of programs more than others. This thesis also gives a preliminary insight into JavaScript applications and benchmarks, showing that these applications tend to be register-intensive, with large live intervals and sparse uses, and sensitive to array and string manipulations. A statically-selected optimal register allocation scheme outperforms the default register allocation scheme resulting in 9.1% performance improvement and 11.23% reduction in execution time on a representative mobile system. / text
|
186 |
Improving prediction accuracy of hard-to-predict branches using data value correlationFarooq, Muhammad Umar, active 2013 17 February 2014 (has links)
Performance of modern pipelined processor depends on steady flow of useful instructions for processing. Branch instruction disrupts sequential flow of instructions by presenting multiple paths through which a program can proceed. By predicting branch
outcome early, branch predictors allow processor to continue fetching instructions from the predicted path. History-based dynamic branch predictors have shown to reach high
prediction accuracy, yet certain branch types continue to mispredict. These are multitarget indirect branches and data-dependent direct and indirect branches. These are hard-to-predict branches since their outcome do not always exhibit repeatable patterns.
This thesis describes branch prediction schemes for improving prediction accuracy of hard-to-predict branches using data value correlation. In these schemes, instead of relying on branch history information, compiler identifies program instructions whose output value strongly correlates with branch outcome. These correlated instructions are tracked at run-time, and their output is used for making branch predictions. Specifically, this thesis proposes following two branch prediction schemes:
(i) Value-based BTB indexing (VBBI) is a low cost, compiler-guided scheme for
predicting multi-target indirect branches. For every indirect branch, compiler identifies an instruction whose output strongly correlates with targets taken by the indirect branch. At run-time, multiple branch targets are stored and subsequently accessed from BTB using index formed by hashing indirect branch PC with output of the correlated instruction.
(ii) Store-Load-Branch (SLB) predictor is a compiler-assisted branch prediction scheme for data-dependent branches. Typically, data-dependent branches are associated with program data structures such as arrays, linked list etc., and follow store-load-branch
execution sequence. A set of memory locations is written at an earlier point in a program. Later, these locations are read, and used for evaluating branch condition. Branch outcome depends on values stored in data structure, which, normally do not have repeatable patterns. In SLB scheme, compiler identifies all program points where data structure associated with a data-dependent branch is modified. These
marked store instructions are tracked at run-time, and stored values are used for computing branch flags ahead of time. Later, when branch instruction is fetched, pre-computed flags are read, and used for making predictions.
This thesis introduces new branch prediction schemes, describes hardware structures and compiler analysis for implementing these schemes, evaluates their performance
impact, and estimates their area, power and timing overhead. / text
|
187 |
Υλοποίηση μεταφραστή πηγαίου προς πηγαίο κώδικα για μοντέλο προγραμματισμού OpenMP σε γλώσσα προγραμματισμού C / Implementation of a source-to-source OpenMP compiler for the C programming languageΓιασλάς, Γεώργιος 16 May 2007 (has links)
Η εργασία αυτή ασχολείται με την υλοποίηση ενός μεταφραστή για το μοντέλο παράλληλου προγραμματισμού OpenMP. Το μοντέλο αυτό χρησιμοποιείται για την υλοποίηση παράλληλων εφαρμογών για την αρχιτεκτονική κοινής μνήμης, και δίνει έμφαση στην ευκολία του προγραμματισμού και την μεταφερσιμότητα των παράλληλων εφαρμογών που αναπτύσσονται μ’ αυτό. Στην εργασία αυτή παρουσιάζεται η σχεδίαση και η υλοποίηση ενός μεταγλωττιστή παράλληλων εφαρμογών OpenMP για τη γλώσσα προγραμματισμού C σύμφωνα με την έκδοση 2.0 του προτύπου OpenMP. Ο μεταγλωττιστής που υλοποιήθηκε ανήκει στην κατηγορία των μεταφραστών, δηλαδή μεταφράζει τον πηγαίο κώδικα σε γλώσσα προγραμματισμού C με τις επεκτάσεις OpenMP, σε ισοδύναμο πηγαίο κώδικα σε γλώσσα προγραμματισμού C, στον οποίο οι επεκτάσεις OpenMP έχουν αντικατασταθεί από ισοδύναμα τμήματα κώδικα τα οποία χρησιμοποιούν κλήσεις βιβλιοθήκης νημάτων για την υλοποίηση του παραλληλισμού. Ο μεταφραστής έχει σχεδιαστεί με τέτοιο τρόπο ώστε να είναι μεταφέρσιμος, υποστηρίζοντας πολλαπλές πλατφόρμες εκτέλεσης, και επεκτάσιμος, υποστηρίζοντας πολλαπλές βιβλιοθήκες νημάτων κατά τη μετάφραση των οδηγιών OpenMP. Η υλοποίηση του μεταφραστή έγινε στη γλώσσα προγραμματισμού Java, χρησιμοποιώντας το γλωσσικό εργαλείο ANTLR για την υλοποίηση του λεκτικού και συντακτικού αναλυτή. Ο μεταφραστής συνοδεύεται, επίσης, από μια βιβλιοθήκη χρόνου εκτέλεσης, η οποία περιέχει τις συναρτήσεις που ορίζονται από το πρότυπο OpenMP v2.0, καθώς και άλλες συναρτήσεις που είναι απαραίτητες για την εκτέλεση των μεταφρασμένων παράλληλων εφαρμογών. Επίσης, στη βιβλιοθήκη υλοποιούνται οι πολιτικές χρονοδρομολόγησης, επιτρέπωντας εύκολα την υλοποίηση κάποιας νέας πολιτικής. Η βιβλιοθήκη έχει υλοποιηθεί σε γλώσσα προγραμματισμού C. Στα πλαίσια της εργασίας αυτής, έχει υλοποιηθεί η υποστήριξη των βιβλιοθηκών νημάτων POSIX και NANOS σε πολυεπεξεργαστικά συστήματα κοινής μνήμης, καθώς και οι πολιτικές χρονοδρομολόγησης που ορίζονται στο πρότυπο OpenMP v2.0. / The subject of this thesis is the implementation of a translator for the OpenMP parallel programming model. This model is used for the development of parallel applications for the shared memory model and emphasizes on the ease of programming and the portability of the applications developed with it. This thesis describes the design and the implementatino of a compiler for OpenMP parallel applications for the C programming language according to version 2.0 of the OpenMP model. The compiler that has been implemented belongs to the translator class, that is translates the source code in C programming language with the OpenMP extensions to equivalent source code in C programming language where the OpenMP extensions have been replaced with equivalent code segments which use thread library calls to implement the parallelism. The translator has been designed in order to be portable, supporting multiple execution platforms, and extensible, supporting multiple thread libraries during the translation of the OpenMP directives. The translator has been implemented using the Java language and using the language tool ANTLR for the implementation of the lexer and the parser. The translator comes with a run-time library, which contains all of the functions which are defined by the OpenMP v2.0, as well as other functions which are needed for the execution of the translated parallel applications. Also, the library contains the scheduling policies allowing easy implementation of a new one. The library has been implemented using the C language. The current implementation supports the POSIX and NANOS thread libraries in shared memory SMPs, as well as all the scheduling policies defined in OpenMP v2.0
|
188 |
Efficient Handling of Narrow Width and Streaming Data in Embedded ApplicationsLi, Bengu January 2006 (has links)
Embedded environment imposes severe constraints of system resources on embedded applications. Performance, memory footprint, and power consumption are critical factors for embedded applications. Meanwhile, the data in embedded applications demonstrate unique properties. More specifically, narrow width data are data representable in considerably fewer bits than in one word, which nevertheless occupy an entire register or memory word and streaming data are the input data processed by an application sequentially, which stay in the system for a short duration and thus exhibit little data locality. Narrow width and streaming data affect the efficiency of register, cache, and memory and must be taken into account when optimizing for performance, memory footprint, and power consumption.This dissertation proposes methods to efficiently handle narrow width and streaming data in embedded applications. Quantitative measurements of narrow width and streaming data are performed to provide guidance for optimizations. Novel architectural features and associated compiler algorithms are developed. To efficiently handle narrow width data in registers, two register allocation schemes are proposed for the ARM processor to allocate two narrow width variables to one register. A static scheme exploits maximum bitwidth. A speculative scheme further exploits dynamic bitwidth. Both result in reduced spill cost and performance improvement. To efficiently handle narrow width data in memory, a memory layout method is proposed to coalesce multiple narrow width data in one memory location in a DSP processor, leading to fewer explicit address calculations. This method improves performance and shrinks memory footprint. To efficiently handle streaming data in network processor, two cache mechanisms are proposed to enable the reuse of data and computation. The slack created is further transformed into reduction in energy consumption through a fetch gating mechanism.
|
189 |
Optimizing System Performance and Dependability Using Compiler TechniquesRajagopalan, Mohan January 2006 (has links)
As systems become more complex, there are increasing demands for improvement with respect to attributes such as performance, dependability, and security. Optimization is defined as theprocess of making the most effective use of a set of resources with respect to some attribute. Existing optimization techniques, however, have two fundamental limitations. They target individual parts of a system without considering the potentially significant global picture, and they are designed to improve a single attribute at a time. These limitations impose significant restrictions on the kinds of optimization possible, the effectiveness of the techniques, and the ability to improvethe optimization process itself.This dissertation presents holistic system optimization, a new approach to optimization based on taking a broad view of a system. Unlike current approaches, holistic optimizations consider different kinds of interactions at multiple levels in a system, and target a variety of metrics uniformly. A key component of this research has been the use of proven compiler techniques to ensure transparency, automation, and correctness. These techniques have been implemented in Cassyopia, a software prototype of a framework for performing holistic optimization.The core of this work is three new holistic optimizations, which are also presented. The first describes profile-directed static optimizations designed to improve the performance of eventbased programs by spanning boundaries that separate code that raises events from handlers that field them. The second, system call clustering, improves the system call behavior of an entire program by grouping together calls that can be executed in a single boundary crossing. In thiscase, the optimization spans kernel and user address spaces. Finally, authenticated system calls optimize system security through a novel implementation of an efficient system call monitor. This example demonstrates how the new approach can be used to create new optimizations that not only span address space boundaries but also target attributes such as dependability. All of these optimizations involve the application of standard compiler techniques in non-traditional contexts and demonstrate how systems can be improved beyond what is possible using existing techniques.
|
190 |
Compiler Support for Fine-grain Software-only CheckpointingZhao, Cheng Yan 13 August 2013 (has links)
Checkpointing support allows program execution to roll-back to an earlier program point, discarding
any modifications made since that point. Existing software-based checkpointing methods are mainly
libraries that snapshot all of working-memory, and hence have prohibitive overhead for many potential
applications. In this thesis we present a light-weight, fine-grain checkpointing framework implemented
entirely in software through compiler transformations and optimizations. A programmer can specify
arbitrary checkpoint regions via a simple API, and the compiler automatically transforms the code to
implement the checkpoint at the granularity of individual stores, optimizing to remove redundancy.
We explore three application areas for this support. First, we investigate its application to debugging,
in particular by providing the ability to rewind to an arbitrarily-placed point in a buggy program’s
execution. A study using BugBench applications shows that our compiler-based approach is more
than 100X less overhead than full-process checkpointing. Second, we demonstrate that compiler-based checkpointing support can be leveraged to free the programmer from manually implementing
and maintaining software rollback mechanisms when coding a backtracking algorithm, with runtime
overhead of only 15% compared to the manual implementation. Finally, we utilize the efficient
software-only checkpointing to support overlapping execution with delinquent loads through the
demonstrations both control-speculation and data-speculation transformations. We further propose
a theoretical speculative timing model and confirm its prediction effectiveness with real-machine
workload.
|
Page generated in 0.044 seconds