Global ETD Search

1	Efficient Execution Paradigms for Parallel Heterogeneous Architectures Koukos, Konstantinos January 2016 (has links) This thesis proposes novel, efficient execution-paradigms for parallel heterogeneous architectures. The end of Dennard scaling is threatening the effectiveness of DVFS in future nodes; therefore, new execution paradigms are required to exploit the non-linear relationship between performance and energy efficiency of memory-bound application-regions. To attack this problem, we propose the decoupled access-execute (DAE) paradigm. DAE transforms regions of interest (at program-level) in two coarse-grain phases: the access-phase and the execute-phase, which we can independently DVFS. The access-phase is intended to prefetch the data in the cache, and is therefore expected to be predominantly memory-bound, while the execute-phase runs immediately after the access-phase (that has warmed-up the cache) and is therefore expected to be compute-bound. DAE, achieves good energy savings (on average 25% lower EDP) without performance degradation, as opposed to other DVFS techniques. Furthermore, DAE increases the memory level parallelism (MLP) of memory-bound regions, which results in performance improvements of memory-bound applications. To automatically transform application-regions to DAE, we propose compiler techniques to automatically generate and incorporate the access-phase(s) in the application. Our work targets affine, non-affine, and even complex, general-purpose codes. Furthermore, we explore the benefits of software multi-versioning to optimize DAE in dynamic environments, and handle codes with statically unknown access-phase overheads. In general, applications automatically-transformed to DAE by our compiler, maintain (or even exceed in some cases) the good performance and energy efficiency of manually-optimized DAE codes. Finally, to ease the programming environment of heterogeneous systems (with integrated GPUs), we propose a novel system-architecture that provides unified virtual memory with low overhead. The underlying insight behind our work is that existing data-parallel programming models are a good fit for relaxed memory consistency models (e.g., the heterogeneous race-free model). This allows us to simplify the coherency protocol between the CPU – GPU, as well as the GPU memory management unit. On average, we achieve 45% speedup and 45% lower EDP over the corresponding SC implementation. Decoupled Execution Performance Energy DVFS Compiler Optimizations Heterogeneous Coherence
2	A Just in Time Register Allocation and Code Optimization Framework for Embedded Systems Thammanur, Sathyanarayan 11 October 2001 (has links) No description available. compiler optimizations dynamic compilation code generation register allocation
3	Effective Automatic Parallelization and Locality Optimization Using The Polyhedral Model Bondhugula, Uday Kumar 11 September 2008 (has links) No description available. Computer Science Automatic Parallelization Polyhedral Model Compiler Optimizations
4	Optimal Loop Unrolling for GPGPU Programs Sreenivasa Murthy, Giridhar 30 September 2009 (has links) No description available. Computer Science GPGPU Compiler Optimizations Loop Transformations Loop Unrolling
5	(Re-)Creating sharing in Agda's GHC backend Perna, Natalie January 2017 (has links) Agda is a dependently-typed programming language and theorem prover, supporting proof construction in a functional programming style. Due to its incredibly flexible concrete syntax and support for Unicode identifiers, Agda can be used to construct elegant and expressive proofs in a format that is understandable even to those unfamiliar with the tool. However, the semantics of Agda is lacking resource guarantees of the kind that Haskell programmers are used to with lazy evaluation, where multiple uses of function arguments and let-bound variables still result in the corresponding expressions to be evaluated at most once. With the current compiler backends of Agda, a mathematically-natural way to structure programs therefore frequently results in inefficient compiled programs, where the run-time complexity can be exponentional in cases where corresponding Haskell code executes in linear time. This makes a highly-optimised compiler backend a particularly essential tool for practical development with Agda. The main contributions of this thesis are a series of compiler optimisations that inlines simple projections, removes some expressions with trivial evaluations that can be statically inferred, and reduces the need for repeated evaluations of the same expressions by increasing sharing. We developed transformations that focus on the inherent “loss” of sharing that is frequently the result of compiling Agda programs. Where an Agda developer may imagine that value sharing should exist in the generated Haskell code, it often does not. We present several optimising transformations that re-introduce some of this “lost” sharing without affecting the type-theoretic semantics, then apply these optimisations to several typical Agda applications to examine the memory allocation and execution time effects. In measuring the effects of these optimisations on Agda code we show that overall improvements in runtime on the order of 10-20% are possible. We hope that the development and discussion of these optimisations is useful to the Agda developer community, and may be helpful for future contributors interested in implementing new optimisations for Agda. / Thesis / Master of Science (MSc) Compiler optimizations Functional programming Dependent types Computer science
6	Run-time optimization of adaptive irregular applications Yu, Hao 15 November 2004 (has links) Compared to traditional compile-time optimization, run-time optimization could oﬀer signiﬁcant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identiﬁed a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Speciﬁcally, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm speciﬁed by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems. compiler optimizations adaptive optimization performance modeling run-time parallelization run-time optimization reduction parallelization
7	Run-time optimization of adaptive irregular applications Yu, Hao 15 November 2004 (has links) Compared to traditional compile-time optimization, run-time optimization could oﬀer signiﬁcant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identiﬁed a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Speciﬁcally, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm speciﬁed by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems. compiler optimizations adaptive optimization performance modeling run-time parallelization run-time optimization reduction parallelization
8	Characterization and optimization of JavaScript programs for mobile systems Srikanth, Aditya 09 October 2013 (has links) JavaScript has permeated into every aspect of the web experience in today's world, making it highly crucial to process it as quickly as possible. With the proliferation of HTML5 and its associated mobile web applications, the world is slowly but surely moving into an age where majority of the webpages will involve complex computations and manipulations within the JavaScript engine. Recent techniques like Just-in-Time (JIT) compilation have become commonplace in popular browsers like Chrome and Firefox, and there is an ongoing effort to further optimize them in the context of mobile systems. In order to fully take advantage of JavaScript-heavy webpages, it is important to first characterize the interaction of these webpages (both existing pages and modern HTML5 pages) with the different components of the JavaScript engine, viz. the interpreter, the method JIT, the optimizing compiler and the garbage collector. In this thesis, the aforementioned characterization work was leveraged to identify the limits of JavaScript optimizations. Subsequently, a particular optimization, i.e. Register Allocation heuristics was explored in detail on different types of JavaScript programs. This was primarily because the majority of the time (an average of 52.81%) spent in the optimizing compiler is for the register allocation stage alone. By varying the heuristics for register assignment, interval priority and spill selection, a clear idea is obtained about how it impacts certain types of programs more than others. This thesis also gives a preliminary insight into JavaScript applications and benchmarks, showing that these applications tend to be register-intensive, with large live intervals and sparse uses, and sensitive to array and string manipulations. A statically-selected optimal register allocation scheme outperforms the default register allocation scheme resulting in 9.1% performance improvement and 11.23% reduction in execution time on a representative mobile system. / text JavaScript Mobile systems Workload characterization Compiler optimizations Register allocation Firefox SpiderMonkey
9	<b>Compiler and Architecture Co-design for Reliable Computing</b> Jianping Zeng (19199323) 24 July 2024 (has links) <p dir="ltr">Reliability against errors, such as soft errors—transient bit flips in transistors caused by energetic particle strikes—and crash inconsistency arising from power failure, is as crucial as performance and power efficiency for a wide range of computing devices, from embedded systems to datacenters. If not properly addressed, these errors can lead to massive financial losses and even endanger human lives. Furthermore, the dynamic nature of modern computing workloads complicates the implementation of reliable systems as the likelihood and impact of these errors increase. Consequently, system designers often face a dilemma: sacrificing reliability for performance and cost-effectiveness or incurring high manufacturing and/or run-time costs to maintain high system dependability. This trade-off can result in reduced availability and increased vulnerability to errors when reliability is not prioritized or escalated costs when it is.</p><p dir="ltr">In light of this, this dissertation, for the first time, demonstrates that with a synergistic compiler and architecture co-design, it is possible to achieve reliability while maintaining high performance and low hardware cost. We begin by illustrating how compiler/architecture co-design achieves near-zero-overhead soft error resilience for embedded cores (Chapter 2). Next, we introduce ReplayCache (Chapter 3), a software-only approach that ensures crash consistency for energy harvesting systems (backed with embedded cores) and outperforms the state-of-the-art by 9x. Apart from embedded cores, reliability for server-class cores is more vital due to their widespread adoption in performance-critical environments. With that in mind, we then propose VeriPipe (Chapter 4), which showcases how a straightforward microarchitectural technique can achieve near-zero-overhead soft error resilience for server-class cores with a storage overhead of just three registers and one countdown timer. Finally, we present two approaches to achieving performant crash consistency for server-class cores by leveraging existing dynamic register renaming in out-of-order cores (Chapter 5) and repurposing Intel’s non-temporal path (Chapter 6), respectively. Through these innovations, this dissertation paves the way for more reliable and efficient computing systems, ensuring that reliability does not come at the cost of performance degradation or hardware complexity.</p> Compiler optimizations Computer architecture. Reliability and Crash Consistency
10	Τεχνικές μεταγλωττιστών για βελτιστοποίηση ειδικών πυρήνων λογισμικού Σιουρούνης, Κωνσταντίνος 16 June 2011 (has links) Με την ολοένα και αυξανόμενη τάση για ενσωματωμένα (embedded) και φορητά υπολογιστικά συστήματα της σύγχρονης εποχής, έχειδημιουργηθεί ένας ολόκληρος επιστημονικός κλάδος γύρω από τεχνικές βελτιστοποίησης μεταγλωττιστών για ειδικούς πυρήνες λογισμικού που εκτελούνται στα συστήματα αυτά. Κάνοντας χρήση τεχνικών βελτιστοποίησης τα κέρδη είναι πολλαπλά. Καταρχήν οι πυρήνες μπορούν να ολοκληρώσουν το χρόνο που απαιτείται για να ολοκληρωθεί η εκτέλεση τους σε πολύ μικρότερο διάστημα, έχοντας πολύ μικρότερες απαιτήσεις μνήμης. Επίσης μειώνονται οι ανάγκες τους σε επεξεργαστική ισχύ κάτι το οποίο άμεσα οδηγεί στη μείωση κατανάλωσης ενέργειας, στην αύξηση αυτονομίας τους σε περίπτωση που μιλάμε για φορητά συστήματα και στις ανάγκες για ψύξη των συστημάτων αυτών καθώς εκλύονται πολύ μικρότερα ποσά ενέργειας. Έτσι λοιπόν επιτυγχάνονται κέρδη σε πολλούς τομείς (χρόνος εκτέλεσης, ανάγκες μνήμης, αυτονομία, έκλυση θερμότητας) καθιστώντας τον κλάδο των βελτιστοποιήσεων ένα από τους πιο ταχέως αναπτυσσόμενους κλάδους. Εκτός όμως από την σκοπιά της αύξησης επιδόσεων, στην περίπτωση των ενσωματωμένων συστημάτων πραγματικού χρόνου (real time operations) που όταν ξεπερνιούνται οι διορίες χρόνου εκτέλεσης οδηγούνται σε υποβαθμισμένες επιδόσεις (soft real time) και ειδικότερα στην περίπτωση αυτών που οδηγούνται σε αποτυχία όταν ξεπερνιούνται οι διορίες αυτές (hard real time operations), οι τεχνικές αυτές αποτελούν ουσιαστικά μονόδρομο για την υλοποίηση των συστημάτων αυτών σε λογικά επίπεδα κόστους. Η διαδικασία όμως της ανάπτυξης βελτιστοποιήσεων δεν είναι αρκετή καθώς είναι εξίσου σημαντικό το κατά πόσο οι βελτιστοποιήσεις αυτές ταιριάζουν στην εκάστοτε αρχιτεκτονική του συστήματος. Εάν δε ληφθεί υπόψη η αρχιτεκτονική του συστήματος που θα εφαρμοστούν, τότε οι βελτιστοποιήσεις μπορούν να οδηγήσουν σε αντίθετα αποτελέσματα υποβαθμίζοντας την απόδοση του συστήματος. Στην παρούσα διπλωματική εργασία βελτιστοποιείται η διαδικασία πολλαπλασιασμού διανύσματος με πίνακα toeplitz. Κατά την εκπόνηση της αναπτύχθηκε πληθώρα χρονοπρογραμματισμών που στοχεύουν στην βελτιστοποίηση της διαδικασίας αυτής. Μετά από μια εις βάθους μελέτη της ιεραρχίας μνήμης και των τεχνικών βελτιστοποίησης που προσφέρονται για αποδοτικότερη εκμετάλλευσή της, αλλά και των κυριότερων τεχνικών βελτιστοποίησης μεταγλωττιστών, παρουσιάζονται οι κυριότεροι χρονοπρογραμματισμοί, από όσους αναπτύχθηκαν, με τον κάθε ένα να προσφέρει κέρδος σε διαφορετικές αρχιτεκτονικές συστημάτων. Κατά αυτό τον τρόπο αναπτύσσεται ένα εργαλείο που δέχεται σαν είσοδο την αρχιτεκτονική του συστήματος πάνω στο οποίο πρόκειται να γίνει βελτιστοποίηση του εν λόγω πυρήνα, αποκλείονται αρχικά οι χρονοπρογραμματισμοί που δεν είναι κατάλληλοι για την συγκεκριμένη αρχιτεκτονική, ενώ για τους υποψήφιους πιο αποδοτικούς γίνεται εξερεύνηση ούτως ώστε να επιλεγεί ο αποδοτικότερος. / -- Μεταγλωττιστές Διαχείριση μνήμης 005.453 Compilers Embedded systems Compiler optimizations Memory management Computer architecture

Search results