91 |
Constraint Programming Techniques for Optimal Instruction SchedulingMalik, Abid 03 1900 (has links)
Modern processors have multiple pipelined functional units
and can issue more than one instruction per clock cycle. This
puts great pressure on the instruction scheduling phase in
a compiler to expose maximum instruction level parallelism.
Basic blocks and superblocks are commonly used regions of code
in a program for instruction scheduling. Instruction scheduling
coupled with register allocation is also a well studied problem
to produce better machine code. Scheduling basic blocks and
superblocks optimally with or with out register allocation is
NP-complete, and is done sub-optimally in production compilers
using heuristic approaches. In this thesis, I present a
constraint programming approach to the superblock and basic
block instruction scheduling problems for both idealized and
realistic architectures. Basic block scheduling with register
allocation with no spilling allowed is also considered. My
models for both basic block and superblock scheduling are
optimal and fast enough to be incorporated into production
compilers. I experimentally evaluated my optimal schedulers on
the SPEC 2000 integer and floating point benchmarks. On this
benchmark suite, the optimal schedulers were very robust and
scaled to the largest basic blocks and superblocks. Depending
on the architectural model, between 99.991\% to 99.999\% of all
basic blocks and superblocks were solved to optimality. The
schedulers were able to routinely solve the largest blocks,
including blocks with up to 2600 instructions. My results compare
favorably to the best previous optimal approaches, which are based
on integer programming and enumeration.
My approach for basic block scheduling without allowing spilling was good
enough to solve 97.496\% of all basic blocks in the SPEC 2000
benchmark. The approach was able to solve basic blocks as
large as 50 instructions for both idealized and realistic
architectures within reasonable time limits.
Again, my results compare favorably to recent work
on optimal integrated code generation, which is
based on integer programming.
|
92 |
A Run-Time Loop Parallelization Technique on Shared-Memory Multiprocessor SystemsWu, Chi-Fan 06 July 2000 (has links)
High performance computing power is important for the current advanced calculations of scientific applications. A multiprocessor system obtains its high performance from the fact that some computations can proceed in parallel. A parallelizing compiler can take a sequential program as input and automatically translate it into parallel form for the target multiprocessor system. But when loops with arrays of irregular, nonlinear or dynamic access patterns, no any current parallelizing compiler can determine whether data dependences exist at compile-time. Thus a run-time parallel algorithm will be utilized to determine dependences and extract the potential parallelism of loops. In this thesis, we propose an efficient run-time parallelization technique to compute a proper parallel execution schedule in those loops. This new method first detects immediate predecessor iterations of each loop iteration and constructs an immediate predecessor table, then efficiently schedules the whole loop iterations into wavefronts for parallel execution. According to either theoretical analysis or experimental results, our new run-time parallelization technique reveals high speedup and low processing overhead. Furthermore, this new technique is appropriate to implement on multiprocessor systems due to the characteristics of high scalability.
|
93 |
An Environment for Automatic Generation of Code OptimizersPaleri, Vineeth Kumar 07 1900 (has links)
Code optimization or code transformation is a complex function of a compiler involving analyses and modifications with the entire program as its scope. In spite of its complexity, hardly any tools exist to support this function of the
compiler. This thesis presents the development of a code transformation system, specifically for scalar transformations, which can be used either as a tool to assist the generation of code transformers or as an environment for experimentation with code transformations.
The development of the code transformation system involves the formal specification of code transformations using dependence relations. We have written formal specifications for the whole class of traditional scalar transformations, including induction variable elimination - a complex transformation - for which no formal specifications are available in the literature. All transformations considered in this thesis are global.
Most of the specifications given here, for
which specifications are already available in the literature, are improved versions, in terms of conservativeness.The study of algorithms for code transformations, in the context of their
formal specification, lead us to the development of a new algorithm for partial redundancy elimination. The basic idea behind the algorithm is the new concepts of safe partial availability and safe partial anticipability. Our algorithm is computationally and lifetime optimal. It
works on flow graphs whose nodes are basic blocks, which makes it practical.In comparison with existing algorithms the new algorithm also requires four unidirectional analyses, but saves some preprocessing time. The main advantage of the algorithm is its conceptual simplicity.
The code transformation system provides an environment in which one can specify a transformation using dependence relations (in
the specification language we have designed), generate code for a transformer from its specification,and experiment with the generated transformers on real-world programs.
The system takes a program to be transformed, in C or FORTRAN, as input,translates it into intermediate code, interacts with the user to decide the transformation to be performed, computes the necessary dependence relations
using the dependence analyzer, applies the specified transformer on the intermediate code, and converts the transformed intermediate code back to high-level. The system is unique of its kind,providing a complete environment for the generation of code transformers, and allowing experimentations with them using real-world programs.
|
94 |
A framework for rapid development of dynamic binary translatorsHolm, David January 2004 (has links)
<p>Binary recompilation and translation play an important role in computer systems today. It is used by systems such as Java and .NET, and system emulators like VMWare and VirtualPC. A dynamic binary translator have several things in common with a regular compiler but as they usually have to translate code in real-time several constraints have to be made, especially when it comes to making code optimisations.</p><p>Designing a dynamic recompiler is a complex process that involves repetitive tasks. Translation tables have to be constructed for the source architecture which contains the data necessary to translate each instruction into binary code that can be executed on the target architecture. This report presents a method that allows a developer to specify how the source and target architectures work using a set of scripting languages. The purpose of these languages is to relocate the repetitive tasks to computer software, so that they do not have to be performed manually by programmers. At the end of the report a simple benchmark is used to evaluate the performance of a basic IA32 emulator running on a PowerPC target that have been implemented using the system described here. The results of the benchmark is compared to the results of running the same benchmark on other, existing, emulators in order to show that the system presented here can compete with the existing methods used today.</p><p>Several ongoing research projects are looking into ways of designing binary translators. Most of these projects focus on ways of optimising code in real-time and how to solve the problems related to binary translation, such as handling self-modifying code.</p>
|
95 |
Incremental Compilation and Dynamic Loading of Functions in OpenModelicaKlinghed, Joel, Jansson, Kim January 2008 (has links)
<p>Advanced development environments are essential for efficient realization of complex industrial products. Powerful equation-based object-oriented (EOO) languages such as Modelica are successfully used for modeling and virtual prototyping complex physical systems and components. The Modelica language enables engineers to build large, sophisticated and complex models. Modelica environments should scale up and be able to handle these large models. This thesis addresses the scalability of Modelica tools by employing incremental compilation and dynamic loading. The design, implementation and evaluation of this approach is presented. OpenModelica is an open-source Modelica environment developed at PELAB in which we have implemented our strategy for incremental compilation and dynamic loading of functions. We have tested the performance of these strategies in a number of different scenarios in order to see how much of an impact they have on the compilation and execution time.</p><p>Our solution contains an overhead of one or two hash calls during runtime as it uses dynamic hashes instead of static arrays.</p>
|
96 |
Performance oriented scheduling with power constraintsHayes, Brian C 01 June 2005 (has links)
Current technology trends continue to increase the power density of modern processors at an exponential rate. The increasing transistor density has significantly impacted cooling and power requirements and if left unchecked, the power barrier will adverselyaffect performance gains in the near future. In this work, we investigate the problem ofinstruction reordering for improving both performance and power requirements. Recently,a new scheduling technique, called Forced Directed Instruction Scheduling, or FDIS, hasbeen proposed in the literature for use in high level synthesis as well as instruction reordering [15, 16, 6]. This thesis extends the FDIS algorithm by adding several features suchas control instruction handling, register renaming in order to obtain better performanceand power reduction. Experimental results indicate that performance improvements up to24.62% and power reduction up to 23.98% are obtained on a selected set of benchmark programs.
|
97 |
Generating RTL for microprocessors from architectural and microarchitectural descriptionBansal, Ankit Sajjan Kumar 17 June 2011 (has links)
Designing a modern processor is a very complex task. Writing the entire design using a hardware description language (like Verilog) is time consuming and difficult to verify. There exists a split architecture/microarchitecture description technique, in which, the description of any hardware can be divided into two orthogonal descriptions: (a) an architectural contract between the user and the implementation, and (b) a microarchitecture which describes the implementation of the architecture. The main aim of this thesis is to build realistic processors using this technique. We have designed an in-order and an
out-of-order superscalar processor using the split-description compiler. The backend of this compiler is another contribution of this thesis. / text
|
98 |
RRQR-MEX - Linux and Windows 32bit MATLAB MEX-Files for the rank revealing QR factorizationSaak, Jens, Schlömer, Stephan 05 January 2010 (has links) (PDF)
The rank revealing QR decomposition is a special form of the well known QR decomposition of a matrix. It uses specialized pivoting strategies and allows for an easy and efficient numerical rank decision for arbitrary matrices. It is especially valuable when column compression of rectangular matrices needs to be performed. Here we provide documentation and compilation instructions for a MATLAB MEX implementation of the RRQR allowing the easy usage of this decomposition inside the MATLAB environment.
|
99 |
Μελέτη σύνθετων εφαρμογών και ανάλυση δυνατοτήτων παραλληλοποίησης τους σε αρχιτεκτονικές κοινής και κοινής/κατανεμημένης μνήμηςΜουτσουρούφης, Γεώργιος 26 September 2007 (has links)
Η χρήση μετροπρογραμμάτων για την μέτρηση της επίδοσης και της απόδοσης συστημάτων αρχιτεκτονικής πολλαπλών επεξεργαστικών στοιχείων και συστημάτων λογισμικού για την υποστήριξη εκτέλεσης παράλληλων εφαρμογών πάνω σε πολυεπεξεργαστικές πλατφόρμες, είναι μία έγκυρη και ευρέως διαδεδομένη μέθοδος. Μάλιστα για την τυποποίηση τέτοιων προγραμμάτων, μεγάλες και παγκοσμίως αναγνωρισμένες ερευνητικές ομάδες, έχουν προτείνει και έχουν παραλληλοποιήσει συλλογή μετροπρογραμμάτων, ως αποτέλεσμα πολυετούς εμπειρίας και έρευνας. Οι πιο γνωστές συλλογές μετροπρογραμμάτων είναι τα SPEC, NAS και SPLASH. Όμως αν και οι κώδικες αυτών των συλλογών ετροπρογραμμάτων είναι διαθέσιμοι στην επιστημονική κοινότητα, δεν είναι δυνατόν να καλύπτουν πλήρως τις ανάγκες των ερευνητικών ομάδων παγκοσμίως που ασχολούνται με την έρευνα και την ανάπτυξη συστημάτων λογισμικού για την υποστήριξη παράλληλων εφαρμογών πάνω σε πολυεπεξεργαστικές πλατφόρμες.
Η παρούσα διπλωματική, επιχειρεί τη μελέτη, την ανάλυση και την παραλληλοποίηση δύο σύνθετων ακολουθιακών εφαρμογών που θα χρησιμοποιηθούν ως μετροπρογράμματα για την αξιολόγηση συστημάτων υποστήριξης παράλληλων εφαρμογών. Οι πλατφόρμες υποστήριξης παράλληλων εφαρμογών αναπτύσσονται στο Εργαστήριο Πληροφοριακών Συστημάτων Υψηλών Επιδόσεων (ΕΠΣΥΕ).
Η διπλωματική αποτελείται από την παρουσίαση και την ανάλυση μεθόδων βελτιστοποίησης εφαρμογών. Αναφέρεται σε βελτιώσεις μονοεπεξεργαστικών συστημάτων, που συνήθως συντελούνται με την αλλαγή κώδικα για την καλύτερη εκμετάλλευση των πόρων του συστήματος, και από την προσαρμογή ή και την αλλαγή των ακολουθιακών αλγορίθμων για παράλληλες αρχιτεκτονικές συστημάτων SMP. Στη συνέχεια επιχειρούμε την βελτιστοποίηση και την παραλληλοποίηση δύο εφαρμογών τις οποίες θα χρησιμοποιεί η ομάδα παρλλαλήλων συστημάτων του εργαστηρίου ΕΠΣΥΕ για την αξιολόγηση του λογισμικού που αναπτύσσει στα πλαίσια των ερευνητικών της δραστηριοτήτων. Η μία εφαρμογή είναι από το χώρο της Ιατρικής και αφορά τον υπολογισμό του απαιτούμενου ποσοστού ακτινοβολίας για την ακτινοθεραπεία όγκων. Η δεύτερη εφαρμογή είναι από το χώρο της μοριακής χημείας και αναφέρεται στον υπολογισμό της κίνησης των μορίων, αερίων εγκλωβισμένων σε μάζες στερεών σωμάτων. Τέλος παραθέτουμε τις μετρήσεις των βελτιστοποιημένων και παραλληλοποιημένων εφαρμογών και τις βελτιώσεις που επιτυγχάνουν.
Η προσπάθεια βελτιστοποίησης και προσαρμογής των συγκεκριμένων αλλά και επιπλέον εφαρμογών σε υπάρχουσες αλλά και σε νέες αρχιτεκτονικές θα συνεχιστεί, με στόχο την απόκτηση της απαιτούμενης τεχνογνωσίας για την βελτιστοποίηση και παραλληλοποίηση δικών μας εφαρμογών/μετροπρογραμμάτων, που θα χρησιμοποιούμε για την αξιολόγηση των πλατφορμών που αναπτύσσουμε για την υποστήριξη παράλληλης και ταυτόχρονης επεξεργασίας. / The use of benchmarks for the measurement of the effectiveness and performance of computer systems with multiple processors and of software systems that support parallel execution of applications on those platforms is a valid and widespread method.
In fact, towards the standardization of such programs large and worldwide acknowledged research teams have proposed and parallelized a collection of benchmarks which are the result of many years of experience and research. The most known collections of such benchmarks are these of SPEC, NASH and SPLASH. Although the codes of this collection of benchmarks are in the disposal of the scientific community, they could not possibly fully cover the needs of the scientific teams all over the world that work in the field of research and development of software systems that support parallel applications the multiprocessor platforms.
This thesis is an effort to study, analyze and parallelize two complex sequential applications that will be used as benchmarks for the evaluation of parallel application support systems. The platforms supporting parallel applications are developed in the High Performance Computer Laboratory in Patra (HPClab).
This thesis includes a presentation and analyses of the optimization methods for parallel applications. It refers to improvements which are usually the result of changes in the programming code, in order to achieve optimal utilization of the given system resources. These improvements may also derive from the adjustment or even the change of the sequential algorithms for parallel SMP system architectures.
What will follow, will be an attempt to optimize and parallelize two applications that will be used by the HPClab team in order to evaluate the software developed in the content of its research activity. The first application comes from the scientific field of medicine and regards the computation of the required amount of radiation applied for the cure of cancer.
The second one comes from the scientific field of engineer chemistry and regards the computation of molecule movement of gases enclosed in solid objects. Finally, the measurements for the optimized and parallelized applications and the improvements achieved will be presented.
The attempt to optimize and adjust these applications as well as others, will continue to be developed in the framework of existent and platforms, with the goal to attain the necessary know how for the optimization and parallelization of our own applications/benchmarks which will be used for the evaluation of the platforms developed and for the support of parallel applications.
|
100 |
Multilevel tiling for non-rectangular interation spacesJiménez Castells, Marta 28 May 1999 (has links)
La motivación principal de esta tesis es el desarrollo de nuevas técnicas de compilación dirigidas a conseguir mayor rendimiento encódigos numéricos complejos que definen es pacios de iteraciones no rectangulares. En particular, nos centramos en la trasformación de "loop tiling" (también conocida como "blocking") y nuestro propósito es mejorar la transformación de loop tiling cuando se aplica a códigos numéricos complejos. Nuestro objetivo es conseguir, a través de la transformación de loop tiling, los mismos o mejores rendimientos que las librerías numéricas proporcionadas por el fabricante que están optimizadas manualmente.En la tesis se muestra que la razón principal por la que los compiladores comerciales actuales consiguen bajos rendimiento en este tipo de aplicaciones es que no son capaces de aplicar loop tiling a nivel de registros. En su lugar, para mejorar la localidad de los datos y el ILP, los compiladores actuales usan y combinan otras transformaciones que no explotan el nivel de registros tan bien como loop tiling. Previamente no se ha considerado aplicar loop tiling a nivel de registro porque en códigos numéricos complejos no es trivial debido a la naturaleza irregular de los espacios de iteraciones. La primera contribución de esta tesis es un algoritmo general de loop tiling a nivel de registros que es aplicable a cualquier tipo de espacio de iteraciones y no sólo a los espacios rectangulares. Nuestro método incluye una heurística muy sencilla para decidir los parámetros de los cortes a nivel de registros. A primera vista parece que loop tiling a nivel de registros (a partir de ahora, register tiling) se tiene que aplicar de tal manera que el bucle que ofrece más reuso temporal de los datos no debe de ser partido. De esta manera maximizamos la reutilización de los registros y minimizamos el número total de load/stores ejecutados. Sin embargo, mostraremos que en espacios de iteraciones no rectangulares, si solamente tenemos en cuenta las direcciones del reuso y no la forma del espacio de iteraciones, los códigos pueden sufrir una degradación en rendimiento. Nuestra segunda contribución es la propuesta de una heurística muy sencilla que determina los parámetros del tiling a nivel de registros considerando no sólo el reuso temporal sino también la forma del espacio de iteraciones. Además, la heurística es suficientemente sencilla para que pueda ser implementada en un compilador comercial.Sin embargo, para conseguir rendimientos similares que códigos optimizados a mano, no es suficiente con aplicar loop tiling a nivel de registros. Con las arquitecturas de hoy en día que disponen de jerarquías de memoria complejas y múltiples procesadores, es necesario que el compilador aplique loop tiling en cuatro o más niveles (paralelismo, cache L2, cache L1 y registros) para conseguir altos rendimientos. Por lo tanto, en las arquitecturas actuales es crucial tener un algoritmo eficiente para aplicar loop tiling en varios niveles de la jerarquía de memoria (tiling multinivel). Además, como mostramos en esta tesis, la transformación de tiling multinivel siempre tendrá que incluir el nivel de registro porque este es el nivel de la jerarquía de memoria que ofrece mayor rendimiento cuando es tratado correctamente.Cuando tiling multinivel incluye el nivel de registros, es necesario que los límites de los bucles sean exactos y que no haya límites redundantes. La razón es que la complejidad y la cantidad de código que se genera con nuestra técnica de register tiling depende polinómicamente del número de límites de los bucles.Sin embargo, hasta ahora, el problema de calcular límites exactos y eliminar límites redundantes es que todas las técnicas conocidas son muy caras en términos de tiempo de compilación y, por ello, difícil de integrar en un compilador comercial. La tercera contribución de esta tesis es una nueva implementación de tiling multinivel que calcula límites exactos y es mucho menos costosa que técnicas tradicionales. Mostraremos que la complejidad de nuestra implementación es proporcional a la complejidad de aplicar una permutación de bucles en el código original (antes de aplicar loop tiling), mientras que las técnicas tradicionales tienen complejidades más altas. Además, nuestra implementación genera menos límites redundantes y permite eliminar los límites redundantes que quedan a menor coste. En conjunto, la eficiencia de nuestra implementación hace posible que pueda ser implementada dentro de un compilador comercial sin tener que preocuparse por los tiempos de compilación.La última parte de esta tesis está dedicada al estudio del rendimiento de tiling multinivel. Se muestran los efectos de tiling en los diferentes niveles de memoria y presentamos datos que comparan los beneficios de tiling a nivel de registros, tiling a nivel de cache y tiling a los dos niveles, cache y registros, simultáneamente. Finalmente, comparamos el rendimiento de códigos optimizados automáticamente con códigos optimizados manualmente (librerías numéricas que ofrecen los fabricantes) sobre dos arquitecturas diferentes (ALPHA 21164 and MIPS R10000) para concluir que actualmente la tecnología de los compiladores hace posible que códigos numéricos complejos consigan el mismo rendimiento que códigos optimizados manualmente. / The main motivation of this thesis is to develop new compilation techniques that address the lack of performance of complex numerical codes consisting of loop nests defining non-rectangular iteration spaces. Specifically, we focus on the loop tiling transformation (also known as blocking) and our purpose is the improvement of the loop tiling transformation when dealing with complex numerical codes. Our goal is to achieve via the loop tiling transformation the same or better performance as hand-optimized vendor-supplied numerical libraries. We will observe that the main reason why current commercial compilers perform poorly when dealing with this type of codes is that they do not apply tiling for the register level. Instead, to enhance locality at this level and to improve ILP, they use/combine other transformations that do not exploit the register level as well as loop tiling. Tiling for the register level has not generally been considered because, in complex numerical codes, it is far from being trivial due to the irregular nature of the iteration space. Our first contribution in this thesis will be a general compiler algorithm to perform tiling at the register level that handles arbitrary iteration space shapes and not only simple rectangular shapes.Our method includes a very simple heuristic to make the tile decisions for the register level. At first sight, register tiling should be performed so that whichever loop carries the most temporal reuse is not tiled. This way, register reuse is maximized and the number of load/store instructions executed is minimized. However, we will show that, for complex loop nests, if we only consider reuse directions and do not take into account the iteration space shape, the tiled loop nest can suffer performance degradation. Our second contribution will be a proposal of a very simple heuristic to determine the tiling parameters for the register level, that considers not only temporal reuse, but also the iteration space shape. Moreover, the heuristic is simple enough to be suitable for automatic implementation by compilers.However, to be able to achieve similar performance to hand-optimized codes, it is not enough by tiling only for the register level. With today's architectures having complex memory hierarchies and multiple processors, it is quite common that the compiler has to perform tiling at four or more levels (parallelism, L2-cache, L1-cache and registers) in order to achieve high performance. Therefore, in today's architectures it is crucial to have an efficient algorithm that can perform multilevel tiling at multiple levels of the memory hierarchy. Moreover, as we will see in this thesis, multilevel tiling should always include the register level, as this is the memory hierarchy level that yields most performance when properly tiled.When multilevel tiling includes the register level, it is critical to compute exact loop bounds and to avoid the generation of redundant bounds. The reason is that the complexity and the amount of code generated by our register tiling technique both depend polynomially on the number of loop bounds. However, to date, the drawback of generating exact loop bounds and eliminating redundant bounds has been that all techniques known were extremely expensive in terms of compilation time and, thus, difficult to integrate in a production compiler. Our third contribution in this thesis will be a new implementation of multilevel tiling that computes exact loop bounds at a much lower complexity than traditional techniques. In fact, we will show that the complexity of our implementation is proportional to the complexity of performing a loop permutation in the original loop nest (before tiling), while traditional techniques have much larger complexities. Moreover, our implementation generates less redundant bounds in the multilevel tiled code and allows removing the remaining redundant bounds at a lower cost. Overall, the efficiency of our implementation makes it possible to integrate multilevel tiling including the register level in a production compiler without having to worry about compilation time.The last part of this thesis is dedicated to studying the performance of multilevel tiling. We will discuss the effects of tiling for different memory levels and present quantitative data comparing the benefits of tiling only for the register level, tiling only for the cache level and tiling for both levels simultaneously. Finally, we will compare automatically-optimized codes against hand-optimized vendor-supplied numerical libraries, on two different architectures (ALPHA 21164 and MIPS R10000), to conclude that compiler technology can make it possible for complex numerical codes to achieve the same performance as hand-optimized codes on modern microprocessors.
|
Page generated in 1.0061 seconds