Global ETD Search

11	Contributions on approximate computing techniques and how to measure them / Contributions sur les techniques de computation approximée et comment les mesurer Rodriguez Cancio, Marcelino 19 December 2017 (has links) La Computation Approximée est basée dans l'idée que des améliorations significatives de l'utilisation du processeur, de l'énergie et de la mémoire peuvent être réalisées, lorsque de faibles niveaux d'imprécision peuvent être tolérés. C'est un concept intéressant, car le manque de ressources est un problème constant dans presque tous les domaines de l'informatique. Des grands superordinateurs qui traitent les big data d'aujourd'hui sur les réseaux sociaux, aux petits systèmes embarqués à contrainte énergétique, il y a toujours le besoin d'optimiser la consommation de ressources. La Computation Approximée propose une alternative à cette rareté, introduisant la précision comme une autre ressource qui peut à son tour être échangée par la performance, la consommation d'énergie ou l'espace de stockage. La première partie de cette thèse propose deux contributions au domaine de l'informatique approximative: Aproximate Loop Unrolling : optimisation du compilateur qui exploite la nature approximative des données de séries chronologiques et de signaux pour réduire les temps d'exécution et la consommation d'énergie des boucles qui le traitent. Nos expériences ont montré que l'optimisation augmente considérablement les performances et l'efficacité énergétique des boucles optimisées (150% - 200%) tout en préservant la précision à des niveaux acceptables. Primer: le premier algorithme de compression avec perte pour les instructions de l'assembleur, qui profite des zones de pardon des programmes pour obtenir un taux de compression qui surpasse techniques utilisées actuellement jusqu'à 10%. L'objectif principal de la Computation Approximée est d'améliorer l'utilisation de ressources, telles que la performance ou l'énergie. Par conséquent, beaucoup d'efforts sont consacrés à l'observation du bénéfice réel obtenu en exploitant une technique donnée à l'étude. L'une des ressources qui a toujours été difficile à mesurer avec précision, est le temps d'exécution. Ainsi, la deuxième partie de cette thèse propose l'outil suivant : AutoJMH : un outil pour créer automatiquement des microbenchmarks de performance en Java. Microbenchmarks fournissent l'évaluation la plus précis de la performance. Cependant, nécessitant beaucoup d'expertise, il subsiste un métier de quelques ingénieurs de performance. L'outil permet (grâce à l'automatisation) l'adoption de microbenchmark par des non-experts. Nos résultats montrent que les microbencharks générés, correspondent à la qualité des manuscrites par des experts en performance. Aussi ils surpassent ceux écrits par des développeurs professionnels dans Java sans expérience en microbenchmarking. / Approximate Computing is based on the idea that significant improvements in CPU, energy and memory usage can be achieved when small levels of inaccuracy can be tolerated. This is an attractive concept, since the lack of resources is a constant problem in almost all computer science domains. From large super-computers processing today’s social media big data, to small, energy-constraint embedded systems, there is always the need to optimize the consumption of some scarce resource. Approximate Computing proposes an alternative to this scarcity, introducing accuracy as yet another resource that can be in turn traded by performance, energy consumption or storage space. The first part of this thesis proposes the following two contributions to the field of Approximate Computing :Approximate Loop Unrolling: a compiler optimization that exploits the approximative nature of signal and time series data to decrease execution times and energy consumption of loops processing it. Our experiments showed that the optimization increases considerably the performance and energy efficiency of the optimized loops (150% - 200%) while preserving accuracy to acceptable levels. Primer: the first ever lossy compression algorithm for assembler instructions, which profits from programs’ forgiving zones to obtain a compression ratio that outperforms the current state-of-the-art up to a 10%. The main goal of Approximate Computing is to improve the usage of resources such as performance or energy. Therefore, a fair deal of effort is dedicated to observe the actual benefit obtained by exploiting a given technique under study. One of the resources that have been historically challenging to accurately measure is execution time. Hence, the second part of this thesis proposes the following tool : AutoJMH: a tool to automatically create performance microbenchmarks in Java. Microbenchmarks provide the finest grain performance assessment. Yet, requiring a great deal of expertise, they remain a craft of a few performance engineers. The tool allows (thanks to automation) the adoption of microbenchmark by non-experts. Our results shows that the generated microbencharks match the quality of payloads handwritten by performance experts and outperforms those written by professional Java developers without experience in microbenchmarking. Computation Approximée Optimisation des compilateurs Diversité logicielle Compression avec perte Microbenchmark Approximate computing Compiler optimizations Software diversity Lossy compression Microbenchmark
12	Sub-Polyhedral Compilation using (Unit-)Two-Variables-Per-Inequality Polyhedra Upadrasta, Ramakrishna 13 March 2013 (has links) (PDF) The goal of this thesis is to design algorithms that run with better complexity when compiling or parallelizing loop programs. The framework within which our algorithms operate is the polyhedral model of compilation which has been successful in the design and implementation of complex loop nest optimizers and parallelizing compilers. The algorithmic complexity and scalability limitations of the above framework remain one important weakness. We address it by introducing sub-polyhedral compilation by using (Unit-)Two-Variable-Per-Inequality or (U)TVPI Polyhedra, namely polyhedrawith restricted constraints of the type ax_{i}+bx_{j}\le c (\pm x_{i}\pm x_{j}\le c). A major focus of our sub-polyhedral compilation is the introduction of sub-polyhedral scheduling, where we propose a technique for scheduling using (U)TVPI polyhedra. As part of this, we introduce algorithms that can be used to construct under-aproximations of the systems of constraints resulting from affine scheduling problems. This technique relies on simple polynomial time algorithms to under approximate a general polyhedron into (U)TVPI polyhedra. The above under-approximation algorithms are generic enough that they can be used for many kinds of loop parallelization scheduling problems, reducing each of their complexities to asymptotically polynomial time. We also introduce sub-polyhedral code-generation where we propose algorithms to use the improved complexities of (U)TVPI sub-polyhedra in polyhedral code generation. In this problem, we show that the exponentialities associated with the widely used polyhedral code generators could be reduced to polynomial time using the improved complexities of (U)TVPI sub-polyhedra. The above presented sub-polyhedral scheduling techniques are evaluated in an experimental framework. For this, we modify the state-of-the-art PLuTo compiler which can parallelize for multi-core architectures using permutation and tiling transformations. We show that using our scheduling technique, the above under-approximations yield polyhedra that are non-empty for 10 out of 16 benchmarks from the Polybench (2.0) kernels. Solving the under-approximated system leads to asymptotic gains in complexity, and shows practically significant improvements when compared to a traditional LP solver. We also verify that code generated by our sub-polyhedral parallelization prototype matches the performance of PLuTo-optimized code when the under-approximation preserves feasibility. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Affine scheduling Approximation algorithms Compiler optimizations Compilers Loop transformations Optimization Parallelism Asymptotic complexity Code generation
13	Automatic Optimization of Geometric Multigrid Methods using a DSL Approach Vasista, Vinay V January 2017 (has links) (PDF) Geometric Multigrid (GMG) methods are widely used in numerical analysis to accelerate the convergence of partial differential equations solvers using a hierarchy of grid discretizations. These solvers find plenty of applications in various fields in engineering and scientific domains, where solving PDEs is of fundamental importance. Using multigrid methods, the pace at which the solvers arrive at the solution can be improved at an algorithmic level. With the advance in modern computer architecture, solving problems with higher complexity and sizes is feasible - this is also the case with multigrid methods. However, since hardware support alone cannot achieve high performance in execution time, there is a need for good software that help programmers in doing so. Multiple grid sizes and recursive expression of multigrid cycles make the task of manual program optimization tedious and error-prone. A high-level language that aids domain experts to quickly express complex algorithms in a compact way using dedicated constructs for multigrid methods and with good optimization support is thus valuable. Typical computation patterns in a GMG algorithm includes stencils, point-wise accesses, restriction and interpolation of a grid. These computations can be optimized for performance on modern architectures using standard parallelization and locality enhancement techniques. Several past works have addressed the problem of automatic optimizations of computations in various scientific domains using a domain-specific language (DSL) approach. A DSL is a language with features to express domain-specific computations and compiler support to enable optimizations specific to these computations. Halide and PolyMage are two of the recent works in this direction, that aim to optimize image processing pipelines. Many computations like upsampling and downsampling an image are similar to interpolation and restriction in geometric multigrid methods. In this thesis, we demonstrate how high performance can be achieved on GMG algorithms written in the PolyMage domain-specific language with new optimizations we added to the compiler. We also discuss the implementation of non-trivial optimizations, on PolyMage compiler, necessary to achieve high parallel performance for multigrid methods on modern architectures. We realize these goals by: • introducing multigrid domain-specific constructs to minimize the verbosity of the algorithm specification; • storage remapping to reduce the memory footprint of the program and improve cache locality exploitation; • mitigating execution time spent in data handling operations like memory allocation and freeing, using a pool of memory, across multiple multigrid cycles; and • incorporating other well-known techniques to leverage performance, like exploiting multi-dimensional parallelism and minimizing the lifetime of storage buffers. We evaluate our optimizations on a modern multicore system using five different benchmarks varying in multigrid cycle structure, complexity and size, for two-and three-dimensional data grids. Experimental results show that our optimizations: • improve performance of existing PolyMage optimizer by 1.31x; • are better than straight-forward parallel and vector implementations by 3.2x; • are better than hand-optimized versions in conjunction with optimizations by Pluto, a state-of-the-art polyhedral source-to-source optimizer, by 1.23x; and • achieve up to 1.5$\times$ speedup over NAS MG benchmark from the NAS Parallel Benchmarks. (The speedup numbers are Geometric means over all benchmarks) Geometric Multigrid Methods Automatic Optimization Geometric Multigrid Method Compiler Optimizations PolyMage Compiler DSL Approach Automatic Optimization Domain-specific Language Approach GMG Algorithm Polyhedral Optimization Storage Optimization Computer Science
14	Τεχνικές μεταγλωττιστών και αρχιτεκτονικές επεξεργαστών για στατιστικές και δυναμικές εφαρμογές Αλαχιώτης, Νικόλαος 19 July 2010 (has links) Οι σημερινές εφαρμογές έχουν ολοένα και μεγαλύτερες ανάγκες επεξεργαστικής ισχύος προκειμένου να εκτελεστούν σε συντομότερο χρονικό διάστημα. Για να την ικανοποίηση αυτών των χρονικών περιορισμών απαιτείται η ανάπτυξη βελτιστοποιημένων τεχνικών σχεδιασμού. Το αντικείμενο της παρούσας διατριβής σχετίζεται με την ανάπτυξη αρχιτεκτονικών και τεχνικών μεταφραστών με σκοπό την γρηγορότερη τροφοδότηση του επεξεργαστή με δεδομένα από την ιεραρχία μνήμης. α) Μεθοδολογία επιτάχυνσης εκτέλεσης εφαρμογής πολλαπλασιασμού πινάκων Παρουσιάζεται μία μεθοδολογία που βασίζεται στην τοπικότητα των δεδομένων με σκοπό την επιτάχυνση εκτέλεσης του πολλαπλασιασμού πινάκων. Μετά από διερεύνηση, παράγεται ο βέλτιστος τρόπος χρονοπρογραμματισμού των προσπελάσεων στη μνήμη λαμβάνοντας υπόψη την τοπικότητα των δεδομένων και τα μεγέθη των επιπέδων ιεραρχίας μνήμης. Ο χρόνος διερεύνησης είναι σύντομος καθώς απορρίπτονται όλες οι μη-βέλτιστες λύσεις. Η προτεινόμενη μεθοδολογία συγκρίνεται με άλλες υπάρχουσες και παρατηρείται αύξηση της απόδοσης μέχρι 55%. β)Mεθοδολογία αποδοτικής υλοποίησης του Fast Fourier Transform (FFT) Παρουσιάζεται μια νέα μεθοδολογία, που επιτυγχάνει βελτιωμένη απόδοση στην υλοποίηση του FFT, έχοντας ως γνώμονα την ελαχιστοποίηση των προσπελάσεων που πραγματοποιούνται στα δεδομένα. Η προτεινόμενη μεθοδολογία έχει σημαντικά πλεονεκτήματα. Πρώτον, την πλήρη αξιοποίηση της παραγωγής και της κατανάλωσης των αποτελεσμάτων των πεταλούδων του FFT αλγορίθμου, της επαναχρησιμοποίησης δεδομένων και της συμμετρίας των twiddle συντελεστών του FFT αλγορίθμου. Δεύτερον, η βέλτιστη λύση χρονοπρογραμματισμού βρίσκεται λαμβάνοντας υπόψη τόσο τον αριθμό των καταχωρητών, όσο και το μέγεθος της κρυφής μνήμης κάθε επιπέδου, αναζητώντας μόνο τον αριθμό του επιπέδου του tiling του FFT. Τρίτον, ο χρόνος μετάφρασης και το μέγεθος του πηγαίου κώδικα είναι πολύ μικροί συγκρινόμενοι με την SOA βιβλιοθήκη υλοποίησης του FFT αλγορίθμου, την FFTW. Η προτεινόμενη μεθοδολογία επιτυγχάνει αύξηση της απόδοσης μέχρι και 63% σε σχέση με την βιβλιοθήκη FFTW. γ)Ανάπτυξη Αρχιτεκτονικών για Διαχείριση Μνήμης Παρουσιάζεται μια αποσυζευγμένη αρχιτεκτονική επεξεργαστών με μια ιεραρχία μνήμης που αποτελείται μόνο από μνήμες scratch-pad, και μια κύρια μνήμη. Η αρχιτεκτονική αυτή εκμεταλλεύεται τα οφέλη των scratch-pad μνημών και τον παραλληλισμό μεταξύ της επεξεργασίας δεδομένων και υπολογισμού διευθύνσεων. Η αρχιτεκτονική συγκρίνεται στην απόδοση με την αρχιτεκτονική MIPS με cache και με scratch-pad ιεραρχίες μνήμης και παρουσιάζεται η υψηλότερη απόδοσή της. Τα πειραματικά αποτελέσματα δείχνουν ότι η απόδοση αυξάνεται μέχρι 3,7 φορές. Στη συνέχεια γίνεται περαιτέρω έρευνα σε αρχιτεκτονικές με Scratch-pad μνήμες. Παρουσιάζεται μια αρχιτεκτονική που είναι σε θέση να παρέχει πληροφορίες για το ακριβές περιεχόμενο δεδομένων της scratch-pad, κατά τη διάρκεια της εκτέλεσης και μπορεί επίσης να εκτελέσει όλες τις απαραίτητες ενέργειες για την τοποθέτηση των νέων δεδομένων στη scratch-pad. Με αυτόν τον τρόπο, αξιοποιείται η επαναχρησιμοποίηση δεδομένων που εμφανίζεται τυχαία και δεν μπορεί να προσδιοριστεί από το μεταγλωττιστή. Συγκρίνεται με αρχιτεκτονική MIPS που περιέχει cache και με scratch-pad μνήμες και αναδεικνύεται η μεγαλύτερη απόδοσή της. Τα πειραματικά αποτελέσματα δείχνουν ότι η απόδοση αυξάνεται μέχρι 5 φορές έναντι των αρχιτεκτονικών με scratch-pad και 2.5 φορές έναντι των αρχιτεκτονικών με cache. / Modern applications have indence needs in processing power in order to be executed in short time. For satisfying the time limits, there have to be generated new techniques for optimizing the designs. The object of the present thesis is about developing new compiler techniques and hardware architectures which aim to transfer data faster, from the memory hierarchy to the CPU. a) Methdology for accelerating the execution of matrix multiplications A new methodology using the standard MMM algorithm is presented, achieving improved performance by focusing on data locality (both temporal and spatial). This methodology finds the scheduling which conforms with the optimum memory management. The scheduling used for the tile level is different from the element level’s one, having better data locality, suited to the sizes of memory hierarchy. Its exploration time is short, because it searches only for the number of the level of tiling used for finding the best tile size for each cache level. Compared with the best existing related work, which we implemented, better performance up to 55% β)Methodology for increasing performance on Fast Fourier Transform (FFT) A new methodology is presented based on minimizing the memory accesses for FFT. It exploits, the production and comsumption of the FFT batterfly results and the reuse of data. The optimum scheduling solution is found taking into account the number of registers and the cache memory size. The compile time and source code size are short comparing to SOA library. The methodology performance gains are up to 63% comparing to FFTW library. γ)Ανάπτυξη Αρχιτεκτονικών για Διαχείριση Μνήμης A decoupled processors architecture with a memory hierarchy is presented consisting only of scratch–pad memories, and a main memory. This architecture exploits both the benefits of scratch-pad memories and the parallelism between address computation and application data processing. The architecture is compared in performance with the MIPS architecture with cache and with scratch-pad memory hierarchies and with the existing decoupled architectures showing its higher normalized performance. Experimental results show that the performance is increased up to 3.7 times. Continuing, more research is done on Scratch-pad memories. We present an architecture that is able to provide information about the exact data contents of scratch-pad during execution and can also do all the necessary operations for placing the new data blocks in scratch-pad. Thereby, the temporal locality which occurs randomly and can not be identified by the compiler is exploited. It is compared with the MIPS architecture with cache and with scratch-pad memories showing its higher normalized performance. Experimental results show that the performance is increased up to 5 times compared to cache architectures and 2,5 times compared to existing scratch-pad architectures. Υψηλή απόδοση Διαχείριση μνήμης 005.453 Compiler optimizations Computer architecture Embedded systems Low power High performance Memory management
15	Sub-Polyhedral Compilation using (Unit-)Two-Variables-Per-Inequality Polyhedra / Compilation sous-polyédrique reposant sur des systèmes à deux variables par inégalité Upadrasta, Ramakrishna 13 March 2013 (has links) Notre étude de la compilation sous-polyédrique est dominée par l’introduction de la notion l’ordonnancement affine sous-polyédrique, pour laquelle nous proposons une technique utilisant des sous-polyèdres (U)TVPI. Dans ce cadre, nous introduisons des algorithmes capables de construire des sous-approximations de systèmes de contraintes résultant de problèmes d’ordonnancement affine. Cette technique repose sur des algorithmes polynomiaux simples pour approcher un polyèdre quelconque par un polyèdre (U)TVPI. Nos algorithmes sont suffisamment génériques pour s’appliquer à de nombreux problèmes d’ordonnancement, de parallélisation, et d’optimisation de boucles, réduisant leur complexité temporelle à des fonctions polynomiales. Nous introduisons également une méthode pour la génération de code utilisant des algorithmes sous-polyédriques, tirant parti de la faible complexité des sous-polyèdres (U)TVPI. Dans ce cadre, nous montrons comment réduire la complexité associée aux générateurs de code les plus populaires, ramenant la complexité de plusieurs facteurs exponentiels à des fonctions polynomiales. Nombre de ces techniques sont évaluées expérimentalement. Pour cela, nous avons réalisé une version modifiée du compilateur PLuTo, capable de paralléliser et d’optimiser des nids de boucles pour des architectures multi-cœurs à l’aide de transformations affines, et notamment de partitionnement (tiling). Nous montrons qu’une majorité des noyaux de calcul de la suite Polybench (2.0) peut être manipulée à l’aide de notre technique d’ordonnancement, en préservant la faisabilité des polyèdres lors des sous-approximations. L’utilisation des systèmes approchés par des sous-polyèdres conduit à des gains asymptotiques en complexité, qui se traduit par des réductions significatives en temps de compilation, par rapport à un solveur de programmation linéaire de référence. Nous vérifions également que le code généré par notre prototype de parallélisation sous-polyédrique est compétitif par rapport à la performance du code généré par Pluto. / The goal of this thesis is to design algorithms that run with better complexity when compiling or parallelizing loop programs. The framework within which our algorithms operate is the polyhedral model of compilation which has been successful in the design and implementation of complex loop nest optimizers and parallelizing compilers. The algorithmic complexity and scalability limitations of the above framework remain one important weakness. We address it by introducing sub-polyhedral compilation by using (Unit-)Two-Variable-Per-Inequality or (U)TVPI Polyhedra, namely polyhedrawith restricted constraints of the type ax_{i}+bx_{j}\le c (\pm x_{i}\pm x_{j}\le c). A major focus of our sub-polyhedral compilation is the introduction of sub-polyhedral scheduling, where we propose a technique for scheduling using (U)TVPI polyhedra. As part of this, we introduce algorithms that can be used to construct under-aproximations of the systems of constraints resulting from affine scheduling problems. This technique relies on simple polynomial time algorithms to under approximate a general polyhedron into (U)TVPI polyhedra. The above under-approximation algorithms are generic enough that they can be used for many kinds of loop parallelization scheduling problems, reducing each of their complexities to asymptotically polynomial time. We also introduce sub-polyhedral code-generation where we propose algorithms to use the improved complexities of (U)TVPI sub-polyhedra in polyhedral code generation. In this problem, we show that the exponentialities associated with the widely used polyhedral code generators could be reduced to polynomial time using the improved complexities of (U)TVPI sub-polyhedra. The above presented sub-polyhedral scheduling techniques are evaluated in an experimental framework. For this, we modify the state-of-the-art PLuTo compiler which can parallelize for multi-core architectures using permutation and tiling transformations. We show that using our scheduling technique, the above under-approximations yield polyhedra that are non-empty for 10 out of 16 benchmarks from the Polybench (2.0) kernels. Solving the under-approximated system leads to asymptotic gains in complexity, and shows practically significant improvements when compared to a traditional LP solver. We also verify that code generated by our sub-polyhedral parallelization prototype matches the performance of PLuTo-optimized code when the under-approximation preserves feasibility. Ordonnancement affine Algorithmes d'approximation Optimisations du compilateur Compilateurs Transformations de boucles Optimisation Parallélisme Complexité asymptotique Génération de code Affine scheduling Approximation algorithms Compiler optimizations Compilers Loop transformations Optimization Parallelism Asymptotic complexity Code generation

Page generated in 0.1143 seconds