261 |
Software-level analysis and optimization to mitigate the cost of write operations on non-volatile memories / Analyse logicielle et optimisation pour réduire le coût des opérations d'écriture sur les mémoires non volatilesBouziane, Rabab 07 December 2018 (has links)
La consommation énergétique est devenue un défi majeur dans les domaines de l'informatique embarquée et haute performance. Différentes approches ont été étudiées pour résoudre ce problème, entre autres, la gestion du système pendant son exécution, les systèmes multicœurs hétérogènes et la gestion de la consommation au niveau des périphériques. Cette étude cible les technologies de mémoire par le biais de mémoires non volatiles (NVMs) émergentes, qui présentent intrinsèquement une consommation statique quasi nulle. Cela permet de réduire la consommation énergétique statique, qui tend à devenir dominante dans les systèmes modernes. L'utilisation des NVMs dans la hiérarchie de la mémoire se fait cependant au prix d'opérations d'écriture coûteuses en termes de latence et d'énergie. Dans un premier temps, nous proposons une approche de compilation pour atténuer l'impact des opérations d'écriture lors de l'intégration de STT-RAM dans la mémoire cache. Une optimisation qui vise à réduire le nombre d'opérations d'écritures est implémentée en utilisant LLVM afin de réduire ce qu'on appelle les silent stores, c'est-à-dire les instances d'instructions d'écriture qui écrivent dans un emplacement mémoire une valeur qui s'y trouve déjà. Dans un second temps, nous proposons une approche qui s'appuie sur l'analyse des programmes pour estimer des pire temps d'exécution partiaux, dénommés δ-WCET. À partir de l'analyse des programmes, δ-WCETs sont déterminés et utilisés pour allouer en toute sécurité des données aux bancs de mémoire NVM avec des temps de rétention des données variables. L'analyse δ-WCET calcule le WCET entre deux endroits quelconques dans un programme, comme entre deux blocs de base ou deux instructions. Ensuite, les pires durées de vie des variables peuvent être déterminées et utilisées pour décider l'affectation des variables aux bancs de mémoire les plus appropriées. / Traditional memories such as SRAM, DRAM and Flash have faced during the last years, critical challenges related to what modern computing systems required: high performance, high storage density and low power. As the number of CMOS transistors is increasing, the leakage power consumption becomes a critical issue for energy-efficient systems. SRAM and DRAM consume too much energy and have low density and Flash memories have a limited write endurance. Therefore, these technologies can no longer ensure the needs in both embedded and high-performance computing domains. The future memory systems must respect the energy and performance requirements. Since Non Volatile Memories (NVMs) appeared, many studies have shown prominent features where such technologies can be a potential replacement of the conventional memories used on-chip and off-chip. NVMs have important qualities in storage density, scalability, leakage power, access performance and write endurance. Nevertheless, there are still some critical drawbacks of these new technologies. The main drawback is the cost of write operations in terms of latency and energy consumption. We propose a compiler-level optimization that reduces the number of write operations by elimination the execution of redundant stores, called silent stores. A store is silent if it’s writing in a memory address the same value that is already stored at this address. The LLVM-based optimization eliminates the identified silent stores in a program by not executing them. Furthermore, the cost of a write operation is highly dependent on the used NVM and its non-volatility called retention time; when the retention time is high then the latency and the energetic cost of a write operation are considerably high and vice versa. Based on that, we propose an approach applicable in a multi- bank NVM where each bank is designed with a specific retention time. We analysis a program and we compute the worst-case lifetime of a store instruction to allocate data to the most appropriate NVM bank.
|
262 |
Harmony: an execution model for heterogeneous systemsDiamos, Gregory Frederick 10 November 2011 (has links)
The emergence of heterogeneous and many-core architectures presents a
unique opportunity to deliver order of magnitude performance
increases to high performance applications by matching certain classes
of algorithms to specifically tailored architectures. However, their
ubiquitous adoption has been limited by a lack of
programming models and management frameworks designed to reduce the
high degree of complexity of software development inherent to
heterogeneous architectures. This dissertation introduces Harmony, an execution
model for heterogeneous systems that draws heavily from concepts and
optimizations used in processor micro-architecture to provide:
(1) semantics for simplifying heterogeneity management, (2) dynamic scheduling
of compute intensive kernels to heterogeneous processor resources, and
(3) online monitoring driven performance optimization for heterogeneous many
core systems. This work focuses on simplifying development and ensuring binary
portability and scalability across system configurations and sizes.
|
263 |
Comprehensive Path-sensitive Data-flow AnalysisThakur, Aditya 07 1900 (has links)
Data-flow analysis is an integral part of any aggressive optimizing compiler. We propose a framework for improving the precision of data-flow analysis in the presence of complex control-flow. We initially perform data-flow analysis to determine those control-flow merges which cause the loss in data-flow analysis precision. The control-flow graph of the program is then restructured such that performing data-flow analysis on the resulting restructured graph gives more precise results. The proposed framework is both simple, involving the familiar notion of product automata, and also general, since it is applicable to any forward or backward data-flow analysis. Apart from proving that our restructuring process is correct, we also show that restructuring is effective in that it necessarily leads to more optimization opportunities.
Furthermore, the framework handles the trade-off between the increase in data-flow precision and the code size increase inherent in the restructuring. We show that determining an optimal restructuring is NP-hard, and propose and evaluate a greedy heuristic.
The framework has been implemented in the Scale research compiler, and instantiated for the specific problems of Constant Propagation and Liveness analysis. On the SPECINT 2000 benchmark suite we observe an average speedup of 4% in the running times over Wegman-Zadeck conditional constant propagation algorithm and 2% over a purely path profile guided approach for Constant Propagation. For the problem of Liveness analysis, we see an average speedup of 0.8% in the running times over the baseline implementation.
|
264 |
Isothermality: making speculative optimizations affordablePereira, David John 22 December 2007 (has links)
Partial Redundancy Elimination (PRE) is a ubiquitous optimization used by compilers to remove
repeated computations from programs. Speculative PRE (SPRE), which uses program profiles
(statistics obtained from running a program), is more cognizant of trends in run time behaviour
and therefore produces better optimized programs. Unfortunately, the optimal version of SPRE is
a very expensive algorithm, of high-order polynomial time complexity, and unlike most compiler
optimizations, which run effectively in linear time complexity over the size of the program that they
are optimizing.
This dissertation uses the concept of “isothermality”—the division of a program into a hot region
and a cold region—to create the Isothermal SPRE (ISPRE) optimization, an approximation to
optimal SPRE. Unlike SPRE, which creates and solves a flow network for each program expression
being optimized—a very expensive operation—ISPRE uses two simple bit-vector analyses, optimizing
all expressions simultaneously. We show, experimentally, that the ISPRE algorithm works, on
average, nine times faster than the SPRE algorithm, while producing programs that are optimized
competitively.
This dissertation also harnesses the power of isothermality to empower another kind of ubiquitous
compiler optimization, Partial Dead Code Elimination (PDCE), which removes computations
whose values are not used. Isothermal Speculative PDCE (ISPDCE) is a new, simple, and efficient
optimization which requires only three bit-vector analyses. We show, experimentally, that ISPDCE
produces superior optimization than PDCE, while keeping a competitive running time.
On account of their small analysis costs, ISPRE and ISPDCE are especially appropriate for use in
Just-In-Time (JIT) compilers.
|
265 |
Τεχνικές μεταγλωττιστών για βελτιστοποίηση ειδικών πυρήνων λογισμικούΣιουρούνης, Κωνσταντίνος 16 June 2011 (has links)
Με την ολοένα και αυξανόμενη τάση για ενσωματωμένα (embedded) και φορητά υπολογιστικά συστήματα της σύγχρονης εποχής, έχειδημιουργηθεί ένας ολόκληρος επιστημονικός κλάδος γύρω από τεχνικές βελτιστοποίησης μεταγλωττιστών για ειδικούς πυρήνες λογισμικού που εκτελούνται στα συστήματα αυτά. Κάνοντας χρήση τεχνικών βελτιστοποίησης τα κέρδη είναι πολλαπλά. Καταρχήν οι πυρήνες μπορούν να ολοκληρώσουν το χρόνο που απαιτείται για να ολοκληρωθεί η εκτέλεση τους σε πολύ μικρότερο διάστημα, έχοντας πολύ μικρότερες απαιτήσεις μνήμης. Επίσης μειώνονται οι ανάγκες τους σε επεξεργαστική ισχύ κάτι το οποίο άμεσα οδηγεί στη μείωση κατανάλωσης ενέργειας, στην αύξηση αυτονομίας τους σε περίπτωση που μιλάμε για φορητά συστήματα και στις ανάγκες για ψύξη των συστημάτων αυτών καθώς εκλύονται πολύ μικρότερα ποσά ενέργειας. Έτσι λοιπόν επιτυγχάνονται κέρδη σε πολλούς τομείς (χρόνος εκτέλεσης, ανάγκες μνήμης, αυτονομία, έκλυση θερμότητας) καθιστώντας τον κλάδο των
βελτιστοποιήσεων ένα από τους πιο ταχέως αναπτυσσόμενους κλάδους.
Εκτός όμως από την σκοπιά της αύξησης επιδόσεων, στην περίπτωση των ενσωματωμένων συστημάτων πραγματικού χρόνου (real time operations) που όταν ξεπερνιούνται οι διορίες χρόνου εκτέλεσης οδηγούνται σε υποβαθμισμένες επιδόσεις (soft real time) και ειδικότερα στην περίπτωση αυτών που οδηγούνται σε αποτυχία όταν ξεπερνιούνται οι διορίες αυτές (hard real time operations), οι τεχνικές αυτές αποτελούν ουσιαστικά μονόδρομο για την υλοποίηση των συστημάτων αυτών σε λογικά επίπεδα κόστους. Η διαδικασία όμως της ανάπτυξης βελτιστοποιήσεων δεν είναι αρκετή καθώς είναι εξίσου σημαντικό το κατά πόσο οι βελτιστοποιήσεις αυτές ταιριάζουν στην εκάστοτε αρχιτεκτονική του συστήματος. Εάν δε ληφθεί υπόψη η αρχιτεκτονική του συστήματος που θα εφαρμοστούν, τότε οι βελτιστοποιήσεις μπορούν να οδηγήσουν σε αντίθετα αποτελέσματα υποβαθμίζοντας την απόδοση του συστήματος.
Στην παρούσα διπλωματική εργασία βελτιστοποιείται η διαδικασία πολλαπλασιασμού διανύσματος με πίνακα toeplitz. Κατά την εκπόνηση της αναπτύχθηκε πληθώρα χρονοπρογραμματισμών που στοχεύουν στην βελτιστοποίηση της διαδικασίας αυτής. Μετά από μια εις βάθους μελέτη της ιεραρχίας μνήμης και των τεχνικών βελτιστοποίησης που προσφέρονται για αποδοτικότερη εκμετάλλευσή της, αλλά και των κυριότερων τεχνικών βελτιστοποίησης μεταγλωττιστών, παρουσιάζονται οι κυριότεροι χρονοπρογραμματισμοί, από όσους αναπτύχθηκαν, με τον κάθε ένα να προσφέρει κέρδος σε διαφορετικές αρχιτεκτονικές συστημάτων. Κατά αυτό τον τρόπο αναπτύσσεται ένα εργαλείο που δέχεται σαν είσοδο την αρχιτεκτονική του συστήματος πάνω στο οποίο πρόκειται να γίνει βελτιστοποίηση του εν λόγω πυρήνα, αποκλείονται αρχικά οι χρονοπρογραμματισμοί που δεν είναι κατάλληλοι για την συγκεκριμένη αρχιτεκτονική, ενώ για τους υποψήφιους πιο αποδοτικούς γίνεται εξερεύνηση ούτως ώστε να επιλεγεί ο αποδοτικότερος. / --
|
266 |
Geração de código otimizado visando a exploração de paralelismo na arquitetura IPNoSysCouto, Juliene Vieira do 09 September 2016 (has links)
Submitted by Lara Oliveira (lara@ufersa.edu.br) on 2017-04-07T22:12:32Z
No. of bitstreams: 1
JulieneVC_DISSERT.pdf: 3209854 bytes, checksum: 18b673023a841a782fbe0c6f32a66254 (MD5) / Approved for entry into archive by Vanessa Christiane (referencia@ufersa.edu.br) on 2017-04-13T15:05:50Z (GMT) No. of bitstreams: 1
JulieneVC_DISSERT.pdf: 3209854 bytes, checksum: 18b673023a841a782fbe0c6f32a66254 (MD5) / Approved for entry into archive by Vanessa Christiane (referencia@ufersa.edu.br) on 2017-04-13T15:05:59Z (GMT) No. of bitstreams: 1
JulieneVC_DISSERT.pdf: 3209854 bytes, checksum: 18b673023a841a782fbe0c6f32a66254 (MD5) / Made available in DSpace on 2017-04-13T15:06:08Z (GMT). No. of bitstreams: 1
JulieneVC_DISSERT.pdf: 3209854 bytes, checksum: 18b673023a841a782fbe0c6f32a66254 (MD5)
Previous issue date: 2016-09-09 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Parallel architectures require optimized code that exploits its new features. Some architectures follow the paradigm of Von Neumann machine, while others differ from this model, such as IPNoSys processor. This processor is based on network-on-chip and features a package-driven computer model driven which reflects in its programming model. Initially, this architecture had an assembler and a simulator and needed a compiler. In later papers compilers for IPNoSys have been developed, but none fully explored the features of this architecture. Thus, the objective of this paper is to define a code optimization step in IPNoSys compiler, considering characteristics unexploited as parallelism and improving your generated code. The optimization module offers three levels of optimization. In order to evaluate the created module, made a comparison of the execution time and the size of codes generated in the three levels of optimization. It was obtained that an optimization level showed better run time, but generated applications with a larger size, while another level showed a smaller size. Furthermore, there was an improvement in the generated code / As arquiteturas paralelas necessitam de código otimizado que explore seus novos recursos. Algumas arquiteturas seguem o paradigma da máquina de Von Neumann, enquanto que outras divergem desse modelo, um exemplo é o processador IPNoSys. Esse processador foi baseado em redes-em-chip e apresenta um modelo de computação dirigido a pacotes o que reflete no seu modelo de programação. Inicialmente, essa arquitetura possuía um montador e um simulador e necessitava de um compilador. Em trabalhos posteriores compiladores para a IPNoSys foram desenvolvidos, mas nenhum explorou completamente as características dessa arquitetura. Com isso, o objetivo deste trabalho é definir uma etapa de otimização de código no compilador IPNoSys, considerando características não exploradas como o paralelismo e melhorando seu código gerado. O módulo de otimização oferece três níveis de otimização. A fim de avaliar o módulo criado, efetuou-se uma comparação do tempo de execução e do tamanho dos códigos gerados nos três níveis de otimização. Foi obtido que um nível de otimização apresentou melhor tempo de execução, porém gerou aplicações com um maior tamanho, enquanto que outro nível apresentou um menor tamanho. Além disso, houve uma melhoria nos códigos gerados / 2017-04-07
|
267 |
Adapting the polytope model for dynamic and speculative parallelization / Adaptation du modèle polyhédrique à la parallélisation dynamique et spéculaticeJimborean, Alexandra 14 September 2012 (has links)
Dans cette thèse, nous décrivons la conception et l'implémentation d'une plate-forme logicielle de spéculation de threads, ou fils d'exécution, appelée VMAD, pour "Virtual Machine for Advanced Dynamic analysis and transformation", et dont la fonction principale est d'être capable de paralléliser de manière spéculative un nid de boucles séquentiel de différentes façons, en ré-ordonnançant ses itérations. La transformation à appliquer est sélectionnée au cours de l'exécution avec pour objectifs de minimiser le nombre de retours arrières et de maximiser la performance. Nous effectuons des transformations de code en appliquant le modèle polyédrique que nous avons adapté à la parallélisation spéculative au cours de l'exécution. Pour cela, nous construisons au préalable un patron de code qui est "patché" par notre "runtime", ou support d'exécution logiciel, selon des informations de profilage collectées sur des échantillons du temps d'exécution. L'adaptabilité est assurée en considérant des tranches de code de tailles différentes, qui sont exécutées successivement, chacune étant parallélisée différemment, ou exécutée en séquentiel, selon le comportement des accès à la mémoire observé. Nous montrons, sur plusieurs programmes que notre plate-forme offre de bonnes performances, pour des codes qui n'auraient pas pu être traités efficacement par les systèmes spéculatifs de threads proposés précédemment. / In this thesis, we present a Thread-Level Speculation (TLS) framework whose main feature is to speculatively parallelize a sequential loop nest in various ways, to maximize performance. We perform code transformations by applying the polyhedral model that we adapted for speculative and runtime code parallelization. For this purpose, we designed a parallel code pattern which is patched by our runtime system according to the profiling information collected on some execution samples. We show on several benchmarks that our framework yields good performance on codes which could not be handled efficiently by previously proposed TLS systems.
|
268 |
Scalable Register File Architecture for CGRA AcceleratorsJanuary 2016 (has links)
abstract: Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable
of accelerating even non-parallel loops and loops with low trip-counts. One challenge
in compiling for CGRAs is to manage both recurring and nonrecurring variables in
the register file (RF) of the CGRA. Although prior works have managed recurring
variables via rotating RF, they access the nonrecurring variables through either a
global RF or from a constant memory. The former does not scale well, and the latter
degrades the mapping quality. This work proposes a hardware-software codesign
approach in order to manage all the variables in a local nonrotating RF. Hardware
provides modulo addition based indexing mechanism to enable correct addressing
of recurring variables in a nonrotating RF. The compiler determines the number of
registers required for each recurring variable and configures the boundary between the
registers used for recurring and nonrecurring variables. The compiler also pre-loads
the read-only variables and constants into the local registers in the prologue of the
schedule. Synthesis and place-and-route results of the previous and the proposed RF
design show that proposed solution achieves 17% better cycle time. Experiments of
mapping several important and performance-critical loops collected from MiBench
show proposed approach improves performance (through better mapping) by 18%,
compared to using constant memory. / Dissertation/Thesis / Masters Thesis Computer Science 2016
|
269 |
Contributions on approximate computing techniques and how to measure them / Contributions sur les techniques de computation approximée et comment les mesurerRodriguez Cancio, Marcelino 19 December 2017 (has links)
La Computation Approximée est basée dans l'idée que des améliorations significatives de l'utilisation du processeur, de l'énergie et de la mémoire peuvent être réalisées, lorsque de faibles niveaux d'imprécision peuvent être tolérés. C'est un concept intéressant, car le manque de ressources est un problème constant dans presque tous les domaines de l'informatique. Des grands superordinateurs qui traitent les big data d'aujourd'hui sur les réseaux sociaux, aux petits systèmes embarqués à contrainte énergétique, il y a toujours le besoin d'optimiser la consommation de ressources. La Computation Approximée propose une alternative à cette rareté, introduisant la précision comme une autre ressource qui peut à son tour être échangée par la performance, la consommation d'énergie ou l'espace de stockage. La première partie de cette thèse propose deux contributions au domaine de l'informatique approximative: Aproximate Loop Unrolling : optimisation du compilateur qui exploite la nature approximative des données de séries chronologiques et de signaux pour réduire les temps d'exécution et la consommation d'énergie des boucles qui le traitent. Nos expériences ont montré que l'optimisation augmente considérablement les performances et l'efficacité énergétique des boucles optimisées (150% - 200%) tout en préservant la précision à des niveaux acceptables. Primer: le premier algorithme de compression avec perte pour les instructions de l'assembleur, qui profite des zones de pardon des programmes pour obtenir un taux de compression qui surpasse techniques utilisées actuellement jusqu'à 10%. L'objectif principal de la Computation Approximée est d'améliorer l'utilisation de ressources, telles que la performance ou l'énergie. Par conséquent, beaucoup d'efforts sont consacrés à l'observation du bénéfice réel obtenu en exploitant une technique donnée à l'étude. L'une des ressources qui a toujours été difficile à mesurer avec précision, est le temps d'exécution. Ainsi, la deuxième partie de cette thèse propose l'outil suivant : AutoJMH : un outil pour créer automatiquement des microbenchmarks de performance en Java. Microbenchmarks fournissent l'évaluation la plus précis de la performance. Cependant, nécessitant beaucoup d'expertise, il subsiste un métier de quelques ingénieurs de performance. L'outil permet (grâce à l'automatisation) l'adoption de microbenchmark par des non-experts. Nos résultats montrent que les microbencharks générés, correspondent à la qualité des manuscrites par des experts en performance. Aussi ils surpassent ceux écrits par des développeurs professionnels dans Java sans expérience en microbenchmarking. / Approximate Computing is based on the idea that significant improvements in CPU, energy and memory usage can be achieved when small levels of inaccuracy can be tolerated. This is an attractive concept, since the lack of resources is a constant problem in almost all computer science domains. From large super-computers processing today’s social media big data, to small, energy-constraint embedded systems, there is always the need to optimize the consumption of some scarce resource. Approximate Computing proposes an alternative to this scarcity, introducing accuracy as yet another resource that can be in turn traded by performance, energy consumption or storage space. The first part of this thesis proposes the following two contributions to the field of Approximate Computing :Approximate Loop Unrolling: a compiler optimization that exploits the approximative nature of signal and time series data to decrease execution times and energy consumption of loops processing it. Our experiments showed that the optimization increases considerably the performance and energy efficiency of the optimized loops (150% - 200%) while preserving accuracy to acceptable levels. Primer: the first ever lossy compression algorithm for assembler instructions, which profits from programs’ forgiving zones to obtain a compression ratio that outperforms the current state-of-the-art up to a 10%. The main goal of Approximate Computing is to improve the usage of resources such as performance or energy. Therefore, a fair deal of effort is dedicated to observe the actual benefit obtained by exploiting a given technique under study. One of the resources that have been historically challenging to accurately measure is execution time. Hence, the second part of this thesis proposes the following tool : AutoJMH: a tool to automatically create performance microbenchmarks in Java. Microbenchmarks provide the finest grain performance assessment. Yet, requiring a great deal of expertise, they remain a craft of a few performance engineers. The tool allows (thanks to automation) the adoption of microbenchmark by non-experts. Our results shows that the generated microbencharks match the quality of payloads handwritten by performance experts and outperforms those written by professional Java developers without experience in microbenchmarking.
|
270 |
A source-to-source compiler for the PRAM language Fork to the REPLICA many-core architectureZhou, Cheng January 2012 (has links)
This thesis describes the implementation of a source to source compiler that translates Fork language to REPLICA baseline language. The Fork language is a high-level programming language designed for the PRAM (Parallel Random Access Machine) model. The baseline language is a low-level parallel programming language for the REPLICA architecture which implements the PRAM computing model. To support the Fork language on REPLICA, a compiler that translates Fork to baseline is built. The Fork to baseline compiler is built in compatibility with the Fork implementation for SB-PRAM. Moreover, the libraries that support Fork's features are built using baseline language.The evaluation result verifies that the features of the Fork language are supported in the implementation. The evaluation also shows the scalability of our implementation and shows that the overhead introduced by Fork-to-baseline translation is small.
|
Page generated in 0.049 seconds