Global ETD Search

1	High-Performance Matrix Multiplication: Hierarchical Data Structures, Optimized Kernel Routines, and Qualitative Performance Modeling Wu, Wenhao 02 August 2003 (has links) The optimal implementation of matrix multiplication on modern computer architectures is of great importance for scientific and engineering applications. However, achieving the optimal performance for matrix multiplication has been continuously challenged both by the ever-widening performance gap between the processor and memory hierarchy and the introduction of new architectural features in modern architectures. The conventional way of dealing with these challenges benefits significantly from the blocking algorithm, which improves the data locality in the cache memory, and from the highly tuned inner kernel routines, which in turn exploit the architectural aspects on the specific processor to deliver near peak performance. A state-of-art improvement of the blocking algorithm is the self-tuning approach that utilizes "heroic" combinatorial optimization of parameters spaces. Other recent research approaches include the approach that explicitly blocks for the TLB (Translation Lookaside Buffer) and the hierarchical formulation that employs memoryriendly Morton Ordering (a spaceilling curve methodology). This thesis compares and contrasts the TLB-blocking-based and Morton-Order-based methods for dense matrix multiplication, and offers a qualitative model to explain the performance behavior. Comparisons to the performance of self-tuning library and the "vendor" library are also offered for the Alpha architecture. The practical benchmark experiments demonstrate that neither conventional blocking-based implementations nor the self-tuning libraries are optimal to achieve consistent high performance in dense matrix multiplication of relatively large square matrix size. Instead, architectural constraints and issues evidently restrict the critical path and options available for optimal performance, so that the relatively simple strategy and framework presented in this study offers higher and flatter overall performance. Interestingly, maximal inner kernel efficiency is not a guarantee of global minimal multiplication time. Also, efficient and flat performance is possible at all problem sizes that fit in main memory, rather than "jagged" performance curves often observed in blocking and self-tuned blocking libraries. performance tuning matrix multiplication hierarchical matrix storage cache model
2	Native simulation of MPSoC : instrumentation and modeling of non-functional aspects / Simulation native des MPSoC : instrumentation et modélisation des aspects non fonctionnels Matoussi, Oumaima 30 November 2017 (has links) Les systèmes embarqués modernes intègrent des dizaines, voire des centaines, de cœurs sur une même puce communiquant à travers des réseaux sur puce, afin de répondre aux exigences de performances édictées par le marché. On parle de systèmes massivement multi-cœurs ou systèmes many-cœurs. La complexité de ces systèmes fait de l’exploration de l’espace de conception architecturale, de la co-vérification du matériel et du logiciel, ainsi que de l’estimation de performance, un vrai défi. Cette complexité est généralement com-pensée par la flexibilité du logiciel embarqué. La dominance du logiciel dans ces architectures nécessite de commencer le développement et la vérification du matériel et du logiciel dès les premières étapes du flot de conception, bien avant d’avoir accès à un prototype matériel.Ainsi, il faut disposer d’un modèle abstrait qui reproduit le comportement de la puce cible en un temps raisonnable. Un tel modèle est connu sous le nom de plateforme virtuelle ou de simulation. L’exécution du logiciel sur une telle plateforme est couramment effectuée au moyen d’un simulateur de jeu d’instruction (ISS). Ce type de simulateur, basé sur l’interprétation des instructions une à une, est malheureusement caractérisé par une vitesse de simulation très lente, qui ne fait qu’empirer par l’augmentation du nombre de cœurs.La simulation native est considérée comme une candidate adéquate pour réduire le temps de simulation des systèmes many-cœurs. Le principe de la simulation native est de compiler puis exécuter la quasi totalité de la pile logicielle directement sur la machine hôte tout en communiquant avec des modèles réalistes des composants matériels de l’architecture cible, permettant ainsi de raccourcir les temps de simulation. La simulation native est beau-coup plus rapide qu’un ISS mais elle ne prend pas en compte les aspects non-fonctionnels,tel que le temps d’exécution, dépendant de l’architecture matérielle réelle, ce qui empêche de faire des estimations de performance du logiciel.Ceci dresse le contexte des travaux menés dans cette thèse qui se focalisent sur la simulation native et s’articulent autour de deux contributions majeures. La première s’attaque à l’introduction d’informations non-fonctionnelles dans la représentation intermédiaire (IR)du compilateur. L’insertion précise de telles informations dans le modèle fonctionnel est réalisée grâce à un algorithme dont l’objectif est de trouver des correspondances entre le code binaire cible et le code IR tout en tenant compte des optimisations faites par le compilateur. La deuxième contribution s’intéresse à la modélisation d’un cache d’instruction et d’un tampon d’instruction d’une architecture VLIW pour générer des estimations de performance précises.Ainsi, la plateforme de simulation native associée à des modèles de performance précis et à une technique d’annotation efficace permet, malgré son haut niveau d’abstraction, non seulement de vérifier le bon fonctionnement du logiciel mais aussi de fournir des estimations de performances précises en des temps de simulation raisonnables. / Modern embedded systems are endowed with a high level of parallelism and significantprocessing capabilities as they integrate hundreds of cores on a single chip communicatingthrough network on chip. The complexity of these systems and their dedicated softwareshould not be an excuse for long design cycles, even though the design space is enormousand the underlying design decisions are critical. Thus, design space exploration, hard-ware/software co-verification and performance estimation need to be conducted within areasonable amount of time and early enough in the design process to avoid any tardy de-tection of functional or performance deficiencies.Co-simulation platforms are becoming an increasingly important part in design and ver-ification steps. With instruction interpretation-based software simulation platforms beingtoo slow as they model low-level details of the target system, an alternative software sim-ulation approach known as native simulation or host-compiled simulation has gained mo-mentum this past decade. Native simulation consists of compiling the embedded softwareto the host binary format and executing it directly on the host machine. However, this tech-nique fails to reflect the performance of the embedded software and its actual interactionwith the target hardware. So, the speedup gained by native simulation comes at a price,which is the absence of non-functional information (such as time and energy) needed for es-timating the performance of the entire system and ensuring its proper functioning. Withoutsuch information, native simulation approaches are limited to functional validation.Yielding accurate estimates entails the integration of high-level abstract models thatmimic the behavior of target-specific micro-architectural components in the simulation plat-form and the accurate placement of the obtained non-functional information in the high-level code. Back-annotating non-functional information at the right place requires a map-ping between the binary instructions and the high-level code statements, which can be chal-lenging particularly when compiler optimizations are enabled.In this thesis, we propose an annotation framework working at the compiler interme-diate representation level to accurately annotate performance metrics extracted from thebinary code, thanks to a dedicated mapping algorithm. This mapping algorithm is furtherenhanced to deal with aggressive compiler optimizations, such as loop unrolling, that radi-cally alter the structure of the code. Our target architecture being a VLIW processor, we alsomodel at a high level its instruction buffer to faithfully reproduce its timing behavior.The experiments we conducted to validate our mapping algorithm and component mod-els yielded accurate results and high simulation speed compared to a cycle accurate ISS ofthe target platform. Simulation Vliw Aspects non fonctionnels Estimation de performance Représentation intermédiaire (IR) Modèle de cache d'instruction Simulation Vliw Non-Functional aspects Performance estimation Intermediate representation (IR) Instruction cache model 004
3	Towards Low-Complexity Scalable Shared-Memory Architectures Zeffer, Håkan January 2006 (has links) <p>Plentiful research has addressed low-complexity software-based shared-memory systems since the idea was first introduced more than two decades ago. However, software-coherent systems have not been very successful in the commercial marketplace. We believe there are two main reasons for this: lack of performance and/or lack of binary compatibility.</p><p>This thesis studies multiple aspects of how to design future binary-compatible high-performance scalable shared-memory servers while keeping the hardware complexity at a minimum. It starts with a software-based distributed shared-memory system relying on no specific hardware support and gradually moves towards architectures with simple hardware support.</p><p>The evaluation is made in a modern chip-multiprocessor environment with both high-performance compute workloads and commercial applications. It shows that implementing the coherence-violation detection in hardware while solving the interchip coherence in software allows for high-performing binary-compatible systems with very low hardware complexity. Our second-generation hardware-software hybrid performs on par with, and often better than, traditional hardware-only designs.</p><p>Based on our results, we conclude that it is not only possible to design simple systems while maintaining performance and the binary-compatibility envelope, it is often possible to get better performance than in traditional and more complex designs.</p><p>We also explore two new techniques for evaluating a new shared-memory design throughout this work: adjustable simulation fidelity and statistical multiprocessor cache modeling.</p> shared memory distributed shared memory hardware-software trade-off software coherence coherence profiling remote access cache chip multiprocessor simultaneous multi threading simulation workload characterization statistical cache model Computer engineering Datorteknik
4	Towards Low-Complexity Scalable Shared-Memory Architectures Zeffer, Håkan January 2006 (has links) Plentiful research has addressed low-complexity software-based shared-memory systems since the idea was first introduced more than two decades ago. However, software-coherent systems have not been very successful in the commercial marketplace. We believe there are two main reasons for this: lack of performance and/or lack of binary compatibility. This thesis studies multiple aspects of how to design future binary-compatible high-performance scalable shared-memory servers while keeping the hardware complexity at a minimum. It starts with a software-based distributed shared-memory system relying on no specific hardware support and gradually moves towards architectures with simple hardware support. The evaluation is made in a modern chip-multiprocessor environment with both high-performance compute workloads and commercial applications. It shows that implementing the coherence-violation detection in hardware while solving the interchip coherence in software allows for high-performing binary-compatible systems with very low hardware complexity. Our second-generation hardware-software hybrid performs on par with, and often better than, traditional hardware-only designs. Based on our results, we conclude that it is not only possible to design simple systems while maintaining performance and the binary-compatibility envelope, it is often possible to get better performance than in traditional and more complex designs. We also explore two new techniques for evaluating a new shared-memory design throughout this work: adjustable simulation fidelity and statistical multiprocessor cache modeling. shared memory distributed shared memory hardware-software trade-off software coherence coherence profiling remote access cache chip multiprocessor simultaneous multi threading simulation workload characterization statistical cache model Computer engineering Datorteknik

1

Page generated in 0.067 seconds