1 |
Improving Instruction Fetch Rate with Code Pattern Cache for Superscalar ArchitectureBeg, Azam Muhammad 06 August 2005 (has links)
In the past, instruction fetch speeds have been improved by using cache schemes that capture the actual program flow. In this proposal, we present the architecture of a new instruction cache named code pattern cache (CPC); the cache is used with superscalar processors. CPC?s operation is based on the fundamental principles that: common programs tend to repeat their execution patterns; and efficient storage of a program flow can enhance the performance of an instruction fetch mechanism. CPC saves basic blocks (sets of instructions separated by control instructions) and their boundary addresses while the code is running. Basic blocks and their addresses are stored in two separate structures, called block pointer cache (BPC) and basic block cache (BBC), respectively. Later, if the same basic block sequence is expected to execute, it is fetched from CPC, instead of the instruction cache; this mechanism results in higher likelihood of delivering a larger number of instructions in every clock cycle. We developed single and multi-threaded simulators for TC, BC, and CPC, and used them with 10 SPECint2000 benchmarks. The simulation results demonstrated CPC?s advantage over TC and BC, in terms of trace miss rate and average trace length. Additionally, we used cache models to quantify the timing, area, and power for the three cache schemes. Using an aggregate performance index that combined the simulation and modeling results, CPC was shown to perform better than both TC and BC. During our research, each of the TC-, BC-, or CPC- configurations took 4-6 hours to simulate, so performance comparison of these caches proved to be a very time-consuming process. Neural network models (NNM?s) can be time-efficient alternatives to simulations, so we studied their feasibility to represent the cache behavior. We developed two NNM's, one to predict the trace miss rate and the other to predict the average trace length for the three caches. The NNM's modeled the caches with reasonable accuracy, and produced results in a fraction of a second.
|
2 |
Hardware techniques to improve cache efficiencyLiu, Haiming 19 October 2009 (has links)
Modern microprocessors devote a large portion of their chip area to caches in order
to bridge the speed and bandwidth gap between the core and main memory. One
known problem with caches is that they are usually used with low efficiency; only a
small fraction of the cache stores data that will be used before getting evicted. As
the focus of microprocessor design shifts towards achieving higher performance-perwatt,
cache efficiency is becoming increasingly important. This dissertation proposes
techniques to improve both data cache efficiency in general and instruction cache
efficiency for Explicit Data Graph Execution (EDGE) architectures.
To improve the efficiency of data caches and L2 caches, dead blocks (blocks
that will not be referenced again before their eviction from the cache) should be
identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed, based on the individual reference history of the block. Such
schemes result in lower prediction accuracy and coverage. We delay the prediction
to achieve better prediction accuracy and coverage. For the L1 cache, we propose
a new class of dead-block prediction schemes that predict dead blocks based on
cache bursts. A cache burst begins when a block moves into the MRU position
and ends when it moves out of the MRU position. Cache burst history is more
predictable than individual reference history and results in better dead-block prediction
accuracy and coverage. Experiment results show that predicting the death
of a block at the end of a burst gives the best tradeoff between timeliness and prediction
accuracy/coverage. We also propose mechanisms to improve counting-based
dead-block predictors, which work best at the L2 cache. These mechanisms handle
reference-count variations, which cause problems for existing counting-based deadblock
predictors. The new schemes can identify the majority of the dead blocks with
approximately 90% or higher accuracy. For a 64KB, two-way L1 D-cache, 96% of
the dead blocks can be identified with a 96% accuracy, half way into a block’s dead
time. For a 64KB, four-way L1 cache, the prediction accuracy and coverage are 92%
and 91% respectively. At any moment, the average fraction of the dead blocks that
has been correctly detected for a two-way or four-way L1 cache is approximately
49% or 67% respectively. For a 1MB, 16-way set-associative L2 cache, 66% of the
dead blocks can be identified with a 89% accuracy, 1/16th way into a block’s dead
time. At any moment, 63% of the dead blocks in such an L2 cache, on average,
has been correctly identified by the dead-block predictor. The ability to accurately
identify the majority of the dead blocks in the cache long before their eviction can
lead to not only higher cache efficiency, but also reduced power consumption or
higher reliability.
In this dissertation, we use the dead-block information to improve cache
efficiency and performance by three techniques: replacement optimization, cache bypassing, and prefetching into dead blocks. Replacement optimization evicts blocks
that become dead after several reuses, before they reach the LRU position. Cache
bypassing identifies blocks that cause cache misses but will not be reused if they
are written into the cache and do not store these blocks in the cache. Prefetching
into dead blocks replaces dead blocks with prefetched blocks that are likely to be
referenced in the future. Simulation results show that replacement optimization or
bypassing improves performance by 5% and prefetching into dead blocks improves
performance by 12% over the baseline prefetching scheme for the L1 cache and by
13% over the baseline prefetching scheme for the L2 cache. Each of these three
techniques can turn part of the identified dead blocks into live blocks. As new
techniques that can better utilize the space of the dead blocks are found, the deadblock
information is likely to become more valuable.
Compared to RISC architectures, the instruction cache in EDGE architectures
faces challenges such as higher miss rate, because of the increase in code size,
and longer miss penalty, because of the large block size and the distributed microarchitecture.
To improve the instruction cache efficiency in EDGE architectures,
we decouple the next-block prediction from the instruction fetch so that the nextblock
prediction can run ahead of instruction fetch and the predicted blocks can be
prefetched into the instruction cache before they cause any I-cache misses. In particular,
we discuss how to decouple the next-block prediction from the instruction
fetch and how to control the run-ahead distance of the next-block predictor in a
fully distributed microarchitecture. The performance benefit of such a look-ahead
instruction prefetching scheme is then evaluated and the run-ahead distance that
gives the best performance improvement is identified. In addition to prefetching,
we also estimate the performance benefit of storing variable-sized blocks in the instruction
cache. Such schemes reduce the inefficiency caused by storing NOPs in the
I-cache and enable the I-cache to store more blocks with the same capacity. Simulation results show that look-ahead instruction prefetching and storing variable-sized
blocks can improve the performance of the benchmarks that have high I-cache miss
rates by 17% and 18% respectively, out of an ideal 30% performance improvement
only achievable by a perfect I-cache. Such techniques will close the gap in I-cache
hit rates between EDGE architectures and RISC architectures, although the latter
will still have higher I-cache hit rates because of the smaller code size. / text
|
3 |
Profilem řízené optimalizace pro instrukční vyrovnávací paměti / Profile-Guided Optimizations for Instruction CachesBobek, Jiří January 2015 (has links)
Instruction cache performance is very important for the overall performance of a computer. The placement of code blocks in memory can significantly affect the cache miss rate. This means that a compiler can improve the performance of a program by placing parts of code at the right addresses in memory. This work discusses several methods for collecting profile information, and describes an algorithm that uses profile information to guide code block placement. Additionally, the algorithm is added into the optimizer of the LLVM compiler, and improvements in cache performance are evaluated.
|
4 |
Native simulation of MPSoC : instrumentation and modeling of non-functional aspects / Simulation native des MPSoC : instrumentation et modélisation des aspects non fonctionnelsMatoussi, Oumaima 30 November 2017 (has links)
Les systèmes embarqués modernes intègrent des dizaines, voire des centaines, de cœurs sur une même puce communiquant à travers des réseaux sur puce, afin de répondre aux exigences de performances édictées par le marché. On parle de systèmes massivement multi-cœurs ou systèmes many-cœurs. La complexité de ces systèmes fait de l’exploration de l’espace de conception architecturale, de la co-vérification du matériel et du logiciel, ainsi que de l’estimation de performance, un vrai défi. Cette complexité est généralement com-pensée par la flexibilité du logiciel embarqué. La dominance du logiciel dans ces architectures nécessite de commencer le développement et la vérification du matériel et du logiciel dès les premières étapes du flot de conception, bien avant d’avoir accès à un prototype matériel.Ainsi, il faut disposer d’un modèle abstrait qui reproduit le comportement de la puce cible en un temps raisonnable. Un tel modèle est connu sous le nom de plateforme virtuelle ou de simulation. L’exécution du logiciel sur une telle plateforme est couramment effectuée au moyen d’un simulateur de jeu d’instruction (ISS). Ce type de simulateur, basé sur l’interprétation des instructions une à une, est malheureusement caractérisé par une vitesse de simulation très lente, qui ne fait qu’empirer par l’augmentation du nombre de cœurs.La simulation native est considérée comme une candidate adéquate pour réduire le temps de simulation des systèmes many-cœurs. Le principe de la simulation native est de compiler puis exécuter la quasi totalité de la pile logicielle directement sur la machine hôte tout en communiquant avec des modèles réalistes des composants matériels de l’architecture cible, permettant ainsi de raccourcir les temps de simulation. La simulation native est beau-coup plus rapide qu’un ISS mais elle ne prend pas en compte les aspects non-fonctionnels,tel que le temps d’exécution, dépendant de l’architecture matérielle réelle, ce qui empêche de faire des estimations de performance du logiciel.Ceci dresse le contexte des travaux menés dans cette thèse qui se focalisent sur la simulation native et s’articulent autour de deux contributions majeures. La première s’attaque à l’introduction d’informations non-fonctionnelles dans la représentation intermédiaire (IR)du compilateur. L’insertion précise de telles informations dans le modèle fonctionnel est réalisée grâce à un algorithme dont l’objectif est de trouver des correspondances entre le code binaire cible et le code IR tout en tenant compte des optimisations faites par le compilateur. La deuxième contribution s’intéresse à la modélisation d’un cache d’instruction et d’un tampon d’instruction d’une architecture VLIW pour générer des estimations de performance précises.Ainsi, la plateforme de simulation native associée à des modèles de performance précis et à une technique d’annotation efficace permet, malgré son haut niveau d’abstraction, non seulement de vérifier le bon fonctionnement du logiciel mais aussi de fournir des estimations de performances précises en des temps de simulation raisonnables. / Modern embedded systems are endowed with a high level of parallelism and significantprocessing capabilities as they integrate hundreds of cores on a single chip communicatingthrough network on chip. The complexity of these systems and their dedicated softwareshould not be an excuse for long design cycles, even though the design space is enormousand the underlying design decisions are critical. Thus, design space exploration, hard-ware/software co-verification and performance estimation need to be conducted within areasonable amount of time and early enough in the design process to avoid any tardy de-tection of functional or performance deficiencies.Co-simulation platforms are becoming an increasingly important part in design and ver-ification steps. With instruction interpretation-based software simulation platforms beingtoo slow as they model low-level details of the target system, an alternative software sim-ulation approach known as native simulation or host-compiled simulation has gained mo-mentum this past decade. Native simulation consists of compiling the embedded softwareto the host binary format and executing it directly on the host machine. However, this tech-nique fails to reflect the performance of the embedded software and its actual interactionwith the target hardware. So, the speedup gained by native simulation comes at a price,which is the absence of non-functional information (such as time and energy) needed for es-timating the performance of the entire system and ensuring its proper functioning. Withoutsuch information, native simulation approaches are limited to functional validation.Yielding accurate estimates entails the integration of high-level abstract models thatmimic the behavior of target-specific micro-architectural components in the simulation plat-form and the accurate placement of the obtained non-functional information in the high-level code. Back-annotating non-functional information at the right place requires a map-ping between the binary instructions and the high-level code statements, which can be chal-lenging particularly when compiler optimizations are enabled.In this thesis, we propose an annotation framework working at the compiler interme-diate representation level to accurately annotate performance metrics extracted from thebinary code, thanks to a dedicated mapping algorithm. This mapping algorithm is furtherenhanced to deal with aggressive compiler optimizations, such as loop unrolling, that radi-cally alter the structure of the code. Our target architecture being a VLIW processor, we alsomodel at a high level its instruction buffer to faithfully reproduce its timing behavior.The experiments we conducted to validate our mapping algorithm and component mod-els yielded accurate results and high simulation speed compared to a cycle accurate ISS ofthe target platform.
|
Page generated in 0.1066 seconds