Spelling suggestions: "subject:"deadblock prediction"" "subject:"diblock prediction""
1 |
Hardware techniques to improve cache efficiencyLiu, Haiming 19 October 2009 (has links)
Modern microprocessors devote a large portion of their chip area to caches in order
to bridge the speed and bandwidth gap between the core and main memory. One
known problem with caches is that they are usually used with low efficiency; only a
small fraction of the cache stores data that will be used before getting evicted. As
the focus of microprocessor design shifts towards achieving higher performance-perwatt,
cache efficiency is becoming increasingly important. This dissertation proposes
techniques to improve both data cache efficiency in general and instruction cache
efficiency for Explicit Data Graph Execution (EDGE) architectures.
To improve the efficiency of data caches and L2 caches, dead blocks (blocks
that will not be referenced again before their eviction from the cache) should be
identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed, based on the individual reference history of the block. Such
schemes result in lower prediction accuracy and coverage. We delay the prediction
to achieve better prediction accuracy and coverage. For the L1 cache, we propose
a new class of dead-block prediction schemes that predict dead blocks based on
cache bursts. A cache burst begins when a block moves into the MRU position
and ends when it moves out of the MRU position. Cache burst history is more
predictable than individual reference history and results in better dead-block prediction
accuracy and coverage. Experiment results show that predicting the death
of a block at the end of a burst gives the best tradeoff between timeliness and prediction
accuracy/coverage. We also propose mechanisms to improve counting-based
dead-block predictors, which work best at the L2 cache. These mechanisms handle
reference-count variations, which cause problems for existing counting-based deadblock
predictors. The new schemes can identify the majority of the dead blocks with
approximately 90% or higher accuracy. For a 64KB, two-way L1 D-cache, 96% of
the dead blocks can be identified with a 96% accuracy, half way into a block’s dead
time. For a 64KB, four-way L1 cache, the prediction accuracy and coverage are 92%
and 91% respectively. At any moment, the average fraction of the dead blocks that
has been correctly detected for a two-way or four-way L1 cache is approximately
49% or 67% respectively. For a 1MB, 16-way set-associative L2 cache, 66% of the
dead blocks can be identified with a 89% accuracy, 1/16th way into a block’s dead
time. At any moment, 63% of the dead blocks in such an L2 cache, on average,
has been correctly identified by the dead-block predictor. The ability to accurately
identify the majority of the dead blocks in the cache long before their eviction can
lead to not only higher cache efficiency, but also reduced power consumption or
higher reliability.
In this dissertation, we use the dead-block information to improve cache
efficiency and performance by three techniques: replacement optimization, cache bypassing, and prefetching into dead blocks. Replacement optimization evicts blocks
that become dead after several reuses, before they reach the LRU position. Cache
bypassing identifies blocks that cause cache misses but will not be reused if they
are written into the cache and do not store these blocks in the cache. Prefetching
into dead blocks replaces dead blocks with prefetched blocks that are likely to be
referenced in the future. Simulation results show that replacement optimization or
bypassing improves performance by 5% and prefetching into dead blocks improves
performance by 12% over the baseline prefetching scheme for the L1 cache and by
13% over the baseline prefetching scheme for the L2 cache. Each of these three
techniques can turn part of the identified dead blocks into live blocks. As new
techniques that can better utilize the space of the dead blocks are found, the deadblock
information is likely to become more valuable.
Compared to RISC architectures, the instruction cache in EDGE architectures
faces challenges such as higher miss rate, because of the increase in code size,
and longer miss penalty, because of the large block size and the distributed microarchitecture.
To improve the instruction cache efficiency in EDGE architectures,
we decouple the next-block prediction from the instruction fetch so that the nextblock
prediction can run ahead of instruction fetch and the predicted blocks can be
prefetched into the instruction cache before they cause any I-cache misses. In particular,
we discuss how to decouple the next-block prediction from the instruction
fetch and how to control the run-ahead distance of the next-block predictor in a
fully distributed microarchitecture. The performance benefit of such a look-ahead
instruction prefetching scheme is then evaluated and the run-ahead distance that
gives the best performance improvement is identified. In addition to prefetching,
we also estimate the performance benefit of storing variable-sized blocks in the instruction
cache. Such schemes reduce the inefficiency caused by storing NOPs in the
I-cache and enable the I-cache to store more blocks with the same capacity. Simulation results show that look-ahead instruction prefetching and storing variable-sized
blocks can improve the performance of the benchmarks that have high I-cache miss
rates by 17% and 18% respectively, out of an ideal 30% performance improvement
only achievable by a perfect I-cache. Such techniques will close the gap in I-cache
hit rates between EDGE architectures and RISC architectures, although the latter
will still have higher I-cache hit rates because of the smaller code size. / text
|
Page generated in 0.1022 seconds