1 |
Performance Evaluation of A Loop-Relevance-Classified Branch Prediction MethodLuo, Shiu-Tang 28 September 2001 (has links)
Along with the advancement of chip architecture and density of processor design, there are more functional units that can execute in parallel on a chip. In order to make good use of them, it is important to obtain enough and accurate instructions ahead of time. Branch prediction provides a way to know the instruction stream ahead of time. Its prediction accuracy is thus one of the key factors of system performance.
In our research, we designed a branch prediction method based upon the loop-relevance classifications of conditional jump instructions. It divides conditional jump instructions into two classifications: loop-exit and non-loop-exit conditional jump instructions. We utilized various prediction methods to perform the branch prediction tasks for these two classes of conditional branch instructions, separately. Inside these methods, dynamic learning from actual branch results is carried out to switch to suitable prediction models such that more prediction accuracy can be obtained.
In this thesis research, in order to validate the accuracy of this prediction method vs. other prediction methods, we designed a software performance evaluation environment to do trace-driven simulation of types of branch prediction methods. It consists of an instruction trace extractor and a set of trace-driven simulators. Experiment results shows that our prediction methods performs near the same as other prediction methods on the scalar processing programs that have little or no amount of regularly behaved loops. However, on the scientific or engineering programs that exhibit certain percentage of regularly behaved loops, experiment results shows that our prediction method recognizes their loop behavior patterns and achieves better prediction accuracy.
|
2 |
Improving prediction accuracy of hard-to-predict branches using data value correlationFarooq, Muhammad Umar, active 2013 17 February 2014 (has links)
Performance of modern pipelined processor depends on steady flow of useful instructions for processing. Branch instruction disrupts sequential flow of instructions by presenting multiple paths through which a program can proceed. By predicting branch
outcome early, branch predictors allow processor to continue fetching instructions from the predicted path. History-based dynamic branch predictors have shown to reach high
prediction accuracy, yet certain branch types continue to mispredict. These are multitarget indirect branches and data-dependent direct and indirect branches. These are hard-to-predict branches since their outcome do not always exhibit repeatable patterns.
This thesis describes branch prediction schemes for improving prediction accuracy of hard-to-predict branches using data value correlation. In these schemes, instead of relying on branch history information, compiler identifies program instructions whose output value strongly correlates with branch outcome. These correlated instructions are tracked at run-time, and their output is used for making branch predictions. Specifically, this thesis proposes following two branch prediction schemes:
(i) Value-based BTB indexing (VBBI) is a low cost, compiler-guided scheme for
predicting multi-target indirect branches. For every indirect branch, compiler identifies an instruction whose output strongly correlates with targets taken by the indirect branch. At run-time, multiple branch targets are stored and subsequently accessed from BTB using index formed by hashing indirect branch PC with output of the correlated instruction.
(ii) Store-Load-Branch (SLB) predictor is a compiler-assisted branch prediction scheme for data-dependent branches. Typically, data-dependent branches are associated with program data structures such as arrays, linked list etc., and follow store-load-branch
execution sequence. A set of memory locations is written at an earlier point in a program. Later, these locations are read, and used for evaluating branch condition. Branch outcome depends on values stored in data structure, which, normally do not have repeatable patterns. In SLB scheme, compiler identifies all program points where data structure associated with a data-dependent branch is modified. These
marked store instructions are tracked at run-time, and stored values are used for computing branch flags ahead of time. Later, when branch instruction is fetched, pre-computed flags are read, and used for making predictions.
This thesis introduces new branch prediction schemes, describes hardware structures and compiler analysis for implementing these schemes, evaluates their performance
impact, and estimates their area, power and timing overhead. / text
|
3 |
Evaluation of Instruction Prefetch Methods for Coresonic DSP ProcessorLind, Tobias January 2016 (has links)
With increasing demands on mobile communication transfer rates the circuits in mobile phones must be designed for higher performance while maintaining low power consumption for increased battery life. One possible way to improve an existing architecture is to implement instruction prefetching. By predicting which instructions will be executed ahead of time the instructions can be prefetched from memory to increase performance and some instructions which will be executed again shortly can be stored temporarily to avoid fetching them from the memory multiple times. By creating a trace driven simulator the existing hardware can be simulated while running a realistic scenario. Different methods of instruction prefetch can be implemented into this simulator to measure how they perform. It is shown that the execution time can be reduced by up to five percent and the amount of memory accesses can be reduced by up to 25 percent with a simple loop buffer and return stack. The execution time can be reduced even further with the more complex methods such as branch target prediction and branch condition prediction.
|
4 |
Exploring Virtualization Techniques for Branch Outcome PredictionSadooghi-Alvandi, Maryam 20 December 2011 (has links)
Modern processors use branch prediction to predict branch outcomes, in order to fetch ahead in the instruction stream, increasing concurrency and performance. Larger predictor tables can improve prediction accuracy, but come at the cost of larger area and longer access delay.
This work introduces a new branch predictor design that increases the perceived predictor capacity without increasing its delay, by using a large virtual second-level table allocated in the second-level caches. Virtualization is applied to a state-of-the-art multi- table branch predictor. We evaluate the design using instruction count as proxy for timing on a set of commercial workloads. For a predictor whose size is determined by access delay constraints rather than area, accuracy can be improved by 8.7%. Alternatively, the design can be used to achieve the same accuracy as a non-virtualized design while using 25% less dedicated storage.
|
5 |
Exploring Virtualization Techniques for Branch Outcome PredictionSadooghi-Alvandi, Maryam 20 December 2011 (has links)
Modern processors use branch prediction to predict branch outcomes, in order to fetch ahead in the instruction stream, increasing concurrency and performance. Larger predictor tables can improve prediction accuracy, but come at the cost of larger area and longer access delay.
This work introduces a new branch predictor design that increases the perceived predictor capacity without increasing its delay, by using a large virtual second-level table allocated in the second-level caches. Virtualization is applied to a state-of-the-art multi- table branch predictor. We evaluate the design using instruction count as proxy for timing on a set of commercial workloads. For a predictor whose size is determined by access delay constraints rather than area, accuracy can be improved by 8.7%. Alternatively, the design can be used to achieve the same accuracy as a non-virtualized design while using 25% less dedicated storage.
|
6 |
YETI: a GraduallY Extensible Trace InterpreterZaleski, Mathew 01 August 2008 (has links)
The implementation of new programming languages benefits from
interpretation because it is simple, flexible and portable. The only
downside is speed of execution, as there remains a large performance
gap between even efficient interpreters and systems that include a
just-in-time (JIT) compiler. Augmenting an interpreter with a JIT,
however, is not a small task. Today, Java JITs are typically
method-based. To compile whole methods, the JIT must re-implement
much functionality already provided by the interpreter, leading to a ``big
bang'' development effort before the JIT can be deployed. Adding a
JIT to an interpreter would be easier if we could more gradually shift
from dispatching virtual instructions bodies implemented for the interpreter
to running instructions compiled into native code by the JIT.
We show that virtual instructions implemented as lightweight callable
routines can form the basis for a very efficient interpreter. Our new
technique, interpreted traces, identifies hot paths, or traces, as a
virtual program is interpreted. By exploiting the way traces predict
branch destinations our technique markedly reduces branch
mispredictions caused by dispatch. Interpreted traces are a
high-performance technique, running about 25% faster than direct
threading.
We show that interpreted traces are a good starting point for a
trace-based JIT. We extend our interpreter so traces may contain a
mixture of compiled code for some virtual instructions and calls to
virtual instruction bodies for others. By compiling about 50 integer
and object virtual instructions to machine code we improve performance
by about 30% over interpreted traces, running about twice as fast as
the direct threaded system with which we started.
|
7 |
YETI: a GraduallY Extensible Trace InterpreterZaleski, Mathew 01 August 2008 (has links)
The implementation of new programming languages benefits from
interpretation because it is simple, flexible and portable. The only
downside is speed of execution, as there remains a large performance
gap between even efficient interpreters and systems that include a
just-in-time (JIT) compiler. Augmenting an interpreter with a JIT,
however, is not a small task. Today, Java JITs are typically
method-based. To compile whole methods, the JIT must re-implement
much functionality already provided by the interpreter, leading to a ``big
bang'' development effort before the JIT can be deployed. Adding a
JIT to an interpreter would be easier if we could more gradually shift
from dispatching virtual instructions bodies implemented for the interpreter
to running instructions compiled into native code by the JIT.
We show that virtual instructions implemented as lightweight callable
routines can form the basis for a very efficient interpreter. Our new
technique, interpreted traces, identifies hot paths, or traces, as a
virtual program is interpreted. By exploiting the way traces predict
branch destinations our technique markedly reduces branch
mispredictions caused by dispatch. Interpreted traces are a
high-performance technique, running about 25% faster than direct
threading.
We show that interpreted traces are a good starting point for a
trace-based JIT. We extend our interpreter so traces may contain a
mixture of compiled code for some virtual instructions and calls to
virtual instruction bodies for others. By compiling about 50 integer
and object virtual instructions to machine code we improve performance
by about 30% over interpreted traces, running about twice as fast as
the direct threaded system with which we started.
|
8 |
Ultra low power cooperative branch predictionBielby, Matthew Iain January 2015 (has links)
Branch Prediction is a key task in the operation of a high performance processor. An inaccurate branch predictor results in increased program run-time and a rise in energy consumption. The drive towards processors with limited die-space and tighter energy requirements will continue to intensify over the coming years, as will the shift towards increasingly multicore processors. Both trends make it increasingly important and increasingly difficult to find effective and efficient branch predictor designs. This thesis presents savings in energy and die-space through the use of more efficient cooperative branch predictors achieved through novel branch prediction designs. The first contribution is a new take on the problem of a hybrid dynamic-static branch predictor allocating branches to be predicted by one of its sub-predictors. A new bias parameter is introduced as a mechanism for trading off a small amount of performance for savings in die-space and energy. This is achieved by predicting more branches with the static predictor, ensuring that only the branches that will most benefit from the dynamic predictor’s resources are predicted dynamically. This reduces pressure on the dynamic predictor’s resources allowing for a smaller predictor to achieve very high accuracy. An improvement in run-time of 7-8% over the baseline BTFN predictor is observed at a cost of a branch predictor bits budget of much less than 1KB. Next, a novel approach to branch prediction for multicore data-parallel applications is presented. The Peloton branch prediction scheme uses a pack of cyclists as an illustration of how a group of processors running similar tasks can share branch predictions to improve accuracy and reduce runtime. The results show that sharing updates for conditional branches across the existing interconnect for I-cache and D-cache updates results in a reduction of mispredictions of up to 25% and a reduction in run-time of up to 6%. McPAT is used to present an energy model that suggests the savings are achieved at little to no increase in energy required. The technique is then extended to architectures where the size of the branch predictors may differ between cores. The results show that such heterogeneity can dramatically reduce the die-space required for an accurate branch predictor while having little impact on performance and up to 9% energy savings. The approach can be combined with the Peloton branch prediction scheme for reduction in branch mispredictions of up to 5%.
|
9 |
Improving Branch Prediction Accuracy Via Effective Source Information And Prediction AlgorithmsGao, Hongliang 01 January 2008 (has links)
Modern superscalar processors rely on branch predictors to sustain a high instruction fetch throughput. Given the trend of deep pipelines and large instruction windows, a branch misprediction will incur a large performance penalty and result in a significant amount of energy wasted by the instructions along wrong paths. With their critical role in high performance processors, there has been extensive research on branch predictors to improve the prediction accuracy. Conceptually a dynamic branch prediction scheme includes three major components: a source, an information processor, and a predictor. Traditional works mainly focus on the algorithm for the predictor. In this dissertation, besides novel prediction algorithms, we investigate other components and develop untraditional ways to improve the prediction accuracy. First, we propose an adaptive information processing method to dynamically extract the most effective inputs to maximize the correlation to be exploited by the predictor. Second, we propose a new prediction algorithm, which improves the Prediction by Partial Matching (PPM) algorithm by selectively combining multiple partial matches. The PPM algorithm was previously considered optimal and has been used to derive the upper limit of branch prediction accuracy. Our proposed algorithm achieves higher prediction accuracy than PPM and can be implemented in realistic hardware budget. Third, we discover a new locality existing between the address of producer loads and the outcomes of their consumer branches. We study this address-branch correlation in detail and propose a branch predictor to explore this correlation for long-latency and hard-to-predict branches, which existing branch predictors fail to predict accurately.
|
10 |
Energy efficient branch predictionHicks, Michael Andrew January 2010 (has links)
Energy efficiency is of the utmost importance in modern high-performance embedded processor design. As the number of transistors on a chip continues to increase each year, and processor logic becomes ever more complex, the dynamic switching power cost of running such processors increases. The continual progression in fabrication processes brings a reduction in the feature size of the transistor structures on chips with each new technology generation. This reduction in size increases the significance of leakage power (a constant drain that is proportional to the number of transistors). Particularly in embedded devices, the proportion of an electronic product’s power budget accounted for by the CPU is significant (often as much as 50%). Dynamic branch prediction is a hardware mechanism used to forecast the direction, and target address, of branch instructions. This is essential to high performance pipelined and superscalar processors, where the direction and target of branches is not computed until several stages into the pipeline. Accurate branch prediction also acts to increase energy efficiency by reducing the amount of time spent executing mis-speculated instructions. ‘Stalling’ is no longer a sensible option when the significance of static power dissipation is considered. Dynamic branch prediction logic typically accounts for over 10% of a processor’s global power dissipation, making it an obvious target for energy optimisation. Previous approaches at increasing the energy efficiency of dynamic branch prediction logic has focused on either fully dynamic or fully static techniques. Dynamic techniques include the introduction of a new cache-like structure that can decide whether branch prediction logic should be accessed for a given branch, and static techniques tend to focus on scheduling around branch instructions so that a prediction is not needed (or the branch is removed completely). This dissertation explores a method of combining static techniques and profiling information with simple hardware support in order to reduce the number of accesses made to a branch predictor. The local delay region is used on unconditional absolute branches to avoid prediction, and, for most other branches, Adaptive Branch Bias Measurement (through profiling) is used to assign a static prediction that is as accurate as a dynamic prediction for that branch. This information is represented as two hint-bits in branch instructions, and then interpreted by simple hardware logic that bypasses both the lookup and update phases for appropriate branches. The global processor power saving that can be achieved by this Combined Algorithm is around 6% on the experimental architectures shown. These architectures are based upon real contemporary embedded architecture specifications. The introduction of the Combined Algorithm also significantly reduces the execution time of programs on Multiple Instruction Issue processors. This is attributed to the increase achieved in global prediction accuracy.
|
Page generated in 0.1099 seconds