1 |
Instruction scheduling optimizations for energy efficient VLIW processorsPorpodas, Vasileios January 2013 (has links)
Very Long Instruction Word (VLIW) processors are wide-issue statically scheduled processors. Instruction scheduling for these processors is performed by the compiler and is therefore a critical factor for its operation. Some VLIWs are clustered, a design that improves scalability to higher issue widths while improving energy efficiency and frequency. Their design is based on physically partitioning the shared hardware resources (e.g., register file). Such designs further increase the challenges of instruction scheduling since the compiler has the additional tasks of deciding on the placement of the instructions to the corresponding clusters and orchestrating the data movements across clusters. In this thesis we propose instruction scheduling optimizations for energy-efficient VLIW processors. Some of the techniques aim at improving the existing state-of-theart scheduling techniques, while others aim at using compiler techniques for closing the gap between lightweight hardware designs and more complex ones. Each of the proposed techniques target individual features of energy efficient VLIW architectures. Our first technique, called Aligned Scheduling, makes use of a novel scheduling heuristic for hiding memory latencies in lightweight VLIW processors without hardware load-use interlocks (Stall-On-Miss). With Aligned Scheduling, a software-only technique, a SOM processor coupled with non-blocking caches can better cope with the cache latencies and it can perform closer to the heavyweight designs. Performance is improved by up to 20% across a range of benchmarks from the Mediabench II and SPEC CINT2000 benchmark suites. The rest of the techniques target a class of VLIW processors known as clustered VLIWs, that are more scalable and more energy efficient and operate at higher frequencies than their monolithic counterparts. The second scheme (LUCAS) is an improved scheduler for clustered VLIW processors that solves the problem of the existing state-of-the-art schedulers being very susceptible to the inter-cluster communication latency. The proposed unified clustering and scheduling technique is a hybrid scheme that performs instruction by instruction switching between the two state-of-the-art clustering heuristics, leading to better scheduling than either of them. It generates better performing code compared to the state-of-the-art for a wide range of inter-cluster latency values on the Mediabench II benchmarks. The third technique (called CAeSaR) is a scheduler for clustered VLIW architectures that minimizes inter-cluster communication by local caching and reuse of already received data. Unlike dynamically scheduled processors, where this can be supported by the register renaming hardware, in VLIWs it has to be done by the code generator. The proposed instruction scheduler unifies cluster assignment, instruction scheduling and communication minimization in a single unified algorithm, solving the phase ordering issues between all three parts. The proposed scheduler shows an improvement in execution time of up to 20.3% and 13.8% on average across a range of benchmarks from the Mediabench II and SPEC CINT2000 benchmark suites. The last technique, applies to heterogeneous clustered VLIWs that support dynamic voltage and frequency scaling (DVFS) independently per cluster. In these processors there are no hardware interlocks between clusters to honor the data dependencies. Instead, the scheduler has to be aware of the DVFS decisions to guarantee correct execution. Effectively controlling DVFS, to selectively decrease the frequency of clusters with slack in their schedule, can lead to significant energy savings. The proposed technique (called UCIFF) solves the phase ordering problem between frequency selection and scheduling that is present in existing algorithms. The results show that UCIFF produces better code than the state-of-the-art and very close to the optimal across the Mediabench II benchmarks. Overall, the proposed instruction scheduling techniques lead to either better efficiency on existing designs or allow simpler lightweight designs to be competitive against ones with more complex hardware.
|
2 |
On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction SchedulingChase, Michael January 2006 (has links)
Modern architectures allow multiple instructions to be issued at once and have other complex features. To account for this, compilers perform instruction scheduling after generating the output code. The instruction scheduling problem is to find an optimal schedule given the limitations and capabilities of the architecture. While this can be done optimally, a greedy algorithm known as list scheduling is used in practice in most production compilers. <br /><br /> List scheduling is generally regarded as being near-optimal in practice, provided a good choice of heuristic is used. However, previous work comparing a list scheduler against an optimal scheduler either makes the assumption that an idealized architectural model is being used or uses too few test cases to strongly prove or disprove the assumed near-optimality of list scheduling. It remains an open question whether or not list scheduling performs well when scheduling for a realistic architectural model. <br /><br /> Using constraint programming, we developed an efficient optimal scheduler capable of scheduling even very large blocks within a popular benchmark suite in a reasonable amount of time. I improved the architectural model and optimal scheduler by allowing for an issue width not equal to the number of functional units, instructions that monopolize the processor for one cycle, and non-fully pipelined instructions. I then evaluated the performance of list scheduling for this more realistic architectural model. <br /><br /> I found that when scheduling for basic blocks when using a realistic architectural model, only 6% or less of schedules produced by a list scheduler are non-optimal, but when scheduling for superblocks, at least 40% of schedules produced by a list scheduler are non-optimal. Furthermore, when the list scheduler and optimal scheduler differed, the optimal scheduler was able to improve schedule cost by at least 5% on average, realizing maximum improvements of 82%. This suggests that list scheduling is only a viable solution in practice when scheduling basic blocks. When scheduling superblocks, the advantage of using a list scheduler is its speed, not the quality of schedules produced, and other alternatives to list scheduling should be considered.
|
3 |
Learning Instruction Scheduling Heuristics from Optimal DataRussell, Tyrel January 2006 (has links)
The development of modern pipelined and multiple functional unit processors has increased the available instruction level parallelism. In order to fully utilize these resources, compiler writers spend large amounts of time developing complex scheduling heuristics for each new architecture. In order to reduce the time spent on this process, automated machine learning techniques have been proposed to generate scheduling heuristics. We present two case studies using these techniques to generate instruction scheduling heuristics for basic blocks and super blocks. A basic block is a block of code with a single flow of control and a super block is a collection of basic blocks with a single entry point but multiple exit points. We improve previous techniques for automated generation of basic block scheduling heuristics by increasing the quality of the training data and increasing the number of features considered, including several novel features that have useful effects on scheduling instructions. Our case study into super block scheduling heuristics is a novel contribution as previous approaches were only applied to basic blocks. We show through experimentation that we can produce efficient heuristics that perform better than current heuristic methods for basic block and super block scheduling. We show that we can reduce the number of non-optimally scheduled blocks by up to 55% for basic blocks and 38% for super blocks. We also show that we can produce better schedules 7. 8 times more often than the next best heuristic for basic blocks and 4. 4 times more often for super blocks.
|
4 |
Constraint Programming Techniques for Optimal Instruction SchedulingMalik, Abid 03 1900 (has links)
Modern processors have multiple pipelined functional units
and can issue more than one instruction per clock cycle. This
puts great pressure on the instruction scheduling phase in
a compiler to expose maximum instruction level parallelism.
Basic blocks and superblocks are commonly used regions of code
in a program for instruction scheduling. Instruction scheduling
coupled with register allocation is also a well studied problem
to produce better machine code. Scheduling basic blocks and
superblocks optimally with or with out register allocation is
NP-complete, and is done sub-optimally in production compilers
using heuristic approaches. In this thesis, I present a
constraint programming approach to the superblock and basic
block instruction scheduling problems for both idealized and
realistic architectures. Basic block scheduling with register
allocation with no spilling allowed is also considered. My
models for both basic block and superblock scheduling are
optimal and fast enough to be incorporated into production
compilers. I experimentally evaluated my optimal schedulers on
the SPEC 2000 integer and floating point benchmarks. On this
benchmark suite, the optimal schedulers were very robust and
scaled to the largest basic blocks and superblocks. Depending
on the architectural model, between 99.991\% to 99.999\% of all
basic blocks and superblocks were solved to optimality. The
schedulers were able to routinely solve the largest blocks,
including blocks with up to 2600 instructions. My results compare
favorably to the best previous optimal approaches, which are based
on integer programming and enumeration.
My approach for basic block scheduling without allowing spilling was good
enough to solve 97.496\% of all basic blocks in the SPEC 2000
benchmark. The approach was able to solve basic blocks as
large as 50 instructions for both idealized and realistic
architectures within reasonable time limits.
Again, my results compare favorably to recent work
on optimal integrated code generation, which is
based on integer programming.
|
5 |
On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction SchedulingChase, Michael January 2006 (has links)
Modern architectures allow multiple instructions to be issued at once and have other complex features. To account for this, compilers perform instruction scheduling after generating the output code. The instruction scheduling problem is to find an optimal schedule given the limitations and capabilities of the architecture. While this can be done optimally, a greedy algorithm known as list scheduling is used in practice in most production compilers. <br /><br /> List scheduling is generally regarded as being near-optimal in practice, provided a good choice of heuristic is used. However, previous work comparing a list scheduler against an optimal scheduler either makes the assumption that an idealized architectural model is being used or uses too few test cases to strongly prove or disprove the assumed near-optimality of list scheduling. It remains an open question whether or not list scheduling performs well when scheduling for a realistic architectural model. <br /><br /> Using constraint programming, we developed an efficient optimal scheduler capable of scheduling even very large blocks within a popular benchmark suite in a reasonable amount of time. I improved the architectural model and optimal scheduler by allowing for an issue width not equal to the number of functional units, instructions that monopolize the processor for one cycle, and non-fully pipelined instructions. I then evaluated the performance of list scheduling for this more realistic architectural model. <br /><br /> I found that when scheduling for basic blocks when using a realistic architectural model, only 6% or less of schedules produced by a list scheduler are non-optimal, but when scheduling for superblocks, at least 40% of schedules produced by a list scheduler are non-optimal. Furthermore, when the list scheduler and optimal scheduler differed, the optimal scheduler was able to improve schedule cost by at least 5% on average, realizing maximum improvements of 82%. This suggests that list scheduling is only a viable solution in practice when scheduling basic blocks. When scheduling superblocks, the advantage of using a list scheduler is its speed, not the quality of schedules produced, and other alternatives to list scheduling should be considered.
|
6 |
Learning Instruction Scheduling Heuristics from Optimal DataRussell, Tyrel January 2006 (has links)
The development of modern pipelined and multiple functional unit processors has increased the available instruction level parallelism. In order to fully utilize these resources, compiler writers spend large amounts of time developing complex scheduling heuristics for each new architecture. In order to reduce the time spent on this process, automated machine learning techniques have been proposed to generate scheduling heuristics. We present two case studies using these techniques to generate instruction scheduling heuristics for basic blocks and super blocks. A basic block is a block of code with a single flow of control and a super block is a collection of basic blocks with a single entry point but multiple exit points. We improve previous techniques for automated generation of basic block scheduling heuristics by increasing the quality of the training data and increasing the number of features considered, including several novel features that have useful effects on scheduling instructions. Our case study into super block scheduling heuristics is a novel contribution as previous approaches were only applied to basic blocks. We show through experimentation that we can produce efficient heuristics that perform better than current heuristic methods for basic block and super block scheduling. We show that we can reduce the number of non-optimally scheduled blocks by up to 55% for basic blocks and 38% for super blocks. We also show that we can produce better schedules 7. 8 times more often than the next best heuristic for basic blocks and 4. 4 times more often for super blocks.
|
7 |
Constraint Programming Techniques for Optimal Instruction SchedulingMalik, Abid 03 1900 (has links)
Modern processors have multiple pipelined functional units
and can issue more than one instruction per clock cycle. This
puts great pressure on the instruction scheduling phase in
a compiler to expose maximum instruction level parallelism.
Basic blocks and superblocks are commonly used regions of code
in a program for instruction scheduling. Instruction scheduling
coupled with register allocation is also a well studied problem
to produce better machine code. Scheduling basic blocks and
superblocks optimally with or with out register allocation is
NP-complete, and is done sub-optimally in production compilers
using heuristic approaches. In this thesis, I present a
constraint programming approach to the superblock and basic
block instruction scheduling problems for both idealized and
realistic architectures. Basic block scheduling with register
allocation with no spilling allowed is also considered. My
models for both basic block and superblock scheduling are
optimal and fast enough to be incorporated into production
compilers. I experimentally evaluated my optimal schedulers on
the SPEC 2000 integer and floating point benchmarks. On this
benchmark suite, the optimal schedulers were very robust and
scaled to the largest basic blocks and superblocks. Depending
on the architectural model, between 99.991\% to 99.999\% of all
basic blocks and superblocks were solved to optimality. The
schedulers were able to routinely solve the largest blocks,
including blocks with up to 2600 instructions. My results compare
favorably to the best previous optimal approaches, which are based
on integer programming and enumeration.
My approach for basic block scheduling without allowing spilling was good
enough to solve 97.496\% of all basic blocks in the SPEC 2000
benchmark. The approach was able to solve basic blocks as
large as 50 instructions for both idealized and realistic
architectures within reasonable time limits.
Again, my results compare favorably to recent work
on optimal integrated code generation, which is
based on integer programming.
|
8 |
Look-ahead instruction scheduling for dynamic execution in pipelined computersReddy Anam, Vijay K. January 1990 (has links)
No description available.
|
9 |
A High Performance Register Allocator for Vector Architectures with a Unified Register-SetSu, Yu-Dan 29 June 2012 (has links)
This thesis describes a compiler optimization targeted for machines with unified, vector-based register sets. This optimization combines register allocation and instruction scheduling. It examines places where the code performs computations on scalar variables. The goal is to identify instances where the same operation is performed. For example, a program might calculate ¡§base+offset¡¨ and then calculate ¡§i+j¡¨. Even though these computations are unrelated, yet they use the same operator; if ¡§base¡¨ and ¡§i¡¨ are packed into one vector register, while ¡§offset¡¨ and ¡§j¡¨ are packed into another, then these two computations can be performed simultaneously through the vectors¡¦ parallel addition operation. This would reduce the execution time of the compiled code.
Although other researchers have considered similar packing methods, their work has been limited by the hardware that they were studying. Such hardware usually imposed high costs for moving data between scalar and vector register banks. This present thesis, however, considers a novel hardware architecture that imposes no such costs. As a consequence, we are able to obtain significant speedups.
The architecture that we consider is a Graphics Processing Unit (GPU) for embedded systems that is under development at this university. This GPU has a single register set for integers, float, and vectors.
|
10 |
Genetic Algorithm for Integrated SoftwarePipeliningCai, Zesi January 2012 (has links)
The purpose of the thesis was to study the feasibility of using geneticalgorithm (GA) to do the integrated software pipelining (ISP). Different from phasedcode generation, ISP is a technique which integrates instruction selection, instructionscheduling, and register allocation together when doing code generation. ISP is able toprovide a lager solution space than phased way does, which means that ISP haspotential to generate more optimized code than phased code generation. However,integrated compiling costs more than phased compiling. GA is stochastic beam searchalgorithm which can accelerate the solution searching and find an optimized result.An experiment was designed for verifying feasibility of implementing GA for ISP(GASP). The implemented algorithm analyzed data dependency graphs of loop bodies,created genes for the graphs and evolved, generated schedules, calculated andevaluated fitness, and obtained optimized codes. The fitness calculation wasimplemented by calculating the maximum value between the smallest possibleresource initiation interval and the smallest possible recurrence initiation interval. Theexperiment was conducted by generating codes from data dependency graphsprovided in FFMPEG and comparing the performance between GASP and integerlinear programming (ILP). The results showed that out of eleven cases that ILP hadgenerated code, GASP performed close to ILP in seven cases. In all twelve cases thatILP did not have result, GASP did generate optimized code. To conclude, the studyindicated that GA was feasible of being implemented for ISP. The generated codesfrom GASP performed similar with the codes from ILP. And for the dependencygraphs that ILP could not solve in a limited time, GASP could also generate optimizedresults.
|
Page generated in 0.1104 seconds