• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 17
  • 3
  • 1
  • Tagged with
  • 22
  • 22
  • 18
  • 13
  • 11
  • 9
  • 8
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Instruction scheduling optimizations for energy efficient VLIW processors

Porpodas, Vasileios January 2013 (has links)
Very Long Instruction Word (VLIW) processors are wide-issue statically scheduled processors. Instruction scheduling for these processors is performed by the compiler and is therefore a critical factor for its operation. Some VLIWs are clustered, a design that improves scalability to higher issue widths while improving energy efficiency and frequency. Their design is based on physically partitioning the shared hardware resources (e.g., register file). Such designs further increase the challenges of instruction scheduling since the compiler has the additional tasks of deciding on the placement of the instructions to the corresponding clusters and orchestrating the data movements across clusters. In this thesis we propose instruction scheduling optimizations for energy-efficient VLIW processors. Some of the techniques aim at improving the existing state-of-theart scheduling techniques, while others aim at using compiler techniques for closing the gap between lightweight hardware designs and more complex ones. Each of the proposed techniques target individual features of energy efficient VLIW architectures. Our first technique, called Aligned Scheduling, makes use of a novel scheduling heuristic for hiding memory latencies in lightweight VLIW processors without hardware load-use interlocks (Stall-On-Miss). With Aligned Scheduling, a software-only technique, a SOM processor coupled with non-blocking caches can better cope with the cache latencies and it can perform closer to the heavyweight designs. Performance is improved by up to 20% across a range of benchmarks from the Mediabench II and SPEC CINT2000 benchmark suites. The rest of the techniques target a class of VLIW processors known as clustered VLIWs, that are more scalable and more energy efficient and operate at higher frequencies than their monolithic counterparts. The second scheme (LUCAS) is an improved scheduler for clustered VLIW processors that solves the problem of the existing state-of-the-art schedulers being very susceptible to the inter-cluster communication latency. The proposed unified clustering and scheduling technique is a hybrid scheme that performs instruction by instruction switching between the two state-of-the-art clustering heuristics, leading to better scheduling than either of them. It generates better performing code compared to the state-of-the-art for a wide range of inter-cluster latency values on the Mediabench II benchmarks. The third technique (called CAeSaR) is a scheduler for clustered VLIW architectures that minimizes inter-cluster communication by local caching and reuse of already received data. Unlike dynamically scheduled processors, where this can be supported by the register renaming hardware, in VLIWs it has to be done by the code generator. The proposed instruction scheduler unifies cluster assignment, instruction scheduling and communication minimization in a single unified algorithm, solving the phase ordering issues between all three parts. The proposed scheduler shows an improvement in execution time of up to 20.3% and 13.8% on average across a range of benchmarks from the Mediabench II and SPEC CINT2000 benchmark suites. The last technique, applies to heterogeneous clustered VLIWs that support dynamic voltage and frequency scaling (DVFS) independently per cluster. In these processors there are no hardware interlocks between clusters to honor the data dependencies. Instead, the scheduler has to be aware of the DVFS decisions to guarantee correct execution. Effectively controlling DVFS, to selectively decrease the frequency of clusters with slack in their schedule, can lead to significant energy savings. The proposed technique (called UCIFF) solves the phase ordering problem between frequency selection and scheduling that is present in existing algorithms. The results show that UCIFF produces better code than the state-of-the-art and very close to the optimal across the Mediabench II benchmarks. Overall, the proposed instruction scheduling techniques lead to either better efficiency on existing designs or allow simpler lightweight designs to be competitive against ones with more complex hardware.
2

On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling

Chase, Michael January 2006 (has links)
Modern architectures allow multiple instructions to be issued at once and have other complex features. To account for this, compilers perform instruction scheduling after generating the output code. The instruction scheduling problem is to find an optimal schedule given the limitations and capabilities of the architecture. While this can be done optimally, a greedy algorithm known as list scheduling is used in practice in most production compilers. <br /><br /> List scheduling is generally regarded as being near-optimal in practice, provided a good choice of heuristic is used. However, previous work comparing a list scheduler against an optimal scheduler either makes the assumption that an idealized architectural model is being used or uses too few test cases to strongly prove or disprove the assumed near-optimality of list scheduling. It remains an open question whether or not list scheduling performs well when scheduling for a realistic architectural model. <br /><br /> Using constraint programming, we developed an efficient optimal scheduler capable of scheduling even very large blocks within a popular benchmark suite in a reasonable amount of time. I improved the architectural model and optimal scheduler by allowing for an issue width not equal to the number of functional units, instructions that monopolize the processor for one cycle, and non-fully pipelined instructions. I then evaluated the performance of list scheduling for this more realistic architectural model. <br /><br /> I found that when scheduling for basic blocks when using a realistic architectural model, only 6% or less of schedules produced by a list scheduler are non-optimal, but when scheduling for superblocks, at least 40% of schedules produced by a list scheduler are non-optimal. Furthermore, when the list scheduler and optimal scheduler differed, the optimal scheduler was able to improve schedule cost by at least 5% on average, realizing maximum improvements of 82%. This suggests that list scheduling is only a viable solution in practice when scheduling basic blocks. When scheduling superblocks, the advantage of using a list scheduler is its speed, not the quality of schedules produced, and other alternatives to list scheduling should be considered.
3

Learning Instruction Scheduling Heuristics from Optimal Data

Russell, Tyrel January 2006 (has links)
The development of modern pipelined and multiple functional unit processors has increased the available instruction level parallelism. In order to fully utilize these resources, compiler writers spend large amounts of time developing complex scheduling heuristics for each new architecture. In order to reduce the time spent on this process, automated machine learning techniques have been proposed to generate scheduling heuristics. We present two case studies using these techniques to generate instruction scheduling heuristics for basic blocks and super blocks. A basic block is a block of code with a single flow of control and a super block is a collection of basic blocks with a single entry point but multiple exit points. We improve previous techniques for automated generation of basic block scheduling heuristics by increasing the quality of the training data and increasing the number of features considered, including several novel features that have useful effects on scheduling instructions. Our case study into super block scheduling heuristics is a novel contribution as previous approaches were only applied to basic blocks. We show through experimentation that we can produce efficient heuristics that perform better than current heuristic methods for basic block and super block scheduling. We show that we can reduce the number of non-optimally scheduled blocks by up to 55% for basic blocks and 38% for super blocks. We also show that we can produce better schedules 7. 8 times more often than the next best heuristic for basic blocks and 4. 4 times more often for super blocks.
4

Constraint Programming Techniques for Optimal Instruction Scheduling

Malik, Abid 03 1900 (has links)
Modern processors have multiple pipelined functional units and can issue more than one instruction per clock cycle. This puts great pressure on the instruction scheduling phase in a compiler to expose maximum instruction level parallelism. Basic blocks and superblocks are commonly used regions of code in a program for instruction scheduling. Instruction scheduling coupled with register allocation is also a well studied problem to produce better machine code. Scheduling basic blocks and superblocks optimally with or with out register allocation is NP-complete, and is done sub-optimally in production compilers using heuristic approaches. In this thesis, I present a constraint programming approach to the superblock and basic block instruction scheduling problems for both idealized and realistic architectures. Basic block scheduling with register allocation with no spilling allowed is also considered. My models for both basic block and superblock scheduling are optimal and fast enough to be incorporated into production compilers. I experimentally evaluated my optimal schedulers on the SPEC 2000 integer and floating point benchmarks. On this benchmark suite, the optimal schedulers were very robust and scaled to the largest basic blocks and superblocks. Depending on the architectural model, between 99.991\% to 99.999\% of all basic blocks and superblocks were solved to optimality. The schedulers were able to routinely solve the largest blocks, including blocks with up to 2600 instructions. My results compare favorably to the best previous optimal approaches, which are based on integer programming and enumeration. My approach for basic block scheduling without allowing spilling was good enough to solve 97.496\% of all basic blocks in the SPEC 2000 benchmark. The approach was able to solve basic blocks as large as 50 instructions for both idealized and realistic architectures within reasonable time limits. Again, my results compare favorably to recent work on optimal integrated code generation, which is based on integer programming.
5

On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling

Chase, Michael January 2006 (has links)
Modern architectures allow multiple instructions to be issued at once and have other complex features. To account for this, compilers perform instruction scheduling after generating the output code. The instruction scheduling problem is to find an optimal schedule given the limitations and capabilities of the architecture. While this can be done optimally, a greedy algorithm known as list scheduling is used in practice in most production compilers. <br /><br /> List scheduling is generally regarded as being near-optimal in practice, provided a good choice of heuristic is used. However, previous work comparing a list scheduler against an optimal scheduler either makes the assumption that an idealized architectural model is being used or uses too few test cases to strongly prove or disprove the assumed near-optimality of list scheduling. It remains an open question whether or not list scheduling performs well when scheduling for a realistic architectural model. <br /><br /> Using constraint programming, we developed an efficient optimal scheduler capable of scheduling even very large blocks within a popular benchmark suite in a reasonable amount of time. I improved the architectural model and optimal scheduler by allowing for an issue width not equal to the number of functional units, instructions that monopolize the processor for one cycle, and non-fully pipelined instructions. I then evaluated the performance of list scheduling for this more realistic architectural model. <br /><br /> I found that when scheduling for basic blocks when using a realistic architectural model, only 6% or less of schedules produced by a list scheduler are non-optimal, but when scheduling for superblocks, at least 40% of schedules produced by a list scheduler are non-optimal. Furthermore, when the list scheduler and optimal scheduler differed, the optimal scheduler was able to improve schedule cost by at least 5% on average, realizing maximum improvements of 82%. This suggests that list scheduling is only a viable solution in practice when scheduling basic blocks. When scheduling superblocks, the advantage of using a list scheduler is its speed, not the quality of schedules produced, and other alternatives to list scheduling should be considered.
6

Learning Instruction Scheduling Heuristics from Optimal Data

Russell, Tyrel January 2006 (has links)
The development of modern pipelined and multiple functional unit processors has increased the available instruction level parallelism. In order to fully utilize these resources, compiler writers spend large amounts of time developing complex scheduling heuristics for each new architecture. In order to reduce the time spent on this process, automated machine learning techniques have been proposed to generate scheduling heuristics. We present two case studies using these techniques to generate instruction scheduling heuristics for basic blocks and super blocks. A basic block is a block of code with a single flow of control and a super block is a collection of basic blocks with a single entry point but multiple exit points. We improve previous techniques for automated generation of basic block scheduling heuristics by increasing the quality of the training data and increasing the number of features considered, including several novel features that have useful effects on scheduling instructions. Our case study into super block scheduling heuristics is a novel contribution as previous approaches were only applied to basic blocks. We show through experimentation that we can produce efficient heuristics that perform better than current heuristic methods for basic block and super block scheduling. We show that we can reduce the number of non-optimally scheduled blocks by up to 55% for basic blocks and 38% for super blocks. We also show that we can produce better schedules 7. 8 times more often than the next best heuristic for basic blocks and 4. 4 times more often for super blocks.
7

Constraint Programming Techniques for Optimal Instruction Scheduling

Malik, Abid 03 1900 (has links)
Modern processors have multiple pipelined functional units and can issue more than one instruction per clock cycle. This puts great pressure on the instruction scheduling phase in a compiler to expose maximum instruction level parallelism. Basic blocks and superblocks are commonly used regions of code in a program for instruction scheduling. Instruction scheduling coupled with register allocation is also a well studied problem to produce better machine code. Scheduling basic blocks and superblocks optimally with or with out register allocation is NP-complete, and is done sub-optimally in production compilers using heuristic approaches. In this thesis, I present a constraint programming approach to the superblock and basic block instruction scheduling problems for both idealized and realistic architectures. Basic block scheduling with register allocation with no spilling allowed is also considered. My models for both basic block and superblock scheduling are optimal and fast enough to be incorporated into production compilers. I experimentally evaluated my optimal schedulers on the SPEC 2000 integer and floating point benchmarks. On this benchmark suite, the optimal schedulers were very robust and scaled to the largest basic blocks and superblocks. Depending on the architectural model, between 99.991\% to 99.999\% of all basic blocks and superblocks were solved to optimality. The schedulers were able to routinely solve the largest blocks, including blocks with up to 2600 instructions. My results compare favorably to the best previous optimal approaches, which are based on integer programming and enumeration. My approach for basic block scheduling without allowing spilling was good enough to solve 97.496\% of all basic blocks in the SPEC 2000 benchmark. The approach was able to solve basic blocks as large as 50 instructions for both idealized and realistic architectures within reasonable time limits. Again, my results compare favorably to recent work on optimal integrated code generation, which is based on integer programming.
8

Look-ahead instruction scheduling for dynamic execution in pipelined computers

Reddy Anam, Vijay K. January 1990 (has links)
No description available.
9

A High Performance Register Allocator for Vector Architectures with a Unified Register-Set

Su, Yu-Dan 29 June 2012 (has links)
This thesis describes a compiler optimization targeted for machines with unified, vector-based register sets. This optimization combines register allocation and instruction scheduling. It examines places where the code performs computations on scalar variables. The goal is to identify instances where the same operation is performed. For example, a program might calculate ¡§base+offset¡¨ and then calculate ¡§i+j¡¨. Even though these computations are unrelated, yet they use the same operator; if ¡§base¡¨ and ¡§i¡¨ are packed into one vector register, while ¡§offset¡¨ and ¡§j¡¨ are packed into another, then these two computations can be performed simultaneously through the vectors¡¦ parallel addition operation. This would reduce the execution time of the compiled code. Although other researchers have considered similar packing methods, their work has been limited by the hardware that they were studying. Such hardware usually imposed high costs for moving data between scalar and vector register banks. This present thesis, however, considers a novel hardware architecture that imposes no such costs. As a consequence, we are able to obtain significant speedups. The architecture that we consider is a Graphics Processing Unit (GPU) for embedded systems that is under development at this university. This GPU has a single register set for integers, float, and vectors.
10

Genetic Algorithm for Integrated SoftwarePipelining

Cai, Zesi January 2012 (has links)
The purpose of the thesis was to study the feasibility of using geneticalgorithm (GA) to do the integrated software pipelining (ISP). Different from phasedcode generation, ISP is a technique which integrates instruction selection, instructionscheduling, and register allocation together when doing code generation. ISP is able toprovide a lager solution space than phased way does, which means that ISP haspotential to generate more optimized code than phased code generation. However,integrated compiling costs more than phased compiling. GA is stochastic beam searchalgorithm which can accelerate the solution searching and find an optimized result.An experiment was designed for verifying feasibility of implementing GA for ISP(GASP). The implemented algorithm analyzed data dependency graphs of loop bodies,created genes for the graphs and evolved, generated schedules, calculated andevaluated fitness, and obtained optimized codes. The fitness calculation wasimplemented by calculating the maximum value between the smallest possibleresource initiation interval and the smallest possible recurrence initiation interval. Theexperiment was conducted by generating codes from data dependency graphsprovided in FFMPEG and comparing the performance between GASP and integerlinear programming (ILP). The results showed that out of eleven cases that ILP hadgenerated code, GASP performed close to ILP in seven cases. In all twelve cases thatILP did not have result, GASP did generate optimized code. To conclude, the studyindicated that GA was feasible of being implemented for ISP. The generated codesfrom GASP performed similar with the codes from ILP. And for the dependencygraphs that ILP could not solve in a limited time, GASP could also generate optimizedresults.

Page generated in 0.1495 seconds