Global ETD Search

1	Integrated Scheduling For Clustered VLIW Processors Nagpal, Rahul 12 1900 (has links) Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Scheduling for clustered architectures involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule). Various clustered VLIW configurations, connectivity types, and inter-cluster communication models present different performance trade-offs to a scheduler. The scheduler is responsible for resolving the conflicting requirements of exploiting the parallelism offered by the hardware and limiting the communication among clusters to achieve better performance. Earlier proposals for cluster scheduling fall into two main categories, viz., phase-decoupled scheduling and phase-coupled scheduling and they focus on clustered architectures which provide inter-cluster communication by an explicit inter-cluster copy operation. However, modern commercial clustered architectures provide snooping capabilities (apart from the support for inter-cluster communication using an explicit MV operation) by allowing some of the functional units to read operands from the register file of some of the other clusters without any extra delay. The phase-decoupled approach of scheduling suffers from the well known phase-ordering problem which becomes severe for such a machine model (with snooping) because communication and resource constraints are tightly coupled and thus are exposed only during scheduling. Tight integration of communication and resource constraints further requires taking into account the resource and communication requirements of other instructions ready to be scheduled in the current cycle while binding an instruction, in order to carry out effective binding. However, earlier proposals on integrated scheduling consider instructions and clusters for binding using a fixed order and thus they show different widely varying performance characteristics in terms of execution time and code size. Other shortcomings of earlier integrated algorithms (that lead to suboptimal cluster scheduling decisions) are due to non-consideration of future communication (that may arise due to a binding) and functional unit binding. In this thesis, we propose a pragmatic scheme and also a generic graph matching based framework for cluster scheduling based on a generic and realistic clustered machine model. The proposed scheme effectively utilizes the exact knowledge of available communication slots, functional units, and load on different clusters as well as future resource and communication requirements known only at schedule time to attain significant performance improvement without code size penalty over earlier algorithms. The proposed graph matching based framework for cluster scheduling resolves the phase-ordering and fixed-ordering problem associated with scheduling on clustered VLIW architectures. The framework provides a mechanism to exploit the slack of instructions by dynamically varying the freedom available in scheduling an instruction and hence the cost of scheduling an instruction using different alternatives to reduce the inter-cluster communication. An experimental evaluation of the proposed framework and some of the earlier proposals is presented in the context of a state-of-art commercial clustered architecture. Computer and Information Science Instruction Scheduling Clustered VLIW Architecture Clustered VLIW Processors Clustered VLIW Configuration Clustered Architectures Cluster Scheduling Spatial Scheduling Inter-cluster Communication Microprocessor - Scheduling
2	Compiler-Assisted Energy Optimization For Clustered VLIW Processors Nagpal, Rahul 03 1900 (has links) Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving clock speed, reducing energy consumption of the logic, and making the design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Inter-cluster communication also introduces many short idle cycles, therby significantly increasing the overall leakage energy consumption in the functional units. The trend towards miniatrurization of devices (and associated reduction in threshold voltage) makes energy consumption in interconnects and functional units even worse and limits the usability of clustered architectures in smaller technologies. In the past, study of leakage energy management at the architectural level has mostly focused on storage structures such as cache. Relatively, little work has been done on architecture level leakage energy management in functional units in the context of superscalar processors and energy efficient scheduling in the context of VLIW architectures. In the absence of any high level model for interconnect energy estimation, the primary focus of research in the context of interconnects has been to reduce the latency of communication and evaluation of various inter-cluster communication models. To the best of our knowledge, there has been no such work in the past from the point of view of enegy efficiency targeting clustered VLIW architectures specifically focusing on smaller technologies. Technological advancements now permit design of interconnects and functional units With varying performance and power modes. In thesis we people scheduling algorithms that aggregate the scheduling slack of instructions and communication slack of data values to exploit the low power modes of interconnects and functional units . We also propose a high level model for estimation of interconnect delay and energy (in contrast to low-level circuit level model proposed earlier) that makes it possible to carry out architectural and compiler optimizations specifically targeting the inter connect, Finally we present synergistic combination of these algorithms that simultaneously saves energy in functional units and interconnects to improve the usability of clustered architectures by archiving better overall energy-performance trade-offs. Our compiler assisted leakage energy management scheme for functional units reduces the energy consumption of functional units approximately by 15% and 17% in the context of a 2-clustered and a 4-clustered VLIW architecture respectively with negligible performance degradation over and above that offered by a hardware-only scheme. The interconnect energy optimization scheme improves the energy consumption of interconnects on an average by 41% and 46% for a 2-clustered and a 4-clustered machine respectively with 2% and 1.5% performance degradation. The combined scheme options slightly better energy benefit in functional units and 37% and 43% energy benefit in interconnect with slightly higher performance degradation. Even with the conservative estimates of contribution of functional unit interconnect to overall processor energy consumption the proposed combined scheme obtains on an average 8% and 10% improvement in overall energy delay product with 3.5% and 2% performance degradation for a 2-clustered and a 4-clustered machine respectively. We present a detailed experimental evaluation of the proposed schemes using the Trimaran compiler infrastructure. Embedded Systems Computer Architecture Algorithms Very Long Instruction Word Architecture Energy Optimization Energy Management Interconnect Modeling Clustered VLIW Architectures Clustered VLIW Processors Power Optimization Micro-architetural Explorations Computer Science

Search results

Integrated Scheduling For Clustered VLIW Processors

Compiler-Assisted Energy Optimization For Clustered VLIW Processors