Spelling suggestions: "subject:"HPC, communication offloading, GOAL, EXTOLL""
1 |
Concepts and Prototype for a Collective Offload UnitSchneider, Timo, Eckelmann, Sven 15 December 2011 (has links)
Optimized implementations of blocking and nonblocking collective operations are most important for scalable high-performance applications. Offloading such collective operations into the communication layer can improve performance and asynchronous progression of the operations. However, it is most important that such offloading schemes remain flexible in order to support user-defined (sparse neighbor) collective communications. In this work we propose a design for a collective offload unit.
Our hardware design is able to execute dependency graph based representations of collective functions. To cope with the scarcity of memory resources we designed a new point to point messaging protocol which does not need to store information about unexpected messages. The offload unit proposed in this thesis could be integrated into high performance networks such as EXTOLL. Our design achieves a clock frequency of 212 MHz on a Xilinx Virtex6 FPGA, while using less than 10% of the available logic slices and less than 30% of the available memory blocks. Due to the specialization of our design we can accelerate important tasks of the message passing framework, such as message matching by a factor of two, compared to a software implementation running on a CPU with a ten times higher clock speed.:1. Task Description
1.1. Theses
2. Introduction
2.1. Motivation
2.2. Outline of this Thesis
2.3. Related Work
2.3.1. NIC Based Packet Forwarding
2.3.2. Hardware Barrier Implementations
2.3.3. ConnectX2 CORE-Direct Collective Offload Support
2.3.4. Collective Offload Support in the Portals 4 API
2.4. Group Operation Assembly Language
2.4.1. GOAL API
2.4.2. Scratchpad Buffer
2.4.3. Schedule Execution
2.5. The EXTOLL Network
2.6. Field Programmable Gate Arrays
3. Dealing with Constrained Resources
3.1. Hardware Limitations
3.2. Common Collective Functions in GOAL
3.3. Schedule Representation for the Hardware GOAL Interpreter
3.4. Executing Large Schedules using a small amount of Memory
3.4.1. Limits of Previously Suggested Approaches
3.4.2. Testing for Deadlocks in Schedules
3.4.3. Transforming Process Local Schedules into Global Schedules
3.4.4. Predetermined Buffer Locations
3.5. Queueing Active Operations in Hardware
3.6. Designing a Low-Memory-Footprint Point to Point Protocol
3.6.1. Arrival Times
3.6.2. Eager Protocol
3.6.3. Rendezvous Protocol
3.6.4. A Protocol without an Unexpected Queue
3.7. Protocol Verification
3.7.1. Capabilities of the Model Checker SPIN
3.7.2. Modeling the Protocol
3.7.3. Limitations of the Basic Protocol
4. The Matching Problem
4.1. Matching on the Host CPU
4.2. Implementation Methodology
4.3. Matching Unit Interface
4.4. Matching Unit Implementation
4.4.1. Slot Management Unit
4.4.2. The Input Consumer
4.4.3. The Output Generator
4.4.4. The Matching Unit
4.5. Slot Management Unit for Non-synchronous Transfers
5. The GOAL Interpreter
5.1. Schedule Interpreter Design
5.1.1. The Active Queue
5.1.2. The Dependency Resolver
5.2. Transceiver Interface
5.3. The Starter
5.3.1. Starting Operations
5.3.2. Processing Incoming Packets
5.3.3. Incoming Non-synchronous Packets
5.3.4. Presorting the Active Queue
5.3.5. Arbitration Units
5.3.6. IN-Filter
5.3.7. Outcommand Manager
5.3.8. Non-synchronous Protocol
5.3.9. Send Protocol
5.3.10. Receive Protocol
5.3.11. Local Operations on FPGA
6 Evaluation
6.1. Performance Analysis
6.2. Future Work
6.3. Conclusions
Bibliography
|
Page generated in 0.0797 seconds