Optimized implementations of blocking and nonblocking collective operations are most important for scalable high-performance applications. Offloading such collective operations into the communication layer can improve performance and asynchronous progression of the operations. However, it is most important that such offloading schemes remain flexible in order to support user-defined (sparse neighbor) collective communications. In this work we propose a design for a collective offload unit.
Our hardware design is able to execute dependency graph based representations of collective functions. To cope with the scarcity of memory resources we designed a new point to point messaging protocol which does not need to store information about unexpected messages. The offload unit proposed in this thesis could be integrated into high performance networks such as EXTOLL. Our design achieves a clock frequency of 212 MHz on a Xilinx Virtex6 FPGA, while using less than 10% of the available logic slices and less than 30% of the available memory blocks. Due to the specialization of our design we can accelerate important tasks of the message passing framework, such as message matching by a factor of two, compared to a software implementation running on a CPU with a ten times higher clock speed.
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa.de:bsz:ch1-qucosa-89006 |
Date | 18 June 2012 |
Creators | Schneider, Timo, Eckelmann, Sven |
Contributors | TU Chemnitz, Fakultät für Informatik, Prof. Dr Wolfgang Rehm, Prof. Dr. Torsten Hoefler, Dipl. Inf. Jochen Strunk, Prof. Dr Wolfgang Rehm |
Publisher | Universitätsbibliothek Chemnitz |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | doc-type:masterThesis |
Format | application/pdf, text/plain, application/zip |
Page generated in 0.0021 seconds