Spelling suggestions: "subject:"open MPI"" "subject:"ipen MPI""
21 |
Efficient Broadcast for Multicast-Capable Interconnection NetworksSiebert, Christian 30 September 2006 (has links)
The broadcast function MPI_Bcast() from the
MPI-1.1 standard is one of the most heavily
used collective operations for the message
passing programming paradigm.
This diploma thesis makes use of a feature called
"Multicast", which is supported by several
network technologies (like Ethernet or
InfiniBand), to create an efficient MPI_Bcast()
implementation, especially for large communicators
and small-sized messages.
A preceding analysis of existing real-world
applications leads to an algorithm which does not
only perform well for synthetical benchmarks
but also even better for a wide class of
parallel applications. The finally derived
broadcast has been implemented for the
open source MPI library "Open MPI" using
IP multicast.
The achieved results prove that
the new broadcast is usually always better
than existing point-to-point implementations,
as soon as the number of MPI processes exceeds the
8 node boundary. The performance gain reaches
a factor of 4.9 on 342 nodes, because the
new algorithm scales practically independently
of the number of involved processes. / Die Broadcastfunktion MPI_Bcast() aus dem MPI-1.1
Standard ist eine der meistgenutzten kollektiven
Kommunikationsoperationen des nachrichtenbasierten
Programmierparadigmas.
Diese Diplomarbeit nutzt die Multicastfähigkeit,
die von mehreren Netzwerktechnologien (wie Ethernet
oder InfiniBand) bereitgestellt wird, um eine
effiziente MPI_Bcast() Implementation zu erschaffen,
insbesondere für große Kommunikatoren und kleinere
Nachrichtengrößen.
Eine vorhergehende Analyse von existierenden
parallelen Anwendungen führte dazu, dass der neue
Algorithmus nicht nur bei synthetischen Benchmarks
gut abschneidet, sondern sein Potential bei echten
Anwendungen noch besser entfalten kann. Der
letztendlich daraus entstandene Broadcast wurde
für die Open-Source MPI Bibliothek "Open MPI"
entwickelt und basiert auf IP Multicast.
Die erreichten Ergebnisse belegen, dass der neue
Broadcast üblicherweise immer besser als jegliche
Punkt-zu-Punkt Implementierungen ist, sobald die
Anzahl von MPI Prozessen die Grenze von 8 Knoten
überschreitet. Der Geschwindigkeitszuwachs
erreicht einen Faktor von 4,9 bei 342 Knoten,
da der neue Algorithmus praktisch unabhängig
von der Knotenzahl skaliert.
|
22 |
ESPGOAL: A Dependency Driven Communication FrameworkSchneider, Timo, Eckelmann, Sven 01 June 2011 (has links)
Optimized implementations of blocking and nonblocking collective operations are most important for scalable high-performance applications. Offloading such collective operations into the communication layer can improve performance and asynchronous progression of the operations. However, it is most important that such offloading schemes remain flexible in order to support user-defined (sparse neighbor) collective communications. In this work, we describe an operating system kernel-based architecture for implementing an interpreter for the flexible Group Operation Assembly Language (GOAL) framework to offload collective communications. We describe an optimized scheme to store the schedules that define the collective operations and show an extension to profile the performance of the kernel layer. Our microbenchmarks demonstrate the effectiveness of the approach and we show performance improvements over traditional progression in user-space. We also discuss complications with the design and offloading strategies in general.:1 Introduction
1.1 Related Work
2 The GOAL API
2.1 API Conventions
2.2 Basic GOAL Functionality
2.2.1 Initialization
2.2.2 Graph Creation
2.2.3 Adding Operations
2.2.4 Adding Dependencies
2.2.5 Scratchpad Buffer
2.2.6 Schedule Compilation
2.2.7 Schedule Execution
2.3 GOAL-Extensions
3 ESP Transport Layer
3.1 Receive Handling
3.2 Transfer Management
3.2.1 Known Problems
4 The Architecture of ESPGOAL
4.1 Control Flow
4.1.1 Loading the Kernel Module
4.1.2 Adding a Communicator
4.1.3 Starting a Schedule
4.1.4 Schedule Progression
4.1.5 Progression by ESP
4.1.6 Unloading the Kernel Module
4.2 Data Structures
4.2.1 Starting a Schedule
4.2.2 Transfer Management
4.2.3 Stack Overflow Avoidance
4.3 Interpreting a GOAL Schedule
5 Implementing Collectives in GOAL
5.1 Recursive Doubling
5.2 Bruck's Algorithm
5.3 Binomial Trees
5.4 MPI_Barrier
5.5 MPI_Gather
6 Benchmarks
6.1 Testbed
6.2 Interrupt coalescing parameters
6.3 Benchmarking Point to Point Latency
6.4 Benchmarking Local Operations
6.5 Benchmarking Collective Communication Latency
6.6 Benchmarking Collective Communication Host Overhead
6.7 Comparing Different Ways to use Ethernet NICs
7 Conclusions and Future Work
8 Acknowledgments
|
Page generated in 0.0279 seconds