1 |
Fast Barrier Synchronization for InfiniBandHoefler, Torsten 04 January 2006 (has links) (PDF)
Barrier Synchronization is crucial for many parallel systems. This talk introduces different synchronization mechanisms and demonstrates new approaches to leverage special hardware properties of InfiniBand to lower the Barrier latency.
|
2 |
Analysis and Optimization of the Packet Scheduler in Open MPILichei, Andre 13 November 2006 (has links) (PDF)
We compared well known measurement methods for LogGP parameters and discuss their
accuracy and network contention. Based on this, a new theoretically exact measurement method
that does not saturate the network is derived and explained in detail. The applicability of our
method is shown for the low level communication API of Open MPI across several
interconnection networks.
Based on the LogGP model, we developed a low overhead packet scheduling algorithm. It can
handle different types of interconnects with different characteristics. It is able to produce
schedules which are very close to the optimum for both small and large messages. The efficiency
of the algorithm for small messages is show for a Open MPI implementation. The
implementation uses the LogGP benchmark to obtain the LogGP parameters of the available
interconnects and can so adapt to any given system.
|
3 |
Entwicklung einer optimierten kollektiven KomponenteMosch, Marek 24 September 2007 (has links) (PDF)
Diese Diplomarbeit beschäftigt sich mit der Entwicklung einer kollektiven Komponente für die MPI-2 Implementation Open MPI. Die Komponente soll optimierte Algorithmen für das Myrinet Netzwerk auf Basis des Low-Level Kommunikations-protokolls GM beinhalten.
|
4 |
Optimierte Implementierung ausgewählter kollektiver Operationen unter Ausnutzung der Hardwareparallelität des InfiniBand NetzwerkesFranke, Maik 24 September 2007 (has links) (PDF)
Ziel der Arbet ist eine optimierte Implementierung der im MPI-1 Standard definierten Reduktionsoperationen MPI_Reduce(), MPI_Allreduce(), MPI_Scan(), MPI_Reduce_scatter() für das InfiniBand Netzwerk. Hierbei soll besonderer Wert auf spezielle InfiniBand Operationen und die Hardwareparallelität gelegt werden.
InfiniBand ermöglicht es Kommunikationsoperationen klar von Berechnungen zu trennen, was eine Überlappung beider Operationstypen in der Reduktion ermöglicht. Das Potential dieser Methode soll modelltheoretisch als auch praktisch in einer prototypischen Implementierung im Rahmen des Open MPI Frameworks erfolgen. Das Endresultat soll mit vorhandenen Implementierungen (z.B. MVAPICH) verglichen werden. / The performance of collective communication operations is one of the deciding factors in the overall performance of a MPI application. Current implementations of MPI use the point-to-point components to access the InfiniBand network. Therefore it is tried to improve the performance of a collective component by accessing the InfiniBand network directly. This should avoid overhead and make it possible to tune the algorithms to this specific network. Various algorithms for the MPI_Reduce, MPI_Allreduce, MPI_Scan and MPI_Reduce_scatter operations are presented. The theoretical performance of the algorithms is analyzed with the LogfP and LogGP models. Selected algorithms are implemented as part of an Open MPI collective component. Finally the performance of different algorithms and different MPI implementations is compared.
|
5 |
Evaluating and Improving the Performance of MPI-Allreduce on QLogic HTX/PCIe InifiniBand HCAMittenzwey, Nico 30 June 2009 (has links) (PDF)
This thesis analysed the QLogic InfiniPath QLE7140 HCA and its onload architecture
and compared the results to the Mellanox InfiniHost III Lx HCA which uses an offload
architecture. As expected, the QLogic InfiniPath QLE7140 HCA can outperform the
Mellanox InfiniHost III Lx HCA in latency and bandwidth terms on our test system in
various test scenarios. The benchmarks showed, that sending messages with multiple
threads in parallel can increase the bandwidth greatly while bi-directional sends cut
the effective bandwidth for one HCA by up to 30%.
Different all-reduce algorithms where evaluated and compared with the help of the
LogGP model. The comparison showed that new all-reduce algorithms can outperform the ones already implemented in Open MPI for different scenarios.
The thesis also demonstrated, that one can implement multicast algorithms for InfiniBand
easily by using the RDMA-CM API.
|
6 |
Low Overhead Ethernet Communication for Open MPI on Linux ClustersHoefler, Torsten, Reinhardt, Mirko, Mietke, Frank, Mehlan, Torsten, Rehm, Wolfgang 20 July 2006 (has links) (PDF)
This paper describes the basic concepts of our solution to
improve the performance of Ethernet Communication on a Linux Cluster
environment by introducing Reliable Low Latency Ethernet Sockets. We
show that about 25% of the socket latency can be saved by using our
simplified protocol. Especially, we put emphasis on demonstrating that
this performance benefit is able to speed up the MPI level
communication. Therefore we have developed a new BTL component for Open
MPI, an open source MPI-2 implementation which offers with its Modular
Component Architecture a nearly ideal environment to implement our
changes. Microbenchmarks of MPI collective and Point-to-Point operations
were performed. We see a performance improvement of 8% to 16% for LU and
SP implementations of the NAS parallel benchmark suite which spends a
significant amount of time in the MPI. Practical application tests with
Abinit, an electronic structure calculation program, show that the
runtime of be nearly halved on a 4 node system. Thus we show evidence
that our new Ethernet communication protocol is able to increase the
speedup of parallel applications considerably.
|
7 |
Improving the Performance of Selected MPI Collective Communication Operations on InfiniBand NetworksViertel, Carsten 23 September 2007 (has links) (PDF)
The performance of collective communication
operations is one of the deciding factors in
the overall performance of a MPI application.
Open MPI's component architecture offers an easy
way to implement new algorithms for collective
operations, but current implementations use the
point-to-point components to access the
InfiniBand network. Therefore it is tried to
improve the performance of a collective component
by accessing the InfiniBand network directly.
This should avoid overhead and make it possible
to tune the algorithms to this specific network.
The first part of this work gives a short overview
of the InfiniBand Architecture and Open MPI. In
the next part several models for parallel
computation are analyzed. Afterwards various
algorithms for the MPI_Scatter, MPI_Gather and
MPI_Allgather operations are presented. The
theoretical performance of the algorithms is
analyzed with the LogfP and LogGP models.
Selected algorithms are implemented
as part of an Open MPI collective component.
Finally the performance of different algorithms and
different MPI implementations is compared. The test
results show, that the performance of the
operations could be improved for several message
and communicator size ranges.
|
8 |
Fast Barrier Synchronization for InfiniBandHoefler, Torsten 04 January 2006 (has links)
Barrier Synchronization is crucial for many parallel systems. This talk introduces different synchronization mechanisms and demonstrates new approaches to leverage special hardware properties of InfiniBand to lower the Barrier latency.
|
9 |
Low Overhead Ethernet Communication for Open MPI on Linux ClustersHoefler, Torsten, Reinhardt, Mirko, Mietke, Frank, Mehlan, Torsten, Rehm, Wolfgang 20 July 2006 (has links)
This paper describes the basic concepts of our solution to
improve the performance of Ethernet Communication on a Linux Cluster
environment by introducing Reliable Low Latency Ethernet Sockets. We
show that about 25% of the socket latency can be saved by using our
simplified protocol. Especially, we put emphasis on demonstrating that
this performance benefit is able to speed up the MPI level
communication. Therefore we have developed a new BTL component for Open
MPI, an open source MPI-2 implementation which offers with its Modular
Component Architecture a nearly ideal environment to implement our
changes. Microbenchmarks of MPI collective and Point-to-Point operations
were performed. We see a performance improvement of 8% to 16% for LU and
SP implementations of the NAS parallel benchmark suite which spends a
significant amount of time in the MPI. Practical application tests with
Abinit, an electronic structure calculation program, show that the
runtime of be nearly halved on a 4 node system. Thus we show evidence
that our new Ethernet communication protocol is able to increase the
speedup of parallel applications considerably.
|
10 |
Analysis and Optimization of the Packet Scheduler in Open MPILichei, Andre 02 November 2006 (has links)
We compared well known measurement methods for LogGP parameters and discuss their
accuracy and network contention. Based on this, a new theoretically exact measurement method
that does not saturate the network is derived and explained in detail. The applicability of our
method is shown for the low level communication API of Open MPI across several
interconnection networks.
Based on the LogGP model, we developed a low overhead packet scheduling algorithm. It can
handle different types of interconnects with different characteristics. It is able to produce
schedules which are very close to the optimum for both small and large messages. The efficiency
of the algorithm for small messages is show for a Open MPI implementation. The
implementation uses the LogGP benchmark to obtain the LogGP parameters of the available
interconnects and can so adapt to any given system.
|
Page generated in 0.0407 seconds