21 |
Efektivní komunikace v multi-GPU systémech / Efficient Communication in Multi-GPU SystemsŠpeťko, Matej January 2018 (has links)
After the introduction of CUDA by Nvidia, the GPUs became devices capable of accelerating any general purpose computation. GPUs are designed as parallel processors which posses huge computation power. Modern supercomputers are often equipped with GPU accelerators. Sometimes single GPU performance is not enough for a scientific application and it needs to scale over multiple GPUs. During the computation, there is a need for the GPUs to exchange partial results. This communication represents computation overhead and it is important to research methods of the effective communication between GPUs. This means less CPU involvement, lower latency and shared system buffers. This thesis is focused on inter-node and intra-node GPU-to-GPU communication using GPUDirect technologies from Nvidia and CUDA-Aware MPI. Subsequently, k-Wave toolbox for simulating the propagation of acoustic waves is introduced. This application is accelerated by using CUDA-Aware MPI. Peer-to-peer transfer support is also integrated to k-Wave using CUDA Inter-process Communication.
|
22 |
Evaluating and Improving the Performance of MPI-Allreduce on QLogic HTX/PCIe InifiniBand HCAMittenzwey, Nico 31 March 2009 (has links)
This thesis analysed the QLogic InfiniPath QLE7140 HCA and its onload architecture
and compared the results to the Mellanox InfiniHost III Lx HCA which uses an offload
architecture. As expected, the QLogic InfiniPath QLE7140 HCA can outperform the
Mellanox InfiniHost III Lx HCA in latency and bandwidth terms on our test system in
various test scenarios. The benchmarks showed, that sending messages with multiple
threads in parallel can increase the bandwidth greatly while bi-directional sends cut
the effective bandwidth for one HCA by up to 30%.
Different all-reduce algorithms where evaluated and compared with the help of the
LogGP model. The comparison showed that new all-reduce algorithms can outperform the ones already implemented in Open MPI for different scenarios.
The thesis also demonstrated, that one can implement multicast algorithms for InfiniBand
easily by using the RDMA-CM API.
|
23 |
Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance InterconnectsPotluri, Sreeram 18 September 2014 (has links)
No description available.
|
24 |
Designing High Performance and Scalable Unified Communication Runtime (UCR) for HPC and Big Data MiddlewareJose, Jithin 30 December 2014 (has links)
No description available.
|
25 |
High Performance Network I/O in Virtual Machines over Modern InterconnectsHuang, Wei 12 September 2008 (has links)
No description available.
|
26 |
Evaluation of publicly available Barrier-Algorithms and Improvement of the Barrier-Operation for large-scale Cluster-Systems with special Attention on InfiniBand NetworksHoefler, Torsten 28 June 2005 (has links) (PDF)
The MPI_Barrier-collective operation, as a part of the MPI-1.1
standard, is extremely important for all parallel applications using it.
The latency of this operation increases the application run time and
can not be overlaid. Thus, the whole MPI performance can be decreased
by unsatisfactory barrier latency. The main goals of this work are to
lower the barrier latency for InfiniBand networks by analyzing well
known barrier algorithms with regards to their suitability within
InfiniBand networks, to enhance the barrier operation by utilizing
standard InfiniBand operations as much as possible, and to design a
constant time barrier for InfiniBand with special hardware support.
This partition into three main steps is retained throughout the whole
thesis. The first part evaluates publicly known models and proposes a
new more accurate model (LoP) for InfiniBand. All barrier algorithms are
evaluated within the well known LogP and this new model. Two new
algorithms which promise a better performance have been developed. A
constant time barrier integrated into InfiniBand as well as a cheap
separate barrier network is proposed in the hardware section. All
results have been implemented inside the Open MPI framework. This work
led to three new Open MPI collective modules. The first one implements
different barrier algorithms which are dynamically benchmarked and
selected during the startup phase to maximize the performance. The
second one offers a special barrier implementation for InfiniBand with RDMA
and performs up to 40% better than the best solution that has been
published so far. The third implementation offers a constant time
barrier in a separate network, leveraging commodity components, with a
latency of only 2.5 microseconds. All components have their specialty and can
be used to enhance the barrier performance significantly.
|
27 |
Software-defined Buffer Management and Robust Congestion Control for Modern Datacenter NetworksDanushka N Menikkumbura (12208121) 20 April 2022 (has links)
<p> Modern datacenter network applications continue to demand ultra low latencies and very high throughputs. At the same time, network infrastructure keeps achieving higher speeds and larger bandwidths. We still need better network management solutions to keep these two demand and supply fronts go hand-in-hand. There are key metrics that define network performance such as flow completion time (the lower the better), throughput (the higher the better), and end-to-end latency (the lower the better) that are mainly governed by how effectively network application get their fair share of network resources. We observe that buffer utilization on network switches gives a very accurate indication of network performance. Therefore, network buffer management is important in modern datacenter networks, and other network management solutions can be efficiently built around buffer utilization. This dissertation presents three solutions based on buffer use on network switches.</p>
<p> This dissertation consists of three main sections. The first section is on a specification language for buffer management in modern programmable switches. The second section is on a congestion control solution for Remote Direct Memory Access (RDMA) networks. The third section is on a solution to head-of-the-line blocking in modern datacenter networks.</p>
|
28 |
Evaluation of publicly available Barrier-Algorithms and Improvement of the Barrier-Operation for large-scale Cluster-Systems with special Attention on InfiniBand NetworksHoefler, Torsten 01 April 2005 (has links)
The MPI_Barrier-collective operation, as a part of the MPI-1.1
standard, is extremely important for all parallel applications using it.
The latency of this operation increases the application run time and
can not be overlaid. Thus, the whole MPI performance can be decreased
by unsatisfactory barrier latency. The main goals of this work are to
lower the barrier latency for InfiniBand networks by analyzing well
known barrier algorithms with regards to their suitability within
InfiniBand networks, to enhance the barrier operation by utilizing
standard InfiniBand operations as much as possible, and to design a
constant time barrier for InfiniBand with special hardware support.
This partition into three main steps is retained throughout the whole
thesis. The first part evaluates publicly known models and proposes a
new more accurate model (LoP) for InfiniBand. All barrier algorithms are
evaluated within the well known LogP and this new model. Two new
algorithms which promise a better performance have been developed. A
constant time barrier integrated into InfiniBand as well as a cheap
separate barrier network is proposed in the hardware section. All
results have been implemented inside the Open MPI framework. This work
led to three new Open MPI collective modules. The first one implements
different barrier algorithms which are dynamically benchmarked and
selected during the startup phase to maximize the performance. The
second one offers a special barrier implementation for InfiniBand with RDMA
and performs up to 40% better than the best solution that has been
published so far. The third implementation offers a constant time
barrier in a separate network, leveraging commodity components, with a
latency of only 2.5 microseconds. All components have their specialty and can
be used to enhance the barrier performance significantly.
|
Page generated in 0.0204 seconds