Global ETD Search

1	Design and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters Qian, Ying 11 January 2010 (has links) Two driving forces behind high-performance clusters are the availability of modern interconnects and the advent of multi-core systems. As multi-core clusters become commonplace, where each core will run at least one process with multiple intra-node and inter-node connections to several other processes, there will be immense pressure on the interconnection network and its communication system software. Many parallel scientific applications use Message Passing Interface (MPI) collective communications intensively. Therefore, efficient and scalable implementation of MPI collective operations is critical to the performance of applications running on clusters. In this dissertation, I propose and evaluate a number of efficient collective communication algorithms that utilize the modern features of Quadrics and InfiniBand interconnects as well as the availability of multiple cores on emerging clusters. To overcome bandwidth limitations and to enhance fault tolerance, using multiple independent networks known as multi-rail networks is very promising. Quadrics multi-rail QsNetII network is constructed using multiple network interface cards (NICs) per node, where each NIC is connected to a rail. I design and evaluate a number of Remote Direct Memory Access (RDMA) based multi-port collective operations on multi-rail QsNetII network. I also extend the gather and allgather algorithms to be shared memory aware for small to medium messages. The algorithms prove to be much more efficient than the native Quadrics MPI implementation. ConnectX is the newest generation of InfiniBand host channel adapters from Mellanox Technologies. I provide evidence that ConnectX achieves scalable performance for simultaneous communication over multiple connections. Utilizing this ability of ConnectX cards, I propose a number of RDMA based multi-connection and multi-core aware allgather algorithms at the MPI level. My algorithms are devised to target different message sizes, and the performance results show that they outperform the native MVAPICH implementation. Recent studies show that MPI processes in real applications could arrive at an MPI collective operation at different times. This imbalanced process arrival pattern can significantly affect the performance of the collective communication operation. Therefore, design and efficient implementation of collectives under different process arrival patterns is critical to the performance of scientific applications running on modern clusters. I propose novel RDMA-based process arrival pattern aware alltoall and allgather for different message sizes over InfiniBand clusters. I also extend the algorithms to be shared memory aware for small to medium messages under process arrival patterns. The performance results indicate that the proposed algorithms outperform the native MVAPICH implementation as well as other non-process arrival pattern aware algorithms when processes arrive at different times. / Thesis (Ph.D, Electrical & Computer Engineering) -- Queen's University, 2010-01-10 21:13:33.249 Collective communications Interconnects
2	Optimizing All-to-All and Allgather Communications on GPGPU Clusters Singh, Ashish Kumar 25 June 2012 (has links) No description available. Computer Science GPGPU HPC Infiniband Collective Communications MPI
3	Designing optimized MPI+NCCL hybrid collective communication routines for dense many-GPU clusters Senthil Kumar, Nithin 04 October 2021 (has links) No description available. Computer Science MPI NCCL NVIDIA Collective Communications Library CUDA-aware MPI MVAPICH2-GDR MVAPICH2
4	Evoluční návrh kolektivních komunikací akcelerovaný pomocí GPU / Evolutionary Design of Collective Communications Accelerated by GPUs Tyrala, Radek January 2012 (has links) This thesis provides an analysis of the application for evolutionary scheduling of collective communications. It proposes possible ways to accelerate the application using general purpose computing on graphics processing units (GPU). This work offers a theoretical overview of systems on a chip, collective communications scheduling and more detailed description of evolutionary algorithms. Further, the work provides a description of the GPU architecture and its memory hierarchy using the OpenCL memory model. Based on the profiling, the work defines a concept for parallel execution of the fitness function. Furthermore, an estimation of the possible level of acceleration is presented. The process of implementation is described with a closer insight into the optimization process. Another important point consists in comparison of the original CPU-based solution and the massively parallel GPU version. As the final point, the thesis proposes distribution of the computation among different devices supported by OpenCL standard. In the conclusion are discussed further advantages, constraints and possibilities of acceleration using distribution on heterogenous computing systems.

1

Page generated in 0.1464 seconds