1 |
Scalable and High Performance Collective Communication for Next Generation Multicore Infiniband ClustersMamidala, Amith Rajith 24 June 2008 (has links)
No description available.
|
2 |
Verargumentierte Geschichte : Exempla romana im politischen Diskurs der späten römischen Republik /Bücher, Frank. January 2006 (has links)
Texte remanié de: Dissertation--Philosophische Fakultät--Universität zu Köln, 2004-2005. / Bibliogr. p. [333]-349.
|
3 |
One To Mant And Many To Many Collective Communication Operations On GridsGupta, Rakhi 12 1900 (has links)
Collective Communication Operations are widely used in MPI applications and play an important role in their performance. Hence, various projects have focused on optimization of collective communications for various kinds of parallel computing environments including LAN settings, heterogeneous networks and most recently Grid systems. The distinguishing factor of Grids from all the other environments is heterogeneity of hosts and network, and dynamically changing resource characteristics including load and availability.
The first part of the thesis develops a solution for MPI broadcast (one-to-many) on Grids. Some current strategies take into consideration static information about network topology for determining an efficient broadcast tree for Grids. Some other strategies take into account only transient network characteristics. We combined both these strategies and cluster the network dynamically on the basis of link bandwidths. Given a set of network parameters we use Simulated Annealing (SA) to obtain the best schedule. Also, we can time tune individual. SAs, to adapt the solution finding process, on the basis of estimated available times before next broadcast invocations in the application. We also developed software architecture for updation of schedules. We compared our algorithm with the earlier approaches under loaded network conditions, and obtained average performance improvement of 20%.
The second part of the thesis extends the work for MPI all gather (many-to-many) operation. Current popular techniques consider strict hierarchical schemes for this operation, wherein from each cluster a representative (or coordinator) node is chosen, and inter cluster communication is done through these representative nodes. This is non optimal as inter cluster communication is usually on high capacity links that can sustain more than one transfer with the same through- put. We developed a cluster based and incremental heuristic algorithm for allgather on Grids.
We compared the time taken by allgather schedules determined by this algorithm with current popular implementations. We also compared our algorithm with a strategy where allgather is constructed from a set of broadcast trees. We obtained average performance improvement of 67% over existing strategies.
|
4 |
On Collective Communication and Notified Read in the Global Address Space Programming Interface (GASPI)End, Vanessa 14 December 2016 (has links)
No description available.
|
5 |
A Survey of Barrier Algorithms for Coarse Grained SupercomputersHoefler, Torsten, Mehlan, Torsten, Mietke, Frank, Rehm, Wolfgang 28 June 2005 (has links) (PDF)
There are several different algorithms available to perform a synchronization of multiple processors. Some of them support only shared memory architectures or very fine grained supercomputers. This work gives an overview about all currently known algorithms which are suitable for distributed shared memory architectures and message passing based computer systems (loosely coupled or coarse grained supercomputers). No absolute decision can be made for choosing a barrier algorithm for a machine. Several architectural aspects have to be taken into account. The overview about known barrier algorithms given in this work is mostly targeted to implementors of libraries supporting collective communication (such as MPI).
|
6 |
A Survey of Barrier Algorithms for Coarse Grained SupercomputersHoefler, Torsten, Mehlan, Torsten, Mietke, Frank, Rehm, Wolfgang 28 June 2005 (has links)
There are several different algorithms available to perform a synchronization of multiple processors. Some of them support only shared memory architectures or very fine grained supercomputers. This work gives an overview about all currently known algorithms which are suitable for distributed shared memory architectures and message passing based computer systems (loosely coupled or coarse grained supercomputers). No absolute decision can be made for choosing a barrier algorithm for a machine. Several architectural aspects have to be taken into account. The overview about known barrier algorithms given in this work is mostly targeted to implementors of libraries supporting collective communication (such as MPI).
|
7 |
High Performance and Scalable Cooperative Communication Middleware for Next Generation ArchitecturesChakraborty, Sourav 10 October 2019 (has links)
No description available.
|
8 |
Tackling the Communication Bottlenecks of Distributed Deep Learning Training WorkloadsHo, Chen-Yu 08 1900 (has links)
Deep Neural Networks (DNNs) find widespread applications across various domains, including computer vision, recommendation systems, and natural language processing.
Despite their versatility, training DNNs can be a time-consuming process, and accommodating large models and datasets on a single machine is often impractical.
To tackle these challenges, distributed deep learning (DDL) training workloads have gained increasing significance.
However, DDL training introduces synchronization requirements among nodes, and the mini-batch stochastic gradient descent algorithm heavily burdens network connections.
This dissertation proposes, analyzes, and evaluates three solutions addressing the communication bottleneck in DDL learning workloads.
The first solution, SwitchML, introduces an in-network aggregation (INA) primitive that accelerates DDL workloads.
By aggregating model updates from multiple workers within the network, SwitchML reduces the volume of exchanged data.
This approach, which incorporates switch processing, end-host protocols, and Deep Learning frameworks, enhances training speed by up to 5.5 times for real-world benchmark models.
The second solution, OmniReduce, is an efficient streaming aggregation system designed for sparse collective communication.
It optimizes performance for parallel computing applications, such as distributed training of large-scale recommendation systems and natural language processing models.
OmniReduce achieves maximum effective bandwidth utilization by transmitting only nonzero data blocks and leveraging fine-grained parallelization and pipelining.
Compared to state-of-the-art TCP/IP and RDMA network solutions, OmniReduce outperforms them by 3.5 to 16 times, delivering significantly better performance for network-bottlenecked DNNs, even at 100 Gbps.
The third solution, CoInNetFlow, addresses congestion in shared data centers, where multiple DNN training jobs compete for bandwidth on the same node.
The study explores the feasibility of coflow scheduling methods in hierarchical and multi-tenant in-network aggregation communication patterns.
CoInNetFlow presents an innovative utilization of the Sincronia priority assignment algorithm.
Through packet-level DDL job simulation, the research demonstrates that appropriate weighting functions, transport layer priority scheduling, and gradient compression on low-priority tensors can significantly improve the median Job Completion Time Inflation by over $70\%$.
Collectively, this dissertation contributes to mitigating the network communication bottleneck in distributed deep learning.
The proposed solutions can enhance the efficiency and speed of distributed deep learning systems, ultimately improving the performance of DNN training across various domains.
|
9 |
Improving cluster performance through the use of programmable network interfacesBuntinas, Darius Tomas 14 October 2003 (has links)
No description available.
|
10 |
Topology-Aware MPI Communication and Scheduling for High Performance Computing SystemsSubramoni, Hari 02 October 2013 (has links)
No description available.
|
Page generated in 0.1212 seconds