Global ETD Search

1	Scalable and High Performance Collective Communication for Next Generation Multicore Infiniband Clusters Mamidala, Amith Rajith 24 June 2008 (has links) No description available. Computer Science Collective Communication InfiniBand Multicore Clusters
2	Verargumentierte Geschichte : Exempla romana im politischen Diskurs der späten römischen Republik / Bücher, Frank. January 2006 (has links) Texte remanié de: Dissertation--Philosophische Fakultät--Universität zu Köln, 2004-2005. / Bibliogr. p. [333]-349.
3	One To Mant And Many To Many Collective Communication Operations On Grids Gupta, Rakhi 12 1900 (has links) Collective Communication Operations are widely used in MPI applications and play an important role in their performance. Hence, various projects have focused on optimization of collective communications for various kinds of parallel computing environments including LAN settings, heterogeneous networks and most recently Grid systems. The distinguishing factor of Grids from all the other environments is heterogeneity of hosts and network, and dynamically changing resource characteristics including load and availability. The ﬁrst part of the thesis develops a solution for MPI broadcast (one-to-many) on Grids. Some current strategies take into consideration static information about network topology for determining an efficient broadcast tree for Grids. Some other strategies take into account only transient network characteristics. We combined both these strategies and cluster the network dynamically on the basis of link bandwidths. Given a set of network parameters we use Simulated Annealing (SA) to obtain the best schedule. Also, we can time tune individual. SAs, to adapt the solution ﬁnding process, on the basis of estimated available times before next broadcast invocations in the application. We also developed software architecture for updation of schedules. We compared our algorithm with the earlier approaches under loaded network conditions, and obtained average performance improvement of 20%. The second part of the thesis extends the work for MPI all gather (many-to-many) operation. Current popular techniques consider strict hierarchical schemes for this operation, wherein from each cluster a representative (or coordinator) node is chosen, and inter cluster communication is done through these representative nodes. This is non optimal as inter cluster communication is usually on high capacity links that can sustain more than one transfer with the same through- put. We developed a cluster based and incremental heuristic algorithm for allgather on Grids. We compared the time taken by allgather schedules determined by this algorithm with current popular implementations. We also compared our algorithm with a strategy where allgather is constructed from a set of broadcast trees. We obtained average performance improvement of 67% over existing strategies. Read more Collective Communication Operations Message Passing Interface Grid Networks Grids MPI Allgather - Algorithms One-To-Many Collective Communication Many-To-Many Collective Communication Computer Science
4	On Collective Communication and Notified Read in the Global Address Space Programming Interface (GASPI) End, Vanessa 14 December 2016 (has links) No description available. 510 GASPI PGAS Paritioned Global Address Space Collective Communication Notified Read Adapted n-way Dissemination Informatik (PPN619939052)
5	A Survey of Barrier Algorithms for Coarse Grained Supercomputers Hoefler, Torsten, Mehlan, Torsten, Mietke, Frank, Rehm, Wolfgang 28 June 2005 (has links) (PDF) There are several different algorithms available to perform a synchronization of multiple processors. Some of them support only shared memory architectures or very fine grained supercomputers. This work gives an overview about all currently known algorithms which are suitable for distributed shared memory architectures and message passing based computer systems (loosely coupled or coarse grained supercomputers). No absolute decision can be made for choosing a barrier algorithm for a machine. Several architectural aspects have to be taken into account. The overview about known barrier algorithms given in this work is mostly targeted to implementors of libraries supporting collective communication (such as MPI). Barrier Collective Communication Kollektive Operationen MPI_Barrier ddc:004 MPI <Schnittstelle> Mpi-Sprache Netzwerk <Graphentheorie> Supercomputer
6	A Survey of Barrier Algorithms for Coarse Grained Supercomputers Hoefler, Torsten, Mehlan, Torsten, Mietke, Frank, Rehm, Wolfgang 28 June 2005 (has links) There are several different algorithms available to perform a synchronization of multiple processors. Some of them support only shared memory architectures or very fine grained supercomputers. This work gives an overview about all currently known algorithms which are suitable for distributed shared memory architectures and message passing based computer systems (loosely coupled or coarse grained supercomputers). No absolute decision can be made for choosing a barrier algorithm for a machine. Several architectural aspects have to be taken into account. The overview about known barrier algorithms given in this work is mostly targeted to implementors of libraries supporting collective communication (such as MPI). info:eu-repo/classification/ddc/004 ddc:004 MPI <Schnittstelle> Mpi-Sprache Netzwerk <Graphentheorie> Supercomputer Barrier Collective Communication Kollektive Operationen MPI_Barrier
7	High Performance and Scalable Cooperative Communication Middleware for Next Generation Architectures Chakraborty, Sourav 10 October 2019 (has links) No description available. Computer Science Computer Engineering
8	Tackling the Communication Bottlenecks of Distributed Deep Learning Training Workloads Ho, Chen-Yu 08 1900 (has links) Deep Neural Networks (DNNs) find widespread applications across various domains, including computer vision, recommendation systems, and natural language processing. Despite their versatility, training DNNs can be a time-consuming process, and accommodating large models and datasets on a single machine is often impractical. To tackle these challenges, distributed deep learning (DDL) training workloads have gained increasing significance. However, DDL training introduces synchronization requirements among nodes, and the mini-batch stochastic gradient descent algorithm heavily burdens network connections. This dissertation proposes, analyzes, and evaluates three solutions addressing the communication bottleneck in DDL learning workloads. The first solution, SwitchML, introduces an in-network aggregation (INA) primitive that accelerates DDL workloads. By aggregating model updates from multiple workers within the network, SwitchML reduces the volume of exchanged data. This approach, which incorporates switch processing, end-host protocols, and Deep Learning frameworks, enhances training speed by up to 5.5 times for real-world benchmark models. The second solution, OmniReduce, is an efficient streaming aggregation system designed for sparse collective communication. It optimizes performance for parallel computing applications, such as distributed training of large-scale recommendation systems and natural language processing models. OmniReduce achieves maximum effective bandwidth utilization by transmitting only nonzero data blocks and leveraging fine-grained parallelization and pipelining. Compared to state-of-the-art TCP/IP and RDMA network solutions, OmniReduce outperforms them by 3.5 to 16 times, delivering significantly better performance for network-bottlenecked DNNs, even at 100 Gbps. The third solution, CoInNetFlow, addresses congestion in shared data centers, where multiple DNN training jobs compete for bandwidth on the same node. The study explores the feasibility of coflow scheduling methods in hierarchical and multi-tenant in-network aggregation communication patterns. CoInNetFlow presents an innovative utilization of the Sincronia priority assignment algorithm. Through packet-level DDL job simulation, the research demonstrates that appropriate weighting functions, transport layer priority scheduling, and gradient compression on low-priority tensors can significantly improve the median Job Completion Time Inflation by over $70\%$. Collectively, this dissertation contributes to mitigating the network communication bottleneck in distributed deep learning. The proposed solutions can enhance the efficiency and speed of distributed deep learning systems, ultimately improving the performance of DNN training across various domains. Read more deep neural networks deep learning distributed training in-network aggregation communication bottleneck streaming aggregation collective communication gradient compression sparsity coflow scheduling multi-tenancy congestion hierarchical aggregation
9	Improving cluster performance through the use of programmable network interfaces Buntinas, Darius Tomas 14 October 2003 (has links) No description available. Computer Science parallel processing clusters of workstations networks of workstations collective communication operations atomic remote memory operations atomic operations application bypass network interface based NIC-based network interface supported
10	Topology-Aware MPI Communication and Scheduling for High Performance Computing Systems Subramoni, Hari 02 October 2013 (has links) No description available. Computer Science Computer Engineering

Search results