61 |
IMPROVING MESSAGE-PASSING PERFORMANCE AND SCALABILITY IN HIGH-PERFORMANCE CLUSTERSRASHTI, Mohammad Javad 26 January 2011 (has links)
High Performance Computing (HPC) is the key to solving many scientific, financial, and engineering problems. Computer clusters are now the dominant architecture for HPC. The scale of clusters, both in terms of processor per node and the number of nodes, is increasing rapidly, reaching petascales these days and soon to exascales. Inter-process communication plays a significant role in the overall performance of HPC applications. With the continuous enhancements in interconnection technologies and node architectures, the Message Passing Interface (MPI) needs to be improved to effectively utilize the modern technologies for higher performance.
After providing a background, I present a deep analysis of the user level and MPI libraries over modern cluster interconnects: InfiniBand, iWARP Ethernet, and Myrinet. Using novel techniques, I assess characteristics such as overlap and communication progress ability, buffer reuse effect on latency, and multiple-connection scalability. The outcome highlights some of the inefficiencies that exist in the communication libraries.
To improve communication progress and overlap in large message transfers, a method is proposed which uses speculative communication to overlap communication with computation in the MPI Rendezvous protocol. The results show up to 100% communication progress and more than 80% overlap ability over iWARP Ethernet. An adaptation mechanism is employed to avoid overhead on applications that do not benefit from the method due to their timing specifications.
To reduce MPI communication latency, I have proposed a technique that exploits the application buffer reuse characteristics for small messages and eliminates the sender-side copy in both two-sided and one-sided MPI small message transfer protocols. The implementation over InfiniBand improves small message latency up to 20%. The implementation adaptively falls back to the current method if the application does not benefit from the proposed technique.
Finally, to improve scalability of MPI applications on ultra-scale clusters, I have proposed an extension to the current iWARP standard. The extension improves performance and memory usage for large-scale clusters. The extension equips Ethernet with an efficient zero-copy, connection-less datagram transport. The software-level evaluation shows more than 40% performance benefits and 30% memory usage reduction for MPI applications on a 64-core cluster. / Thesis (Ph.D, Electrical & Computer Engineering) -- Queen's University, 2010-10-16 12:25:18.388
|
62 |
Virtual application appliances on clustersUnal, Erkan Unknown Date
No description available.
|
63 |
Theory and Practice of Dynamic Voltage/Frequency Scaling in the High Performance Computing EnvironmentRountree, Barry January 2009 (has links)
This dissertation provides a comprehensive overview of the theory and practice of Dynamic Voltage/Frequency Scaling (DVFS) in the High Performance Computing (HPC) environment. We summarize the overall problem as follows: how can the same level of computational performance be achieved using less electrical power? Equivalently, how can computational performance be increased using the same amount of electrical power? In this dissertation we present performance and architecture models of DVFS as well as the Adagio runtime system. The performance model recasts the question as an optimization problem that we solve using linear programming, thus establishing a bound on potential energy savings. The architectural model provides a low-level explanation of how memory bus and CPU clock frequencies interact to determine execution time. Using insights provided from these models, we have designed and implemented the Adagio runtime system. This system realizes near-optimal energy savings on real-world scientific applications without the use of training runs or source code modification, and under the constraint that only negligible delay will be tolerated by the user. This work has opened up several new avenues of research, and we conclude by enumerating these.
|
64 |
Creating dynamic application behavior for distributed performance analysisLepler, Joerg January 1998 (has links)
No description available.
|
65 |
Enabling efficient high-performance communication in multicomputer interconnection networksMay, Philip 05 1900 (has links)
No description available.
|
66 |
Overlapping Computation and Communication through Offloading in MPI over InfiniBandInozemtsev, Grigori 30 May 2014 (has links)
As the demands of computational science and engineering simulations increase, the size and capabilities of High Performance Computing (HPC) clusters are also expected to grow. Consequently, the software providing the application programming abstractions for the clusters must adapt to meet these demands. Specifically, the increased cost of interprocessor synchronization and communication in larger systems must be accommodated. Non-blocking operations that allow communication latency to be hidden by overlapping it with computation have been proposed to mitigate this problem.
In this work, we investigate offloading a portion of the communication processing to dedicated hardware in order to support communication/computation overlap efficiently. We work with the Message Passing Interface (MPI), the de facto standard for parallel programming in HPC environments. We investigate both point-to-point non-blocking communication and collective operations; our work with collectives focuses on the allgather operation. We develop designs for both flat and hierarchical cluster topologies and examine both eager and rendezvous communication protocols.
We also develop a generalized primitive operation with the aim of simplifying further research into non-blocking collectives. We propose a new algorithm for the non-blocking allgather collective and implement it using this primitive. The algorithm has constant resource usage even when executing multiple operations simultaneously.
We implemented these designs using CORE-Direct offloading support in Mellanox InfiniBand adapters. We present an evaluation of the designs using microbenchmarks and an application kernel that shows that offloaded non-blocking communication operations can provide latency that is comparable to that of their blocking counterparts while allowing most of the duration of the communication to be overlapped with computation and remaining resilient to process arrival and scheduling variations. / Thesis (Master, Electrical & Computer Engineering) -- Queen's University, 2014-05-29 11:55:53.87
|
67 |
Towards an MPI-like Framework for Azure Cloud PlatformKaramati, Sara 12 August 2014 (has links)
Message passing interface (MPI) has been widely used for implementing parallel and distributed applications. The emergence of cloud computing offers a scalable, fault-tolerant, on-demand al-ternative to traditional on-premise clusters. In this thesis, we investigate the possibility of adopt-ing the cloud platform as an alternative to conventional MPI-based solutions. We show that cloud platform can exhibit competitive performance and benefit the users of this platform with its fault-tolerant architecture and on-demand access for a robust solution. Extensive research is done to identify the difficulties of designing and implementing an MPI-like framework for Azure cloud platform. We present the details of the key components required for implementing such a framework along with our experimental results for benchmarking multiple basic operations of MPI standard implemented in the cloud and its practical application in solving well-known large-scale algorithmic problems.
|
68 |
Scheduling in STAPLSharma, Shishir 03 October 2013 (has links)
Writing efficient parallel programs is a difficult and error-prone process. The Standard Template Adaptive Parallel Library (STAPL) is being developed to make this task easier for programmers with little experience in parallel programming. STAPL is a C++ library for writing parallel programs using a generic programming approach similar to writing sequential programs using the C++ Standard Template Library (STL). STAPL provides a collection of parallel containers (pContainers) to store data in a distributed fashion and a collection of pViews to abstract details of the data distribution. STAPL algorithms are written in terms of PARAGRAPHs which are high level descriptions of task dependence graphs.
Scheduling plays a very important role in the efficient execution of parallel programs. In this thesis, we present our work to enable efficient scheduling of parallel programs written using STAPL. We abstract the scheduling activities associated with PARAGRAPHs in a software module called the scheduler which is customizable and extensible. We provide support for static scheduling of PARAGRAPHs and develop mechanisms based on migration of tasks and data to support dynamic scheduling strategies for PARAGRAPHs with arbitrary dependencies. We also provide implementations of different scheduling strategies that can be used to improve the performance of applications suffering from load imbalance.
The scheduling infrastructure developed in this thesis is highly customizable and can be used to execute a variety of parallel computations. We demonstrate its usefulness by improving the performance of two applications: a widely used synthetic benchmark (UTS) and a Parallel Motion Planning application. The experiments are conducted on an Opteron cluster and a massively parallel Cray XE6 machine. Experimental results up to 6144 processors are presented.
|
69 |
Scalable mining on emerging architecturesBuehrer, Gregory T. January 2007 (has links)
Thesis (Ph. D.)--Ohio State University, 2007.
|
70 |
Communication performance measurement and analysis on commodity clustersAbdul Hamid, Nor Asilah Wati. January 1900 (has links)
Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 2008. / "March 13, 2008" Bibliography: leaves 217-235. Also available in print form.
|
Page generated in 0.1191 seconds