Parallel computing offers immense potential to solve large, complex scientific problems. Load imbalance is a major impediment in obtaining high performance by a parallel system. One principal form of parallelism found in scientific applications is data parallelism. Loops without dependencies are data parallel. During the execution of large parallel loops, computational requirements vary due to problem, algorithmic and systemic characteristics. These factors lead to load imbalance which in turn degrades the performance of an application. Over the years, a number of dynamic loop scheduling techniques have been proposed to address one or more of these factors. However, there is no single strategy that works well for different problem domains and system characteristics. Moreover, load balancing during runtime is complicated because of its need for dynamic data redistribution. Therefore, there is a distinct need to integrate the dynamic loop scheduling techniques into a single package and provide them as an application programming interface (API) to the application developer. In recent years, along this direction, a number of dynamic loop scheduling techniques have been integrated into the compiler technologies for shared memory environments. On the other hand, there is no such integrated approach for distributed memory applications. The purpose of this thesis is to present the design, implementation and effectiveness of an integrated approach:the dynamic loop scheduling techniques are integrated into a runtime system for distributed memory architectures. For this purpose, we choose the newly developed parallel runtime environment for multicomputer architecture (PREMA) with its main components: the data movement and control substrate (DMCS) and mobile object layer (MOL). This runtime system has recently been developed and has demonstrated to be one of the most competitive runtime systems for distributed memory architectures. The significance of this work is that the proposed API will enhance the performance of parallel applications by reducing the load imbalance among processors caused by a wide range of factors and will reduce the software developmental cost required for load balancing. With the integration of the scheduling capabilities into the runtime system, its applicability has been expanded. The performance of the API has been evaluated qualitatively and quantitatively. The overhead of the API has been studied analytically and measured experimentally. Three parallel benchmarks including scientific applications of general interest (N-body simulations, automatic quadrature routine and unstructured grid heat solver) were considered for experimentation purpose. Based on the experiments conducted, a cost improvement of up to 76% over the straight forward parallel benchmark has been obtained. For certain application characteristics, the overhead of the runtime system was found to be within 10% of the underlying messaging layer. These results demonstrate that, in large scientific applications it is possible and desirable to combine the rich functionality of a runtime system with the advantages of scheduling techniques to achieve high performance.
Identifer | oai:union.ndltd.org:MSSTATE/oai:scholarsjunction.msstate.edu:td-4494 |
Date | 10 May 2003 |
Creators | Balasubramaniam, Mahadevan |
Publisher | Scholars Junction |
Source Sets | Mississippi State University |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | Theses and Dissertations |
Page generated in 0.0021 seconds