Automatic Global Data Partitioning for Distributed Memory Machines (DMMs)
is a difficult problem. Distributed memory machines are scalable,
but since the memory is distributed across processors, the scheme
of placement of
data (arrays) onto local memories of different processors become
crucial since any communication between processors for non-local
data access is an order of magnitude costlier than access to local
memory. Researchers have given varied solutions
to this problem, most of which work for uniform dependences in loops
and they suggest HPF-like distributions only. For non-uniform
dependences the loop was made to run sequentially.
In this work, we present a partitioning strategy
called Hyperplane Partitioning which works well with
loops with non-uniform dependences also. In this method of partitioning,
the iteration space
is partitioned into as many number of partitions as there are
number of logical processors, in such a way that the overall
inter-processor communication will be minimum. The idea is to
localize as many as dependences as possible so that overall
communication both beacuse of non-local data as well as
inter-processor synchronizations are reduced.
These partitions are
then induced into data spaces of the arrays referenced in the loop.
Each processor then runs its part of iteration space keeping the data
partition that it owns locally. Any non-local data access is
implemented by inter-processor communication at run-time.The Hyperplane Partitioning is also extended to
a sequence of loops. This is done by first finding
Best Local Distribution (BLD) for every loop first and
then finding the best way of grouping different adjacent loops
(just for finding the data partition)
which gives best global data partition. This sequence of
distributions/redistributions is found by constructing a
data structure called Data Distribution Tree (DDT) and finding
the least cost path from the source to any of the leaf nodes
in the DDT. The costs for the edges come from the communication
cost incurred while running a loop with a particular distribution
and redistribution to suit the requirement at the next loop.
For this a communication cost estimator is developed which
works well for fewer dimensions. To handle complete programs
we use some heuristic to find the best global distribution
for the entire program.Some optimizations like message optimization to reduce the number
of messages sent across processors, time optimization
which is done by uniform scheduling across processors, and
space optimization to keep only the part of array space
that any processor owns onto its local memory, are studied.
Hyperplane Partitioning is also implemented using an algorithm for
synchronization to handle non-local memory access as well
as obeying data dependence constraints. The algorithm is also
proved to be correct. The target machine is IBM-SP2 using
PVM for the message passing library. The performance of the tool
on some standard benchmarks (ADI and RHS) and also on some
programs designed by us to show the specific merits of the tool.
The results show that the loops which have non-uniform dependences
also can be run on DMM with good speed-ups.
Identifer | oai:union.ndltd.org:IISc/oai:etd.ncsi.iisc.ernet.in:2005/175 |
Date | 07 1900 |
Creators | Prakash, S R |
Contributors | Srikant, Y N |
Publisher | Indian Institute of Science |
Source Sets | India Institute of Science |
Language | English |
Detected Language | English |
Type | Electronic Thesis and Dissertation |
Format | 937496 bytes, application/postscript |
Rights | I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. |
Page generated in 0.0019 seconds