<p>As
parallel and distributed systems scale to hundreds of thousands of
cores and beyond, fault tolerance becomes increasingly important --
particularly on systems with limited I/O capacity and bandwidth.
Error correcting codes (ECCs) are used in communication systems where
errors arise when bits are corrupted silently in a message. Error
correcting codes can detect and correct erroneous bits. Erasure
codes, an instance of error correcting codes that deal with data
erasures, are widely used in storage systems. An erasure code
addsredundancy to the data to tolerate erasures.
</p>
<p><br>
</p>
<p>In
this thesis, erasure coded computations are proposed as a novel
approach to dealing with processor faults in parallel and distributed
systems. We first give a brief review of traditional fault tolerance
methods, error correcting codes, and erasure coded storage. The
benefits and challenges of erasure coded computations with respect to
coding scheme, fault models and system support are also presented.</p>
<p><br>
</p>
<p>In
the first part of my thesis, I demonstrate the novel concept of
erasure coded computations for linear system solvers. Erasure coding
augments a given problem instance with redundant data. This augmented
problem is executed in a fault oblivious manner in a faulty parallel
environment. In the event of faults, we show how we can compute the
true solution from potentially fault-prone solutions using a
computationally inexpensive procedure. The results on diverse linear
systems show that our technique has several important advantages: (i)
as the hardware platform scales in size and in number of faults, our
scheme yields increasing improvement in resource utilization,
compared to traditional schemes; (ii) the proposed scheme is easy to
code as the core algorithm remains the same; (iii) the general scheme
is flexible to accommodate a range of computation and communication
trade-offs.
</p>
<p><br>
</p>
<p>We
propose a new coding scheme for augmenting the input matrix that
satisfies the recovery equations of erasure coding with high
probability in the event of random failures. This coding scheme also
minimizes fill (non-zero elements introduced by the coding block),
while being amenable to efficient partitioning across processing
nodes. Our experimental results show that the scheme adds minimal
overhead for fault tolerance, yields excellent parallel efficiency
and scalability, and is robust to different fault arrival models and
fault rates.</p>
<p><br>
</p>
<p>Building
on these results, we show how we can minimize, to optimality, the
overhead associated with our problem augmentation techniques for
linear system solvers. Specifically, we present a technique that
adaptively augments the problem only when faults are detected. At any
point during execution, we only solve a system with the same size as
the original input system. This has several advantages in terms of
maintaining the size and conditioning of the system, as well as in
only adding the minimal amount of computation needed to tolerate the
observed faults. We present, in details, the augmentation process,
the parallel formulation, and the performance of our method.
Specifically, we show that the proposed adaptive fault tolerance
mechanism has minimal overhead in terms of FLOP counts with respect
to the original solver executing in a non-faulty environment, has
good convergence properties, and yields excellent parallel
performance.</p>
<p><br>
</p>
<p>Based
on the promising results for linear system solvers, we apply the
concept of erasure coded computation to eigenvalue problems, which
arise in many applications including machine learning and scientific
simulations. Erasure coded computation is used to design a fault
tolerant eigenvalue solver. The original eigenvalue problem is
reformulated into a generalized eigenvalue problem defined on
appropriate augmented matrices. We present the augmentation scheme,
the necessary conditions for augmentation blocks, and the proofs of
equivalence of the original eigenvalue problem and the reformulated
generalized eigenvalue problem. Finally, we show how the eigenvalues
can be derived from the augmented system in the event of faults.
</p>
<p><br>
</p>
<p>We
present detailed experiments, which demonstrate the excellent
convergence properties of our fault tolerant TraceMin eigensolver in
the average case. In the worst case where the row-column pairs that
have the most impact on eigenvalues are erased, we present a novel
scheme that computes the augmentation blocks as the computation
proceeds, using the estimates of leverage scores of row-column pairs
as they are computed by the iterative process. We demonstrate low
overhead, excellent scalability in terms of the number of faults, and
the robustness to different fault arrival models and fault rates for
our method.</p>
<p><br>
</p>
<p>In
summary, this thesis presents a novel approach to fault tolerance
based on erasure coded computations, demonstrates it in the
context of important linear algebra kernels, and validates its
performance on a diverse set of problems on scalable parallel
computing platforms. As parallel systems scale to hundreds of
thousands of processing cores and beyond, these techniques present
the most scalable fault tolerant mechanisms currently available.</p><br>
Identifer | oai:union.ndltd.org:purdue.edu/oai:figshare.com:article/7429934 |
Date | 16 January 2019 |
Creators | Xuejiao Kang (5929862) |
Source Sets | Purdue University |
Detected Language | English |
Type | Text, Thesis |
Rights | CC BY 4.0 |
Relation | https://figshare.com/articles/Fault_Tolerance_in_Linear_Algebraic_Methods_using_Erasure_Coded_Computations/7429934 |
Page generated in 0.0025 seconds