Return to search

Hierarchical Matrix Operations on GPUs

Large dense matrices are ubiquitous in scientific computing, arising from the
discretization of integral operators associated with elliptic pdes, Schur complement
methods, covariances in spatial statistics, kernel-based machine learning, and numerical optimization problems. Hierarchical matrices are an efficient way for storing the
dense matrices of very large dimension that appear in these and related settings. They
exploit the fact that the underlying matrices, while formally dense, are data sparse.
They have a structure consisting of blocks many of which can be well-approximated by
low rank factorizations. A hierarchical organization of the blocks avoids superlinear
growth in memory requirements to store n × n dense matrices in a scalable manner,
requiring O(n) units of storage with a constant depending on a representative rank k
for the low rank blocks. The asymptotically optimal storage requirement of the resulting hierarchical matrices is a critical advantage, particularly in extreme computing
environments, characterized by low memory per processing core. The challenge then
becomes to develop the parallel linear algebra operations that can be performed directly on this compressed representation. In this dissertation, I implement a set of
hierarchical basic linear algebra subroutines (HBLAS) optimized for GPUs, including
hierarchical matrix vector multiplication, orthogonalization, compression, low rank
updates, and matrix multiplication. I develop a library of open source batched kernel operations previously missing on GPUs for the high performance implementation
of the H2 operations, while relying wherever possible on existing open source and
vendor kernels to ride future improvements in the technology. Fast marshaling routines extract the batch operation data from an efficient representation of the trees

that compose the hierarchical matrices. The methods developed for GPUs extend to
CPUs using the same code base with simple abstractions around the batched routine
execution. To demonstrate the scalability of the hierarchical operations I implement
a distributed memory multi-GPU hierarchical matrix vector product that focuses on
reducing communication volume and hiding communication overhead and areas of low
GPU utilization using low priority streams. Two demonstrations involving Hessians
of inverse problems governed by pdes and space-fractional diffusion equations show
the effectiveness of the hierarchical operations in realistic applications.

Identiferoai:union.ndltd.org:kaust.edu.sa/oai:repository.kaust.edu.sa:10754/662664
Date26 April 2020
CreatorsBoukaram, Wagih Halim
ContributorsKeyes, David E., Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Ketcheson, David I., Hadwiger, Markus, Turkiyyah, George, Darve, Eric F.
Source SetsKing Abdullah University of Science and Technology
LanguageEnglish
Detected LanguageEnglish
TypeDissertation

Page generated in 0.0023 seconds