Global ETD Search

1	Formalized parallel dense linear algebra and its application to the generalized eigenvalue problem Poulson, Jack Lesly 03 September 2009 (has links) This thesis demonstrates an efficient parallel method of solving the generalized eigenvalue problem, KΦ = M ΦΛ, where K is symmetric and M is symmetric positive-definite, by first converting it to a standard eigenvalue problem, solving the standard eigenvalue problem, and back-transforming the results. An abstraction for parallel dense linear algebra is introduced along with a new algorithm for forming A := U⁻ᵀ K U⁻¹ , where U is the Cholesky factor of M , that is up to twice as fast as the ScaLAPACK implementation. Additionally, large improvements over the PBLAS implementations of general matrix-matrix multiplication and triangular solves with many right-hand sides are shown. Significant performance gains are also demonstrated for Cholesky factorizations, and a case is made for using 2D-cyclic distributions with a distribution blocksize of one. / text generalized eigenvalue problem parallel dense linear algebra
2	High Productivity Programming of Dense Linear Algebra on Heterogeneous NUMA Architectures Alomairy, Rabab M. 07 1900 (has links) High-end multicore systems with GPU-based accelerators are now ubiquitous in the hardware landscape. Besides dealing with the nontrivial heterogeneous environ- ment, end users should often take into consideration the underlying memory architec- ture to decrease the overhead of data motion, especially when running on non-uniform memory access (NUMA) platforms. We propose the OmpSs parallel programming model approach using its Nanos++ dynamic runtime system to solve the two challeng- ing problems aforementioned, through 1) an innovative NUMA node-aware scheduling policy to reduce data movement between NUMA nodes and 2) a nested parallelism feature to concurrently exploit the resources available from the GPU devices as well as the CPU host, without compromising the overall performance. Our approach fea- tures separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. The Cholesky factorization is used as a benchmark representative of dense numerical linear algebra algorithms. Superior performance is also demonstrated on the symmetric matrix inversion based on Cholesky factorization, commonly used in co-variance computations in statistics. Performance on a NUMA system with Kepler-based GPUs exceeds that of existing implementations, while the OmpSs-enabled code remains very similar to its original sequential version. HIgh Productivity Dense Linear Algebra NUMA
3	Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach Nath, Rajib Kumar 01 August 2010 (has links) Dense linear algebra(DLA) is one of the most seven important kernels in high performance computing. The introduction of new machines from vendors provides us opportunities to optimize DLA libraries for the new machines and thus exploit their power. Unfortunately the optimization phase is not straightforward. The optimum code of a certain Basic Linear Algebra Subprogram (BLAS) kernel, which is the core of DLA algorithms, in two different machines with different semiconductor process can be different even if they share the same features in terms of instruction set architecture, memory hierarchy and clock speed. It has become a tradition to optimize BLAS for new machines. Vendors maintain highly optimized BLAS libraries targeting their CPUs. Unfortunately the existing BLAS for GPUs is not highly optimized for DLA algorithms. In my research, I have provided new algorithms for several important BLAS kernels for different generation of GPUs and introduced a pointer redirecting approach to make BLAS run faster in generic problem size. I have also presented an auto-tuning approach to parameterize the developed BLAS algorithms and select the best set of parameters for a given card. The hardware trends have also brought up the need for updates on existing legacy DLA software packages, such as the sequential LAPACK. To take advantage of the new computational environment, successors of LAPACK must incorporate algorithms of three main characteristics: high parallelism, reduced communication, and heterogeneity-awareness. On multicore architectures, Parallel Linear Algebra Software for Multicore Architectures (PLASMA) has been developed to meet the challenges in multicore. On the other extreme, Matrix Algebra on GPU and Multicore Architectures (MAGMA) library demonstrated a hybridization approach that indeed streamlined the development of high performance DLA for multicores with GPU accelerators. The performance of these two libraries depend upon right choice of parameters for a given problem size and given number of cores and/or GPUs. In this work, the issue of automatically tuning these two libraries is presented. A prune based empirical auto-tuning method has been proposed for tuning PLASMA. Part of the tuning method for PLASMA was considered to tune hybrid MAGMA library. Optimization GPUs Multicore Dense Linear Algebra BLAS Hybrid Architecture
4	Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach Nath, Rajib Kumar 01 August 2010 (has links) Dense linear algebra(DLA) is one of the most seven important kernels inhigh performance computing. The introduction of new machines from vendorsprovides us opportunities to optimize DLA libraries for the new machinesand thus exploit their power. Unfortunately the optimization phase is notstraightforward. The optimum code of a certain Basic Linear AlgebraSubprogram (BLAS) kernel, which is the core of DLA algorithms, in twodifferent machines with different semiconductor process can be differenteven if they share the same features in terms of instruction setarchitecture, memory hierarchy and clock speed. It has become a traditionto optimize BLAS for new machines. Vendors maintain highly optimized BLASlibraries targeting their CPUs. Unfortunately the existing BLAS for GPUsis not highly optimized for DLA algorithms. In my research, I haveprovided new algorithms for several important BLAS kernels for differentgeneration of GPUs and introduced a pointer redirecting approach to makeBLAS run faster in generic problem size. I have also presented anauto-tuning approach to parameterize the developed BLAS algorithms andselect the best set of parameters for a given card.The hardware trends have also brought up the need for updates on existinglegacy DLA software packages, such as the sequential LAPACK. To takeadvantage of the new computational environment, successors of LAPACK mustincorporate algorithms of three main characteristics: high parallelism,reduced communication, and heterogeneity-awareness. On multicorearchitectures, Parallel Linear Algebra Software for MulticoreArchitectures (PLASMA) has been developed to meet the challenges inmulticore. On the other extreme, Matrix Algebra on GPU and MulticoreArchitectures (MAGMA) library demonstrated a hybridization approach thatindeed streamlined the development of high performance DLA for multicoreswith GPU accelerators. The performance of these two libraries depend uponright choice of parameters for a given problem size and given number ofcores and/or GPUs. In this work, the issue of automatically tuning thesetwo libraries is presented. A prune based empirical auto-tuning method hasbeen proposed for tuning PLASMA. Part of the tuning method for PLASMA wasconsidered to tune hybrid MAGMA library. Optimization GPUs Multicore Dense Linear Algebra BLAS Hybrid Architecture
5	A calculus of loop invariants for dense linear algebra optimization Low, Tze Meng 29 January 2014 (has links) Loop invariants have traditionally been used in proofs of correctness (e.g. program verification) and program derivation. Given that a loop invariant is all that is required to derive a provably correct program, the loop invariant can be thought of as being the essence of a loop. Being the essence of a loop, we ask the question “What other information is embedded within a loop invariant?” This dissertation provides evidence that in the domain of dense linear algebra, loop invariants can be used to determine the behavior of the loops. This dissertation demonstrates that by understanding how the loop invariant describes the behavior of the loop, a goal-oriented approach can be used to derive loops that are not only provably correct, but also have the desired performance behavior. / text Loop Invariants Loop optimization Dense linear algebra Goal-oriented programming
6	Exploiting Data Sparsity In Covariance Matrix Computations on Heterogeneous Systems Charara, Ali 24 May 2018 (has links) Covariance matrices are ubiquitous in computational sciences, typically describing the correlation of elements of large multivariate spatial data sets. For example, covari- ance matrices are employed in climate/weather modeling for the maximum likelihood estimation to improve prediction, as well as in computational ground-based astronomy to enhance the observed image quality by filtering out noise produced by the adap- tive optics instruments and atmospheric turbulence. The structure of these covariance matrices is dense, symmetric, positive-definite, and often data-sparse, therefore, hier- archically of low-rank. This thesis investigates the performance limit of dense matrix computations (e.g., Cholesky factorization) on covariance matrix problems as the number of unknowns grows, and in the context of the aforementioned applications. We employ recursive formulations of some of the basic linear algebra subroutines (BLAS) to accelerate the covariance matrix computation further, while reducing data traffic across the memory subsystems layers. However, dealing with large data sets (i.e., covariance matrices of billions in size) can rapidly become prohibitive in memory footprint and algorithmic complexity. Most importantly, this thesis investigates the tile low-rank data format (TLR), a new compressed data structure and layout, which is valuable in exploiting data sparsity by approximating the operator. The TLR com- pressed data structure allows approximating the original problem up to user-defined numerical accuracy. This comes at the expense of dealing with tasks with much lower arithmetic intensities than traditional dense computations. In fact, this thesis con- solidates the two trends of dense and data-sparse linear algebra for HPC. Not only does the thesis leverage recursive formulations for dense Cholesky-based matrix al- gorithms, but it also implements a novel TLR-Cholesky factorization using batched linear algebra operations to increase hardware occupancy and reduce the overhead of the API. Performance reported of the dense and TLR-Cholesky shows many-fold speedups against state-of-the-art implementations on various systems equipped with GPUs. Additionally, the TLR implementation gives the user flexibility to select the desired accuracy. This trade-off between performance and accuracy is, currently, a well-established leading trend in the convergence of the third and fourth paradigm, i.e., HPC and Big Data, when moving forward with exascale software roadmap. data sparse Hierarchical covariance matrix GPU tile low-rank Dense Linear Algebra
7	A Systematic Approach for Obtaining Performance on Matrix-Like Operations Veras, Richard Michael 01 August 2017 (has links) Scientific Computation provides a critical role in the scientific process because it allows us ask complex queries and test predictions that would otherwise be unfeasible to perform experimentally. Because of its power, Scientific Computing has helped drive advances in many fields ranging from Engineering and Physics to Biology and Sociology to Economics and Drug Development and even to Machine Learning and Artificial Intelligence. Common among these domains is the desire for timely computational results, thus a considerable amount of human expert effort is spent towards obtaining performance for these scientific codes. However, this is no easy task because each of these domains present their own unique set of challenges to software developers, such as domain specific operations, structurally complex data and ever-growing datasets. Compounding these problems are the myriads of constantly changing, complex and unique hardware platforms that an expert must target. Unfortunately, an expert is typically forced to reproduce their effort across multiple problem domains and hardware platforms. In this thesis, we demonstrate the automatic generation of expert level high-performance scientific codes for Dense Linear Algebra (DLA), Structured Mesh (Stencil), Sparse Linear Algebra and Graph Analytic. In particular, this thesis seeks to address the issue of obtaining performance on many complex platforms for a certain class of matrix-like operations that span across many scientific, engineering and social fields. We do this by automating a method used for obtaining high performance in DLA and extending it to structured, sparse and scale-free domains. We argue that it is through the use of the underlying structure found in the data from these domains that enables this process. Thus, obtaining performance for most operations does not occur in isolation of the data being operated on, but instead depends significantly on the structure of the data. Code Generation Computational Science Dense Linear Algebra Graph Analytics High Performance Computing Sparse Linear Algebra
8	Design by transformation : from domain knowledge to optimized program generation Marker, Bryan Andrew 20 June 2014 (has links) Expert design knowledge is essential to develop a library of high-performance software. This includes how to implement and parallelize domain operations, how to optimize implementations, and estimates of which implementation choices are best. An expert repeatedly applies his knowledge, often in a rote and tedious way, to develop all of the related functionality expected from a domain-specific library. Expert knowledge is hard to gain and is easily lost over time when an expert forgets or when a new engineer starts developing code. The domain of dense linear algebra (DLA) is a prime example with software that is so well designed that much of experts' important work has become tediously rote in many ways. In this dissertation, we demonstrate how one can encode design knowledge for DLA so it can be automatically applied to generate code as an expert would or to generate better code. Further, the knowledge is encoded for perpetuity, so it can be reused to make implementing functionality on new hardware easier or it can be used to teach how software is designed to a non-expert. We call this approach to software engineering (encoding expert knowledge and automatically applying it) Design by Transformation (DxT). We present our vision, the methodology, a prototype code generation system, and possibilities when applying DxT to the domain of dense linear algebra. / text Automatic program generation Software engineering Code transformation High-performance computing Distributed-memory computing Shared-memory computing Dense linear algebra Design by transformation
9	Scheduling of Dense Linear Algebra Kernels on Heterogeneous Resources / Ordonnancement de noyaux d'algèbre linéaire dense sur ressources hétérogènes Kumar, Suraj 12 April 2017 (has links) Du fait des énormes capacités de calculs des accélérateurs tels que les GPUs et les Xeon Phi, l’utilisation de machines multicoques pourvues d’accélérateurs est devenue commune dans le domaine du calcul haute performance (HPC). La complexité induite par ces accélérateurs a suscité le développement de systèmes d’exécution à base de tâches, dans lesquels les dépendances entre les applications sont exprimées sous la forme de graphe de tâches et où les tâches sont ordonnancées dynamiquement sur les ressources de calcul. La difficulté est alors de concevoir des stratégies d’ordonnancement qui font une utilisation efficace des ressources de calculs et le développement de telles stratégies, même pour un unique noeud hybride, est un enjeu essentiel de la performance des systèmes HPC. Nous considérons dans cette thèse l’ordonnancement de noyaux d’algèbre linéaire dense sur des noeuds complètement hétérogènes et constitués de CPUs et de GPUs. Les performances relatives des accélérateurs par rapport aux coeurs classique dépend très fortement du noyau considéré. Par exemple, les accélérateurs sont beaucoup plus efficaces pour les produits de matrices, par exemple, que pour les factorisations. Dans cette thèse, nous analysons les performances de stratégies statiques et dynamiques d’ordonnancement et nous proposons un ensemble de stratégies intermédiaires, en ajoutant des composantes statiques (respectivement dynamiques) à des stratégies d’ordonnancements dynamique (respectivement statiques). Récemment, une stratégie appelée HeteroPrio a été proposée, qui s’appuie sur les affinités entre les tâches et les ressources pour un petit ensemble de tâches différentes s’exécutant sur deux types de ressources. Nous avons étendu cette stratégie d’ordonnancement pour des graphes de tâches généraux pour deux types de ressources puis pour plus de deux types. De manière complémentaire, nous avons également démontré des facteurs d’approximation et des pires cas pour HeteroPrio dans le cas d’un ensemble de tâches indépendantes sur différents types de plates-formes. / Due to massive computation power of accelerators such as GPU, Xeon phi, multicore machines equipped with accelerators are becoming popular in High Performance Computing (HPC). The added complexity led to the development of different task-based runtime systems, which allow computations to be expressed as graphs of tasks and rely on runtime systems to schedule those tasks among all resources of the platform. The real challenge is to design efficient schedulers for such runtimes to make effective utilization of all resources. Developing good schedulers, even for a single hybrid node, and analyzing them can thus have a strong impact on the performance of current HPC systems. We consider the problem of scheduling dense linear algebra applications on fully hybrid platforms made of CPUs and GPUs. The relative performance of CPU and GPU highly depends on the sub-routine. For instance, GPUs are much more efficient to process matrix-matrix multiplications than matrix factorizations. In this thesis, we analyze the performance of static and dynamic scheduling strategies and we propose a set of intermediate strategies, by adding static (resp. dynamic) features into dynamic (resp. static) strategies. A resource centric dynamic scheduler, HeteroPrio, which is based on affinity between tasks and resources, has been proposed recently for a set of small independent tasks on two types of resources. We extend and analyze this scheduler for general task graphs first on two types of resources and then on more than two types of resources. Additionally, we provide approximation ratios and worst case examples of HeteroPrio for a set of independent tasks on different platform sizes. Algèbre linéaire dense Ordonnancement dynamique Plates-formes hétérogènes Systèmes d’ordonnancement dynamiques Dense Linear Algebra Dynamic Schedulers Task-based Scheduling Heterogeneous Platforms Runtime Systems
10	Algorithm-Architecture Co-Design for Dense Linear Algebra Computations Merchant, Farhad January 2015 (has links) (PDF) Achieving high computation efficiency, in terms of Cycles per Instruction (CPI), for high-performance computing kernels is an interesting and challenging research area. Dense Linear Algebra (DLA) computation is a representative high-performance computing ap- plication, which is used, for example, in LU and QR factorizations. Unfortunately, mod- ern off-the-shelf microprocessors fall significantly short of achieving theoretical lower bound in CPI for high performance computing applications. In this thesis, we perform an in-depth analysis of the available parallelisms and propose suitable algorithmic and architectural variation to significantly improve the computation efficiency. There are two standard approaches for improving the computation effficiency, first, to perform application-specific architecture customization and second, to do algorithmic tuning. In the same manner, we first perform a graph-based analysis of selected DLA kernels. From the various forms of parallelism, thus identified, we design a custom processing element for improving the CPI. The processing elements are used as building blocks for a commercially available Coarse-Grained Reconfigurable Architecture (CGRA). By per- forming detailed experiments on a synthesized CGRA implementation, we demonstrate that our proposed algorithmic and architectural variations are able to achieve lower CPI compared to off-the-shelf microprocessors. We also benchmark against state-of-the-art custom implementations to report higher energy-performance-area product. DLA computations are encountered in many engineering and scientific computing ap- plications ranging from Computational Fluid Dynamics (CFD) to Eigenvalue problem. Traditionally, these applications are written in highly tuned High Performance Comput- ing (HPC) software packages like Linear Algebra Package (LAPACK), and/or Scalable Linear Algebra Package (ScaLAPACK). The basic building block for these packages is Ba- sic Linear Algebra Subprograms (BLAS). Algorithms pertaining LAPACK/ScaLAPACK are written in-terms of BLAS to achieve high throughput. Despite extensive intellectual efforts in development and tuning of these packages, there still exists a scope for fur- ther tuning in this packages. In this thesis, we revisit most prominent and widely used compute bound algorithms like GMM for further exploitation of Instruction Level Parallelism (ILP). We further look into LU and QR factorizations for generalizations and exhibit higher ILP in these algorithms. We first accelerate sequential performance of the algorithms in BLAS and LAPACK and then focus on the parallel realization of these algorithms. Major contributions in the algorithmic tuning in this thesis are as follows: Algorithms: We present graph based analysis of General Matrix Multiplication (GMM) and discuss different types of parallelisms available in GMM We present analysis of Givens Rotation based QR factorization where we improve GR and derive Column-wise GR (CGR) that can annihilate multiple elements of a column of a matrix simultaneously. We show that the multiplications in CGR are lower than GR We generalize CGR further and derive Generalized GR (GGR) that can annihilate multiple elements of the columns of a matrix simultaneously. We show that the parallelism exhibited by GGR is much higher than GR and Householder Transform (HT) We extend generalizations to Square root Free GR (also knows as Fast Givens Rotation) and Square root and Division Free GR (SDFG) and derive Column-wise Fast Givens, and Column-wise SDFG . We also extend generalization for complex matrices and derive Complex Column-wise Givens Rotation Coarse-grained Recon gurable Architectures (CGRAs) have gained popularity in the last decade due to their power and area efficiency. Furthermore, CGRAs like REDEFINE also exhibit support for domain customizations. REDEFINE is an array of Tiles where each Tile consists of a Compute Element and a Router. The Routers are responsible for on-chip communication, while Compute Elements in the REDEFINE can be domain customized to accelerate the applications pertaining to the domain of interest. In this thesis, we consider REDEFINE base architecture as a starting point and we design Processing Element (PE) that can execute algorithms in BLAS and LAPACK efficiently. We perform several architectural enhancements in the PE to approach lower bound of the CPI. For parallel realization of BLAS and LAPACK, we attach this PE to the Router of REDEFINE. We achieve better area and power performance compared to the yesteryear customized architecture for DLA. Major contributions in architecture in this thesis are as follows: Architecture: We present design of a PE for acceleration of GMM which is a Level-3 BLAS operation We methodically enhance the PE with different features for improvement in the performance of GMM For efficient realization of Linear Algebra Package (LAPACK), we use PE that can efficiently execute GMM and show better performance For further acceleration of LU and QR factorizations in LAPACK, we identify macro operations encountered in LU and QR factorizations, and realize them on a reconfigurable data-path resulting in 25-30% lower run-time Dense Linear Algebra (DLA) Algorithms-Applications Parallesing Algorithmic Structural Variations Dense Linear Algebra General Matrix Multiplication (GMM) Column-wise Givens Rotation (CGR) QR Factorization Basic Linear Algebra Subprograms (BLAS) Linear Algebra Package (LAPACK) Electronic Systems Engineering

Search results