Global ETD Search

1	Implementing a New Dense Symmetric Eigensolver on Multicore Systems Sukkari, Dalal E. 07 1900 (has links) We present original advanced architecture implementations of the QDWHeig algo- rithm for solving dense symmetric eigenproblems. The algorithm (Y. Nakatsukasa and N. J. Higham, 2012) performs a spectral divide-and-conquer, which recursively divides the matrix into smaller submatrices by finding an invariant subspace for a subset of the spectrum. The main contribution of this thesis is to enhance the per- formance of QDWHeig algorithm by relying on a high performance kernels from PLASMA [1] and LAPACK [2]. We demonstrate the quality of the eigenpairs that are computed with the QDWHeig algorithm for many matrix types with different eigenvalue clustering. We then implement QDWHeig using kernels from LAPACK and PLASMA, and compare its performance against other divide-and-conquer sym- metric eigensolvers. The main part of QDWHeig is finding a polar decomposition. We introduce mixed precision to enhance the performance in finding the polar decom- position. Our evaluation considers speed and accuracy of the computed eigenvalues. Some applications require finding only a subspectrum of the eigenvalues; therefore we modify the algorithm to find the eigenpairs in a given interval of interest. An ex- perimental study shows significant improvement on the performance of our algorithm using mixed precision and PLASMA routines. Read more accuracy performance implementation plasma lapack eigensolver
2	FLENS - A Flexible Library for Efficient Numerical Solutions Lehn, Michael Christian, January 2008 (has links) Ulm, Univ., Diss., 2008.
3	DGRSVX and DMSRIC: Fortran 77 subroutines for solving continuous-time matrix algebraic Riccati equations with condition and accuracy estimates Petkov, P. Hr., Konstantinov, M. M., Mehrmann, V. 12 September 2005 (has links) (PDF) We present new Fortran 77 subroutines which implement the Schur method and the matrix sign function method for the solution of the continuoustime matrix algebraic Riccati equation on the basis of LAPACK subroutines. In order to avoid some of the wellknown difficulties with these methods due to a loss of accuracy, we combine the implementations with block scalings as well as condition estimates and forward error estimates. Results of numerical experiments comparing the performance of both methods for more than one hundred well and illconditioned Riccati equations of order up to 150 are given. It is demonstrated that there exist several classes of examples for which the matrix sign function approach performs more reliably and more accurately than the Schur method. In all cases the forward error estimates allow to obtain a reliable bound on the accuracy of the computed solution. Schur method matrix sign function method ddc:510 Algebraische Riccati-Gleichung FORTRAN-Programm LAPACK
4	DGRSVX and DMSRIC: Fortran 77 subroutines for solving continuous-time matrix algebraic Riccati equations with condition and accuracy estimates Petkov, P. Hr., Konstantinov, M. M., Mehrmann, V. 12 September 2005 (has links) We present new Fortran 77 subroutines which implement the Schur method and the matrix sign function method for the solution of the continuoustime matrix algebraic Riccati equation on the basis of LAPACK subroutines. In order to avoid some of the wellknown difficulties with these methods due to a loss of accuracy, we combine the implementations with block scalings as well as condition estimates and forward error estimates. Results of numerical experiments comparing the performance of both methods for more than one hundred well and illconditioned Riccati equations of order up to 150 are given. It is demonstrated that there exist several classes of examples for which the matrix sign function approach performs more reliably and more accurately than the Schur method. In all cases the forward error estimates allow to obtain a reliable bound on the accuracy of the computed solution. info:eu-repo/classification/ddc/510 ddc:510 Algebraische Riccati-Gleichung FORTRAN-Programm LAPACK Schur method matrix sign function method
5	Robust Explicit Construction of 3D Configuration Spaces Using Controlled Linear Perturbation Trac, Steven Cy 19 December 2008 (has links) We present robust explicit construction of 3D configuration spaces using controlled linear perturbation. The input is two planar parts: a fixed set and a moving set, where each set is bounded by circle segments. The configuration space is the three-dimensional space of Euclidean transformation (translations plus rotations) of the moving set relative to the fixed set. The goal of constructing the 3D configuration space is to determine the boundary representation of the free space where the intersection of the moving set and fixed set is empty. To construct the configuration space, we use the controlled linear perturbation algorithm. The controlled linear perturbation algorithm assigns function signs that are correct for a nearly minimal input perturbation. The output of the algorithm is a consistent set of function signs. This approach is algorithm-independent, and the overhead over traditional floating point methods is reasonable. If the fixed and moving sets are computer representations of physical objects, then computing the configuration space greatly aids in many computational geometry problems. The main focus of computing the configuration space is for the path planning problem. We must find if a path exists from the start to the goal, where the fixed set is the obstacle, and the moving set is the object trying to reach the goal. Read more
6	Résolutions rapides et fiables pour les solveurs d'algèbre linéaire numérique en calcul haute performance. Baboulin, Marc 05 December 2012 (has links) (PDF) Dans cette Habilitation à Diriger des Recherches (HDR), nous présentons notre recherche effectuée au cours de ces dernières années dans le domaine du calcul haute-performance. Notre travail a porté essentiellement sur les algorithmes parallèles pour les solveurs d'algèbre linéaire numérique et leur implémentation parallèle dans les bibliothèques logicielles du domaine public. Nous illustrons dans ce manuscrit comment ces calculs peuvent être accélérées en utilisant des algorithmes innovants et être rendus fiables en utilisant des quantités spécifiques de l'analyse d'erreur. Nous expliquons tout d'abord comment les solveurs d'algèbre linéaire numérique peuvent être conçus de façon à exploiter les capacités des calculateurs hétérogènes actuels comprenant des processeurs multicœurs et des GPUs. Nous considérons des algorithmes de factorisation dense pour lesquels nous décrivons la répartition des tâches entre les différentes unités de calcul et son influence en terme de coût des communications. Ces cal- culs peuvent être également rendus plus performants grâce à des algorithmes en précision mixte qui utilisent une précision moindre pour les tâches les plus coûteuses tout en calculant la solution en précision supérieure. Puis nous décrivons notre travail de recherche dans le développement de solveurs d'algèbre linéaire rapides qui utilisent des algorithmes randomisés. La randomisation représente une approche innovante pour accélérer les calculs d'algèbre linéaire et la classe d'algorithmes que nous proposons a l'avantage de réduire la volume de communications dans les factorisations en supprimant complètement la phase de pivotage dans les systèmes linéaires. Les logiciels correspondants on été développés pour architectures multicœurs éventuellement accélérées par des GPUs. Enfin nous proposons des outils qui nous permettent de garantir la qualité de la solution calculée pour les problèmes de moindres carrés sur-déterminés, incluant les moindres carrés totaux. Notre méthode repose sur la dérivation de formules exactes ou d'estimateurs pour le conditionnement de ces problèmes. Nous décrivons les algorithmes et les logiciels qui permettent de calculer ces quantités avec les bibliothèques logicielles parallèles standards. Des pistes de recherche pour les années à venir sont données dans un chapître de conclusion. Read more Calcul haute-performance solveurs d'algèbre linéaire dense systèmes linéaires moindres carrés linéaires processeurs multicœurs Graphics Processing Units (GPU) algorithmes randomisés analyse d'erreur inverse estimation de conditionnement LAPACK ScaLAPACK PLASMA MAGMA
7	Algorithm-Architecture Co-Design for Dense Linear Algebra Computations Merchant, Farhad January 2015 (has links) (PDF) Achieving high computation efficiency, in terms of Cycles per Instruction (CPI), for high-performance computing kernels is an interesting and challenging research area. Dense Linear Algebra (DLA) computation is a representative high-performance computing ap- plication, which is used, for example, in LU and QR factorizations. Unfortunately, mod- ern off-the-shelf microprocessors fall significantly short of achieving theoretical lower bound in CPI for high performance computing applications. In this thesis, we perform an in-depth analysis of the available parallelisms and propose suitable algorithmic and architectural variation to significantly improve the computation efficiency. There are two standard approaches for improving the computation effficiency, first, to perform application-specific architecture customization and second, to do algorithmic tuning. In the same manner, we first perform a graph-based analysis of selected DLA kernels. From the various forms of parallelism, thus identified, we design a custom processing element for improving the CPI. The processing elements are used as building blocks for a commercially available Coarse-Grained Reconfigurable Architecture (CGRA). By per- forming detailed experiments on a synthesized CGRA implementation, we demonstrate that our proposed algorithmic and architectural variations are able to achieve lower CPI compared to off-the-shelf microprocessors. We also benchmark against state-of-the-art custom implementations to report higher energy-performance-area product. DLA computations are encountered in many engineering and scientific computing ap- plications ranging from Computational Fluid Dynamics (CFD) to Eigenvalue problem. Traditionally, these applications are written in highly tuned High Performance Comput- ing (HPC) software packages like Linear Algebra Package (LAPACK), and/or Scalable Linear Algebra Package (ScaLAPACK). The basic building block for these packages is Ba- sic Linear Algebra Subprograms (BLAS). Algorithms pertaining LAPACK/ScaLAPACK are written in-terms of BLAS to achieve high throughput. Despite extensive intellectual efforts in development and tuning of these packages, there still exists a scope for fur- ther tuning in this packages. In this thesis, we revisit most prominent and widely used compute bound algorithms like GMM for further exploitation of Instruction Level Parallelism (ILP). We further look into LU and QR factorizations for generalizations and exhibit higher ILP in these algorithms. We first accelerate sequential performance of the algorithms in BLAS and LAPACK and then focus on the parallel realization of these algorithms. Major contributions in the algorithmic tuning in this thesis are as follows: Algorithms: We present graph based analysis of General Matrix Multiplication (GMM) and discuss different types of parallelisms available in GMM We present analysis of Givens Rotation based QR factorization where we improve GR and derive Column-wise GR (CGR) that can annihilate multiple elements of a column of a matrix simultaneously. We show that the multiplications in CGR are lower than GR We generalize CGR further and derive Generalized GR (GGR) that can annihilate multiple elements of the columns of a matrix simultaneously. We show that the parallelism exhibited by GGR is much higher than GR and Householder Transform (HT) We extend generalizations to Square root Free GR (also knows as Fast Givens Rotation) and Square root and Division Free GR (SDFG) and derive Column-wise Fast Givens, and Column-wise SDFG . We also extend generalization for complex matrices and derive Complex Column-wise Givens Rotation Coarse-grained Recon gurable Architectures (CGRAs) have gained popularity in the last decade due to their power and area efficiency. Furthermore, CGRAs like REDEFINE also exhibit support for domain customizations. REDEFINE is an array of Tiles where each Tile consists of a Compute Element and a Router. The Routers are responsible for on-chip communication, while Compute Elements in the REDEFINE can be domain customized to accelerate the applications pertaining to the domain of interest. In this thesis, we consider REDEFINE base architecture as a starting point and we design Processing Element (PE) that can execute algorithms in BLAS and LAPACK efficiently. We perform several architectural enhancements in the PE to approach lower bound of the CPI. For parallel realization of BLAS and LAPACK, we attach this PE to the Router of REDEFINE. We achieve better area and power performance compared to the yesteryear customized architecture for DLA. Major contributions in architecture in this thesis are as follows: Architecture: We present design of a PE for acceleration of GMM which is a Level-3 BLAS operation We methodically enhance the PE with different features for improvement in the performance of GMM For efficient realization of Linear Algebra Package (LAPACK), we use PE that can efficiently execute GMM and show better performance For further acceleration of LU and QR factorizations in LAPACK, we identify macro operations encountered in LU and QR factorizations, and realize them on a reconfigurable data-path resulting in 25-30% lower run-time Read more Dense Linear Algebra (DLA) Algorithms-Applications Parallesing Algorithmic Structural Variations Dense Linear Algebra General Matrix Multiplication (GMM) Column-wise Givens Rotation (CGR) QR Factorization Basic Linear Algebra Subprograms (BLAS) Linear Algebra Package (LAPACK) Electronic Systems Engineering

1

Page generated in 0.0274 seconds