Global ETD Search

41	Numerical Quality and High Performance In Interval Linear Algebra on Multi-Core Processors / Algèbre linéaire d'intervalles - Qualité Numérique et Hautes Performances sur Processeurs Multi-Cœurs Theveny, Philippe 31 October 2014 (has links) L'objet est de comparer des algorithmes de multiplication de matrices à coefficients intervalles et leurs implémentations.Le premier axe est la mesure de la précision numérique. Les précédentes analyses d'erreur se limitent à établir une borne sur la surestimation du rayon du résultat en négligeant les erreurs dues au calcul en virgule flottante. Après examen des différentes possibilités pour quantifier l'erreur d'approximation entre deux intervalles, l'erreur d'arrondi est intégrée dans l'erreur globale. À partir de jeux de données aléatoires, la dispersion expérimentale de l'erreur globale permet d'éclairer l'importance des différentes erreurs (de méthode et d'arrondi) en fonction de plusieurs facteurs : valeur et homogénéité des précisions relatives des entrées, dimensions des matrices, précision de travail. Cette démarche conduit à un nouvel algorithme moins coûteux et tout aussi précis dans certains cas déterminés.Le deuxième axe est d'exploiter le parallélisme des opérations. Les implémentations précédentes se ramènent à des produits de matrices de nombres flottants. Pour contourner les limitations d'une telle approche sur la validité du résultat et sur la capacité à monter en charge, je propose une implémentation par blocs réalisée avec des threads OpenMP qui exécutent des noyaux de calcul utilisant les instructions vectorielles. L'analyse des temps d'exécution sur une machine de 4 octo-coeurs montre que les coûts de calcul sont du même ordre de grandeur sur des matrices intervalles et numériques de même dimension et que l'implémentation par bloc passe mieux à l'échelle que l'implémentation avec plusieurs appels aux routines BLAS. / This work aims at determining suitable scopes for several algorithms of interval matrices multiplication.First, we quantify the numerical quality. Former error analyses of interval matrix products establish bounds on the radius overestimation by neglecting the roundoff error. We discuss here several possible measures for interval approximations. We then bound the roundoff error and compare experimentally this bound with the global error distribution on several random data sets. This approach enlightens the relative importance of the roundoff and arithmetic errors depending on the value and homogeneity of relative accuracies of inputs, on the matrix dimension, and on the working precision. This also leads to a new algorithm that is cheaper yet as accurate as previous ones under well-identified conditions.Second, we exploit the parallelism of linear algebra. Previous implementations use calls to BLAS routines on numerical matrices. We show that this may lead to wrong interval results and also restrict the scalability of the performance when the core count increases. To overcome these problems, we implement a blocking version with OpenMP threads executing block kernels with vector instructions. The timings on a 4-octo-core machine show that this implementation is more scalable than the BLAS one and that the cost of numerical and interval matrix products are comparable. Algèbre linéaire numérique Multiplication de matrices Implémentation parallèle Processeurs multi-cœurs Memoire partagée Virgule flottante Analyse d’erreur Arithmétique d’intervalles Numerical linear algebra Matrix multiplication Parallel implementation Multi-core processors Shared memory Floating-point number Error analysis Interval arithmetic
42	Implementace neuronové sítě bez operace násobení / Neural Network Implementation without Multiplication Slouka, Lukáš January 2018 (has links) The subject of this thesis is neural network acceleration with the goal of reducing the number of floating point multiplications. The theoretical part of the thesis surveys current trends and methods used in the field of neural network acceleration. However, the focus is on the binarization techniques which allow replacing multiplications with logical operators. The theoretical base is put into practice in two ways. First is the GPU implementation of crucial binary operators in the Tensorflow framework with a performance benchmark. Second is an application of these operators in simple image classifier. Results are certainly encouraging. Implemented operators achieve speed-up by a factor of 2.5 when compared to highly optimized cuBLAS operators. The last chapter compares accuracies achieved by binarized models and their full-precision counterparts on various architectures.
43	Code Optimization on GPUs Hong, Changwan 30 October 2019 (has links) No description available. Computer Science GPU performance modeling optimization SpMV SpMM SDDMM sparse matrix graph processing tiling multicore manycore matrix multiplication tensor stencil SIMD data locality CSR parallel load balance shared memory graph analytics
44	Average case analysis of algorithms for the maximum subarray problem Bashar, Mohammad Ehsanul January 2007 (has links) Maximum Subarray Problem (MSP) is to find the consecutive array portion that maximizes the sum of array elements in it. The goal is to locate the most useful and informative array segment that associates two parameters involved in data in a 2D array. It's an efficient data mining method which gives us an accurate pattern or trend of data with respect to some associated parameters. Distance Matrix Multiplication (DMM) is at the core of MSP. Also DMM and MSP have the worst-case complexity of the same order. So if we improve the algorithm for DMM that would also trigger the improvement of MSP. The complexity of Conventional DMM is O(n³). In the average case, All Pairs Shortest Path (APSP) Problem can be modified as a fast engine for DMM and can be solved in O(n² log n) expected time. Using this result, MSP can be solved in O(n² log² n) expected time. MSP can be extended to K-MSP. To incorporate DMM into K-MSP, DMM needs to be extended to K-DMM as well. In this research we show how DMM can be extended to K-DMM using K-Tuple Approach to solve K-MSP in O(Kn² log² n log K) time complexity when K ≤ n/log n. We also present Tournament Approach which solves K-MSP in O(n² log² n + Kn²) time complexity and outperforms the K-Tuple Maximum Subarray Problem Distance Matrix Multiplication All Pairs Shortest Path Problem Shortest Path Expected Time Average Time Average Case Analysis Binary Heap Prefix Sum Directed Acyclic Graph X + Y Problem Binary Tournament Tree Solution Set MSP K-MSP DMM K-DMM APSP K-Tuple
45	Algorithm-Architecture Co-Design for Dense Linear Algebra Computations Merchant, Farhad January 2015 (has links) (PDF) Achieving high computation efficiency, in terms of Cycles per Instruction (CPI), for high-performance computing kernels is an interesting and challenging research area. Dense Linear Algebra (DLA) computation is a representative high-performance computing ap- plication, which is used, for example, in LU and QR factorizations. Unfortunately, mod- ern off-the-shelf microprocessors fall significantly short of achieving theoretical lower bound in CPI for high performance computing applications. In this thesis, we perform an in-depth analysis of the available parallelisms and propose suitable algorithmic and architectural variation to significantly improve the computation efficiency. There are two standard approaches for improving the computation effficiency, first, to perform application-specific architecture customization and second, to do algorithmic tuning. In the same manner, we first perform a graph-based analysis of selected DLA kernels. From the various forms of parallelism, thus identified, we design a custom processing element for improving the CPI. The processing elements are used as building blocks for a commercially available Coarse-Grained Reconfigurable Architecture (CGRA). By per- forming detailed experiments on a synthesized CGRA implementation, we demonstrate that our proposed algorithmic and architectural variations are able to achieve lower CPI compared to off-the-shelf microprocessors. We also benchmark against state-of-the-art custom implementations to report higher energy-performance-area product. DLA computations are encountered in many engineering and scientific computing ap- plications ranging from Computational Fluid Dynamics (CFD) to Eigenvalue problem. Traditionally, these applications are written in highly tuned High Performance Comput- ing (HPC) software packages like Linear Algebra Package (LAPACK), and/or Scalable Linear Algebra Package (ScaLAPACK). The basic building block for these packages is Ba- sic Linear Algebra Subprograms (BLAS). Algorithms pertaining LAPACK/ScaLAPACK are written in-terms of BLAS to achieve high throughput. Despite extensive intellectual efforts in development and tuning of these packages, there still exists a scope for fur- ther tuning in this packages. In this thesis, we revisit most prominent and widely used compute bound algorithms like GMM for further exploitation of Instruction Level Parallelism (ILP). We further look into LU and QR factorizations for generalizations and exhibit higher ILP in these algorithms. We first accelerate sequential performance of the algorithms in BLAS and LAPACK and then focus on the parallel realization of these algorithms. Major contributions in the algorithmic tuning in this thesis are as follows: Algorithms: We present graph based analysis of General Matrix Multiplication (GMM) and discuss different types of parallelisms available in GMM We present analysis of Givens Rotation based QR factorization where we improve GR and derive Column-wise GR (CGR) that can annihilate multiple elements of a column of a matrix simultaneously. We show that the multiplications in CGR are lower than GR We generalize CGR further and derive Generalized GR (GGR) that can annihilate multiple elements of the columns of a matrix simultaneously. We show that the parallelism exhibited by GGR is much higher than GR and Householder Transform (HT) We extend generalizations to Square root Free GR (also knows as Fast Givens Rotation) and Square root and Division Free GR (SDFG) and derive Column-wise Fast Givens, and Column-wise SDFG . We also extend generalization for complex matrices and derive Complex Column-wise Givens Rotation Coarse-grained Recon gurable Architectures (CGRAs) have gained popularity in the last decade due to their power and area efficiency. Furthermore, CGRAs like REDEFINE also exhibit support for domain customizations. REDEFINE is an array of Tiles where each Tile consists of a Compute Element and a Router. The Routers are responsible for on-chip communication, while Compute Elements in the REDEFINE can be domain customized to accelerate the applications pertaining to the domain of interest. In this thesis, we consider REDEFINE base architecture as a starting point and we design Processing Element (PE) that can execute algorithms in BLAS and LAPACK efficiently. We perform several architectural enhancements in the PE to approach lower bound of the CPI. For parallel realization of BLAS and LAPACK, we attach this PE to the Router of REDEFINE. We achieve better area and power performance compared to the yesteryear customized architecture for DLA. Major contributions in architecture in this thesis are as follows: Architecture: We present design of a PE for acceleration of GMM which is a Level-3 BLAS operation We methodically enhance the PE with different features for improvement in the performance of GMM For efficient realization of Linear Algebra Package (LAPACK), we use PE that can efficiently execute GMM and show better performance For further acceleration of LU and QR factorizations in LAPACK, we identify macro operations encountered in LU and QR factorizations, and realize them on a reconfigurable data-path resulting in 25-30% lower run-time Dense Linear Algebra (DLA) Algorithms-Applications Parallesing Algorithmic Structural Variations Dense Linear Algebra General Matrix Multiplication (GMM) Column-wise Givens Rotation (CGR) QR Factorization Basic Linear Algebra Subprograms (BLAS) Linear Algebra Package (LAPACK) Electronic Systems Engineering
46	Power and Energy Efficiency Evaluation for HW and SW Implementation of nxn Matrix Multiplication on Altera FPGAs Renbi, Abdelghani January 2009 (has links) In addition to the performance, low power design became an important issue in the design process of mobile embedded systems. Mobile electronics with rich features most often involve complex computation and intensive processing, which result in short battery lifetime and particularly when low power design is not taken in consideration. In addition to mobile computers, thermal design is also calling for low power techniques to avoid components overheat especially with VLSI technology. Low power design has traced a new era. In this thesis we examined several techniques to achieve low power design for FPGAs, ASICs and Processors where ASICs were more flexible to exploit the HW oriented techniques for low power consumption. We surveyed several power estimation methodologies where all of them were prone to at least one disadvantage. We also compared and analyzed the power and energy consumption in three different designs, which perform matrix multiplication within Altera platform and using state-of-the-art FPGA device. We concluded that NIOS II\e is not an energy efficient alternative to multiply nxn matrices compared to HW matrix multipliers on FPGAs and configware is an enormous potential to reduce the energy consumption costs. Low Power Design Techniques Energy Efficiency FPGA ASIC SoC NIOS CMOS Power Estimation Latency Matrix Multiplication Configware Reconfigurable Computing RISC Annan elektroteknik och elektronik Information Systems Information Systems Computer Engineering Datorteknik Computer Sciences Datavetenskap (datalogi)
47	Deep Learning Inference on Low-Power Commodity Processors and the AMD Versal AI Engine Lei, Jie 18 November 2024 (has links) [ES] Esta tesis presenta un estudio exhaustivo sobre la implementación de una realización eficiente de GEMM en procesadores de bajo consumo y en una plataforma heterogénea de AMD. Esta investigación está inspirada por la creciente demanda de inferencias de bajo consumo, baja latencia y alto rendimiento con modelos complejos de Deep Learning (DL) que surgen, por ejemplo, en Natural Language Processing (NLP) y Convolutional Neural Networks (CNN). Esto llevó a la oportunidad de explorar la aplicabilidad de la aceleración de hardware y software para GEMM en plataformas ARM, RISC-V y AMD Versal AI Engine (AIE). Establecimos los objetivos de nuestra investigación de la siguiente manera: Primero, desarrollar kernels de precisión mixta eficientes para GEMM en arquitecturas ARM y RISC-V explotando las unidades Single-Instruction, Multiple-Data (SIMD) en estas arquitecturas. En segundo lugar, explorar la aplicabilidad del algoritmo convencional para GEMM en plataformas de hardware no convencionales como el AIE en el sistema AMD Versal. Por último, investigar la escalabilidad del diseño paralelo de GEMM a múltiples AIE en sistemas AMD Versal. En mayor detalle, la investigación comienza implementando GEMM en las arquitecturas ARM y RISC-V, donde propusimos una herramienta de generación de código de micro-kernels basada en plantillas para ARM Neon, la extensión vectorial RISC-V (RVV) 0.7.1 y RVV 1.0. La herramienta de generación de código también permite configurar las dimensiones del micro-kernel, un parámetro crítico desde el punto de vista del rendimiento. Este trabajo indica que esta generación de código de kernels mejoró drásticamente la productividad y la portabilidad de los diseños de GEMM basados en intrínsecos. También incorporamos aritmética de precisión mixta INT8\|INT32, mostrando la aceleración sobre los enfoques FP32. Basándonos en el éxito de la implementación de GEMM en sistemas convencionales de bajo costo, extendimos nuestros intereses a plataformas heterogéneas no convencionales, en particular, la arquitectura AMD Versal AIE. Para esta plataforma, diseñamos micro-kernels específicos de la arquitectura de 8x8 utilizando intrínsecos flexibles de bajo nivel, implementando aritmética de precisión mixta y rutinas de empaquetado de datos, todo orientado a la inferencia de DL de alto rendimiento. Más importante aún, propusimos un diseño de jerarquía de memoria personalizada para esta arquitectura, crucial para operaciones de GEMM de baja latencia. Los resultados muestran que los micro-kernels propuestos lograron el 86.7% del rendimiento máximo de la implementación de un solo AIE. Fuimos un paso más allá al evaluar el diseño de GEMM en el modelo de DL ResNet-50 v1.5+ImageNet, donde convertimos los operadores de convolución a kernels de GEMM. Tras la implementación exitosa de GEMM en un solo tile de AIE, extendimos nuestra investigación a múltiples tiles de AIE, donde introdujimos la paralelización en el algoritmo. Rediseñamos el GEMM específico de la arquitectura acomodando hasta 32 tiles de AIE. Para lograr esto, optimizamos el diseño de la jerarquía de memoria personalizada y propusimos una nueva topología para un mayor rendimiento de comunicación. Los resultados muestran una gran escalabilidad del diseño paralelo de GEMM, reduciendo drásticamente el tiempo de computación en 31.5x en comparación con el diseño de un solo tile de AIE. / [CA] Aquesta tesi presenta un estudi complet sobre la implementació d'una realització eficient de GEMM en processadors de baix consum i una plataforma heterogènia d'AMD. Aquesta investigació s'inspira en la creixent demanda d'inferències de baix consum, baixa latència i alt rendiment amb models complexos de Deep Learning (DL), com per exemple, en Natural Language Processing (NLP) i Convolutional Neural Networks (CNN). Això va portar a l'oportunitat d'explorar l'aplicabilitat de l'acceleració de maquinari i programari per a GEMM en plataformes ARM, RISC-V i AMD Versal AI Engine (AIE). Els objectius de la nostra investigació són els següents: En primer lloc, desenvolupar nuclis de precisió mixta eficients per a GEMM en arquitectures ARM i RISC-V explotant les unitats Single-Instruction, Multiple-Data (SIMD) en aquestes arquitectures. En segon lloc, explorar l'aplicabilitat de l'algorisme convencional per a GEMM en plataformes de maquinari no convencionals com l'AIE en el sistema AMD Versal. Finalment, investigar l'escalabilitat del disseny paral·lel de GEMM a múltiples AIE en sistemes AMD Versal. En més detall, la investigació comença implementant GEMM en arquitectures ARM i RISC-V, on hem proposat una eina de generació de codi de micro-nuclis basada en plantilles per a ARM Neon, l'extensió vectorial RISC-V (RVV) 0.7.1 i RVV 1.0. L'eina de generació de codi també permet configurar les dimensions del micro-nucli, un paràmetre crític des del punt de vista del rendiment. Aquest treball indica que aquesta generació de codi de nucli va millorar dràsticament la productivitat i portabilitat dels dissenys de GEMM basats en intrínsecs. També incorporem aritmètica de precisió mixta INT8\|INT32, mostrant la millora de velocitat respecte als enfocaments FP32. Sobre la base de l'èxit de la implementació de GEMM en sistemes convencionals de consum, vam ampliar els nostres interessos a arquitectures heterogènies no convencionals, en particular, l'arquitectura AMD Versal AIE. Per a aquesta plataforma, vam dissenyar micro-nuclis específics d'arquitectura de 8x8 utilitzant els intrínsecs de baix nivell flexibles, implementant aritmètica de precisió mixta i rutines d'embalatge de dades, totes destinades a inferència de DL d'alt rendiment. Més important encara, vam proposar un disseny de jerarquia de memòria personalitzat per a aquesta arquitectura, que és crucial per a operacions GEMM de baixa latència. Els resultats mostren que els micro-nuclis proposats van aconseguir el 86,7% del rendiment màxim d'una implementació d'AIE única. Vam anar un pas més enllà avaluant el disseny de GEMM en el model de DL ResNet-50 v1.5+ImageNet, on vam convertir els operadors de convolució en nuclis GEMM. Després de la implementació exitosa de GEMM en una sola rajola AIE, vam ampliar la nostra investigació a múltiples rajoles AIE, on vam introduir la paral·lelització a l'algorisme. Vam redissenyar el GEMM específic d'arquitectura per a acomodar fins a 32 rajoles AIE. Per aconseguir-ho, vam optimitzar el disseny de la jerarquia de memòria personalitzada i vam proposar una nova topologia per a un major ample de banda de comunicació. / [EN] This thesis presents a comprehensive study on implementing an efficient realization of GEMM on low-power commodity processors and a heterogeneous platform from AMD. This research is inspired by the increasing demand for low-power, low-latency, high-performance inference with complex Deep Learning (DL) models arising, for instance, in Natural Language Processing (NLP) and Convolutional Neural Networks (CNN). This led to the opportunity to explore the applicability of hardware and software acceleration for GEMM on ARM, RISC-V, and AMD Versal AI Engine (AIE) platforms. We set up the objectives of our research as follows: Firstly, to develop efficient mixed precision kernels for GEMM on ARM and RISC-V architectures exploiting the Single-Instruction, Multiple-Data (SIMD) units in these architectures. Secondly, to explore the applicability of the conventional algorithm for GEMM to non-conventional hardware platforms such as the AIE in the AMD Versal system. Lastly, to investigate the scalability of the parallel design of GEMM to multiple AIEs on AMD Versal systems. In greater detail, the research starts by implementing GEMM on ARM and RISC-V architectures, where we proposed template-based micro-kernels code generation tool for ARM Neon, RISC-V vector (RVV) extension 0.7.1, and RVV 1.0. The code generation tool also allows configuring the micro-kernel dimensions, a critical parameter from the point of performance. This work indicates this kernel code generation drastically improved the productivity and portability of intrinsic-based GEMM designs. We also incorporate mixed-precision INT8\|INT32 arithmetic, showing the speedup over FP32 approaches. Building upon the success of GEMM implementation on conventional commodity systems, we extended our interests to non-conventional heterogeneous platforms, in particular, the AMD Versal AIE architecture. For this platform, we designed architecture-specific 8x8 micro-kernels utilizing the flexible low-level intrinsic, implementing mixed-precision arithmetic and data-packing routines, all aimed for high-performance DL inference. More importantly, we proposed a customized memory hierarchy design for this architecture, which is crucial for low-latency GEMM operations. The results show that the proposed micro-kernels achieved 86.7% of the peak performance of a single AIE implementation. We went a step further by benchmarking the GEMM design on the DL model ResNet-50 v1.5+ImageNet, where we converted the convolution operators to GEMM kernels. Following the successful implementation of GEMM on a single AIE tile, we extended our research to multiple AIE tiles, where we introduced parallelization to the algorithm. We redesigned the architecture-specific GEMM accommodating up to 32 AIE tiles. To achieve this, we optimized the customized memory hierarchy design and proposed a new topology for higher communication throughput. The results show great scalability of the parallel GEMM design, drastically reducing computational time by 31.5x compared to the single AIE tile design. / I would like to express my sincere appreciation to Horizon 2020 of the European Union for their generous funding. This project has been supported by the European Union’s Horizon 2020 (H2020) Marie Sklodowska-Curie Innovative Training Networks H2020-MSCA-ITN-2020 call, under Grant Agreement no. 956090. This funding has been crucial in enabling the success of this research. / Lei, J. (2024). Deep Learning Inference on Low-Power Commodity Processors and the AMD Versal AI Engine [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/212297 General Matrix Multiply (GEMM) Matrix Multiplication High Performance Computing AMD Versal Cache Hierarchy SIMD Units Loop Parallelism Deep Learning Artificial Intelligence Inteligencia artificial Aprendizaje profundo Multiplicación de matrices Computación de alto rendimiento
48	ACCELERATING SPARSE MACHINE LEARNING INFERENCE Ashish Gondimalla (14214179) 17 May 2024 (has links) <p>Convolutional neural networks (CNNs) have become important workloads due to their<br> impressive accuracy in tasks like image classification and recognition. Convolution operations<br> are compute intensive, and this cost profoundly increases with newer and better CNN models.<br> However, convolutions come with characteristics such as sparsity which can be exploited. In<br> this dissertation, we propose three different works to capture sparsity for faster performance<br> and reduced energy. </p> <p><br></p> <p>The first work is an accelerator design called <em>SparTen</em> for improving two-<br> sided sparsity (i.e, sparsity in both filters and feature maps) convolutions with fine-grained<br> sparsity. <em>SparTen</em> identifies efficient inner join as the key primitive for hardware acceleration<br> of sparse convolution. In addition, <em>SparTen</em> proposes load balancing schemes for higher<br> compute unit utilization. <em>SparTen</em> performs 4.7x, 1.8x and 3x better than dense architecture,<br> one-sided architecture and SCNN, the previous state of the art accelerator. The second work<br> <em>BARISTA</em> scales up SparTen (and SparTen like proposals) to large-scale implementation<br> with as many compute units as recent dense accelerators (e.g., Googles Tensor processing<br> unit) to achieve full speedups afforded by sparsity. However at such large scales, buffering,<br> on-chip bandwidth, and compute utilization are highly intertwined where optimizing for<br> one factor strains another and may invalidate some optimizations proposed in small-scale<br> implementations. <em>BARISTA</em> proposes novel techniques to balance the three factors in large-<br> scale accelerators. <em>BARISTA</em> performs 5.4x, 2.2x, 1.7x and 2.5x better than dense, one-<br> sided, naively scaled two-sided and an iso-area two-sided architecture, respectively. The last<br> work, <em>EUREKA</em> builds an efficient tensor core to execute dense, structured and unstructured<br> sparsity with losing efficiency. <em>EUREKA</em> achieves this by proposing novel techniques to<br> improve compute utilization by slightly tweaking operand stationarity. <em>EUREKA</em> achieves a<br> speedup of 5x, 2.5x, along with 3.2x and 1.7x energy reductions over Dense and structured<br> sparse execution respectively. <em>EUREKA</em> only incurs area and power overheads of 6% and<br> 11.5%, respectively, over Ampere</p> Digital processor architectures Energy-efficient computing High performance computing Deep neural networks sparsity exploitation convolution neural network Machine learning inference Machine learning accelerators GPUs tensor cores Computer Engineering Computer Architecture ASIC Computer systems organization Special purpose systems Sparse tensors Sparse matrix multiplication

Search results