Spelling suggestions: "subject:"1atrix multiplication"" "subject:"béatrix multiplication""
41 |
Implementace neuronové sítě bez operace násobení / Neural Network Implementation without MultiplicationSlouka, Lukáš January 2018 (has links)
The subject of this thesis is neural network acceleration with the goal of reducing the number of floating point multiplications. The theoretical part of the thesis surveys current trends and methods used in the field of neural network acceleration. However, the focus is on the binarization techniques which allow replacing multiplications with logical operators. The theoretical base is put into practice in two ways. First is the GPU implementation of crucial binary operators in the Tensorflow framework with a performance benchmark. Second is an application of these operators in simple image classifier. Results are certainly encouraging. Implemented operators achieve speed-up by a factor of 2.5 when compared to highly optimized cuBLAS operators. The last chapter compares accuracies achieved by binarized models and their full-precision counterparts on various architectures.
|
42 |
Code Optimization on GPUsHong, Changwan 30 October 2019 (has links)
No description available.
|
43 |
Average case analysis of algorithms for the maximum subarray problemBashar, Mohammad Ehsanul January 2007 (has links)
Maximum Subarray Problem (MSP) is to find the consecutive array portion that maximizes the sum of array elements in it. The goal is to locate the most useful and informative array segment that associates two parameters involved in data in a 2D array. It's an efficient data mining method which gives us an accurate pattern or trend of data with respect to some associated parameters. Distance Matrix Multiplication (DMM) is at the core of MSP. Also DMM and MSP have the worst-case complexity of the same order. So if we improve the algorithm for DMM that would also trigger the improvement of MSP. The complexity of Conventional DMM is O(n³). In the average case, All Pairs Shortest Path (APSP) Problem can be modified as a fast engine for DMM and can be solved in O(n² log n) expected time. Using this result, MSP can be solved in O(n² log² n) expected time. MSP can be extended to K-MSP. To incorporate DMM into K-MSP, DMM needs to be extended to K-DMM as well. In this research we show how DMM can be extended to K-DMM using K-Tuple Approach to solve K-MSP in O(Kn² log² n log K) time complexity when K ≤ n/log n. We also present Tournament Approach which solves K-MSP in O(n² log² n + Kn²) time complexity and outperforms the K-Tuple
|
44 |
Algorithm-Architecture Co-Design for Dense Linear Algebra ComputationsMerchant, Farhad January 2015 (has links) (PDF)
Achieving high computation efficiency, in terms of Cycles per Instruction (CPI), for high-performance computing kernels is an interesting and challenging research area. Dense Linear Algebra (DLA) computation is a representative high-performance computing ap-
plication, which is used, for example, in LU and QR factorizations. Unfortunately, mod-
ern off-the-shelf microprocessors fall significantly short of achieving theoretical lower bound in CPI for high performance computing applications. In this thesis, we perform an in-depth analysis of the available parallelisms and propose suitable algorithmic
and architectural variation to significantly improve the computation efficiency. There
are two standard approaches for improving the computation effficiency, first, to perform
application-specific architecture customization and second, to do algorithmic tuning.
In the same manner, we first perform a graph-based analysis of selected DLA kernels.
From the various forms of parallelism, thus identified, we design a custom processing
element for improving the CPI. The processing elements are used as building blocks for
a commercially available Coarse-Grained Reconfigurable Architecture (CGRA). By per-
forming detailed experiments on a synthesized CGRA implementation, we demonstrate
that our proposed algorithmic and architectural variations are able to achieve lower CPI compared to off-the-shelf microprocessors. We also benchmark against state-of-the-art custom implementations to report higher energy-performance-area product.
DLA computations are encountered in many engineering and scientific computing ap-
plications ranging from Computational Fluid Dynamics (CFD) to Eigenvalue problem.
Traditionally, these applications are written in highly tuned High Performance Comput-
ing (HPC) software packages like Linear Algebra Package (LAPACK), and/or Scalable
Linear Algebra Package (ScaLAPACK). The basic building block for these packages is Ba-
sic Linear Algebra Subprograms (BLAS). Algorithms pertaining LAPACK/ScaLAPACK
are written in-terms of BLAS to achieve high throughput. Despite extensive intellectual
efforts in development and tuning of these packages, there still exists a scope for fur-
ther tuning in this packages. In this thesis, we revisit most prominent and widely used
compute bound algorithms like GMM for further exploitation of Instruction Level Parallelism (ILP). We further look into LU and QR factorizations for generalizations and
exhibit higher ILP in these algorithms. We first accelerate sequential performance of the algorithms in BLAS and LAPACK and then focus on the parallel realization of these
algorithms. Major contributions in the algorithmic tuning in this thesis are as follows:
Algorithms:
We present graph based analysis of General Matrix Multiplication (GMM) and
discuss different types of parallelisms available in GMM
We present analysis of Givens Rotation based QR factorization where we improve
GR and derive Column-wise GR (CGR) that can annihilate multiple elements of a
column of a matrix simultaneously. We show that the multiplications in CGR are
lower than GR
We generalize CGR further and derive Generalized GR (GGR) that can annihilate
multiple elements of the columns of a matrix simultaneously. We show that the
parallelism exhibited by GGR is much higher than GR and Householder Transform
(HT)
We extend generalizations to Square root Free GR (also knows as Fast Givens
Rotation) and Square root and Division Free GR (SDFG) and derive Column-wise
Fast Givens, and Column-wise SDFG . We also extend generalization for complex
matrices and derive Complex Column-wise Givens Rotation
Coarse-grained Recon gurable Architectures (CGRAs) have gained popularity in the
last decade due to their power and area efficiency. Furthermore, CGRAs like REDEFINE also exhibit support for domain customizations. REDEFINE is an array of Tiles where each Tile consists of a Compute Element and a Router. The Routers are responsible
for on-chip communication, while Compute Elements in the REDEFINE can be domain
customized to accelerate the applications pertaining to the domain of interest. In this
thesis, we consider REDEFINE base architecture as a starting point and we design Processing Element (PE) that can execute algorithms in BLAS and LAPACK efficiently.
We perform several architectural enhancements in the PE to approach lower bound of the
CPI. For parallel realization of BLAS and LAPACK, we attach this PE to the Router of
REDEFINE. We achieve better area and power performance compared to the yesteryear
customized architecture for DLA. Major contributions in architecture in this thesis are as follows:
Architecture:
We present design of a PE for acceleration of GMM which is a Level-3 BLAS
operation
We methodically enhance the PE with different features for improvement in the
performance of GMM
For efficient realization of Linear Algebra Package (LAPACK), we use PE that can
efficiently execute GMM and show better performance
For further acceleration of LU and QR factorizations in LAPACK, we identify
macro operations encountered in LU and QR factorizations, and realize them on a
reconfigurable data-path resulting in 25-30% lower run-time
|
45 |
Power and Energy Efficiency Evaluation for HW and SW Implementation of nxn Matrix Multiplication on Altera FPGAsRenbi, Abdelghani January 2009 (has links)
In addition to the performance, low power design became an important issue in the design process of mobile embedded systems. Mobile electronics with rich features most often involve complex computation and intensive processing, which result in short battery lifetime and particularly when low power design is not taken in consideration. In addition to mobile computers, thermal design is also calling for low power techniques to avoid components overheat especially with VLSI technology. Low power design has traced a new era. In this thesis we examined several techniques to achieve low power design for FPGAs, ASICs and Processors where ASICs were more flexible to exploit the HW oriented techniques for low power consumption. We surveyed several power estimation methodologies where all of them were prone to at least one disadvantage. We also compared and analyzed the power and energy consumption in three different designs, which perform matrix multiplication within Altera platform and using state-of-the-art FPGA device. We concluded that NIOS II\e is not an energy efficient alternative to multiply nxn matrices compared to HW matrix multipliers on FPGAs and configware is an enormous potential to reduce the energy consumption costs.
|
46 |
ACCELERATING SPARSE MACHINE LEARNING INFERENCEAshish Gondimalla (14214179) 17 May 2024 (has links)
<p>Convolutional neural networks (CNNs) have become important workloads due to their<br>
impressive accuracy in tasks like image classification and recognition. Convolution operations<br>
are compute intensive, and this cost profoundly increases with newer and better CNN models.<br>
However, convolutions come with characteristics such as sparsity which can be exploited. In<br>
this dissertation, we propose three different works to capture sparsity for faster performance<br>
and reduced energy. </p>
<p><br></p>
<p>The first work is an accelerator design called <em>SparTen</em> for improving two-<br>
sided sparsity (i.e, sparsity in both filters and feature maps) convolutions with fine-grained<br>
sparsity. <em>SparTen</em> identifies efficient inner join as the key primitive for hardware acceleration<br>
of sparse convolution. In addition, <em>SparTen</em> proposes load balancing schemes for higher<br>
compute unit utilization. <em>SparTen</em> performs 4.7x, 1.8x and 3x better than dense architecture,<br>
one-sided architecture and SCNN, the previous state of the art accelerator. The second work<br>
<em>BARISTA</em> scales up SparTen (and SparTen like proposals) to large-scale implementation<br>
with as many compute units as recent dense accelerators (e.g., Googles Tensor processing<br>
unit) to achieve full speedups afforded by sparsity. However at such large scales, buffering,<br>
on-chip bandwidth, and compute utilization are highly intertwined where optimizing for<br>
one factor strains another and may invalidate some optimizations proposed in small-scale<br>
implementations. <em>BARISTA</em> proposes novel techniques to balance the three factors in large-<br>
scale accelerators. <em>BARISTA</em> performs 5.4x, 2.2x, 1.7x and 2.5x better than dense, one-<br>
sided, naively scaled two-sided and an iso-area two-sided architecture, respectively. The last<br>
work, <em>EUREKA</em> builds an efficient tensor core to execute dense, structured and unstructured<br>
sparsity with losing efficiency. <em>EUREKA</em> achieves this by proposing novel techniques to<br>
improve compute utilization by slightly tweaking operand stationarity. <em>EUREKA</em> achieves a<br>
speedup of 5x, 2.5x, along with 3.2x and 1.7x energy reductions over Dense and structured<br>
sparse execution respectively. <em>EUREKA</em> only incurs area and power overheads of 6% and<br>
11.5%, respectively, over Ampere</p>
|
Page generated in 0.1149 seconds