1 |
Fast Matrix Multiplication via Group ActionsOrem, Hendrik 01 May 2009 (has links)
Recent work has shown that fast matrix multiplication algorithms can be constructed by embedding the two input matrices into a group algebra, applying a generalized discrete Fourier transform, and performing the multiplication in the Fourier basis. Developing an embedding that yields a matrix multiplication algorithm with running time faster than naive matrix multiplication leads to interesting combinatorial problems in group theory. The crux of such an embedding, after a group G has been chosen, lies in finding a triple of subsets of G that satisfy a certain algebraic relation. I show how the process of finding such subsets can in some cases be greatly simplified by considering the action of the group G on an appropriate set X. In particular, I focus on groups acting on regularly branching trees.
|
2 |
Matrix Multiplications on Apache Spark through GPUs / Matrismultiplikationer på Apache Spark med GPUSafari, Arash January 2017 (has links)
In this report, we consider the distribution of large scale matrix multiplications across a group of systems through Apache Spark, where each individual system utilizes Graphical Processor Units (GPUs) in order to perform the matrix multiplication. The purpose of this thesis is to research whether the GPU's advantage in performing parallel work can be applied to a distributed environment, and whether it scales noticeably better than a CPU implementation in a distributed environment. This question was resolved by benchmarking the different implementations at their peak. Based on these benchmarks, it was concluded that GPUs indeed do perform better as long as single precision support is available in the distributed environment. When single precision operations are not supported, GPUs perform much worse due to the low double precision performance of most GPU devices. / I denna rapport betraktar vi fördelningen av storskaliga matrismultiplikationeröver ett Apache Spark kluster, där varje system i klustret delegerar beräkningarnatill grafiska processorenheter (GPU). Syftet med denna avhandling är attundersöka huruvida GPU:s fördel vid parallellt arbete kan tillämpas på en distribuerad miljö, och om det skalar märkbart bättre än en CPU-implementationi en distribuerad miljö. Detta gjordes genom att testa de olika implementationerna i en miljö däroptimal prestanda kunde förväntas. Baserat på resultat ifrån dessa tester drogsslutsatsen att GPU-enheter preseterar bättre än CPU-enheter så länge ramverkethar stöd för single precision beräkningar. När detta inte är fallet så presterar deflesta GPU-enheterna betydligt sämre på grund av deras låga double-precisionprestanda.
|
3 |
Optimizing Sparse Matrix-Matrix Multiplication on a Heterogeneous CPU-GPU PlatformWu, Xiaolong 16 December 2015 (has links)
Sparse Matrix-Matrix multiplication (SpMM) is a fundamental operation over irregular data, which is widely used in graph algorithms, such as finding minimum spanning trees and shortest paths. In this work, we present a hybrid CPU and GPU-based parallel SpMM algorithm to improve the performance of SpMM. First, we improve data locality by element-wise multiplication. Second, we utilize the ordered property of row indices for partial sorting instead of full sorting of all triples according to row and column indices. Finally, through a hybrid CPU-GPU approach using two level pipelining technique, our algorithm is able to better exploit a heterogeneous system. Compared with the state-of-the-art SpMM methods in cuSPARSE and CUSP libraries, our approach achieves an average of 1.6x and 2.9x speedup separately on the nine representative matrices from University of Florida sparse matrix collection.
|
4 |
Analysis-Driven Design of Parallel Floating-Point Matrix Multiplication for Implementation in Reconfigurable LogicKhayyat, Ahmad 06 August 2013 (has links)
The objective of this research is to design an efficient and flexible implementation of parallel matrix multiplication for FPGA devices by
analyzing the computation and studying its design space. In order to adapt to the FPGA platform, the design employs blocking
and parallelization. Blocked matrix multiplication enables processing arbitrarily large matrices using limited memory capacity, and reduces the bandwidth requirements across the device boundaries by reusing available elements. Exploiting the inherent parallelism in the matrix multiplication computation improves the performance and utilizes the available reconfigurable FPGA resources.
The design is constructed by identifying the main design decisions and evaluating the alternatives for each one. The considered design decisions include the scheduling of block transfers, the scheduling of arithmetic operations in a block multiplication, the extent to which the parallelism is exploited, determining the block sizes and shapes, and the use of double buffers for storing matrix blocks. The choices offered by each decision are evaluated analytically in terms of their performance and utilization of FPGA resources. Based on this analysis, a detailed, flexible design that accommodates various alternative design choices is described. The design is optimized for matrices of floating-point elements, and for the FPGA target platform. Prior work is analyzed based on the considered design choices in order
to identify the similarities and the differences.
The proposed design is implemented using the VHDL hardware description language. The implementation is used to verify the correctness of the design and to confirm the analysis of the design decisions. Correctness is verified both by simulation using the ModelSim logic simulator, and in hardware through compiling the implementation using the Altera Quartus II CAD software and testing it on the Altera DE4 board, featuring a Stratix IV EP4SGX530C2 FPGA device. The implementation supports a range of parameters to facilitate the experimental evaluation of design choices.
Experimental results show that the design scales linearly with respect to the consumed resources. Although increasing the system size reduces the maximum operating frequency, it also increases the parallelism, resulting in a higher performance. For instance, with 8 floating-point arithmetic units, the system runs at 320 MHz, which corresponds to a performance of 4 GFLOPS, whereas with 64 arithmetic units, it runs at 160 MHz, which corresponds to a performance of 16 GFLOPS. It is also shown that using a transfer schedule based on inner products reduces the transfer time by up to 50% compared to other schedules. Although using square blocks minimizes the number of required block multiplications, other non-square blocks minimize the transfer time, resulting in better total times. / Thesis (Ph.D, Electrical & Computer Engineering) -- Queen's University, 2013-08-03 12:46:13.484
|
5 |
Analysis-Driven Design of Parallel Floating-Point Matrix Multiplication for Implementation in Reconfigurable LogicKhayyat, Ahmad 06 August 2013 (has links)
The objective of this research is to design an efficient and flexible implementation of parallel matrix multiplication for FPGA devices by
analyzing the computation and studying its design space. In order to adapt to the FPGA platform, the design employs blocking
and parallelization. Blocked matrix multiplication enables processing arbitrarily large matrices using limited memory capacity, and reduces the bandwidth requirements across the device boundaries by reusing available elements. Exploiting the inherent parallelism in the matrix multiplication computation improves the performance and utilizes the available reconfigurable FPGA resources.
The design is constructed by identifying the main design decisions and evaluating the alternatives for each one. The considered design decisions include the scheduling of block transfers, the scheduling of arithmetic operations in a block multiplication, the extent to which the parallelism is exploited, determining the block sizes and shapes, and the use of double buffers for storing matrix blocks. The choices offered by each decision are evaluated analytically in terms of their performance and utilization of FPGA resources. Based on this analysis, a detailed, flexible design that accommodates various alternative design choices is described. The design is optimized for matrices of floating-point elements, and for the FPGA target platform. Prior work is analyzed based on the considered design choices in order
to identify the similarities and the differences.
The proposed design is implemented using the VHDL hardware description language. The implementation is used to verify the correctness of the design and to confirm the analysis of the design decisions. Correctness is verified both by simulation using the ModelSim logic simulator, and in hardware through compiling the implementation using the Altera Quartus II CAD software and testing it on the Altera DE4 board, featuring a Stratix IV EP4SGX530C2 FPGA device. The implementation supports a range of parameters to facilitate the experimental evaluation of design choices.
Experimental results show that the design scales linearly with respect to the consumed resources. Although increasing the system size reduces the maximum operating frequency, it also increases the parallelism, resulting in a higher performance. For instance, with 8 floating-point arithmetic units, the system runs at 320 MHz, which corresponds to a performance of 4 GFLOPS, whereas with 64 arithmetic units, it runs at 160 MHz, which corresponds to a performance of 16 GFLOPS. It is also shown that using a transfer schedule based on inner products reduces the transfer time by up to 50% compared to other schedules. Although using square blocks minimizes the number of required block multiplications, other non-square blocks minimize the transfer time, resulting in better total times. / Thesis (Ph.D, Electrical & Computer Engineering) -- Queen's University, 2013-08-03 12:46:13.484
|
6 |
Využití GPU pro náročné výpočty / Using GPU for HPCMáček, Branislav Unknown Date (has links)
Recently there was a significant grow in building HPC systems. Nowadays they are building from mainstream computer components. One of them is graphics accelerators with GPU. This thesis deals with description of graphics accelerators. It examines possibilities usage. GPU chip has hundreds simple processors. This thesis examine possibilities how to benefit from these parallel processors. It contains description of several testing applications, discuss results from experiments and compares them with another components used for HPC.
|
7 |
Aplicações de matrizes no ensino médio / Applications of matrices in the secondary schoolFerreira, Silvia da Rocha Izidoro 23 April 2013 (has links)
Esta dissertação tem como objetivo salientar a utilidade e importância de cálculos matriciais no ensino médio. Para tanto, foram estudados alguns tópicos que descrevem situações que necessitam de recursos gerados por operações matriciais. Foi observado que esses tópicos apresentam situações que evidenciam a utilidade da multiplicação de matrizes não somente no desenvolvimento teórico, mas também nas aplicações de matrizes, e têm potencial para serem abordados no ensino médio / The aim is this work is to stress on the use of algebraic operations with matrices in the mathematics teaching for secondary school students. For this purpose, we studied some topics that require algebraic operations with matrices. It was observed that these topics reveal circumstances in which the matrix multiplication is not only useful in the theoretical development but also in the applications. In addition, the studied showed that these themes have potential to be considered in the secondary school
|
8 |
Aplicações de matrizes no ensino médio / Applications of matrices in the secondary schoolSilvia da Rocha Izidoro Ferreira 23 April 2013 (has links)
Esta dissertação tem como objetivo salientar a utilidade e importância de cálculos matriciais no ensino médio. Para tanto, foram estudados alguns tópicos que descrevem situações que necessitam de recursos gerados por operações matriciais. Foi observado que esses tópicos apresentam situações que evidenciam a utilidade da multiplicação de matrizes não somente no desenvolvimento teórico, mas também nas aplicações de matrizes, e têm potencial para serem abordados no ensino médio / The aim is this work is to stress on the use of algebraic operations with matrices in the mathematics teaching for secondary school students. For this purpose, we studied some topics that require algebraic operations with matrices. It was observed that these topics reveal circumstances in which the matrix multiplication is not only useful in the theoretical development but also in the applications. In addition, the studied showed that these themes have potential to be considered in the secondary school
|
9 |
High-Performance Matrix Multiplication: Hierarchical Data Structures, Optimized Kernel Routines, and Qualitative Performance ModelingWu, Wenhao 02 August 2003 (has links)
The optimal implementation of matrix multiplication on modern computer architectures is of great importance for scientific and engineering applications. However, achieving the optimal performance for matrix multiplication has been continuously challenged both by the ever-widening performance gap between the processor and memory hierarchy and the introduction of new architectural features in modern architectures. The conventional way of dealing with these challenges benefits significantly from the blocking algorithm, which improves the data locality in the cache memory, and from the highly tuned inner kernel routines, which in turn exploit the architectural aspects on the specific processor to deliver near peak performance. A state-of-art improvement of the blocking algorithm is the self-tuning approach that utilizes "heroic" combinatorial optimization of parameters spaces. Other recent research approaches include the approach that explicitly blocks for the TLB (Translation Lookaside Buffer) and the hierarchical formulation that employs memoryriendly Morton Ordering (a spaceilling curve methodology). This thesis compares and contrasts the TLB-blocking-based and Morton-Order-based methods for dense matrix multiplication, and offers a qualitative model to explain the performance behavior. Comparisons to the performance of self-tuning library and the "vendor" library are also offered for the Alpha architecture. The practical benchmark experiments demonstrate that neither conventional blocking-based implementations nor the self-tuning libraries are optimal to achieve consistent high performance in dense matrix multiplication of relatively large square matrix size. Instead, architectural constraints and issues evidently restrict the critical path and options available for optimal performance, so that the relatively simple strategy and framework presented in this study offers higher and flatter overall performance. Interestingly, maximal inner kernel efficiency is not a guarantee of global minimal multiplication time. Also, efficient and flat performance is possible at all problem sizes that fit in main memory, rather than "jagged" performance curves often observed in blocking and self-tuned blocking libraries.
|
10 |
Nonlinear Prediction in Credit Forecasting and Cloud Computing Deployment OptimizationJarrett, Nicholas Walton Daniel January 2015 (has links)
<p>This thesis presents data analysis and methodology for two prediction problems. The first problem is forecasting midlife credit ratings from personality information collected during early adulthood. The second problem is analysis of matrix multiplication in cloud computing.</p><p>The goal of the credit forecasting problem is to determine if there is a link between personality assessments of young adults with their propensity to develop credit in middle age. The data we use is from a long term longitudinal study of over 40 years. We do find an association between credit risk and personality in this cohort Such a link has obvious implications for lenders but also can be used to improve social utility via more efficient resource allocation</p><p>We analyze matrix multiplication in the cloud and model I/O and local computation for individual tasks. We established conditions for which the distribution of job completion times can be explicitly obtained. We further generalize these results to cases where analytic derivations are intractable.</p><p>We develop models that emulate the multiplication procedure, allowing job times for different deployment parameter settings to be emulated after only witnessing a subset of tasks, or subsets of tasks for nearby deployment parameter settings. </p><p>The modeling framework developed sheds new light on the problem of determining expected job completion time for sparse matrix multiplication.</p> / Dissertation
|
Page generated in 0.1428 seconds