Spelling suggestions: "subject:"curas"" "subject:"cgas""
1 |
Compiling For Coarse-Grained Reconfigurable Architectures Based On Dataflow Execution ParadigmAlle, Mythri 12 1900 (has links) (PDF)
Coarse-Grained Reconfigurable Architectures(CGRAs) can be employed for accelerating computational workloads that demand both flexibility and performance. CGRAs comprise a set of computation elements interconnected using a network and this interconnection of computation elements is referred to as a reconfigurable fabric. The size of application that can be accommodated on the reconfigurable fabric is limited by the size of instruction buffers associated with each Compute element. When an application cannot be accommodated entirely, application is partitioned such that each of these partitions can be executed on the reconfigurable fabric. These partitions are scheduled by an orchestrator. The orchestrator employs dynamic dataflow execution paradigm. Dynamic dataflow execution paradigm has inherent support for synchronization and helps in exploitation of parallelism that exists across application partitions. In this thesis, we present a compiler that targets such CGRAs.
The compiler presented in this thesis is capable of accepting applications specified in C89 standard. To enable architectural design space exploration, the compiler is designed such that it can be customized for several instances of CGRAs employing dataflow execution paradigm at the orchestrator. This can be achieved by specifying the appropriate configuration parameters to the compiler. The focus of this thesis is to provide efficient support for various kinds of parallelism while ensuring correctness. The compiler is designed to support fine-grained task level parallelism that exists across iterations of loops and function calls. Additionally, compiler can also support pipeline parallelism, where a loop is split into multiple stages that execute in a pipelined manner.
The prototype compiler, which targets multiple instances of a CGRA, is demonstrated in this thesis. We used this compiler to target multiple variants of CGRAs employing dataflow execution paradigm. We varied the reconfigur-able fabric, orchestration mechanism employed, size of instruction buffers. We also choose applications from two different domains viz. cryptography and linear algebra. The execution time of the CGRA (the best among all instances) is compared against an Intel Quad core processor. Cryptography applications show a performance improvement ranging from more than one order of magnitude to close to two orders of magnitude. These applications have large amounts of ILP and our compiler could successfully expose the ILP available in these applications. Further, the domain customization also played an important role in achieving good performance. We employed two custom functional units for accelerating Cryptography applications and compiler could efficiently use them. In linear algebra kernels we observe multiple iterations of the loop executing in parallel, effectively exploiting loop-level parallelism at runtime. Inspite of this we notice close to an order of magnitude performance degradation. The reason for this degradation can be attributed to the use of non-pipelined floating point units, and the delays involved in accessing memory. Pipeline parallelism was demonstrated using this compiler for FFT and QR factorization. Thus, the compiler is capable of efficiently supporting different kinds of parallelism and can support complete C89 standard. Further, the compiler can also support different instances of CGRAs employing dataflow execution paradigm.
|
2 |
Case Studies to Learn Human Mapping Strategies in a Variety of Coarse-Grained Reconfigurable ArchitecturesMalla, Tika K. 05 1900 (has links)
Computer hardware and algorithm design have seen significant progress over the years. It is also seen that there are several domains in which humans are more efficient than computers. For example in image recognition, image tagging, natural language understanding and processing, humans often find complicated algorithms quite easy to grasp. This thesis presents the different case studies to learn human mapping strategy to solve the mapping problem in the area of coarse-grained reconfigurable architectures (CGRAs). To achieve optimum level performance and consume less energy in CGRAs, place and route problem has always been a major concern. Making use of human characteristics can be helpful in problems as such, through pattern recognition and experience. Therefore to conduct the case studies a computer mapping game called UNTANGLED was analyzed as a medium to convey insights of human mapping strategies in a variety of architectures. The purpose of this research was to learn from humans so that we can come up with better algorithms to outperform the existing algorithms. We observed how human strategies vary as we present them with different architectures, different architectures with constraints, different visualization as well as how the quality of solution changes with experience. In this work all the case studies obtained from exploiting human strategies provide useful feedback that can improve upon existing algorithms. These insights can be adapted to find the best architectural solution for a particular domain and for future research directions for mapping onto mesh-and- stripe based CGRAs.
|
3 |
Algorithm-Architecture Co-Design for Dense Linear Algebra ComputationsMerchant, Farhad January 2015 (has links) (PDF)
Achieving high computation efficiency, in terms of Cycles per Instruction (CPI), for high-performance computing kernels is an interesting and challenging research area. Dense Linear Algebra (DLA) computation is a representative high-performance computing ap-
plication, which is used, for example, in LU and QR factorizations. Unfortunately, mod-
ern off-the-shelf microprocessors fall significantly short of achieving theoretical lower bound in CPI for high performance computing applications. In this thesis, we perform an in-depth analysis of the available parallelisms and propose suitable algorithmic
and architectural variation to significantly improve the computation efficiency. There
are two standard approaches for improving the computation effficiency, first, to perform
application-specific architecture customization and second, to do algorithmic tuning.
In the same manner, we first perform a graph-based analysis of selected DLA kernels.
From the various forms of parallelism, thus identified, we design a custom processing
element for improving the CPI. The processing elements are used as building blocks for
a commercially available Coarse-Grained Reconfigurable Architecture (CGRA). By per-
forming detailed experiments on a synthesized CGRA implementation, we demonstrate
that our proposed algorithmic and architectural variations are able to achieve lower CPI compared to off-the-shelf microprocessors. We also benchmark against state-of-the-art custom implementations to report higher energy-performance-area product.
DLA computations are encountered in many engineering and scientific computing ap-
plications ranging from Computational Fluid Dynamics (CFD) to Eigenvalue problem.
Traditionally, these applications are written in highly tuned High Performance Comput-
ing (HPC) software packages like Linear Algebra Package (LAPACK), and/or Scalable
Linear Algebra Package (ScaLAPACK). The basic building block for these packages is Ba-
sic Linear Algebra Subprograms (BLAS). Algorithms pertaining LAPACK/ScaLAPACK
are written in-terms of BLAS to achieve high throughput. Despite extensive intellectual
efforts in development and tuning of these packages, there still exists a scope for fur-
ther tuning in this packages. In this thesis, we revisit most prominent and widely used
compute bound algorithms like GMM for further exploitation of Instruction Level Parallelism (ILP). We further look into LU and QR factorizations for generalizations and
exhibit higher ILP in these algorithms. We first accelerate sequential performance of the algorithms in BLAS and LAPACK and then focus on the parallel realization of these
algorithms. Major contributions in the algorithmic tuning in this thesis are as follows:
Algorithms:
We present graph based analysis of General Matrix Multiplication (GMM) and
discuss different types of parallelisms available in GMM
We present analysis of Givens Rotation based QR factorization where we improve
GR and derive Column-wise GR (CGR) that can annihilate multiple elements of a
column of a matrix simultaneously. We show that the multiplications in CGR are
lower than GR
We generalize CGR further and derive Generalized GR (GGR) that can annihilate
multiple elements of the columns of a matrix simultaneously. We show that the
parallelism exhibited by GGR is much higher than GR and Householder Transform
(HT)
We extend generalizations to Square root Free GR (also knows as Fast Givens
Rotation) and Square root and Division Free GR (SDFG) and derive Column-wise
Fast Givens, and Column-wise SDFG . We also extend generalization for complex
matrices and derive Complex Column-wise Givens Rotation
Coarse-grained Recon gurable Architectures (CGRAs) have gained popularity in the
last decade due to their power and area efficiency. Furthermore, CGRAs like REDEFINE also exhibit support for domain customizations. REDEFINE is an array of Tiles where each Tile consists of a Compute Element and a Router. The Routers are responsible
for on-chip communication, while Compute Elements in the REDEFINE can be domain
customized to accelerate the applications pertaining to the domain of interest. In this
thesis, we consider REDEFINE base architecture as a starting point and we design Processing Element (PE) that can execute algorithms in BLAS and LAPACK efficiently.
We perform several architectural enhancements in the PE to approach lower bound of the
CPI. For parallel realization of BLAS and LAPACK, we attach this PE to the Router of
REDEFINE. We achieve better area and power performance compared to the yesteryear
customized architecture for DLA. Major contributions in architecture in this thesis are as follows:
Architecture:
We present design of a PE for acceleration of GMM which is a Level-3 BLAS
operation
We methodically enhance the PE with different features for improvement in the
performance of GMM
For efficient realization of Linear Algebra Package (LAPACK), we use PE that can
efficiently execute GMM and show better performance
For further acceleration of LU and QR factorizations in LAPACK, we identify
macro operations encountered in LU and QR factorizations, and realize them on a
reconfigurable data-path resulting in 25-30% lower run-time
|
Page generated in 0.298 seconds