1 |
SUPPORTING APPROXIMATE COMPUTING ON COARSE GRAINED RE-CONFIGURABLE ARRAY ACCELERATORSDickerson, Jonathan 01 December 2019 (has links)
Recent research has shown approximate computing and Course-Grained Reconfigurable Arrays (GGRAs) are promising computing paradigms to reduce energy consumption in a compute intensive environment. CGRAs provide a promising middle ground between energy inefficient yet flexible Freely Programmable Gate Arrays (FPGAs) and energy efficient yet inflexible Application Specific Integrated Circuits (ASICs). With the integration of approximate computing in CGRAs, there is substantial gains in energy efficiency at the cost of arithmetic precision. However, some applications require a certain percent of accuracy in calculation to effectively perform its task. The ability to control the accuracy of approximate computing during run-time is an emerging topic.
|
2 |
Register File Organization for Coarse-Grained Reconfigurable Architectures: Compiler-Microarchitecture PerspectiveJanuary 2014 (has links)
abstract: Coarse-Grained Reconfigurable Architectures (CGRA) are a promising fabric for improving the performance and power-efficiency of computing devices. CGRAs are composed of components that are well-optimized to execute loops and rotating register file is an example of such a component present in CGRAs. Due to the rotating nature of register indexes in rotating register file, it is very challenging, if at all possible, to hold and properly index memory addresses (pointers) and static values. In this Thesis, different structures for CGRA register files are investigated. Those structures are experimentally compared in terms of performance of mapped applications, design frequency, and area. It is shown that a register file that can logically be partitioned into rotating and non-rotating regions is an excellent choice because it imposes the minimum restriction on underlying CGRA mapping algorithm while resulting in efficient resource utilization. / Dissertation/Thesis / Masters Thesis Computer Science 2014
|
3 |
Performance Improvement of Adaptive ProcessorsDöbrich, Stefan 03 August 2017 (has links) (PDF)
Improving a computers performance has been of major interest to all users around the world, from computing centers to private persons, ever since computer science has entered the stage and then the spotlight in the 1940’s. Most often times, this is either achieved by exchanging parts of the computer with better performing parts, called an upgrade, or by simply buying a newer and better computer.
Another approach, which originates from the scientific community, is the optimization of the source code of an application. Thereby, the application programmer capitalizes his knowledge about the underlying platform and its tool-chain in order to gain tweaked binary code, which results in a better performance. It is clear, that this technique will never be an option for consumer electronics or people outside the area of programming and software development. Traditionally, these users stick with the upgrade/buy new method.
During the last years, consumer electronics improved into multi-tool devices, which are capable of almost any functionality, originating from their internet connection and their ability to dynamically download and install new software. Certainly, it may happen that an application is too demanding for a given underlying hardware revision. As these new devices are built in a monolithic way, a hardware upgrade is not an option. Nonetheless, most users do not want to buy a new device every time this happens. Thus, it is necessary to provide a possibility, which allows the processor to adapt to a given application at runtime, and thereby improving its own performance. This thesis presents three major approaches to such a runtime dynamic application acceleration.
|
4 |
Design Space Exploration of Domain Specific CGRAs Using Crowd-sourcingSistla, Anil Kumar 08 1900 (has links)
CGRAs (coarse grained reconfigurable array architectures) try to fill the gap between FPGAs and ASICs. Over three decades, the research towards CGRA design has produced number of architectures. Each of these designs lie at different points on a line drawn between FPGAs and ASICs, depending on the tradeoffs and design choices made during the design of architectures. Thus, design space exploration (DSE) takes a very important role in the circuit design process. In this work I propose the design space exploration of CGRAs can be done quickly and efficiently through crowd-sourcing and a game driven approach based on an interactive mapping game UNTANGLED and a design environment called SmartBricks. Both UNTANGLED and SmartBricks have been developed by our research team at Reconfigurable Computing Lab, UNT. I present the results of design space exploration of domain-specific reconfigurable architectures and compare the results comparing stripe vs mesh style, heterogeneous vs homogeneous. I also compare the results obtained from different interconnection topologies in mesh. These results show that this approach offers quick DSE for designers and also provides low power architectures for a suite of benchmarks. All results were obtained using standard cell ASICs with 90 nm process.
|
5 |
SmartCell: An Energy Efficient Reconfigurable Architecture for Stream ProcessingLiang, Cao 04 May 2009 (has links)
Data streaming applications, such as signal processing, multimedia applications, often require high computing capacity, yet also have stringent power constraints, especially in portable devices. General purpose processors can no longer meet these requirements due to their sequential software execution. Although fixed logic ASICs are usually able to achieve the best performance and energy efficiency, ASIC solutions are expensive to design and their lack of flexibility makes them unable to accommodate functional changes or new system requirements. Reconfigurable systems have long been proposed to bridge the gap between the flexibility of software processors and performance of hardware circuits. Unfortunately, mainstream reconfigurable FPGA designs suffer from high cost of area, power consumption and speed due to the routing area overhead and timing penalty of their bit-level fine granularity. In this dissertation, we present an architecture design, application mapping and performance evaluation of a novel coarse-grained reconfigurable architecture, named SmartCell, for data streaming applications. The system tiles a large number of computing cell units in a 2D mesh structure, with four coarse-grained processing elements developed inside each cell to form a quad structure. Based on this structure, a hierarchical reconfigurable network is developed to provide flexible on-chip communication among computing resources: including fully connected crossbar, nearest neighbor connection and clustered mesh network. SmartCell can be configured to operate in various computing modes, including SIMD, MIMD and systolic array styles to fit for different application requirements. The coarse-grained SmartCell has the potential to improve the power and energy efficiency compared with fine-grained FPGAs. It is also able to provide high performance comparable to the fixed function ASICs through deep pipelining and large amount of computing parallelism. Dynamic reconfiguration is also addressed in this dissertation. To evaluate its performance, a set of benchmark applications has been successfully mapped onto the SmartCell system, ranging from signal processing, multimedia applications to scientific computing and data encryption. A 4 by 4 SmartCell prototype system was initially designed in CMOS standard cell ASIC with 130 nm process. The chip occupies 8.2 mm square and dissipates 1.6 mW/MHz under fully operation. The results show that the SmartCell can bridge the performance and flexibility gap between logic specific ASICs and reconfigurable FPGAs. SmartCell is also about 8% and 69% more energy efficient and achieves 4x and 2x throughput gains compared with Montium and RaPiD CGRAs. Based on our first SmartCell prototype experiences, an improved SmartCell-II architecture was developed, which includes distributed data memory, segmented instruction format and improved dynamic configuration schemes. A novel parallel FFT algorithm with balanced workloads and optimized data flow was also proposed and successfully mapped onto SmartCell-II for performance evaluations. A 4 by 4 SmartCell-II prototype was then synthesized into standard cell ASICs with 90 nm process. The results show that SmartCell-II consists of 2.0 million gates and is fully functional at up to 295 MHz with 3.1 mW/MHz power consumption. SmartCell-II is about 3.6 and 28.9 times more energy efficient than Xilinx FPGA and TI's high performance DSPs, respectively. It is concluded that the SmartCell is able to provide a promising solution to achieve high performance and energy efficiency for future data streaming applications.
|
6 |
Improving CGRA Utilization by Enabling Multi-threading for Power-efficient Embedded SystemsJanuary 2011 (has links)
abstract: Performance improvements have largely followed Moore's Law due to the help from technology scaling. In order to continue improving performance, power-efficiency must be reduced. Better technology has improved power-efficiency, but this has a limit. Multi-core architectures have been shown to be an additional aid to this crusade of increased power-efficiency. Accelerators are growing in popularity as the next means of achieving power-efficient performance. Accelerators such as Intel SSE are ideal, but prove difficult to program. FPGAs, on the other hand, are less efficient due to their fine-grained reconfigurability. A middle ground is found in CGRAs, which are highly power-efficient, but largely programmable accelerators. Power-efficiencies of 100s of GOPs/W have been estimated, more than 2 orders of magnitude greater than current processors. Currently, CGRAs are limited in their applicability due to their ability to only accelerate a single thread at a time. This limitation becomes especially apparent as multi-core/multi-threaded processors have moved into the mainstream. This limitation is removed by enabling multi-threading on CGRAs through a software-oriented approach. The key capability in this solution is enabling quick run-time transformation of schedules to execute on targeted portions of the CGRA. This allows the CGRA to be shared among multiple threads simultaneously. Analysis shows that enabling multi-threading has very small costs but provides very large benefits (less than 1% single-threaded performance loss but nearly 300% CGRA throughput increase). By increasing dynamism of CGRA scheduling, system performance is shown to increase overall system performance of an optimized system by almost 350% over that of a single-threaded CGRA and nearly 20x faster than the same system with no CGRA in a highly threaded environment. / Dissertation/Thesis / M.S. Computer Science 2011
|
7 |
ENERGY EFFICIENCY EXPLORATION OF COARSE-GRAIN RECONFIGURABLE ARCHITECTURE WITH EMERGING NONVOLATILE MEMORYLiu, Xiaobin 18 March 2015 (has links)
With the rapid growth in consumer electronics, people expect thin, smart and powerful devices, e.g. Google Glass and other wearable devices. However, as portable electronic products become smaller, energy consumption becomes an issue that limits the development of portable systems due to battery lifetime. In general, simply reducing device size cannot fully address the energy issue.
To tackle this problem, we propose an on-chip interconnect infrastructure and pro- gram storage structure for a coarse-grained reconfigurable architecture (CGRA) with emerging non-volatile embedded memory (MRAM). The interconnect is composed of a matrix of time-multiplexed switchboxes which can be dynamically reconfigured with the goal of energy reduction. The number of processors performing computation can also be adapted. The use of MRAM provides access to high-density storage and lower memory energy consumption versus more standard SRAM technologies. The combination of CGRA, MRAM, and flexible on-chip interconnection is considered for signal processing. This application domain is of interest based on its time-varying computing demands.
To evaluate CGRA architectural features, prototype architectures have been pro- totyped in a field-programmable gate array (FPGA). Measurements of energy, power, instruction count, and execution time performance are considered for a scalable num- ber of processors. Applications such as adaptive Viterbi decoding and Reed Solomon coding are used for evaluation. To complete this thesis, a time-scheduled switchbox was integrated into our CGRA model. This model was prototyped on an FPGA. It is shown that energy consumption can be reduced by about 30% if dynamic design reconfiguration is performed.
|
8 |
Performance Improvement of Adaptive Processors: Hardware Synthesis, Instruction Folding and Microcode AssemblyDöbrich, Stefan 28 January 2013 (has links)
Improving a computers performance has been of major interest to all users around the world, from computing centers to private persons, ever since computer science has entered the stage and then the spotlight in the 1940’s. Most often times, this is either achieved by exchanging parts of the computer with better performing parts, called an upgrade, or by simply buying a newer and better computer.
Another approach, which originates from the scientific community, is the optimization of the source code of an application. Thereby, the application programmer capitalizes his knowledge about the underlying platform and its tool-chain in order to gain tweaked binary code, which results in a better performance. It is clear, that this technique will never be an option for consumer electronics or people outside the area of programming and software development. Traditionally, these users stick with the upgrade/buy new method.
During the last years, consumer electronics improved into multi-tool devices, which are capable of almost any functionality, originating from their internet connection and their ability to dynamically download and install new software. Certainly, it may happen that an application is too demanding for a given underlying hardware revision. As these new devices are built in a monolithic way, a hardware upgrade is not an option. Nonetheless, most users do not want to buy a new device every time this happens. Thus, it is necessary to provide a possibility, which allows the processor to adapt to a given application at runtime, and thereby improving its own performance. This thesis presents three major approaches to such a runtime dynamic application acceleration.:1 Introduction 5
1.1 Motivation 5
1.2 Targets and Aims 7
1.3 Thesis Outline 8
2 AMIDAR - A Runtime Reconfigurable Processor 11
2.1 Overall Processor Architecture 11
2.2 Principle of Operation 14
2.3 Applicability of the AMIDAR Model 15
2.4 Adaptivity in AMIDAR Processors 16
2.5 Relations to Existing Processor Architectures 19
3 Applicability to Different Instruction Set Architectures 23
3.1 Supported Instruction Set Architectures 23
3.2 Selecting an ISA for Hardware Acceleration 25
3.3 A Detailed Look at an AMIDAR Based Java Processor 29
3.4 Example Token Sequence and Execution Trace 31
3.5 Performance Comparison of AMIDAR and IA32 Processors 34
4 Hotspot Evaluation 37
5 Runtime Reconfiguration of Processors 41
5.1 The Idea of Processor Reconfiguration 41
5.2 Targets and Aims for Efficient Processor Extensibility 43
6 Hardware Synthesis 47
6.1 The Evolution of Coarse Grain Reconfigurable Computing 47
6.2 The CGRA Target Architecture 71
6.3 Hardware Synthesis 79
6.4 Evaluation and Results of Hardware Synthesis 97
6.5 Saving Hardware With Heterogeneous CGRAs 103
6.6 The Size of Token Sets for Synthesized Functional Units 107
6.7 The Runtime Consumption of Performance Acceleration 108
7 Instruction Folding 113
7.1 The General Idea Behind Instruction Folding 113
7.2 General Classification of Folding Strategies 114
7.3 Folding Based on Instruction Type Pattern 116
7.4 Java Bytecode Folding Based on Behavioural Pattern 121
7.5 Common Applications of Instruction Folding 125
7.6 Instruction Folding and the AMIDAR Execution Model 126
8 Assembly of Microinstruction Groups 151
8.1 Motivation and General Idea 151
8.2 The Basic Token Set Assembly Algorithm 159
8.3 Algorithmic Extensions 179
8.4 Synthilation for an Unaltered Basic Processor 182
8.5 Synthilation Performance on Multi-ALU Processors 191
8.6 Runtime Characteristics of Synthilation Algorithms 195
9 Comparison 197
9.1 Speedup Comparison 197
9.2 Runtime and Complexity 198
9.3 Token Memory Consumption 200
9.4 Consumed Hardware Resources 201
10 Conclusion 203
10.1 Realization of Targets and Aims 203
10.2 The Ideal Use Case for Each Acceleration Approach 204
10.3 Limitations and Drawbacks 206
10.4 Summary 207
A Benchmark Applications 209
A.1 Cryptographic Ciphers 209
A.2 Hash Functions and Message Digests 210
A.3 Image Processing Filters 212
A.4 Jpeg Encoder 212
B Benchmark Measurement Values 213
B.1 Measurements of Instruction Set Evaluation 213
B.2 Measurement Values of Hardware Synthesis 217
B.3 Measurement Values of Instruction Folding 227
B.4 Measurement Values of Token Set Synthilation 243
|
9 |
Low Density Parity Check Encoder and Decoder on SiLago Coarse Grain Reconfigurable ArchitectureKong, Weijiang January 2019 (has links)
Low density parity check (LDPC) code is an error correction code that has been widely adopted as an optional error correcting operation in most of today’s communication protocols. Current design of ASIC or FPGA based LDPC accelerators can reach Gbit/s data rate. However, the hardware cost of ASIC based methods and related interface is considerably high to be integrated into coarse grain reconfigurable architectures (CGRA). Moreover, for platforms aiming at high level synthesis or system level synthesis, they don’t provide flexibility under low-performance low-cost design scenarios. In this degree project, we establish connectivity between SiLago CGRA and a typical QC-LDPC code defined in IEEE 802.11n standard. We design lightweight LDPC encoder and decoder blocks using FSM+Datapath design pattern. The encoder provides sufficient throughput and consumes very little area and power. The decoder provides sufficient performance for low speed modulations while consuming significantly lower hardware resources. Both encoder and decoder are capable of cooperating with SiLago based DRRA through standard Network on Chip (NOC) based shared memory, DiMArch. And extra hardware for interface is no longer necessary. We verified our design through RTL simulation and synthesis. Encoder went through logic and physical synthesis while decoder went through only logic synthesis. The result acquired proves that our design is closely coupled with the SiLago CGRA while provides a solution with lowperformance and low-cost. / LDPC-kod med låg densitet är en felkorrigeringskod som har vidtagits i stor utsträckning som en valfri felsökande operation i de flesta av dagens kommunikationsprotokoll. Nuvarande design av ASICeller FPGAbaserade LDPC-acceleratorer kan nå Gbit / s datahastighet. Hårdvarukostnaden för ASIC-baserade metoder och relaterade gränssnitt är emellertid avsevärt hög för att integreras i grova kornkonfigurerbara arkitekturer (CGRA). Dessutom ger plattformar som syftar till syntese på hög nivå eller syntes på systemnivå inte flexibilitet under lågprestanda med låg kostnadsscenarier. I detta examensarbete upprättar vi anslutning mellan SiLago CGRA och en typisk QC-LDPC-kod definierad i IEEE 802.11n-standarden. Vi designar lätta LDPC-kodare och avkodarblock med FSM + Datapathdesignmönster. Kodaren ger tillräcklig genomströmning och förbrukar mycket lite areal och effekt. Avkodaren ger tillräckligt med prestanda för moduleringar med låg hastighet medan den förbrukar betydligt lägre hårdvaruressurser. Både kodare och avkodare kan samarbeta med SiLago-baserade DRRA genom standard Network on Chip (NOC) baserat delat minne, DiMArch. Och extra hårdvara för gränssnittet är inte längre nödvändigt. Vi verifierade vår design genom RTL-simulering och syntes. Kodaren genomgick logik och fysisk syntes medan avkodare genomgick endast logisk syntes. Det förvärvade resultatet bevisar att vår design är nära kopplad till SiLago CGRA och ger en lösning med låg prestanda och låg kostnad.
|
10 |
Architecture and Programming Model Support for Reconfigurable Accelerators in Multi-Core Embedded Systems / Architecture et modèle de programmation pour accélérateurs reconfigurables dans les systèmes embarqués multi-coeursDas, Satyajit 04 June 2018 (has links)
La complexité des systèmes embarqués et des applications impose des besoins croissants en puissance de calcul et de consommation énergétique. Couplé au rendement en baisse de la technologie, le monde académique et industriel est toujours en quête d'accélérateurs matériels efficaces en énergie. L'inconvénient d'un accélérateur matériel est qu'il est non programmable, le rendant ainsi dédié à une fonction particulière. La multiplication des accélérateurs dédiés dans les systèmes sur puce conduit à une faible efficacité en surface et pose des problèmes de passage à l'échelle et d'interconnexion. Les accélérateurs programmables fournissent le bon compromis efficacité et flexibilité. Les architectures reconfigurables à gros grains (CGRA) sont composées d'éléments de calcul au niveau mot et constituent un choix prometteur d'accélérateurs programmables. Cette thèse propose d'exploiter le potentiel des architectures reconfigurables à gros grains et de pousser le matériel aux limites énergétiques dans un flot de conception complet. Les contributions de cette thèse sont une architecture de type CGRA, appelé IPA pour Integrated Programmable Array, sa mise en œuvre et son intégration dans un système sur puce, avec le flot de compilation associé qui permet d'exploiter les caractéristiques uniques du nouveau composant, notamment sa capacité à supporter du flot de contrôle. L'efficacité de l'approche est éprouvée à travers le déploiement de plusieurs applications de traitement intensif. L'accélérateur proposé est enfin intégré à PULP, a Parallel Ultra-Low-Power Processing-Platform, pour explorer le bénéfice de ce genre de plate-forme hétérogène ultra basse consommation. / Emerging trends in embedded systems and applications need high throughput and low power consumption. Due to the increasing demand for low power computing and diminishing returns from technology scaling, industry and academia are turning with renewed interest toward energy efficient hardware accelerators. The main drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low is they perform one specific function and increasing the number of the accelerators in a system on chip (SoC) causes scalability issues. Programmable accelerators provide flexibility and solve the scalability issues. Coarse-Grained Reconfigurable Array (CGRA) architecture consisting of several processing elements with word level granularity is a promising choice for programmable accelerator. Inspired by the promising characteristics of programmable accelerators, potentials of CGRAs in near threshold computing platforms are studied and an end-to-end CGRA research framework is developed in this thesis. The major contributions of this framework are: CGRA design, implementation, integration in a computing system, and compilation for CGRA. First, the design and implementation of a CGRA named Integrated Programmable Array (IPA) is presented. Next, the problem of mapping applications with control and data flow onto CGRA is formulated. From this formulation, several efficient algorithms are developed using internal resources of a CGRA, with a vision for low power acceleration. The algorithms are integrated into an automated compilation flow. Finally, the IPA accelerator is augmented in PULP - a Parallel Ultra-Low-Power Processing-Platform to explore heterogeneous computing.
|
Page generated in 0.0298 seconds