Global ETD Search

11	System Level Exploration of RRAM for SRAM Replacement Dogan, Rabia January 2013 (has links) Recently an effective usage of the chip area plays an essential role for System-on-Chip (SOC) designs. Nowadays on-chip memories take up more than 50%of the total die-area and are responsible for more than 40% of the total energy consumption. Cache memory alone occupies 30% of the on-chip area in the latest microprocessors. This thesis project “System Level Exploration of RRAM for SRAM Replacement” describes a Resistive Random Access Memory (RRAM) based memory organizationfor the Coarse Grained Reconfigurable Array (CGRA) processors. Thebenefit of the RRAM based memory organization, compared to the conventional Static-Random Access Memory (SRAM) based memory organization, is higher interms of energy and area requirement. Due to the ever-growing problems faced by conventional memories with Dynamic Voltage Scaling (DVS), emerging memory technologies gained more importance. RRAM is typically seen as a possible candidate to replace Non-volatilememory (NVM) as Flash approaches its scaling limits. The replacement of SRAMin the lowest layers of the memory hierarchies in embedded systems with RRAMis very attractive research topic; RRAM technology offers reduced energy and arearequirements, but it has limitations with regards to endurance and write latency. By reason of the technological limitations and restrictions to solve RRAM write related issues, it becomes beneficial to explore memory access schemes that tolerate the longer write times. Therefore, since RRAM write time cannot be reduced realistically speaking we have to derive instruction memory and data memory access schemes that tolerate the longer write times. We present an instruction memory access scheme to compromise with these problems. In addition to modified instruction memory architecture, we investigate the effect of the longer write times to the data memory. Experimental results provided show that the proposed architectural modifications can reduce read energy consumption by a significant frame without any performance penalty. Resistive RAM(RRAM) Static RAM (SRAM) Non-volatile memory(NVM)
12	Smart Compilers for Reliable and Power-efficient Embedded Computing January 2012 (has links) abstract: Thanks to continuous technology scaling, intelligent, fast and smaller digital systems are now available at affordable costs. As a result, digital systems have found use in a wide range of application areas that were not even imagined before, including medical (e.g., MRI, remote or post-operative monitoring devices, etc.), automotive (e.g., adaptive cruise control, anti-lock brakes, etc.), security systems (e.g., residential security gateways, surveillance devices, etc.), and in- and out-of-body sensing (e.g., capsule swallowed by patients measuring digestive system pH, heart monitors, etc.). Such computing systems, which are completely embedded within the application, are called embedded systems, as opposed to general purpose computing systems. In the design of such embedded systems, power consumption and reliability are indispensable system requirements. In battery operated portable devices, the battery is the single largest factor contributing to device cost, weight, recharging time, frequency and ultimately its usability. For example, in the Apple iPhone 4 smart-phone, the battery is $40\%$ of the device weight, occupies $36\%$ of its volume and allows only $7$ hours (over 3G) of talk time. As embedded systems find use in a range of sensitive applications, from bio-medical applications to safety and security systems, the reliability of the computations performed becomes a crucial factor. At our current technology-node, portable embedded systems are prone to expect failures due to soft errors at the rate of once-per-year; but with aggressive technology scaling, the rate is predicted to increase exponentially to once-per-hour. Over the years, researchers have been successful in developing techniques, implemented at different layers of the design-spectrum, to improve system power efficiency and reliability. Among the layers of design abstraction, I observe that the interface between the compiler and processor micro-architecture possesses a unique potential for efficient design optimizations. A compiler designer is able to observe and analyze the application software at a finer granularity; while the processor architect analyzes the system output (power, performance, etc.) for each executed instruction. At the compiler micro-architecture interface, if the system knowledge at the two design layers can be integrated, design optimizations at the two layers can be modified to efficiently utilize available resources and thereby achieve appreciable system-level benefits. To this effect, the thesis statement is that, ``by merging system design information at the compiler and micro-architecture design layers, smart compilers can be developed, that achieve reliable and power-efficient embedded computing through: i) Pure compiler techniques, ii) Hybrid compiler micro-architecture techniques, and iii) Compiler-aware architectures''. In this dissertation demonstrates, through contributions in each of the three compiler-based techniques, the effectiveness of smart compilers in achieving power-efficiency and reliability in embedded systems. / Dissertation/Thesis / Ph.D. Computer Science 2012 Computer science cgra compiler computer architecture power efficiency reliability soft error
13	Towards Energy Efficient Computing with Linux : Enabling Task Level Power Awareness and Support for Energy Efficient Accelerator January 2013 (has links) abstract: With increasing transistor volume and reducing feature size, it has become a major design constraint to reduce power consumption also. This has given rise to aggressive architectural changes for on-chip power management and rapid development to energy efficient hardware accelerators. Accordingly, the objective of this research work is to facilitate software developers to leverage these hardware techniques and improve energy efficiency of the system. To achieve this, I propose two solutions for Linux kernel: Optimal use of these architectural enhancements to achieve greater energy efficiency requires accurate modeling of processor power consumption. Though there are many models available in literature to model processor power consumption, there is a lack of such models to capture power consumption at the task-level. Task-level energy models are a requirement for an operating system (OS) to perform real-time power management as OS time multiplexes tasks to enable sharing of hardware resources. I propose a detailed design methodology for constructing an architecture agnostic task-level power model and incorporating it into a modern operating system to build an online task-level power profiler. The profiler is implemented inside the latest Linux kernel and validated for Intel Sandy Bridge processor. It has a negligible overhead of less than 1\% hardware resource consumption. The profiler power prediction was demonstrated for various application benchmarks from SPEC to PARSEC with less than 4\% error. I also demonstrate the importance of the proposed profiler for emerging architectural techniques through use case scenarios, which include heterogeneous computing and fine grained per-core DVFS. Along with architectural enhancement in general purpose processors to improve energy efficiency, hardware accelerators like Coarse Grain reconfigurable architecture (CGRA) are gaining popularity. Unlike vector processors, which rely on data parallelism, CGRA can provide greater flexibility and compiler level control making it more suitable for present SoC environment. To provide streamline development environment for CGRA, I propose a flexible framework in Linux to do design space exploration for CGRA. With accurate and flexible hardware models, fine grained integration with accurate architectural simulator, and Linux memory management and DMA support, a user can carry out limitless experiments on CGRA in full system environment. / Dissertation/Thesis / M.S. Electrical Engineering 2013 Engineering Electrical engineering Computer engineering CGRA framework Multicore Processor Power Management Task Power Profiler
14	Rate Flexible Soft Decision Viterbi Decoder using SiLago Baliga, Naveen Bantwal January 2021 (has links) The IEEE 802.11a protocol is part of the IEEE 802 protocols for implementing WLAN Wi- Fi computer communications in various frequencies. These protocols find applications worldwide, covering a wide range of devices like mobile phones, computers, laptops, household appliances, etc. Since wireless communication is being used, data that is transmitted is susceptible to noise. As a means to recover from noise, the data transmitted is encoded using convolutional encoding and correspondingly decoded on the receiver side. The decoder used is the Viterbi decoder, in the PHY layer of the protocol. This thesis investigates soft-decision Viterbi decoder implementations that meet the requirements of the IEEE 802.11a protocol. It aims to implement a rate-flexible design as a coarse grain re-configurable architecture using the SiLago framework. SiLago is a modular approach towards ASIC design. Components are designed as hardened blocks, which means they are synthesised and pre-verified. Each block is also abuttable like LEGO blocks, which allows users to connect compatible blocks and make designs specific to their requirements, while getting performance similar to that of traditional ASICs. This approach significantly reduces the design costs, as verification is a one-time task. The thesis discusses the strongly connected trellis Viterbi decoding algorithm and proposes a design for a soft decision Viterbi decoder. The proposed design meets the throughput requirements of the communication protocol and it can be reconfigured to work for 45 different code rates, with programmable soft decision width and parallelism. The algorithm used is compared against MATLAB for its BER performance. Results from RTL simulations, advantages and disadvantages of the proposed design are discussed. Recommendations for future improvements are also made. / IEEE 802.11a-protokollet är en del av IEEE 802-protokollen för att implementera WLAN Wi-Fi-datorkommunikation i olika frekvenser. Dessa protokoll används i applikationer över hela världen som täcker ett brett spektrum av produkter som mobiltelefoner, datorer, bärbara datorer, hushållsapparater etc. Eftersom trådlös kommunikation används är data som överförs känslig för brus. Som ett medel för att återhämta sig från brus kodas överförd data med hjälp av faltningskodning och avkodas på motsvarande sätt på mottagarsidan. Den avkodare som används är Viterbi-avkodaren, i PHY-skiktet i protokollet. Denna avhandling undersöker mjuka beslut Viterbi avkodarimplementeringar som uppfyller kraven i IEEE 802.11a protokollet. Det syftar till att implementera en hastighetsflexibel design som en grovkornig konfigurerbar arkitektur som använder SiLago ramverket. SiLago är ett modulärt synsätt på ASIC design. Komponenterna är utformade som härda block, vilket innebär att de är syntetiserade och förverifierade. Varje block kan också kopplas ihop, som LEGO block, vilket gör det möjligt för användare att ansluta kompatibla block och göra designer som är specifika för deras krav, samtidigt som de får prestanda som liknar traditionella ASICs. Detta tillvägagångssätt minskar designkostnaderna avsevärt, eftersom verifiering är en engångsuppgift. Avhandlingen diskuterar den starkt kopplade trellis Viterbi-avkodningsalgoritmen och föreslår en design för en mjuk Viterbi-avkodare. Den föreslagna designen uppfyller kommunikationsprotokollets genomströmningskrav och den kan konfigureras om för att fungera för 45 olika kodhastigheter, med programmerbar mjuk beslutsbredd och parallellitet. Algoritmen som används jämförs mot MATLAB för dess BER-prestanda. Resultat från RTL-simuleringar, fördelar och nackdelar med den föreslagna designen diskuteras. Rekommendationer för framtida förbättringar görs också. SiLago Viterbi Decoder Rate-flexible Soft Decision WLAN 802.11a Strongly Connected Trellis CGRA Convolution Encoding BER SiLago Viterbi-avkodare hastighetsflexibel mjukt beslut WLAN 802.11a starkt kopplade trellis CGRA konvolutionskodning BER Computer and Information Sciences Data- och informationsvetenskap
15	Optimizing the instruction scheduler of high-level synthesis tool / Optimera instruktion schemaläggaren för högnivå syntes verktyg Xu, Zihao January 2023 (has links) With the increasing complexity of the chip architecture design for meeting different application requirements, the corresponding instruction scheduler of high-level synthesis tool needs to solve complex scheduling problems. Dynamically Reconfigurable Resource Array (DRRA) is a novel architecture based on Coarse-Grained Reconfigurable Architecture (CGRA) on SiLago platform, the instruction scheduler of Vesyla-II, the dedicated High-Level Synthesis (HLS) tool targets for DRRA needs to schedule the specific instruction sets designed for Distributed Two-level Control System (D2LC). This kind of instruction has different lifetimes and is fully cooperative and persistent. Based on these features, the instruction scheduler needs to be applied to the scheduling algorithm under complex constraints. The previously existing naive algorithm shows poor scalability and low efficiency. This thesis attempts to design and implement a new scheduling algorithm to improve the performance of a constraint programming engine-based scheduler. The new scheduling algorithm is based on the heuristic method, the scheduler with this algorithm does the order prediction during the resource scheduling process. Besides, a test bench for meeting different instruction scheduling behavior is also designed, and the test bench could generate the maximum boundary of the schedule to do the performance profiling of the developed algorithm. Several experiments are performed to compare the proposed method against the previous naive algorithm. The execution time and quality of the result are crucial to determine which algorithm has better performance. The experiment result shows that the scheduler with a heuristic algorithm could reduce the execution time and have comparable schedule quality, and it could solve all the test cases, whilst the naive algorithm only can solve part of them. / Med den ökande komplexiteten hos chiparkitekturdesignen för att möta olika applikationskrav, måste motsvarande instruktionsschemaläggare för högnivåsyntesverktyg lösa komplexa schemaläggningsproblem. Dynamically Reconfigurable Resource Array (DRRA) är en ny arkitektur baserad på Coarse-Grained Reconfigurable Architecture (CGRA) på SiLago-plattformen, instruktionsschemaläggaren för Vesyla-II, de dedikerade High Level Synthesis (HLS) verktygsmålen för DRRA behöver för att schemalägga de specifika instruktionsuppsättningar designade för distribuerat tvånivåstyrsystem (D2LC). Denna typ av undervisning har olika livslängder och är helt samarbetsvillig och ihållande. Baserat på dessa funktioner måste instruktionsschemaläggaren appliceras på schemaläggningsalgoritmen under komplexa begränsningar. Den tidigare existerande naiva algoritmen visar dålig skalbarhet och låg effektivitet. Den här avhandlingen försöker designa och implementera en ny schemaläggningsalgoritm för att förbättra prestandan hos en schemaläggare som är baserad på begränsningsprogrammeringsmotorer. Den nya schemaläggningsalgoritmen är baserad på den heuristiska metoden, schemaläggaren med denna algoritm gör ordningsförutsägelsen under resursschemaläggningsprocessen. Dessutom är en testbänk för att möta olika instruktionsschemaläggningsbeteenden också utformad, och testbänken kan generera den maximala gränsen för schemat för att göra prestandaprofileringen av den utvecklade algoritmen. Flera experiment utförs för att jämföra den föreslagna metoden mot den tidigare naiva algoritmen. Exekveringstiden och kvaliteten på resultatet är avgörande för att avgöra vilken algoritm som har bättre prestanda. Experimentresultatet visar att schemaläggaren med en heuristisk algoritm kan minska exekveringstiden och ha jämförbar schemakvalitet, och den kan lösa alla testfall, medan den naiva algoritmen bara kan lösa en del av dem. Instruction scheduling Scheduling algorithm CGRA High-level Sythnesis SiLago Algorithm-level Synthesis Constraint programming Instruktion schemaläggning schemaläggning algoritm CGRA High-level Sythnesis SiLago Algoritm-nivå Synthesis Constraint programmering Computer and Information Sciences Data- och informationsvetenskap
16	Improving the Gameplay Experience and Guiding Bottom Players in an Interactive Mapping Game Ambekar, Kiran 05 1900 (has links) In game based learning, motivating the players to learn by providing them a desirable gameplay experience is extremely important. However, it's not an easy task considering the quality of today's commercial non-educational games. Throughout the gameplay, the player should neither get overwhelmed nor under-challenged. The best way to do so is to monitor the player's actions in the game because these actions can tell the reason behind the player's performance. They can also tell about the player's lacking competencies or knowledge. Based on this information, in-game educational interventions in the form of hints can be provided to the player. The success of such games depends on their interactivity, motivational outlook and thus player retention. UNTANGLED is an online mapping game based on crowd-sourcing, developed by Reconfigurable Computing Lab, UNT for the mapping problem of CGRAs. It is also an educational game for teaching the concepts of reconfigurable computing. This thesis performs qualitative comparative analysis on gameplays of low performing players of UNTANGLED. And the implications of this analysis are used to provide recommendations for improving the gameplay experience for these players by guiding them. The recommendations include strategies to reach a high score and a compact solution, hints in the form of preset patterns and a clustering based approach. CGRA Mapping Benchmarks Qualitative Comparative Analysis Clustering Gameplay Experience Computer games -- Design. Human computation.
17	ソフトエラー耐性を考慮した粗粒度再構成可能アーキテクチャの設計手法今川, 隆司 23 March 2015 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第19136号 / 情博第582号 / 新制\|\|情\|\|102(附属図書館) / 32087 / 京都大学大学院情報学研究科通信情報システム専攻 / (主査)教授佐藤高史, 教授小野寺秀俊, 教授髙木直史 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM CGRA 信頼性設計空間探索空間多重化時間多重化 007
18	A Coarse Grained Reconfigurable Architecture Framework Supporting Macro-Dataflow Execution Varadarajan, Keshavan 12 1900 (has links) (PDF) A Coarse-Grained Reconfigurable Architecture (CGRA) is a processing platform which constitutes an interconnection of coarse-grained computation units (viz. Function Units (FUs), Arithmetic Logic Units (ALUs)). These units communicate directly, viz. send-receive like primitives, as opposed to the shared memory based communication used in multi-core processors. CGRAs are a well-researched topic and the design space of a CGRA is quite large. The design space can be represented as a 7-tuple (C, N, T, P, O, M, H) where each of the terms have the following meaning: C -choice of computation unit, N -choice of interconnection network, T -Choice of number of context frame (single or multiple), P -presence of partial reconfiguration, O choice of orchestration mechanism, M -design of memory hierarchy and H host-CGRA coupling. In this thesis, we develop an architectural framework for a Macro-Dataflow based CGRA where we make the following choice for each of these parameters: C -ALU, N -Network-on-Chip (NoC), T -Multiple contexts, P -support for partial reconfiguration, O -Macro Dataflow based orchestration, M -data memory banks placed at the periphery of the reconfigurable fabric (reconfigurable fabric is the name given to the interconnection of computation units), H -loose coupling between host processor and CGRA, enabling our CGRA to execute an application independent of the host-processor’s intervention. The motivations for developing such a CGRA are: To execute applications efficiently through reduction in reconfiguration time (i.e. the time needed to transfer instructions and data to the reconfigurable fabric) and reduction in execution time through better exploitation of all forms of parallelism: Instruction Level Parallelism (ILP), Data Level Parallelism (DLP) and Thread/Task Level Parallelism (TLP). We choose a macro-dataflow based orchestration framework in combination with partial reconfiguration so as to ease exploitation of TLP and DLP. Macro-dataflow serves as a light weight synchronization mechanism. We experiment with two variants of the macro-dataflow orchestration units, namely: hardware controlled orchestration unit and the compiler controlled orchestration unit. We employ a NoC as it helps reduce the reconfiguration overhead. To permit customization of the CGRA for a particular domain through the use of domain-specific custom-Intellectual Property (IP) blocks. This aids in improving both application performance and makes it energy efficient. To develop a CGRA which is completely programmable and accepts any program written using the C89 standard. The compiler and the architecture were co-developed to ensure that every feature of the architecture could be automatically programmed through an application by a compiler. In this CGRA framework, the orchestration mechanism (O) and the host-CGRA coupling (H) are kept fixed and we permit design space exploration of the other terms in the 7-tuple design space. The mode of compilation and execution remains invariant of these changes, hence referred to as a framework. We now elucidate the compilation and execution flow for this CGRA framework. An application written in C language is compiled and is transformed into a set of temporal partitions, referred to as HyperOps in this thesis. The macro-dataflow orchestration unit selects a HyperOp for execution when all its inputs are available. The instructions and operands for a ready HyperOp are transferred to the reconfigurable fabric for execution. Each ALU (in the computation unit) is capable of waiting for the availability of the input data, prior to issuing instructions. We permit the launch and execution of a temporal partition to progress in parallel, which reduces the reconfiguration overhead. We further cut launch delays by keeping loops persistent on fabric and thus eliminating the need to launch the instructions. The CGRA framework has been implemented using Bluespec System Verilog. We evaluate the performance of two of these CGRA instances: one for cryptographic applications and another instance for linear algebra kernels. We also run other general purpose integer and floating point applications to demonstrate the generic nature of these optimizations. We explore various microarchitectural optimizations viz. pipeline optimizations (i.e. changing value of T ), different forms of macro dataflow orchestration such as hardware controlled orchestration unit and compiler-controlled orchestration unit, different execution modes including resident loops, pipeline parallelism, changes to the router etc. As a result of these optimizations we observe 2.5x improvement in performance as compared to the base version. The reconfiguration overhead was hidden through overlapping launching of instructions with execution making. The perceived reconfiguration overhead is reduced drastically to about 9-11 cycles for each HyperOp, invariant of the size of the HyperOp. This can be mainly attributed to the data dependent instruction execution and use of the NoC. The overhead of the macro-dataflow execution unit was reduced to a minimum with the compiler controlled orchestration unit. To benchmark the performance of these CGRA instances, we compare the performance of these with an Intel Core 2 Quad running at 2.66GHz. On the cryptographic CGRA instance, running at 700MHz, we observe one to two orders of improvement in performance for cryptographic applications and up to one order of magnitude performance degradation for linear algebra CGRA instance. This relatively poor performance of linear algebra kernels can be attributed to the inability in exploiting ILP across computation units interconnected by the NoC, long latency in accessing data memory placed at the periphery of the reconfigurable fabric and unavailability of pipelined floating point units (which is critical to the performance of linear algebra kernels). The superior performance of the cryptographic kernels can be attributed to higher computation to load instruction ratio, careful choice of custom IP block, ability to construct large HyperOps which allows greater portion of the communication to be performed directly (as against communication through a register file in a general purpose processor) and the use of resident loops execution mode. The power consumption of a computation unit employed on the cryptography CGRA instance, along with its router is about 76mW, as estimated by Synopsys Design Vision using the Faraday 90nm technology library for an activity factor of 0.5. The power of other instances would be dependent on specific instantiation of the domain specific units. This implies that for a reconfigurable fabric of size 5 x 6 the total power consumption is about 2.3W. The area and power ( 84mW) dissipated by the macro dataflow orchestration unit, which is common to both instances, is comparable to a single computation unit, making it an effective and low overhead technique to exploit TLP. Coarse Grained Computation Reconfigurable Architectures Macro Dataflow Execution Macro-Dataflow Orchestration Microarchitectural Optimizations Reconfigurable Fabric Macro Dataflow Execution Computer Science
19	Compiling For Coarse-Grained Reconfigurable Architectures Based On Dataflow Execution Paradigm Alle, Mythri 12 1900 (has links) (PDF) Coarse-Grained Reconfigurable Architectures(CGRAs) can be employed for accelerating computational workloads that demand both flexibility and performance. CGRAs comprise a set of computation elements interconnected using a network and this interconnection of computation elements is referred to as a reconfigurable fabric. The size of application that can be accommodated on the reconfigurable fabric is limited by the size of instruction buffers associated with each Compute element. When an application cannot be accommodated entirely, application is partitioned such that each of these partitions can be executed on the reconfigurable fabric. These partitions are scheduled by an orchestrator. The orchestrator employs dynamic dataflow execution paradigm. Dynamic dataflow execution paradigm has inherent support for synchronization and helps in exploitation of parallelism that exists across application partitions. In this thesis, we present a compiler that targets such CGRAs. The compiler presented in this thesis is capable of accepting applications specified in C89 standard. To enable architectural design space exploration, the compiler is designed such that it can be customized for several instances of CGRAs employing dataflow execution paradigm at the orchestrator. This can be achieved by specifying the appropriate configuration parameters to the compiler. The focus of this thesis is to provide efficient support for various kinds of parallelism while ensuring correctness. The compiler is designed to support fine-grained task level parallelism that exists across iterations of loops and function calls. Additionally, compiler can also support pipeline parallelism, where a loop is split into multiple stages that execute in a pipelined manner. The prototype compiler, which targets multiple instances of a CGRA, is demonstrated in this thesis. We used this compiler to target multiple variants of CGRAs employing dataflow execution paradigm. We varied the reconfigur-able fabric, orchestration mechanism employed, size of instruction buffers. We also choose applications from two different domains viz. cryptography and linear algebra. The execution time of the CGRA (the best among all instances) is compared against an Intel Quad core processor. Cryptography applications show a performance improvement ranging from more than one order of magnitude to close to two orders of magnitude. These applications have large amounts of ILP and our compiler could successfully expose the ILP available in these applications. Further, the domain customization also played an important role in achieving good performance. We employed two custom functional units for accelerating Cryptography applications and compiler could efficiently use them. In linear algebra kernels we observe multiple iterations of the loop executing in parallel, effectively exploiting loop-level parallelism at runtime. Inspite of this we notice close to an order of magnitude performance degradation. The reason for this degradation can be attributed to the use of non-pipelined floating point units, and the delays involved in accessing memory. Pipeline parallelism was demonstrated using this compiler for FFT and QR factorization. Thus, the compiler is capable of efficiently supporting different kinds of parallelism and can support complete C89 standard. Further, the compiler can also support different instances of CGRAs employing dataflow execution paradigm. Reconfigurable Fabric Dataflow Execution Compilers (Computer Programs) Computer Architecture Reconfigurable Architectures Run Time Reconfigurable Platform Runtime Reconfigurable Platform Runtime Reconfigurable Hardware Coarse Grained Computation Computer Science

Search results