Global ETD Search

1	Scratchpad Management in Software Managed Manycore Architectures January 2017 (has links) abstract: Caches have long been used to reduce memory access latency. However, the increased complexity of cache coherence brings significant challenges in processor design as the number of cores increases. While making caches scalable is still an important research problem, some researchers are exploring the possibility of a more power-efficient SRAM called scratchpad memories or SPMs. SPMs consume significantly less area, and are more energy-efficient per access than caches, and therefore make the design of on-chip memories much simpler. Unlike caches, which fetch data from memories automatically, an SPM requires explicit instructions for data transfers. SPM-only architectures are thus named as software managed manycore (SMM), since the data movements of such architectures rely on software. SMM processors have been widely used in different areas, such as embedded computing, network processing, or even high performance computing. While SMM processors provide a low-power platform, the hardware alone does not guarantee power efficiency, if applications on such processors deliver low performance. Efficient software techniques are therefore required. A big body of management techniques for SMM architectures are compiler-directed, as inserting data movement operations by hand forces programmers to trace flow of data, which can be error-prone and sometimes difficult if not impossible. This thesis develops compiler-directed techniques to manage data transfers for embedded applications on SMMs efficiently. The techniques analyze and find out the proper program points and insert data movement instructions accordingly. The techniques manage code, stack and heap data of applications, and reduce execution time by 14%, 52% and 80% respectively compared to their predecessors on typical embedded applications. On top of managing local data, a technique is also developed for shared data in SMM architectures. Experimental results show it achieves more than 2X speedup than the previous technique on average. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2017 Computer science compiler multicore scratchpad memory SPM
2	Architectures and Theoretical Models for Shared Scratchpad Memory Systems Wittig, Robert Klaus 10 November 2021 (has links) Computer engineering is advancing rapidly. For 55 years, the performance of integrated circuits has almost doubled every 18 months. Mostly, these advancements were enabled by technological progress. Even the end of frequency scaling could not bring the ever-increasing performance growth to a halt. However, technology burdens, like noticeable leakage currents, have piled up, which shifts the focus towards architectural improvements. Especially the multi-core paradigm has proven its virtue for chip designs over the last decade. While having been introduced in high-performance computing areas, modern technology nodes also enable low-cost, low-power embedded designs to benefi t from multiple cores and accelerators. Since the majority of cores depend on memory, which requires a considerable amount of chip area, this common resource needs to be shared effi ciently. High-performance cores use shared caches to increase memory utilization. However, many accelerators do not use caches as they need predictable and fast scratchpad memory (SM). But sharing SM entails confl icts, questioning its fast and predictable nature. Hence, the question arises on how to adapt architectures for sharing while retaining SM’s advantages. This thesis presents a novel, shared SM architecture that embraces the idea of a minimal logic path between core and memory, thereby increasing the maximum operating frequency. Because of its additional capabilities, like dynamic address translation and programmable priorities, it is also well suited for heterogeneous platforms that use dynamic scheduling and require predictable behavior. Demonstrating its advantages, we analyze the characteristics of the new architecture and compare it to state-of-the-art approaches. To further mitigate confl icts, we present the conception of access interval prediction (AIP). By predicting memory accesses with a granularity of a single clock cycle, AIP guides the allocation of resources. This method maximizes memory utilization while reducing confl ict delays. With the help of various methods inspired by branch prediction, we achieve over 90 % of accurate predictions and reduce stall cycles signifi cantly. Another key contribution of this thesis is the extension of analytic models to estimate the throughput of shared SM systems. Again, the focus lies on heterogeneous systems with different priorities and access patterns. The results show a promising error reduction, boosting the used models applicability for real design use cases. info:eu-repo/classification/ddc/621.3 ddc:621.3
3	An Instruction Scratchpad Memory Allocation for the Precision Timed Architecture Prakash, Aayush 11 December 2012 (has links) This work presents a static instruction allocation scheme for the precision timed architecture’s (PRET) scratchpad memory. Since PRET provides timing instructions to control the temporal execution of programs, the objective of the allocation scheme is to ensure that the explicitly specified temporal requirements are met. Furthermore, this allocation incorporates instructions from multiple hardware threads of the PRET architecture. We formulate the allocation as an integer-linear programming problem, and we implement a tool that takes binaries, constructs a control-flow graph, performs the allocation, rewrites the binary with the new allocation, and generates an output binary for the PRET architecture. We carry out experiments on a modified version of the Malardalen benchmarks to illustrate that commonly known ACET and WCET based approaches cannot be directly applied to meet explicit timing requirements. We also show the advantage of performing the allocation across multiple threads. We present a real time benchmark controlling an Unmanned Air Vehicle as the case study. memory allocation precision timed architecture scratchpad memory Electrical and Computer Engineering
4	A Memory-Realistic SPM Allocator with WCET/ACET Tunable Performance Bai, Jia-yu 16 September 2010 (has links) Real-time systems often use SPM instead of cache, because SPM allows a program¡¦s run time to be more predictable. Real-time system need predictable runtimes, because they must schedule programs to finish within specific deadlines. A deadline should be larger than its program¡¦s worst-case execution time (WCET). Our laboratory is conducting ongoing research into scratchpad memory allocation (SPM) for reducing the WCET of a program. Compared to our previous work, this current thesis improves our memory model, our allocation algorithms, our real-time support, and our measurement benchmarks and platform. Our key accomplishments in this paper are to: 1) add, for the first time in the literature, true WCETmeas analysis to an SPM allocator, 2) to modestly improve the performance of our previous allocator, and 3) to greatly increase the applicability over that allocator, by extending the method to support recursive programs. real-time system. ACET WCET SPM allocator scratchpad memory
5	A Stack-Optimized Scratchpad Memory Allocator for Reducing Either the Average-Case or the Worst-Case Execution Time Wu, Cheng-Ying 10 August 2009 (has links) Scratchpad memory (SPM) is popular for real-time embedded systems. Whereas caches use a memory management unit (MMU) to control which data accesses go to the fast, on-chip SRAM, SPM directly maps certain addresses to the SRAM. One advantage of SPM is that it avoids the cache¡¦s costly MMU. Another advantage is that the SPM is 100% statically predictable, whereas the variables stored in the cache depend upon the dynamic execution history. This predictability is beneficial for real-time systems which must schedule tasks to finish by fixed deadlines. To set these deadlines, system designers must determine the worst-case execution times (WCETs) of the applications. The predictability of SPM makes these WCETs easier to measure. This thesis presents a new method for allocating stack and global data to the SPM. It is the first method to make use of the special properties of non-escaping variables so as to increase the effective size of the SPM. Our insight is that many local variables of caller functions can be temporarily swapped out of the SPM while the callee function executes. Ours is also the first method to support profiled WCET measurements in the allocation strategy. Most previous SPM methods optimize only for the average-case execution time (ACET), despite the fact that SPMs are often used in real-time environments where the WCET is also important. This new memory allocation strategy is also the first to be WCET/ACET tunable, a feature that is particular useful for soft real-time systems. Only one previous work considers a WCET-targeted SPM allocator. That work, however, only applies to static WCET analysis tools. Such tools are difficult to program and are not widely used. Also, they only have application to the most safety-critical of real-time systems. In contrast, our approach is the first to employ measurement-based WCET analysis (such as is most commonly used in industry) for SPM allocation. memory allocation WCET ACET scratchpad memory real-time systems
6	An Instruction Scratchpad Memory Allocation for the Precision Timed Architecture Prakash, Aayush 11 December 2012 (has links) This work presents a static instruction allocation scheme for the precision timed architecture’s (PRET) scratchpad memory. Since PRET provides timing instructions to control the temporal execution of programs, the objective of the allocation scheme is to ensure that the explicitly specified temporal requirements are met. Furthermore, this allocation incorporates instructions from multiple hardware threads of the PRET architecture. We formulate the allocation as an integer-linear programming problem, and we implement a tool that takes binaries, constructs a control-flow graph, performs the allocation, rewrites the binary with the new allocation, and generates an output binary for the PRET architecture. We carry out experiments on a modified version of the Malardalen benchmarks to illustrate that commonly known ACET and WCET based approaches cannot be directly applied to meet explicit timing requirements. We also show the advantage of performing the allocation across multiple threads. We present a real time benchmark controlling an Unmanned Air Vehicle as the case study. memory allocation precision timed architecture scratchpad memory Electrical and Computer Engineering
7	A Dynamic Scratchpad Memory Unit for Predictable Real-Time Embedded Systems Wasly, Saud 21 December 2012 (has links) Scratch-pad memory is a popular alternative to caches in real-time embedded systems due to its advantages in terms of timing predictability and power consumption. However, dynamic management of scratch-pad content is challenging in multitasking environments. To address this issue, this thesis proposes the design of a novel Real-Time Scratchpad Memory Unit (RSMU). The RSMU can be integrated into existing systems with minimal architectural modi cations. Furthermore, scratchpad management is performed at the OS level, requiring no application changes. In conjunction with a two-level scheduling scheme, the RSMU provides strong timing guarantees to critical tasks. Demonstration and evaluation of the system design is provided on an embedded FPGA platform. Real-time embedded systems scratchpad Electrical and Computer Engineering
8	A Dynamic Scratchpad Memory Unit for Predictable Real-Time Embedded Systems Wasly, Saud 21 December 2012 (has links) Scratch-pad memory is a popular alternative to caches in real-time embedded systems due to its advantages in terms of timing predictability and power consumption. However, dynamic management of scratch-pad content is challenging in multitasking environments. To address this issue, this thesis proposes the design of a novel Real-Time Scratchpad Memory Unit (RSMU). The RSMU can be integrated into existing systems with minimal architectural modi cations. Furthermore, scratchpad management is performed at the OS level, requiring no application changes. In conjunction with a two-level scheduling scheme, the RSMU provides strong timing guarantees to critical tasks. Demonstration and evaluation of the system design is provided on an embedded FPGA platform. Real-time embedded systems scratchpad Electrical and Computer Engineering
9	Optimizing Heap Data Management on Software Managed Manycore Architectures January 2017 (has links) abstract: Caches pose a serious limitation in scaling many-core architectures since the demand of area and power for maintaining cache coherence increases rapidly with the number of cores. Scratch-Pad Memories (SPMs) provide a cheaper and lower power alternative that can be used to build a more scalable many-core architecture. The trade-off of substituting SPMs for caches is however that the data must be explicitly managed in software. Heap management on SPM poses a major challenge due to the highly dynamic nature of of heap data access. Most existing heap management techniques implement a software caching scheme on SPM, emulating the behavior of hardware caches. The state-of-the-art heap management scheme implements a 4-way set-associative software cache on SPM for a single program running with one thread on one core. While the technique works correctly, it suffers from signifcant performance overhead. This paper presents a series of compiler-based efficient heap management approaches that reduces heap management overhead through several optimization techniques. Experimental results on benchmarks from MiBenchGuthaus et al. (2001) executed on an SMM processor modeled in gem5Binkert et al. (2011) demonstrate that our approach (implemented in llvm v3.8Lattner and Adve (2004)) can improve execution time by 80% on average compared to the previous state-of-the-art. / Dissertation/Thesis / Masters Thesis Computer Science 2017 Computer science Computer engineering Compiler Heap Memory Scratchpad
10	Compilation of Stream Programs onto Embedded Multicore Architectures January 2012 (has links) abstract: In recent years, we have observed the prevalence of stream applications in many embedded domains. Stream programs distinguish themselves from traditional sequential programming languages through well defined independent actors, explicit data communication, and stable code/data access patterns. In order to achieve high performance and low power, scratch pad memory (SPM) has been introduced in today's embedded multicore processors. Current design frameworks for developing stream applications on SPM enhanced embedded architectures typically do not include a compiler that can perform automatic partitioning, mapping and scheduling under limited on-chip SPM capacities and memory access delays. Consequently, many designs are implemented manually, which leads to lengthy tasks and inferior designs. In this work, optimization techniques that automatically compile stream programs onto embedded multi-core architectures are proposed. As an initial case study, we implemented an automatic target recognition (ATR) algorithm on the IBM Cell Broadband Engine (BE). Then integer linear programming (ILP) and heuristic approaches were proposed to schedule stream programs on a single core embedded processor that has an SPM with code overlay. Later, ILP and heuristic approaches for Compiling Stream programs on SPM enhanced Multicore Processors (CSMP) were studied. The proposed CSMP ILP and heuristic approaches do not optimize for cycles in stream applications. Further, the number of software pipeline stages in the implementation is dependent on actor to processing engine (PE) mapping and is uncontrollable. We next presented a Retiming technique for Throughput optimization on Embedded Multi-core processors (RTEM). RTEM approach inherently handles cycles and can accept an upper bound on the number of software pipeline stages to be generated. We further enhanced RTEM by incorporating unrolling (URSTEM) that preserves all the beneficial properties of RTEM heuristic and also scales with the number of PEs through unrolling. / Dissertation/Thesis / Ph.D. Computer Science 2012 Computer science compilation embedded multicore parallel scratchpad stream

Search results