241 |
Prototyping Hardware-compressed Memory for Multi-tenant SystemsLiu, Yuqing 18 October 2023 (has links)
Software memory compression has been a common practice among operating systems. Since then, prior works have explored hardware memory compression to reduce the load on the CPU by offloading memory compression to hardware. However, prior works on hardware memory compression cannot provide critical isolation in multi-tenant systems like cloud servers. Our evaluation of prior work (TMCC) shows that a tenant can be slowed down by more than 12x due to the lack of isolation.
This work, Compressed Memory Management Unit (CMMU), prototypes hardware compression for multi-tenant systems. CMMU provides critical isolation for multi-tenant systems.First, CMMU allows OS to control individual tenants' usage of physical memory. Second, CMMU compresses a tenant's memory to an OS-specified physical usage target. Finally, CMMU notifies the OS to start swapping the memory to the storage if it fails to compress the memory to the target.
We prototype CMMU with a real compression module on an FPGA board. CMMU runs with a Linux kernel modified to support CMMU. The prototype virtually expands the memory capacity to 4X. CMMU stably supports the modified Linux kernel with multiple tenants and applications. While achieving this, CMMU only requires several extra cycles of overhead besides the essential data structure accesses. ASIC synthesis results show CMMU fits within 0.00931mm2 of silicon and operates at 3GHz while consuming 36.90mW of power. It is a negligible cost to modern server systems. / Master of Science / Memory is a critical resource in computer systems. Memory compression is a common technique to save memory resources. Memory compression consumes the computing resource, traditionally supplied by the CPU. In other words, memory compression traditionally competes with applications for CPU computing power. The prior work, TMCC, provides a design to perform memory compression in ASIC hardware, therefore no longer competing for CPU computing power. However, TMCC provides no isolation in a multitenant system like a modern cloud server.
This thesis prototypes a new design, Compressed Memory Management Unit (CMMU), providing isolation in hardware memory compression. This prototype can speed up applications by 12x compared to without the isolation, with a 4x expansion in virtual memory capacity. CMMU supports a modified Linux OS running stably. CMMU also runs at high clock speed and offers little overhead in latency, silicon chip area, and power
|
242 |
Protocol constructions for communication networks /Teng, Yanpyng Albert January 1980 (has links)
No description available.
|
243 |
Computer architecture for parallel execution of high level language programs /Wang, Pong-sheng January 1980 (has links)
No description available.
|
244 |
A new methodology for designing communication protocols /Lin, Huai-An January 1983 (has links)
No description available.
|
245 |
A Comparative study of RISC vs. CISC philosophies of implementing mathematical functionsMahalingaiah, Rupaka 10 June 2012 (has links)
A comparative study of the RISC philosophy of implementing mathematical functions vs the CISC. philosophy implementing the same functions is undertaken. This study tries to verify whether, the RISC philosophy is suited for the computers designed to run specific applications like the realtime systems. A CISC processor is used as a platform machine and several mathematical functions are implemented in both the philosophies. / Master of Science
|
246 |
Design and prototyping of Hardware-Accelerated Locality-aware Memory CompressionSrinivas, Raghavendra 09 September 2020 (has links)
Hardware Acceleration is the most sought technique in chip design to achieve better performance and power efficiency for critical functions that may be in-efficiently handled from
traditional OS/software. As technology started advancing with 7nm products already in the
market which can provide better power and performance consuming low area, the latency-critical functions that were handled by software traditionally now started moving as acceleration units in the chip. This thesis describes the accelerator architecture, implementation, and prototype for one of such functions namely "Locality-Aware memory compression" which
is part of the "OS-controlled memory compression" scheme that has been actively deployed in
today's OSes. In brief, OS-controlled memory compression is a new memory management
feature that transparently, dramatically, and adaptively increases effective main memory
capacity on-demand as software-level memory usage increases beyond physical memory system capacity. OS-controlled memory compression has been adopted across almost all OSes
(e.g., Linux, Windows, macOS, AIX) and almost all classes of computing systems (e.g.,
smartphones, PCs, data centers, and cloud). The OS-controlled memory compression scheme
is Locality Aware. But still under OS-controlled memory compression today, applications
experience long-latency page faults when accessing compressed memory. To solve this per-
performance bottle-neck, acceleration technique has been proposed to manage "Locality Aware
Memory compression" within hardware thereby enabling applications to access their OS-
compressed memory directly. This Accelerator is referred to as HALK throughout this work, which stands for "Hardware-accelerated Locality-aware Memory Compression". The literal mean-
ing of the word HALK in English is 'a hidden place'. As such, this accelerator is neither
exposed to the OS nor to the running applications. It is hidden entirely in the memory con-
troller hardware and incurs minimal hardware cost. This thesis work explores developing
FPGA design prototype and gives the proof of concept for the functionality of HALK by
running non-trivial micro-benchmarks. This work also provides and analyses power, performance, and area of HALK for ASIC designs (at technology node of 7nm) and selected FPGA
Prototype design. / Master of Science / Memory capacity has become a scarce resource across many digital computing systems spanning from smartphones to large-scale cloud systems. The slowing improvement of memory capacity per dollar further worsens this problem. To address this, almost all industry-standard OSes like Linux, Windows, macOS, etc implement Memory compression to store more data in the same space. This is handled with software in today's systems which is very inefficient and suffers long latency thus degrading the user responsiveness. Hardware is always faster in performing computations compared to software. So, a solution that is implemented in hardware with the low area and low cost is always preferred as it can provide better performance and power efficiency. In the hardware world, such modules that perform specifically targeted software functions are called accelerators. This thesis shows the work on developing such a hardware accelerator to handle ``Locality Aware Memory Compression" so as to allow the applications to directly access compressed data without OS intervention thereby improving the overall performance of the system. The proposed accelerator is locality aware which means least recently allocated uncompressed page would be picked for compression to free up more space on-demand and most recently allocated page is put into an uncompressed format.
|
247 |
Towards Using Free Memory to Improve Microarchitecture PerformancePanwar, Gagandeep 18 May 2020 (has links)
A computer system's memory is designed to accommodate the worst-case workloads with the highest memory requirement; as such, memory is underutilized when a system runs workloads with common-case memory requirements. Through a large-scale study of four production HPC systems, we find that memory underutilization problem in HPC systems is very severe. As unused memory is wasted memory, we propose exposing a compute node's unused memory to its CPU(s) through a user-transparent CPU-OS codesign. This can enable many new microarchitecture techniques that transparently leverage unused memory locations to help improve microarchitecture performance. We refer to these techniques as Free-memory-aware Microarchitecture Techniques (FMTs). In the context of HPC systems, we present a detailed example of an FMT called Free-memory-aware Replication (FMR). FMR replicates in-use data to unused memory locations to effectively reduce average memory read latency. On average across five HPC benchmark suites, FMR provides 13% performance and 8% system-level energy improvement. / M.S. / Random-access memory (RAM) or simply memory, stores the temporary data of applications that run on a computer system. Its size is determined by the worst-case application workload that the computer system is supposed to run. Through our memory utilization study of four large multi-node high-performance computing (HPC) systems, we find that memory is underutilized severely in these systems. Unused memory is a wasted resource that does nothing. In this work, we propose techniques that can make use of this wasted memory to boost computer system performance. We call these techniques Free-memory-aware Microarchitecture Techniques (FMTs). We then present an FMT for HPC systems in detail called Free-memory-aware Replication (FMR) that provides performance improvement of over 13%.
|
248 |
iLORE: Discovering a Lineage of MicroprocessorsFurman, Samuel Lewis 29 June 2021 (has links)
Researchers, benchmarking organizations, and hardware manufacturers maintain repositories of computer component and performance information. However, this data is split across many isolated sources and is stored in a form that is not conducive to analysis. A centralized repository of said data would arm stakeholders across industry and academia with a tool to more quantitatively understand the history of computing. We propose iLORE, a data model designed to represent intricate relationships between computer system benchmarks and computer components. We detail the methods we used to implement and populate the iLORE data model using data harvested from publicly available sources. Finally, we demonstrate the validity and utility of our iLORE implementation through an analysis of the characteristics and lineage of commercial microprocessors. We encourage the research community to interact with our data and visualizations at csgenome.org. / Master of Science / Researchers, benchmarking organizations, and hardware manufacturers maintain repositories of computer component and performance information. However, this data is split across many isolated sources and is stored in a form that is not conducive to analysis. A centralized repository of said data would arm stakeholders across industry and academia with a tool to more quantitatively understand the history of computing. We propose iLORE, a data model designed to represent intricate relationships between computer system benchmarks and computer components. We detail the methods we used to implement and populate the iLORE data model using data harvested from publicly available sources. Finally, we demonstrate the validity and utility of our iLORE implementation through an analysis of the characteristics and lineage of commercial microprocessors. We encourage the research community to interact with our data and visualizations at csgenome.org.
|
249 |
Complete Design Methodology of a Massively Parallel and Pipelined Memristive Stateful IMPLY Logic Based Reconfigurable ArchitectureRahman, Kamela Choudhury 06 June 2016 (has links)
Continued dimensional scaling of CMOS processes is approaching fundamental limits and therefore, alternate new devices and microarchitectures are explored to address the growing need of area scaling and performance gain. New nanotechnologies, such as memristors, emerge. Memristors can be used to perform stateful logic with nanowire crossbars, which allows for implementation of very large binary networks that can be easily reconfigured. This research involves the design of a memristor-based massively parallel datapath for various applications, specifically SIMD (Single Instruction Multiple Data) like architecture, and parallel pipelines. The dissertation develops a new model of massively parallel memristor-CMOS hybrid datapath architectures at a systems level, as well as a complete methodology to design them. One innovation of the proposed approach is that the datapath design is based on space-time diagrams that use stateful IMPLY gates built from binary memristors. This notation aids in the circuit minimization in logic design, calculations of delay and memristor costs, and sneak-path avoidance. Another innovation of the proposed methodology is a general, new, architecture model, MsFSMD (Memristive stateful Finite State Machine with Datapath) that has two interacting sub-systems: 1) a controller composed of a memristive RAM, MsRAM, to act as a pulse generator, along with a finite state machine realized in CMOS, a CMOS counter, CMOS multiplexers and CMOS decoders, 2) massively parallel, pipelined, datapath realized with a new variant of a CMOL-like nanowire crossbar array, MsCMOL (Memristive stateful CMOL), with binary stateful memristor-based IMPLY gates. Next contribution of the dissertation is the new type of FPGA. In contrast to the previous memristor-based FPGA (mrFPGA), the proposed MsFPGA (Memristive stateful logic Field Programmable Gate Array) uses memristors for memory, connections programming, and combinational logic implementation. With a regular structure of square abutting blocks of memristive nanowire crossbars and their short connections, proposed architecture is highly reconfigurable. As an example of using the proposed new FPGA to realize biologically inspired systems, the detailed design of a pipelined Euclidean Distance processor was presented and its various applications are mentioned. Euclidean Distance calculation is widely used by many neural network and associative memory based algorithms.
|
250 |
The Design of a Simple, Spiking Sparse Coding Algorithm for Memristive HardwareWoods, Walt 11 March 2016 (has links)
Calculating a sparse code for signals with high dimensionality, such as high-resolution images, takes substantial time to compute on a traditional computer architecture. Memristors present the opportunity to combine storage and computing elements into a single, compact device, drastically reducing the area required to perform these calculations. This work focused on the analysis of two existing sparse coding architectures, one of which utilizes memristors, as well as the design of a new, third architecture that employs a memristive crossbar. These architectures implement either a non-spiking or spiking variety of sparse coding based on the Locally Competitive Algorithm (LCA) introduced by Rozell et al. in 2008. Each architecture receives an arbitrary number of input lines and drives an arbitrary number of output lines. Training of the dictionary used for the sparse code was implemented through external control signals that approximate Oja's rule. The resulting designs were capable of representing input in real-time: no resets would be needed between frames of a video, for instance, though some settle time would be needed. The spiking architecture proposed is novel, emphasizing simplicity to achieve lower power than existing designs.
The architectures presented were tested for their ability to encode and reconstruct 8 x 8 patches of natural images. The proposed network reconstructed patches with a normalized, root-mean-square error of 0.13, while a more complicated CMOS-only approach yielded 0.095, and a non-spiking approach yielded 0.074. Several outputs competing for representation of the input was shown to improve reconstruction quality and preserve more subtle components in the final encoding; the proposed algorithm lacks this feature. Steps to address this were proposed for future work by scaling input spikes according to the current expected residual, without adding much complexity. The architectures were also tested with the MNIST digit database, passing a sparse code onto a basic classifier. The proposed architecture scored 81% on this test, a CMOS-only spiking variant scored 76%, and the non-spiking algorithm scored 85%. Power calculations were made for each design and compared against other publications. The overall findings showed great promise for spiking memristor-based ASICs, consuming only 28% of the power used by non-spiking architectures and 6.6% as much power as a CMOS-only spiking architecture on this task. The spike-based nature of the novel design was also parameterized into several intuitive parameters that could be adjusted to prefer either performance or power efficiency.
The design and analysis of architectures for sparse coding should greatly reduce the amount of future work needed to implement an end-to-end classification pipeline for images or other signal data. When lower power is a primary concern, the proposed architecture should be considered as it surpassed other published algorithms. These pipelines could be used to provide low-power visual assistance, highlighting objects within high-definition video frames in real-time. The technology could also be used to help self-driving cars identify hazards more quickly and efficiently.
|
Page generated in 0.1732 seconds