Spelling suggestions: "subject:"demory controller"" "subject:"demory ccontroller""
1 |
Workload-adaptation in memory controllersGhasempour, Mohsen January 2015 (has links)
Advanced development in processor design, increasing the heterogeneity of computer system, by involving Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs) and custom accelerators, and increasing the number of cores and threads in such systems puts extra pressure on the main memory, demanding a higher performance. Current computing trends are putting ever more pressure on main memory. In modern computer systems, this is generally Dynamic Random Access Memory (DRAM) which consists of a multi-level access hierarchy (e.g. Rank, Bank, Row etc.). This heterogeneity of structure implies different access latencies (and power consumption), resulting in performance differences according to memory access patterns. DRAM controllers manage access and satisfy the timing constraints and now employ complex scheduling and prediction algorithms to mitigate the effect on performance. This complexity can limit the scalability of a controller with the size of memory, while maintaining performance. The focus of this PhD thesis is to improve performance, reliability and scalability (with respect to memory size) of DRAM controllers. To this end, it covers three significant contributors to the performance and reliability of a memory controller: ‘Address Mapping’, ‘Page Closure Policies’ and ‘Reliability Monitoring’. A detailed DRAM simulator is used as an evaluation platform throughout this work. The following contributions are presented in this thesis. Hybrid Address-based Page PolicY (HAPPY): Memory controllers have used static page-closure policies to decide whether a row should be left open (open-page policy) or closed immediately (close-page policy) after use. The appropriate choice can reduce the average memory latency. Since access patterns are dynamic, static page policies cannot guarantee to deliver optimum execution time. Hybrid page policies can cover dynamic scenarios and are now implemented in state-of-the-art processors. These switch between open-page and close-page policies by monitoring the access pattern of row hits/conflicts and predicting future behaviour. Unfortunately, as the size of DRAM memory increases, fine-grain tracking and analysis of accesses does not remain practical. HAPPY proposes a compact, memory address-based encoding technique which can maintain or improve page closure predictor performance while reducing the hardware overhead. As a case study, HAPPY is integrated, with a state-of-the-art monitor – the Intel-adaptive open-page policy predictor employed by the Intel Xeon X5650 – and a traditional Hybrid page policy. The experimental results show that using the HAPPY encoding applied to the Intel-adaptive page closure policy can reduce the hardware overhead by 5× for the evaluated 64 GB memory (up to 40× for a 512 GB memory) while maintaining the prediction accuracy. Dynamic Re-arrangement of Address Mapping (DReAM): The initial location of data in DRAMs is determined and controlled by the ‘address-mapping’ and even modern memory controllers use a fixed and runtime-agnostic address-mapping. On the other hand, the memory access pattern seen at the memory interface level will be dynamically changed at run-time. This dynamic nature of memory access pattern and the fixed behaviour of address mapping process in DRAM controllers, implied by using a fixed address-mapping scheme, means that DRAM performance cannot be exploited efficiently. DReAM is a novel hardware technique that can detect a workload-specific address mapping at run-time based on the application access pattern. The experimental results show that DReAM outperforms the best evaluated baseline address mapping by 5%, on average, and up to 28% across all the workloads. A Run-time Memory hot-row detectOR (ARMOR): DRAM needs refreshing to avoid data loss. Data can also be corrupted within a refresh interval by crosstalk caused by repeated accesses to neighbouring rows; this is the row hammer effect and is perceived as a potentially serious reliability and security threat. ARMOR is a novel technique which improves memory reliability by detecting which rows are potentially being “hammered” within the memory controller, which can then insert extra refresh operations. It can detect (and thus prevent) row hammer errors with minimal execution time overhead and hardware requirements. Alternatively by adding buffers inside the memory controller to cache such hammered rows, execution times are reduced with small hardware costs. The ARMOR technique is now the basis of a patent applicant and under process for commercial exploitation. As a final step of this PhD thesis, an adaptive memory controller was developed integrating HAPPY, DReAM and ARMOR into a standard memory controller. The performance and the implementation cost of such an adaptive memory controller were compared against a state-of-the-art memory controller, as a baseline. The experimental results show that the adaptive memory controller outperforms the baseline, on average by 18%, and up to 35% for some workloads, while requiring around 6 KB-900 KB more storage than the baseline to support a wide range of memory sizes (from 4 GB up to 512 GB).
|
2 |
Implementation of Hierarchical Architecture of Basic Memory ModulesYang, Shang-da 11 September 2008 (has links)
In system-on-chip designs, memory designs store data to be accessed by processing modules. Memory access time can affect overall system performance significantly. In this research, we implemented a configurable architecture of a basic memory module and its design composition, including memory interface, memory controller, memory array, row buffer, row decoder and column decoder. We explore various memory module designs. Utilizing the configurable architecture, we can effectively reduce design time and improve access time of memory module designs. We also realized these functionalities in SystemC language and performed configurability experiments.
|
3 |
Prototyping Hardware-compressed Memory for Multi-tenant SystemsLiu, Yuqing 18 October 2023 (has links)
Software memory compression has been a common practice among operating systems. Since then, prior works have explored hardware memory compression to reduce the load on the CPU by offloading memory compression to hardware. However, prior works on hardware memory compression cannot provide critical isolation in multi-tenant systems like cloud servers. Our evaluation of prior work (TMCC) shows that a tenant can be slowed down by more than 12x due to the lack of isolation.
This work, Compressed Memory Management Unit (CMMU), prototypes hardware compression for multi-tenant systems. CMMU provides critical isolation for multi-tenant systems.First, CMMU allows OS to control individual tenants' usage of physical memory. Second, CMMU compresses a tenant's memory to an OS-specified physical usage target. Finally, CMMU notifies the OS to start swapping the memory to the storage if it fails to compress the memory to the target.
We prototype CMMU with a real compression module on an FPGA board. CMMU runs with a Linux kernel modified to support CMMU. The prototype virtually expands the memory capacity to 4X. CMMU stably supports the modified Linux kernel with multiple tenants and applications. While achieving this, CMMU only requires several extra cycles of overhead besides the essential data structure accesses. ASIC synthesis results show CMMU fits within 0.00931mm2 of silicon and operates at 3GHz while consuming 36.90mW of power. It is a negligible cost to modern server systems. / Master of Science / Memory is a critical resource in computer systems. Memory compression is a common technique to save memory resources. Memory compression consumes the computing resource, traditionally supplied by the CPU. In other words, memory compression traditionally competes with applications for CPU computing power. The prior work, TMCC, provides a design to perform memory compression in ASIC hardware, therefore no longer competing for CPU computing power. However, TMCC provides no isolation in a multitenant system like a modern cloud server.
This thesis prototypes a new design, Compressed Memory Management Unit (CMMU), providing isolation in hardware memory compression. This prototype can speed up applications by 12x compared to without the isolation, with a 4x expansion in virtual memory capacity. CMMU supports a modified Linux OS running stably. CMMU also runs at high clock speed and offers little overhead in latency, silicon chip area, and power
|
4 |
Design and prototyping of Hardware-Accelerated Locality-aware Memory CompressionSrinivas, Raghavendra 09 September 2020 (has links)
Hardware Acceleration is the most sought technique in chip design to achieve better performance and power efficiency for critical functions that may be in-efficiently handled from
traditional OS/software. As technology started advancing with 7nm products already in the
market which can provide better power and performance consuming low area, the latency-critical functions that were handled by software traditionally now started moving as acceleration units in the chip. This thesis describes the accelerator architecture, implementation, and prototype for one of such functions namely "Locality-Aware memory compression" which
is part of the "OS-controlled memory compression" scheme that has been actively deployed in
today's OSes. In brief, OS-controlled memory compression is a new memory management
feature that transparently, dramatically, and adaptively increases effective main memory
capacity on-demand as software-level memory usage increases beyond physical memory system capacity. OS-controlled memory compression has been adopted across almost all OSes
(e.g., Linux, Windows, macOS, AIX) and almost all classes of computing systems (e.g.,
smartphones, PCs, data centers, and cloud). The OS-controlled memory compression scheme
is Locality Aware. But still under OS-controlled memory compression today, applications
experience long-latency page faults when accessing compressed memory. To solve this per-
performance bottle-neck, acceleration technique has been proposed to manage "Locality Aware
Memory compression" within hardware thereby enabling applications to access their OS-
compressed memory directly. This Accelerator is referred to as HALK throughout this work, which stands for "Hardware-accelerated Locality-aware Memory Compression". The literal mean-
ing of the word HALK in English is 'a hidden place'. As such, this accelerator is neither
exposed to the OS nor to the running applications. It is hidden entirely in the memory con-
troller hardware and incurs minimal hardware cost. This thesis work explores developing
FPGA design prototype and gives the proof of concept for the functionality of HALK by
running non-trivial micro-benchmarks. This work also provides and analyses power, performance, and area of HALK for ASIC designs (at technology node of 7nm) and selected FPGA
Prototype design. / Master of Science / Memory capacity has become a scarce resource across many digital computing systems spanning from smartphones to large-scale cloud systems. The slowing improvement of memory capacity per dollar further worsens this problem. To address this, almost all industry-standard OSes like Linux, Windows, macOS, etc implement Memory compression to store more data in the same space. This is handled with software in today's systems which is very inefficient and suffers long latency thus degrading the user responsiveness. Hardware is always faster in performing computations compared to software. So, a solution that is implemented in hardware with the low area and low cost is always preferred as it can provide better performance and power efficiency. In the hardware world, such modules that perform specifically targeted software functions are called accelerators. This thesis shows the work on developing such a hardware accelerator to handle ``Locality Aware Memory Compression" so as to allow the applications to directly access compressed data without OS intervention thereby improving the overall performance of the system. The proposed accelerator is locality aware which means least recently allocated uncompressed page would be picked for compression to free up more space on-demand and most recently allocated page is put into an uncompressed format.
|
5 |
Increasing memory access efficiency through a two-level memory controllerLinck, Marcelo Melo 22 March 2018 (has links)
Submitted by PPG Ci?ncia da Computa??o (ppgcc@pucrs.br) on 2018-04-03T14:30:24Z
No. of bitstreams: 1
MARCELO_MELO_LINCK_DIS.pdf: 4153250 bytes, checksum: 821a8f1e65f49c1b24a0b69b4f6e7f94 (MD5) / Approved for entry into archive by Tatiana Lopes (tatiana.lopes@pucrs.br) on 2018-04-12T21:09:45Z (GMT) No. of bitstreams: 1
MARCELO_MELO_LINCK_DIS.pdf: 4153250 bytes, checksum: 821a8f1e65f49c1b24a0b69b4f6e7f94 (MD5) / Made available in DSpace on 2018-04-12T21:23:08Z (GMT). No. of bitstreams: 1
MARCELO_MELO_LINCK_DIS.pdf: 4153250 bytes, checksum: 821a8f1e65f49c1b24a0b69b4f6e7f94 (MD5)
Previous issue date: 2018-03-22 / Acessos simult?neos gerados por m?ltiplos clientes para um ?nico dispositivo de mem?ria
em um Sistema-em-Chip (SoC) imp?e desafios que requerem aten??o extra devido ao gargalo gerado
na performance. Considerando estes clientes como processadores, este problema torna-se mais
evidente, pois a taxa de crescimento de velocidade para processadores excede a de dispositivos de
mem?ria, criando uma lacuna de desempenho. Neste cen?rio, estrat?gias de controle de mem?ria
s?o necess?rias para aumentar o desempenho do sistema. Estudos provam que a comunica??o com a
mem?ria ? a maior causa de atrasos durante a execu??o de programas em processadores. Portanto, a
maior contribui??o deste trabalho ? a implementa??o de uma arquitetura de controlador de mem?ria
composta por dois n?veis: prioridade e mem?ria. O n?vel de prioridade ? respons?vel por interagir
com os clientes e escalonar requisi??es de mem?ria de acordo com um algoritmo de prioridade fixa.
O n?vel de mem?ria ? respons?vel por reordenar as requisi??es e garantir o isolamento de acesso ?
mem?ria para clientes de alta prioridade. O principal objetivo deste trabalho ? apresentar um modelo
que reduza as lat?ncias de acesso ? mem?ria para clientes de alta prioridade em um sistema altamente
escal?vel. Os experimentos neste trabalho foram realizados atrav?s de uma simula??o comportamental
da estrutura proposta utilizando um programa de simula??o. A an?lise dos resultados ? dividida em
quatro partes: an?lise de lat?ncia, an?lise de row-hit, an?lise de tempo de execu??o e an?lise de
escalabilidade. / Simultaneous accesses generated by memory clients in a System-on-Chip (SoC) to a single memory device impose challenges that require extra attention due to the performance bottleneck created. When considering these clients as processors, this issue becomes more evident, because the growth rate in speed for processors exceeds the same rate for memory devices, creating a performance gap. In this scenario, memory-controlling strategies are necessary to improve system performances. Studies have proven that the main cause of processor execution lagging is the memory communication. Therefore, the main contribution of this work is the implementation of a memory-controlling architecture composed of two levels: priority and memory. The priority level is responsible for interfacing with clients and scheduling memory requests according to a fixed-priority algorithm. The memory level is responsible for reordering requests and guaranteeing memory access isolation to high-priority clients. The main objective of this work is to provide latency reductions to high-priority clients in a scalable system. Experiments in this work have been conducted considering the behavioral simulation of the proposed architecture through a software simulator. The evaluation of the proposed work is divided into four parts: latency evaluation, row-hit evaluation, runtime evaluation and scalability evaluation.
|
6 |
DRAM Controller BenchmarkingWinberg, Ulf January 2009 (has links)
<p>Since a few years, flat screen TVs, such as LCD and plasma, has come to completelydominate the market of televisions. In a SoC solution for digital TVs, severalprocessors are used to obtain a decent image quality. Some of the processorsneed temporal information, which means that whole frames need to be storedin memory, which in turn motivates the use of SDRAM memory. When higherdemands of resolution and image quality arrives, greater pressure is put on theperformance of the SoC memory subsystem, to not become a bottleneck of thesystem.</p><p>In this master thesis project, a model of an existing SoC for digital TVs is usedto benchmark and evaluate the performance of an SDRAM memory controllerarchitecture study. The two major features are the ability to reorder transactionsand the compatibility with DDR3. By introducing reordering of transactions, thechoice is given to the memory controller to service memory requests in an orderthat decreases bank conflicts and read/write turn arounds. Measurements showthat a utilization of 86.5 % of the total available bandwidth can be achieved, whichis 18.5 percentage points more, compared to an existing nonreordering memorycontroller developed by NXP.</p>
|
7 |
DRAM Controller BenchmarkingWinberg, Ulf January 2009 (has links)
Since a few years, flat screen TVs, such as LCD and plasma, has come to completelydominate the market of televisions. In a SoC solution for digital TVs, severalprocessors are used to obtain a decent image quality. Some of the processorsneed temporal information, which means that whole frames need to be storedin memory, which in turn motivates the use of SDRAM memory. When higherdemands of resolution and image quality arrives, greater pressure is put on theperformance of the SoC memory subsystem, to not become a bottleneck of thesystem. In this master thesis project, a model of an existing SoC for digital TVs is usedto benchmark and evaluate the performance of an SDRAM memory controllerarchitecture study. The two major features are the ability to reorder transactionsand the compatibility with DDR3. By introducing reordering of transactions, thechoice is given to the memory controller to service memory requests in an orderthat decreases bank conflicts and read/write turn arounds. Measurements showthat a utilization of 86.5 % of the total available bandwidth can be achieved, whichis 18.5 percentage points more, compared to an existing nonreordering memorycontroller developed by NXP.
|
8 |
A Preliminary Exploration of Memory Controller Policies on Smartphone WorkloadsNarancic, Goran 26 November 2012 (has links)
This thesis explores memory performance for smartphone workloads. We design a Video Conference Workload (VCW) to model typical smartphone usage. We describe a trace-based methodology which uses a software implementation to mimic the behaviour of specialised hardware accelerators. Our methodology stores dataflow information from the original application to maintain the relationships between requests.
We first study seven address mapping schemes with our VCW, using a first-ready, first-come-first-served (FR-FCFS) memory scheduler. Our results show the best performing scheme is up to 82% faster than the worst. The VCW is memory intensive, with up to 86.8% bandwidth utilisation using the best performing scheme. We also test a Web Browsing and a set of computer vision workloads. Most are not memory intensive, with utilisation under 15%.
Finally, we compare four schedulers and find that the FR-FCFS scheduler using the Write Drain mode [8] performed the best, outperforming the worst scheduler by 6.3%.
|
9 |
A Preliminary Exploration of Memory Controller Policies on Smartphone WorkloadsNarancic, Goran 26 November 2012 (has links)
This thesis explores memory performance for smartphone workloads. We design a Video Conference Workload (VCW) to model typical smartphone usage. We describe a trace-based methodology which uses a software implementation to mimic the behaviour of specialised hardware accelerators. Our methodology stores dataflow information from the original application to maintain the relationships between requests.
We first study seven address mapping schemes with our VCW, using a first-ready, first-come-first-served (FR-FCFS) memory scheduler. Our results show the best performing scheme is up to 82% faster than the worst. The VCW is memory intensive, with up to 86.8% bandwidth utilisation using the best performing scheme. We also test a Web Browsing and a set of computer vision workloads. Most are not memory intensive, with utilisation under 15%.
Finally, we compare four schedulers and find that the FR-FCFS scheduler using the Write Drain mode [8] performed the best, outperforming the worst scheduler by 6.3%.
|
10 |
Núcleos de interface de memória DDR SDRAM para sistemas-em-chipBonatto, Alexsandro Cristóvão January 2009 (has links)
Dispositivos integrados de sistemas-em-chip (SoC), especialmente aqueles dedicados às aplicações multimídia, processam grandes quantidades de dados armazenados em memórias. O desempenho das portas de memória afeta diretamente no desempenho do sistema. A melhor utilização do espaço de armazenamento de dados e a redução do custo e do consumo de potência dos sistemas eletrônicos encorajam o desenvolvimento de arquiteturas eficientes para controladores de memória. Essa melhoria deve ser alcançada tanto para interfaces com memórias internas quanto externas ao chip. Em sistemas de processamento de vídeo, por exemplo, memórias de grande capacidade são necessárias para armazenar vários quadros de imagem enquanto que os algoritmos de compressão fazem a busca por redundâncias. No caso de sistemas implementados em tecnologia FPGA é possível utilizar os blocos de memória disponíveis internamente ao FPGA, os quais são limitados a poucos mega-bytes de dados. Para aumentar a capacidade de armazenamento de dados é necessário usar elementos de memória externa e um núcleo de propriedade intelectual (IP) de controlador de memória é necessário. Contudo, seu desenvolvimento é uma tarefa muito complexa e nem sempre é possível utilizar uma solução "sob demanda". O uso de FPGAs para prototipar sistemas permite ao desenvolvedor integrar módulos rapidamente. Nesse caso, a verificação do projeto é uma questão importante a ser considerada no desenvolvimento de um sistema complexo. Controladores de memória de alta velocidade são extremamente sensíveis aos atrasos de propagação da lógica e do roteamento. A síntese a partir de uma descrição em linguagem de hardware (HDL) necessita da verificação de sua compatibilidade com as especificações de temporização pré-determinadas. Como solução para esse problema, é apresentado nesse trabalho um IP do controlador de memória DDR SDRAM com função de BIST (Built-In Self-Test) integrada, onde o teste de memória é utilizado para verificar o funcionamento correto do controlador. / Many integrated Systems-on-Chip (SoC) devices, specially those dedicated to multimedia applications, process large amounts of data stored on memories. The performance of the memories ports directly affects the performance of the system. Optimization of the usage of data storage and reduction of cost and power consumption of the electronic systems encourage the development of efficient architectures for memory controllers. This improvement must be reached either for embedded or external memories. In systems for video processing, for example, large memory arrays are needed to store several video frames while compression algorithms search for redundancies. In the case of FPGA system implementation, it is possible to use memory blocks available inside FPGA, but for only a few megabytes of data. To increase data storage capacity it is necessary to use external memory devices and a memory controller intellectual property (IP) core is required. Nevertheless, its development is a very complex task and it is not always possible to have a custom solution. Using FPGA for system prototyping allows the developer to perform rapid integration of modules to exercise a hardware version. In this case, test is an important issue to be considered in a complex system design. High speed memory controllers are very sensitive to gate and routing delays and the synthesis from a hardware description language (HDL) needs to be verified to comply with predefined timing specifications. To overcome these problems, a DDR SDRAM controller IP was developed which integrate the BIST (Built-In Self-Test) function, where the memory test is used to check the correct functioning of the DDR controller.
|
Page generated in 0.05 seconds