Global ETD Search

1	A Study on Flat-Address-Space Heterogeneous Memory Architectures Islam, Mahzabeen 05 1900 (has links) In this dissertation, we present a number of studies that primarily focus on data movement challenges among different types of memories (viz., 3D-DRAM, DDRx DRAM and NVM) employed together as a flat-address heterogeneous memory system. We introduce two different hardware-based techniques for prefetching data from slow off-chip phase change memory (PCM) to fast on-chip memories. The prefetching techniques efficiently fetch data from PCM and place that data into processor-resident or 3D-DRAM-resident buffers without putting high demand on bandwidth and provide significant performance improvements. Next, we explore different page migration techniques for flat-address memory systems which differ in when to migrate pages (i.e., periodically or instantaneously) and how to manage the migrations (i.e., OS-based or hardware-based approach). In the first page migration study, we present several epoch-based page migration policies for different organizations of flat-address memories consisting of two (2-level) and three (3-level) types of memory modules. These policies have resulted in significant energy savings. In the next page migration study, we devise an efficient "on-the-fly'" page migration technique which migrates a page from slow PCM to fast 3D-DRAM whenever it receives a certain number of memory accesses without waiting for any specific time interval. Furthermore, we present a light-weight hardware-assisted address reconciliation process for address management of the migrated pages. Such an on-the-fly page migration with hardware-assisted address reconciliation technique provides significant performance improvement over systems using epoch-based page migration and OS-based address management. Finally, we have developed an analytical model, which employs offline analyses of memory access counts per page and recommends whether an application is migration friendly or not. This can be useful in deciding if page migration (either epoch-based or on-the-fly based) should be used or turned off for a given application. Thus, our data management techniques and model enable significant performance improvements for flat-address heterogeneous memory systems involving NVMs. heterogeneous memory flat-address memory non-volatile memory prefetching page migration
2	Architecting heterogeneous memory systems with 3D die-stacked memory Sim, Jae Woong 21 September 2015 (has links) The main objective of this research is to efficiently enable 3D die-stacked memory and heterogeneous memory systems. 3D die-stacking is an emerging technology that allows for large amounts of in-package high-bandwidth memory storage. Die-stacked memory has the potential to provide extraordinary performance and energy benefits for computing environments, from data-intensive to mobile computing. However, incorporating die-stacked memory into computing environments requires innovations across the system stack from hardware and software. This dissertation presents several architectural innovations to practically deploy die-stacked memory into a variety of computing systems. First, this dissertation proposes using die-stacked DRAM as a hardware-managed cache in a practical and efficient way. The proposed DRAM cache architecture employs two novel techniques: hit-miss speculation and self-balancing dispatch. The proposed techniques virtually eliminate the hardware overhead of maintaining a multi-megabytes SRAM structure, when scaling to gigabytes of stacked DRAM caches, and improve overall memory bandwidth utilization. Second, this dissertation proposes a DRAM cache organization that provides a high level of reliability for die-stacked DRAM caches in a cost-effective manner. The proposed DRAM cache uses error-correcting code (ECCs), strong checksums (CRCs), and dirty data duplication to detect and correct a wide range of stacked DRAM failures—from traditional bit errors to large-scale row, column, bank, and channel failures—within the constraints of commodity, non-ECC DRAM stacks. With only a modest performance degradation compared to a DRAM cache with no ECC support, the proposed organization can correct all single-bit failures, and 99.9993% of all row, column, and bank failures. Third, this dissertation proposes architectural mechanisms to use large, fast, on-chip memory structures as part of memory (PoM) seamlessly through the hardware. The proposed design achieves the performance benefit of on-chip memory caches without sacrificing a large fraction of total memory capacity to serve as a cache. To achieve this, PoM implements the ability to dynamically remap regions of memory based on their access patterns and expected performance benefits. Lastly, this dissertation explores a new usage model for die-stacked DRAM involving a hybrid of caching and virtual memory support. In the common case where system’s physical memory is not over-committed, die-stacked DRAM operates as a cache to provide performance and energy benefits to the system. However, when the workload’s active memory demands exceed the capacity of the physical memory, the proposed scheme dynamically converts the stacked DRAM cache into a fast swap device to avoid the otherwise grievous performance penalty of swapping to disk. Stacked DRAM Die-stacking Memory systems Heterogeneous memory Hardware management Computer architecture
3	Heterogeneous Graphene Nanoribbon-CMOS Multi-State Volatile Random Access Memory Fabric Khasanvis, Santosh 01 January 2012 (has links) (PDF) CMOS SRAM area scaling is slowing down due to several challenges faced by transistors at nanoscale such as increased leakage. This calls for new concepts and technologies to overcome CMOS scaling limitations. In this thesis, we propose a multi-state memory to store multiple bits in a single cell, enabled by graphene and graphene nanoribbon crossbar devices (xGNR). This could provide a new dimension for scaling. We present a new multi-state volatile memory fabric called Graphene Nanoribbon Tunneling Random Access Memory (GNTRAM) featuring a heterogeneous integration between graphene and CMOS. A latch based on the xGNR devices is used as the memory element which exhibits 3 stable states. We propose binary and ternary GNTRAM and compare them with respect to 16nm CMOS SRAM and 3T DRAM. Ternary GNTRAM (1.58 bits/cell) shows up to 1.77x density-per-bit benefit over CMOS SRAMs and 1.42x benefit over 3T DRAM in 16nm technology node. Ternary GNTRAM is also up to 1196x more power-efficient per bit against high-performance CMOS SRAMs during stand-by. To enable further scaling, we explore two approaches to increase the number of bits per cell. We propose quaternary GNTRAM (2 bits/cell) using these approaches and extensively benchmark these designs. The first uses additional xGNR devices in the latch to achieve 4 stable states and the quaternary memory shows up to 2.27x density benefit vs. 16nm CMOS SRAMs and 1.8x vs. 3T DRAM. It has comparable read performance in addition to being power-efficient, up to 1.32x during active period and 818x during stand-by against high performance SRAMs. However, the need for relatively high-voltage operation may ultimately limit this scaling approach. An alternative approach is also explored by increasing the stub length in the xGNR devices, which allows for storing 2 bits per cell without requiring an increased operating voltage. This approach for quaternary GNTRAM shows higher benefits in terms of power, specifically up to 4.67x in terms of active power and 3498x during stand-by against high-performance SRAMs. Multi-bit GNTRAM has the potential to realize high-density low-power nanoscale memories. Further improvements may be possible by using graphene more extensively, as graphene transistors become available in future. Graphene GNTRAM heterogeneous memory multistate memory graphene nanoribbons Computer Engineering Nanoscience and Nanotechnology
4	Energy-aware Thread and Data Management in Heterogeneous Multi-Core, Multi-Memory Systems Su, Chun-Yi 03 February 2015 (has links) By 2004, microprocessor design focused on multicore scaling"increasing the number of cores per die in each generation "as the primary strategy for improving performance. These multicore processors typically equip multiple memory subsystems to improve data throughput. In addition, these systems employ heterogeneous processors such as GPUs and heterogeneous memories like non-volatile memory to improve performance, capacity, and energy efficiency. With the increasing volume of hardware resources and system complexity caused by heterogeneity, future systems will require intelligent ways to manage hardware resources. Early research to improve performance and energy efficiency on heterogeneous, multi-core, multi-memory systems focused on tuning a single primitive or at best a few primitives in the systems. The key limitation of past efforts is their lack of a holistic approach to resource management that balances the tradeoff between performance and energy consumption. In addition, the shift from simple, homogeneous systems to these heterogeneous, multicore, multi-memory systems requires in-depth understanding of efficient resource management for scalable execution, including new models that capture the interchange between performance and energy, smarter resource management strategies, and novel low-level performance/energy tuning primitives and runtime systems. Tuning an application to control available resources efficiently has become a daunting challenge; managing resources in automation is still a dark art since the tradeoffs among programming, energy, and performance remain insufficiently understood. In this dissertation, I have developed theories, models, and resource management techniques to enable energy-efficient execution of parallel applications through thread and data management in these heterogeneous multi-core, multi-memory systems. I study the effect of dynamic concurrent throttling on the performance and energy of multi-core, non-uniform memory access (NUMA) systems. I use critical path analysis to quantify memory contention in the NUMA memory system and determine thread mappings. In addition, I implement a runtime system that combines concurrent throttling and a novel thread mapping algorithm to manage thread resources and improve energy efficient execution in multi-core, NUMA systems. In addition, I propose an analytical model based on the queuing method that captures important factors in multi-core, multi-memory systems to quantify the tradeoff between performance and energy. The model considers the effect of these factors in a holistic fashion that provides a general view of performance and energy consumption in contemporary systems. Finally, I focus on resource management of future heterogeneous memory systems, which may combine two heterogeneous memories to scale out memory capacity while maintaining reasonable power use. I present a new memory controller design that combines the best aspects of two baseline heterogeneous page management policies to migrate data between two heterogeneous memories so as to optimize performance and energy. / Ph. D. Thread Management Multi-core Processors Performance Modeling and Analysis Power-Aware Computing Heterogeneous Memory Data Management
5	Optimisation des allocations de données pour des applications du Calcul Haute Performance sur une architecture à mémoires hétérogènes / Data Allocation Optimisation for High Performance Computing Application on Heterogeneous Architecture Brunie, Hugo 28 January 2019 (has links) Le Calcul Haute Performance, regroupant l’ensemble des acteurs responsables de l’amélioration des performances de calcul des applications scientifiques sur supercalculateurs, s’est donné pour objectif d’atteindre des performances exaflopiques. Cette course à la performance se caractérise aujourd’hui par la fabrication de machines hétérogènes dans lesquelles chaque composant est spécialisé. Parmi ces composants, les mémoires du système se spécialisent, et la tendance va vers une architecture composée de plusieurs mémoires aux caractéristiques complémentaires. La question se pose alors de l’utilisation de ces nouvelles machines dont la performance pratique dépend du placement des données de l’application sur les différentes mémoires. Dans cette thèse, nous avons développé une formulation du problème d’allocation de donnée sur une Architecture à Mémoires Hétérogènes. Dans cette formulation, nous avons fait apparaître le bénéfice que pourrait apporter une analyse temporelle du problème, parce que de nombreux travaux reposaient uniquement sur une approche spatiale. À partir de cette formulation, nous avons développé un outil de profilage hors ligne pour approximer les coefficients de la fonction objective afin de résoudre le problème d’allocation et d’optimiser l’allocation des données sur une architecture composée deux de mémoires principales aux caractéristiques complémentaires. Afin de réduire la quantité de modifications nécessaires pour prendre en compte la stratégie d’allocation recommandée par notre boîte à outils, nous avons développé un outil capable de rediriger automatiquement les allocations de données à partir d’un minimum d’instrumentation dans le code source. Les gains de performances obtenus sur des mini-applications représentatives des applications scientifiques codées par la communauté permet d’affirmer qu’une allocation intelligente des données est nécessaire pour bénéficier pleinement de ressources mémoires hétérogènes. Sur certaines tailles de problèmes, le gain entre un placement naïf est une allocation instruite peut atteindre un facteur ×3.75. / High Performance Computing, which brings together all the players responsible for improving the computing performance of scientific applications on supercomputers, aims to achieve exaflopic performance. This race for performance is today characterized by the manufacture of heterogeneous machines in which each component is specialized. Among these components, system memories specialize too, and the trend is towards an architecture composed of several memories with complementary characteristics. The question arises then of these new machines use whose practical performance depends on the application data placement on the different memories. Compromising code update against performance is challenging. In this thesis, we have developed a data allocation on Heterogeneous Memory Architecture problem formulation. In this formulation, we have shown the benefit of a temporal analysis of the problem, because many studies were based solely on a spatial approach this result highlight their weakness. From this formulation, we developed an offline profiling tool to approximate the coefficients of the objective function in order to solve the allocation problem and optimize the allocation of data on a composite architecture composed of two main memories with complementary characteristics. In order to reduce the amount of code changes needed to execute an application according to our toolbox recommended allocation strategy, we have developed a tool that can automatically redirect data allocations from a minimum source code instrumentation. The performance gains obtained on mini-applications representative of the scientific applications coded by the community make it possible to assert that intelligent data allocation is necessary to fully benefit from heterogeneous memory resources. On some problem sizes, the gain between a naive data placement strategy, and an educated data allocation one, can reach up to ×3.75 speedup. Profilage Allocation de données Memoires hétérogènes Optimisation combinatoire Data allocation Memory management Heterogeneous Memory Architecture Profiling Combinatorial optimization

1

Page generated in 0.071 seconds