Spelling suggestions: "subject:"allocated"" "subject:"allocate""
1 |
POSEIDON: The First Safe and Scalable Persistent Memory AllocatorDemeri, Anthony K. 20 May 2020 (has links)
With the advent of byte-addressable Non-Volatile Memory (NVMM), the need for a safe, scalable and high-performing memory allocator is inevitable. A slow memory allocator can bottleneck the entire application stack, while an unsecure memory allocator can render underlying systems and applications inconsistent upon program bugs or system failure. Unlike DRAM-based memory allocators, it is indispensable for an NVMM allocator to guarantee its heap metadata safety from both internal and external errors. An effective NVMM memory allocator should be 1) safe 2) scalable and 3) high performing. Unfortunately, none of the existing persistent memory allocators achieve all three requisites; critically, we also note: the de-facto NVMM allocator, Intel's Persistent Memory Development Kit (PMDK), is vulnerable to silent data corruption and persistent memory leaks as result of a simple heap overflow. We closely investigate the existing defacto NVMM memory allocators, especially PMDK, to study their vulnerability to metadata corruption and reasons for poor performance and scalability. We propose Poseidon, which is safe, fast and scalable. The premise of Poseidon revolves around providing a user application with per-CPU sub-heaps for scalability, while managing the heap metadata in a segregated fashion and efficiently protecting the metadata using a scalable hardware-based protection scheme, Intel's Memory Protection Keys (MPK). We evaluate Poseidon with a wide array of microbenchmarks and real-world benchmarks, noting: Poseidon outperforms the state-of-art allocators by a significant margin, showing improved scalability and performance, while also guaranteeing metadata safety. / Master of Science / Since the dawn of time, civilization has revolved around effective communication. From smoke signals to telegraphs and beyond, communication has continued to be a cornerstone of successful societies. Today, communication and collaboration occur, daily, on a global scale, such that even sub-second units of time are critical to successful societal operation. Naturally, many forms of modern communication revolve around our digital systems, such as personal computers, email servers, and social networking database applications. There is, thus, a never-ending surge of digital system development, constantly striving toward increased performance. For some time, increasing a system's dynamic random-access memory, or DRAM, has been able to provide performance gains; unfortunately, due to thermal and power constraints, such an increase is no longer feasible. Additionally, loss of power on a DRAM system causes bothersome loss of data, since the memory storage is volatile to power loss. Now, we are on the advent of an entirely new physical memory system, termed non-volatile main memory (NVMM), which has near identical performance properties to DRAM, but is operational in much larger quantities, thus allowing increased overall system speed. Alas, such a system also imposes additional requirements upon software developers; since, for NVMM, all memory updates are permanent, such that a failed update can cause persistent memory corruption. Regrettably, the existing software standard, led by Intel's Persistent Memory Development Kit (PMDK), is both unsecure (allowing for permanent memory corruption, with ease), low performance, and a bottleneck for multicore systems. Here, we present a secure, high performing solution, termed Poseidon, which harnesses the full potential of NVMM.
|
2 |
A Memory-Realistic SPM Allocator with WCET/ACET Tunable PerformanceBai, Jia-yu 16 September 2010 (has links)
Real-time systems often use SPM instead of cache, because SPM allows a program¡¦s run time to be more predictable. Real-time system need predictable runtimes, because they must schedule programs to finish within specific deadlines. A deadline should be larger than its program¡¦s worst-case execution time (WCET).
Our laboratory is conducting ongoing research into scratchpad memory allocation (SPM) for reducing the WCET of a program. Compared to our previous work, this current thesis improves our memory model, our allocation algorithms, our real-time support, and our measurement benchmarks and platform.
Our key accomplishments in this paper are to: 1) add, for the first time in the literature, true WCETmeas analysis to an SPM allocator, 2) to modestly improve the performance of our previous allocator, and 3) to greatly increase the applicability over that allocator, by extending the method to support recursive programs.
|
3 |
Dynamic Memory Management for Embedded Real-Time Multiprocessor System-on-a-ChipShalan, Mohamed A. 25 November 2003 (has links)
The aggressive evolution of the semiconductor industry smaller process geometries, higher densities, and greater chip complexity has provided design engineers the means to create complex, high-performance System-on-a-Chip (SoC) designs. Such SoC designs typically have more than one processor and huge (tens of Mega Bytes) amount of memory, all on the same chip. Dealing with the global on-chip memory allocation/deallocation in a dynamic yet deterministic way is an important issue for upcoming billion transistor multiprocessor SoC designs. To achieve this, we propose a memory management hierarchy we call Two-Level Memory Management. To implement this memory management scheme which presents a shift in the way designers look at on-chip dynamic memory allocation we present the System-on-a-Chip Dynamic Memory Management Unit (SoCDMMU) for allocation of the global on-chip memory, which we refer to as Level Two memory management (Level One is the management of memory allocated to a particular on-chip Processing Element, e.g., an operating systems management of memory allocated to a particular processor). In this way, processing elements (heterogeneous or non-heterogeneous hardware or software) in an SoC can request and be granted portions of the global memory in a fast and deterministic time. A new tool is introduced to generate a custom optimized version of the SoCDMMU hardware. Also, a real-time operating system is modified support the new proposed SoCDMMU. We show an example where shared memory multiprocessor SoC that employs the Two-Level Memory Management and utilizes the SoCDMMU has an overall average speedup in application transition time as well as normal execution time.
|
4 |
A High Performance Register Allocator for Vector Architectures with a Unified Register-SetSu, Yu-Dan 29 June 2012 (has links)
This thesis describes a compiler optimization targeted for machines with unified, vector-based register sets. This optimization combines register allocation and instruction scheduling. It examines places where the code performs computations on scalar variables. The goal is to identify instances where the same operation is performed. For example, a program might calculate ¡§base+offset¡¨ and then calculate ¡§i+j¡¨. Even though these computations are unrelated, yet they use the same operator; if ¡§base¡¨ and ¡§i¡¨ are packed into one vector register, while ¡§offset¡¨ and ¡§j¡¨ are packed into another, then these two computations can be performed simultaneously through the vectors¡¦ parallel addition operation. This would reduce the execution time of the compiled code.
Although other researchers have considered similar packing methods, their work has been limited by the hardware that they were studying. Such hardware usually imposed high costs for moving data between scalar and vector register banks. This present thesis, however, considers a novel hardware architecture that imposes no such costs. As a consequence, we are able to obtain significant speedups.
The architecture that we consider is a Graphics Processing Unit (GPU) for embedded systems that is under development at this university. This GPU has a single register set for integers, float, and vectors.
|
5 |
Techniques for formal modelling and verification on dynamic memory allocators / Techniques de modélisation et de vérification formelles des allocateurs de mémoire dynamiquesFang, Bin 10 September 2018 (has links)
Cette thèse est une contribution à la spécification et à la vérification formelles des allocateurs de mémoire dynamiques séquentiels (SDMA, en abrégé), qui sont des composants clés des systèmes d'exploitation ou de certaines bibliothèques logiciel. Les SDMA gèrent la partie tas de la mémoire des processus. Leurs implémentations utilisent à la fois des structures de données complexes et des opérations de bas niveau. Cette thèse se concentre sur les SDMA qui utilisent des structures de données de type liste pour gérer les blocs du tas disponibles pour l'allocation (SDMA à liste).La première partie de la thèse montre comment obtenir des spécifications formelles de SDMA à liste en utilisant une approche basée sur le raffinement. La thèse définit une hiérarchie de modèles classés par la relation de raffinement qui capture une grande variété de techniques et de politiques employées par le implémentations réelles de SDMA. Cette hiérarchie forme une théorie algorithmique pour les SDMA à liste et pourrait être étendue avec d'autres politiques. Les spécifications formelles sont écrites en Event-B et les raffinements ont été prouvés en utilisant la plateforme Rodin. La thèse étudie diverses applications des spécifications formelles obtenues: le test basé sur des modèles, la génération de code et la vérification.La deuxième partie de la thèse définit une technique de vérification basée sur l'interprétation abstraite. Cette technique peut inférer des invariants précis des implémentations existantes de SDMA. Pour cela, la thèse définit un domaine abstrait dont les valeurs representent des ensembles d'états du SDMA. Le domaine abstrait est basé sur un fragment de la logique de séparation, appelé SLMA. Ce fragment capture les propriétés liées à la forme et au contenu des structures de données utilisées par le SDMA pour gérer le tas. Le domaine abstrait est défini comme un produit spécifique d'un domaine abstrait pour graphes du tas avec un domaine abstrait pour des sequences finies d'adresses mémoire. Pour obtenir des valueurs abstraites compactes, la thèse propose une organisation hiérarchique des valeurs abstraites: un premier niveau abstrait la liste de tous les blocs mémoire, alors qu'un second niveau ne sélectionne que les blocs disponibles pour l’allocation. La thèse définit les transformateurs des valeurs abstraites qui capturent la sémantique des instructions utilisées dans les implémentations des SDMA. Un prototype d'implémentation de ce domaine abstrait a été utilisé pour analyser des implémentations simples de SDMA. / The first part of the thesis demonstrates how to obtain formal specifications of free-list SDMA using a refinement-based approach. The thesis defines a hierarchy of models ranked by the refinement relation that capture a large variety of techniques and policies employed by real-work SDMA. This hierarchy forms an algorithm theory for the free-list SDMA and could be extended with other policies. The formal specifications are written in Event-B and the refinements have been proved using the Rodin platform. The thesis investigates applications of the formal specifications obtained, such as model-based testing, code generation and verification.The second part of the thesis defines a technique for inferring precise invariants of existing implementations of SDMA based abstract interpretation. For this, the thesis defines an abstract domain representing sets of states of the SDMA. The abstract domain is based on a fragment of Separation Logic, called SLMA. This fragment captures properties related with the shape and the content of data structures used by the SDMA to manage the heap. The abstract domain is defined as a specific product of an abstract domain for heap shapes with an abstract domain for finite arrays of locations. To obtain compact elements of this abstract domain, the thesis proposes an hierarchical organisation of the abstract values: a first level abstracts the list of all chunks while a second level selects only the chunks available for allocation. The thesis defines transformers of the abstract values that soundly capture the semantics of statements used in SDMA implementations. A prototype implementation of this abstract domain has been used to analyse simple implementations of SDMA
|
6 |
Simple, safe, and efficient memory management using linear pointersLiu, Likai 22 January 2016 (has links)
Efficient and safe memory management is a hard problem. Garbage collection promises automatic memory management but comes with the cost of increased memory footprint, reduced parallelism in multi-threaded programs, unpredictable pause time, and intricate tuning parameters balancing the program's workload and designated memory usage in order for an application to perform reasonably well. Existing research mitigates the above problems to some extent, but programmer error could still cause memory leak by erroneously keeping memory references when they are no longer needed. We need a methodology for programmers to become resource aware, so that efficient, scalable, predictable and high performance programs may be written without the fear of resource leak.
Linear logic has been recognized as the formalism of choice for resource tracking. It requires explicit introduction and elimination of resources and guarantees that a resource cannot be implicitly shared or abandoned, hence must be linear. Early languages based on linear logic focused on Curry-Howard correspondence. They began by limiting the expressive powers of the language and then reintroduced them by allowing controlled sharing which is necessary for recursive functions. However, only by deviating from Curry-Howard correspondence could later development actually address programming errors in resource usage.
The contribution of this dissertation is a simple, safe, and efficient approach introducing linear resource ownership semantics into C++ (which is still a widely used language after 30 years since inception) through linear pointer, a smart pointer inspired by linear logic. By implementing various linear data structures and a parallel, multi-threaded memory allocator based on these data structures, this work shows that linear pointer is practical and efficient in the real world, and that it is possible to build a memory management stack that is entirely leak free. The dissertation offers some closing remarks on the difficulties a formal system would encounter when reasoning about a concurrent linear data algorithm, and what might be done to solve these problems.
|
7 |
Efficient Connection Allocator in Network-on-ChipNam, Seungseok 20 June 2022 (has links)
As semiconductor technologies develop, a System-on-Chip (SoC) that integrates all semiconductor intellectual property (IP) cores is suggested and widely used for various applications. A traditional bus interconnection does not support transmitting data between IP cores for high performance. Because of this reason, a Network-on-Chip (NoC) has been suggested to provide an efficient and scalable solution to interconnect among all IP cores. High throughput and low latency have recently become the main important factors of NoC for achieving hard guaranteed real-time systems. In order to guarantee these factors and provide real-time service (i.e., Guaranteed Service, GS), the circuit switching (CS) approach has been widely utilized. The CS approach allocates mutually exclusive paths to transmitting data between different sources and destinations using dedicated NoC resources. However, the exclusive occupancy of the allocated path reduces the efficiency of the overall use of NoC resources. In order to solve this problem, Space-Division-Multiplexing (SDM) and Time-Division-Multiplexing (TDM) techniques have been suggested. SDM implements a circuit switching technique by assigning physically different NoC-links between different connections. Path connections of the SDM technique based on spatial resources assignment do not provide high scalability. In contrast to this, using virtual time slots for a path connection, the TDM technique can share physical links between exclusively established connections, thereby improving NoC path diversity.
For all of these mentioned techniques, the factor that significantly impacts the system efficiency or performance scaling is how the path is allocated. In recent years, a dynamic connection allocation approach that can cope with highly dynamic workloads has been gaining attention due to the sudden and diverse demands of applications in real-time systems. There are two groups in the dynamic connection allocation approach. One is a distributed allocation technique, and the other is a centralized allocation technique. While distributed allocation exploits additional logic integrated into the NoC-routers for path search and allocation, the centralized approach makes use of a central unit to manage the path allocation problem. There are several algorithms for the centralized allocation technique. Trellis search-based allocation approach shows the best performance among them.
Many algorithms related to centralized connection allocators have been studied extensively during the past decade. However, relatively little attention was paid to methodology in analyzing and evaluating the centralized connection allocation algorithms. In order to further develop the algorithms, it is necessary to understand and evaluate the centralized connection allocator by establishing a new analysis methodology. Thus, this thesis presents a performance analysis methodology for the trellis search-based allocation approach. Firstly, this thesis proposes a system model for analysis. Secondly, performance metrics are defined. Finally, the analysis results of each performance metric related to the trellis search-based allocation approach are presented. Through this analysis, the performance of the trellis search-based allocation approach can be accurately analyzed. Although a simulation is not performed, the upper limit of performance of the trellis search-based allocation approach can also be predicted through the analysis metrics. Additionally, we introduce the general formulation of the trellis search-based path allocation algorithm. The weight values among available paths through the branch metric and path metric are proposed to enable higher performance path connection. Furthermore, according to network size, topology, TDM, interface load delivery, and router internal storage, the performance of trellis search-based path allocation algorithms is also described.
In the end, the Application Specific Instruction Processor (ASIP) hardware platform customized for the trellis search-based path allocation algorithm is presented. The shortest available and lowest-cost (SALC) path search algorithm is proposed to improve the success rate of path connection in the ASIP hardware platform. We evaluate the algorithm performance and implementation synthesis results. In order to realize the dynamic connection approach, a short execution cycle of ASIP time is essential.
We develop several algorithms to achieve this short execution cycle. The first one is a rectangular region of search algorithm that allows adapting the size and form of path search region according to the particular source-destination positions and considers actual operational constraints. The average execution cycles for searching an optimum path are decreased because the unnecessary region for path-search is excluded. The second one is a path-spreading search algorithm that separates between involved routers and uninvolved routers in path search. The involved routers are selected and spread out from source to destination at each intermediate trellis-search process. The path-search overhead is considerably reduced due to the router involvements. The third one is a three-directional path-spreading search algorithm that eliminates one direction movement among four spreading movements. Because of this reason, the trellis search-based path connection algorithm, which omits the back-tracing process, can be implemented in the ASIP platform. Thus, the whole algorithm execution time can be halved. The last one is a moving regional path search algorithm that significantly reduces computation complexity by selecting a constant dimensional path-search region that affects performance and moving the region from source to destination. The moving regional path search algorithm achieves a considerable decrement of computational complexity.:1 Introduction 1
1.1 NoC-interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Connection allocation in a Network-on-Chip 7
2.1 Circuit Switching NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Guaranteed Service in NoCs . . . . . . . . . . . . . . . . . . . 7
2.1.2 Spatial-Division-Multiplexing technique . . . . . . . . . . . . 8
2.1.3 Time-Division-Multiplexing technique . . . . . . . . . . . . . 10
2.2 System architectures employing circuit switching NoCs . . . . . . . . 11
2.2.1 Static and dynamic connection allocation . . . . . . . . . . . 12
2.2.2 Distributed connection allocation technique . . . . . . . . . . 14
2.2.3 Centralized connection allocation technique . . . . . . . . . . 16
2.2.4 Algorithms for centralized connection allocation . . . . . . . . 17
2.2.4.1 Software based run-time path allocation approach . 18
2.2.4.2 Trellis search-based allocation approach . . . . . . . 19
3 Performance analysis methodology for a centralized connection allocator
23
3.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Performance metrics and analysis methodology . . . . . . . . . . . . 25
3.3 System simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Trellis search-based path allocation algorithm 45
4.1 General formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 Trellis graph structure . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 Survivor path selection criterion . . . . . . . . . . . . . . . . . 52
ix
4.1.2.1 Branch metric and path metric . . . . . . . . . . . . 52
4.1.2.2 The shortest-available and lowest-cost path selection
criterion . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Algorithm Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Network topology . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Network size . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 Time-Division-Multiplexing . . . . . . . . . . . . . . . . . . . 61
4.2.4 NoC interface load diversity . . . . . . . . . . . . . . . . . . . 63
4.2.5 The internal storage of the router . . . . . . . . . . . . . . . . 66
5 ASIP approach for Trellis search-based connection allocation 73
5.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.1 Trellis search-based ASIP platform architecture . . . . . . . . 74
5.2 Algorithm for improving success rates of path connection . . . . . . . 81
5.2.1 SALC algorithm for Trellis search-based ASIP platform . . . . 81
5.2.2 Performance evaluation of the SALC algorithm . . . . . . . . 88
5.2.2.1 Simulation results . . . . . . . . . . . . . . . . . . . 88
5.2.2.2 Synthesis results . . . . . . . . . . . . . . . . . . . . 91
5.3 Algorithm for reducing path-search time . . . . . . . . . . . . . . . . 93
5.3.1 Rectangular regional path search algorithm . . . . . . . . . . 93
5.3.2 Path-spreading search algorithm . . . . . . . . . . . . . . . . 99
5.3.3 Three directional path-spreading search algorithm . . . . . . 108
5.3.4 Moving regional path search algorithm . . . . . . . . . . . . . 114
5.3.5 Performance evaluation . . . . . . . . . . . . . . . . . . . . . 123
5.3.5.1 Simulation results . . . . . . . . . . . . . . . . . . . 123
5.3.5.2 Synthesis results . . . . . . . . . . . . . . . . . . . . 126
6 Conclusion and Future work 131
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Bibliography 135
|
8 |
Addressing Fragmentation in ZGC through Custom Allocators : Leveraging a Lean, Mean, Free-List MachineSikström, Joel January 2024 (has links)
The Java programming language manages memory automatically through the use of a garbage collector (GC). The Java Virtual Machine provides several GCs tuned for different usage scenarios. One such GC is ZGC. Both ZGC and other GCs utilize bump-pointer allocation, which allocates objects compactly but leads to the creation of unusable memory gaps over time, known as fragmentation. ZGC handles fragmentation through relocation, a process which is costly. This thesis proposes an alternative memory allocation method leveraging free-lists to reduce the need for relocation to manage fragmentation.We design and develop a new allocator tailored for ZGC, based on the TLSF allocator by Masmano et al. Previous research on the customization of allocators shows varying results and does not fully investigate usage in complex environments like a GC.Opportunities for enhancements in performance and memory efficiency are identified and implemented through the exploration of ZGC's operational boundaries. The most significant adaptation is the introduction of a 0-byte header, which leverages information within ZGC to significantly reduce internal fragmentation of the allocator. We evaluate the performance of our adapted allocator and compare it to a reference implementation of TLSF. Results show that the adapted allocator performs on par with the reference implementation for single allocations but is slightly slower for single frees and when applying allocation patterns from real-world programs. The findings of this work suggest that customizing allocators for garbage collection is worth considering and may be useful for future integration.
|
9 |
Memory management techniques for large-scale persistent-main-memory systemsOukid, Ismail, Booss, Daniel, Lespinasse, Adrien, Lehner, Wolfgang, Willhalm, Thomas, Gomes, Grégoire 10 January 2023 (has links)
Storage Class Memory (SCM) is a novel class of memory technologies that promise to revolutionize database architectures. SCM is byte-addressable and exhibits latencies similar to those of DRAM, while being non-volatile. Hence, SCM could replace both main memory and storage, enabling a novel single-level database architecture without the traditional I/O bottleneck. Fail-safe persistent SCM allocation can be considered conditio sine qua non for enabling this novel architecture paradigm for database management systems. In this paper we present PAllocator, a fail-safe persistent SCM allocator whose design emphasizes high concurrency and capacity scalability. Contrary to previous works, PAllocator thoroughly addresses the important challenge of persistent memory fragmentation by implementing an efficient defragmentation algorithm. We show that PAllocator outperforms state-of-the-art persistent allocators by up to one order of magnitude, both in operation throughput and recovery time, and enables up to 2.39x higher operation throughput on a persistent B-Tree.
|
10 |
A Memory Allocation Framework for Optimizing Power Consumption and Controlling FragmentationPanwar, Ashish January 2015 (has links) (PDF)
Large physical memory modules are necessary to meet performance demands of today's ap-
plications but can be a major bottleneck in terms of power consumption during idle periods or when systems are running with workloads which do not stress all the plugged memory resources. Contribution of physical memory in overall system power consumption becomes even more signi cant when CPU cores run on low power modes during idle periods with hardware support like Dynamic Voltage Frequency Scaling.
Our experiments show that even 10% of memory allocations can make references to all the
banks of physical memory on a long running system primarily due to the randomness in page allocation. We also show that memory hot-remove or memory migration for large blocks is often restricted, in a long running system, due to allocation policies of current Linux VM which mixes movable and unmovable pages. Hence it is crucial to improve page migration for large contiguous blocks for a practical realization of power management support provided by the hardware.
Operating systems can play a decisive role in effectively utilizing the power management support of modern DIMMs like PASR(Partial Array Self Refresh) in these situations but have not been using them so far.
We propose three different approaches for optimizing memory power consumption by in-
ducing bank boundary awareness in the standard buddy allocator of Linux kernel as well as distinguishing user and kernel memory allocations at the same time to improve the movability of memory sections (and hence memory-hotplug) by page migration techniques. Through a set of minimal changes in the standard buddy system of Linux VM, we have been able to reduce the number of active memory banks significantly (upto 80%) as well as to improve performance of memory-hotplug framework (upto 85%).
|
Page generated in 0.0634 seconds