Spelling suggestions: "subject:"sharedmemory"" "subject:"sharememory""
71 |
DSM64: A Distributed Shared Memory System in User-SpaceHolsapple, Stephen Alan 01 May 2012 (has links) (PDF)
This paper presents DSM64: a lazy release consistent software distributed shared memory (SDSM) system built entirely in user-space. The DSM64 system is capable of executing threaded applications implemented with pthreads on a cluster of networked machines without any modifications to the target application. The DSM64 system features a centralized memory manager [1] built atop Hoard [2, 3]: a fast, scalable, and memory-efficient allocator for shared-memory multiprocessors.
In my presentation, I present a SDSM system written in C++ for Linux operating systems. I discuss a straight-forward approach to implement SDSM systems in a Linux environment using system-provided tools and concepts avail- able entirely in user-space. I show that the SDSM system presented in this paper is capable of resolving page faults over a local area network in as little as 2 milliseconds.
In my analysis, I present the following. I compare the performance characteristics of a matrix multiplication benchmark using various memory coherency models. I demonstrate that matrix multiplication benchmark using a LRC model performs orders of magnitude quicker than the same application using a stricter coherency model. I show the effect of coherency model on memory access patterns and memory contention. I compare the effects of different locking strategies on execution speed and memory access patterns. Lastly, I provide a comparison of the DSM64 system to a non-networked version using a system-provided allocator.
|
72 |
High-Concurrency Visualization on SupercomputersNouanesengsy, Boonthanome 30 August 2012 (has links)
No description available.
|
73 |
Modeling and Runtime Systems for Coordinated Power-Performance ManagementLi, Bo 28 January 2019 (has links)
Emergent systems in high-performance computing (HPC) expect maximal efficiency to achieve the goal of power budget under 20-40 megawatts for 1 exaflop set by the Department of Energy. To optimize efficiency, emergent systems provide multiple power-performance control techniques to throttle different system components and scale of concurrency. In this dissertation, we focus on three throttling techniques: CPU dynamic voltage and frequency scaling (DVFS), dynamic memory throttling (DMT), and dynamic concurrency throttling (DCT). We first conduct an empirical analysis of the performance and energy trade-offs of different architectures under the throttling techniques. We show the impact on performance and energy consumption on Intel x86 systems with accelerators of Intel Xeon Phi and a Nvidia general-purpose graphics processing unit (GPGPU). We show the trade-offs and potentials for improving efficiency. Furthermore, we propose a parallel performance model for coordinating DVFS, DMT, and DCT simultaneously. We present a multivariate linear regression-based approach to approximate the impact of DVFS, DMT, and DCT on performance for performance prediction. Validation using 19 HPC applications/kernels on two architectures (i.e., Intel x86 and IBM BG/Q) shows up to 7% and 17% prediction error correspondingly. Thereafter, we develop the metrics for capturing the performance impact of DVFS, DMT, and DCT. We apply the artificial neural network model to approximate the nonlinear effects on performance impact and present a runtime control strategy accordingly for power capping. Our validation using 37 HPC applications/kernels shows up to a 20% performance improvement under a given power budget compared with the Intel RAPL-based method. / Ph. D. / System efficiency on high-performance computing (HPC) systems is the key to achieving the goal of power budget for exascale supercomputers. Techniques for adjusting the performance of different system components can help accomplish this goal by dynamically controlling system performance according to application behaviors. In this dissertation, we focus on three techniques: adjusting CPU performance, memory performance, and the number of threads for running parallel applications. First, we profile the performance and energy consumption of different HPC applications on both Intel systems with accelerators and IBM BG/Q systems. We explore the trade-offs of performance and energy under these techniques and provide optimization insights. Furthermore, we propose a parallel performance model that can accurately capture the impact of these techniques on performance in terms of job completion time. We present an approximation approach for performance prediction. The approximation has up to 7% and 17% prediction error on Intel x86 and IBM BG/Q systems respectively under 19 HPC applications. Thereafter, we apply the performance model in a runtime system design for improving performance under a given power budget. Our runtime strategy achieves up to 20% performance improvement to the baseline method.
|
74 |
f-DSM: An FPGA-Accelerated Distributed Shared Memory for Heterogeneous Instruction-Set-Architecture HardwareVSathish, Naarayanan Rao 03 March 2022 (has links)
Due to the diminishing relevance of Moore's Law, traditional multi-core systems are increasingly struggling to meet the computational demands of many emerging workloads. Heterogeneous computing, which involves exploiting higher degrees of parallelism (e.g., GPUs) and application-specific specialization (e.g., FPGAs), is increasingly used to meet this demand. An important architectural trend in this space involves using instruction-set-architecture (ISA) heterogeneity. An exemplar case is emerging I/O devices that include CPU cores with ISAs (e.g., ARM, RISC-V) that differ from that of host CPUs (e.g., x86) and have physically discrete memory. Shared-memory programming of such systems requires the Dis- tributed Shared Memory (DSM) abstraction. Software DSM incurs significant OS overhead for maintaining memory coherency. Despite outperforming software predecessors, hardware DSM and cache-coherent interfaces require custom chips and lack the flexibility to experiment with different DSM consistency protocols. This thesis presents fDSM, an FPGA-accelerated DSM framework for ISA-heterogeneous hardware. fDSM implements a high-speed messaging layer to enable inter-node communication across ISA-different CPU cores and a DSM protocol processor that maintains virtual memory coherency using a multiple-reader single- writer DSM algorithm. Experimental studies reveal that fDSM outperforms prior art, including Popcorn Linux's software DSM abstraction, which uses TCP-IP and state-of-the-art Infiniband RDMA messaging layers by 2.8X and 7%, respectively. fDSM also provides reconfigurability and thereby allows implementation and experimentation of different memory consistency models. / Master of Science / Moore's Law predicts that the number of transistors in a chip will double approximately every two years. Chip vendors are increasingly observing that this law is nearing its limit when transistor sizes are shrunk to 5nm and 3nm due to power consumption and heat dissipation issues. As a result, innovation in new computing architectures has increasingly focused on heterogeneity, i.e., the use of hardware performance accelerators like graphic processors and reconfigurable logic used in confluence with a computer's CPU (host). To improve the programmability of these architectures, which usually have physically separate memory, the shared-memory programming model is usually used to provide coherent virtual memory. The shared memory model, when applied to such distributed systems, called distributed shared memory (or DSM), has been previously developed in software as well as in hardware. The former usually suffer from high latency overheads, while the latter often requires custom chips and lack programmability for implementing new memory consistency protocols. This thesis presents fDSM, a reconfigurable distributed shared memory framework that provides coherent shared memory between a host and a smart I/O device such as a SmartNIC. fDSM is implemented in FPGAs, which are increasingly available in hosts and Smart I/O devices at the commodity scale. Our prototype implementation uses ISA-heterogeneous hosts to emulate such an environment. Our experimental evaluation using applications from High- Performance Computing benchmark suites reveal that fDSM yields performance benefits over a state-of-the-art software DSM.
|
75 |
Towards Low-Complexity Scalable Shared-Memory ArchitecturesZeffer, Håkan January 2006 (has links)
<p>Plentiful research has addressed low-complexity software-based shared-memory systems since the idea was first introduced more than two decades ago. However, software-coherent systems have not been very successful in the commercial marketplace. We believe there are two main reasons for this: lack of performance and/or lack of binary compatibility.</p><p>This thesis studies multiple aspects of how to design future binary-compatible high-performance scalable shared-memory servers while keeping the hardware complexity at a minimum. It starts with a software-based distributed shared-memory system relying on no specific hardware support and gradually moves towards architectures with simple hardware support.</p><p>The evaluation is made in a modern chip-multiprocessor environment with both high-performance compute workloads and commercial applications. It shows that implementing the coherence-violation detection in hardware while solving the interchip coherence in software allows for high-performing binary-compatible systems with very low hardware complexity. Our second-generation hardware-software hybrid performs on par with, and often better than, traditional hardware-only designs.</p><p>Based on our results, we conclude that it is not only possible to design simple systems while maintaining performance and the binary-compatibility envelope, it is often possible to get better performance than in traditional and more complex designs.</p><p>We also explore two new techniques for evaluating a new shared-memory design throughout this work: adjustable simulation fidelity and statistical multiprocessor cache modeling.</p>
|
76 |
Towards Low-Complexity Scalable Shared-Memory ArchitecturesZeffer, Håkan January 2006 (has links)
Plentiful research has addressed low-complexity software-based shared-memory systems since the idea was first introduced more than two decades ago. However, software-coherent systems have not been very successful in the commercial marketplace. We believe there are two main reasons for this: lack of performance and/or lack of binary compatibility. This thesis studies multiple aspects of how to design future binary-compatible high-performance scalable shared-memory servers while keeping the hardware complexity at a minimum. It starts with a software-based distributed shared-memory system relying on no specific hardware support and gradually moves towards architectures with simple hardware support. The evaluation is made in a modern chip-multiprocessor environment with both high-performance compute workloads and commercial applications. It shows that implementing the coherence-violation detection in hardware while solving the interchip coherence in software allows for high-performing binary-compatible systems with very low hardware complexity. Our second-generation hardware-software hybrid performs on par with, and often better than, traditional hardware-only designs. Based on our results, we conclude that it is not only possible to design simple systems while maintaining performance and the binary-compatibility envelope, it is often possible to get better performance than in traditional and more complex designs. We also explore two new techniques for evaluating a new shared-memory design throughout this work: adjustable simulation fidelity and statistical multiprocessor cache modeling.
|
77 |
Teste de programas concorrentes com memória compartilhada / Test of shared memory concurrent programsSarmanho, Felipe Santos 13 April 2009 (has links)
Este trabalho propõe um modelo de teste para programas concorrentes que utilizam memória compartilhada. O modelo é inovador em três aspectos principais: (1) tratar a sincronização e a comunicação de threads de forma separada, (2) considerar a sincronização decorrente da inicialização/finalização de threads, e (3) apresenta um método baseado em timestamps para determinar as comunicações exercitadas em uma dada execução do programa. Os critérios de cobertura existentes para programas concorrentes foram adaptados ao contexto de programas baseados no paradigma de memória compartilhada. A ferramenta chamada ValiPThread foi implementada neste trabalho para apoiar a aplicação do modelo e dos critérios definidos. Com essa ferramenta é possível criar sessões de teste que podem ser salvas, interrompidas e retomadas a qualquer momento. Também é possível adicionar e executar casos de teste, avaliando a cobertura do código fonte em relação aos critérios de teste. A implementação da ferramenta mostra que é possível instanciar o modelo proposto em um software que auxilie a atividade de teste no contexto de programas com memória compartilhada. O trabalho apresenta soluções significativas para os principais desafios impostos pela programação concorrente para a atividade de teste, destacando-se dentre eles: (1) desenvolvimento de novas técnicas de análise estática para analisar programas concorrentes no contexto de memória compartilhada; (2) testar aspectos cruciais à programação concorrente como: sincronização, comunicação e fluxo de dados; (3) reproduzir uma execução de maneira controlada; (4) mapeamento de critérios de teste já existentes para programas concorrentes com passagem de mensagem para o contexto de memória compartilhada; (5) projetar critérios de fluxo de dados para programas concorrentes, considerando variáveis compartilhadas e (6) desenvolvimento de uma ferramenta de apoio a essas atividades / This work presents a novel test model for shared memory concurrent programs. Some important new features in this model are: (1) analysis the communication and synchronization in an isolated manner, (2) examines the synchronization due the start and the finish of threads, and (3) employs a method based on timestamps to check the communication exercised for an execution of the program. The coverage criteria set defined for concurrent programs was adjusted to the shared memory programs. In this work, the tool ValiPThread was implemented to support the application of the test model and of the coverage criteria. This tool allows to create a test session that can be saved, stopped and resumed at any time. In this tool is also possible to add and to execute test cases, analyzing the source code coverage with respect to the coverage criteria. The new tool shows that is possible instance the proposed model in a software that supports the test activity in context of shared memory. This work presents solutions for the major challenges related to task of to test concurrent programs, such as: (1) develop new techniques to do static analyzes of shared memory programs; (2) test important aspects to concurrent programs such as like: synchronization, communication and data flow; (3) replay and deterministic re-execution; (4) adjust coverage criteria from the context of message passing concurrent programs to shared memory context; (5) design data flow criteria based on shared variables and (6) build a tool to support these activities
|
78 |
Abortable and Query-abortable Types and Their Efficient ImplementationHorn, Stephanie Lorraine 24 September 2009 (has links)
We introduce abortable and query-abortable object types intended for implementation in asynchronous shared-memory systems with low contention. Implementations of such types behave like ordinary objects when accessed sequentially, but may abort operations when accessed concurrently. An aborted operation may or may not take effect, i.e., cause a state transition, and it returns no indication of which possibility occurred. Since this uncertainty can be problematic, a query-abortable type supports a QUERY operation that each process can use to determine its last non-QUERY operation on the object that caused a state transition, and the response associated with this state transition.
Our research is closely related to obstruction-free implementations (introduced by Herlihy, Luchangco and Moir) and responsive obstruction-free implementations (introduced by Attiya, Guerraoui and Kouznetsov). Like abortable and query-abortable types, these implementations may exhibit degraded behaviour in the face of contention. We show that abortable registers--registers strictly weaker than safe registers--can be used to obtain obstruction-free and responsive obstruction-free implementations for any type.
We present universal constructions for abortable and query-abortable types that are novel and efficient in the number of registers used. Specifically, they are based on a simple timestamping mechanism for detecting concurrent executions, and, in systems with n processes, use only n abortable registers or only O(n^2) single-reader, single-writer abortable registers. The timestamping mechanism we introduce is based on the inc&read counter type and appears to be interesting in its own right. As a generalization, we study the k-inc&read counter types, for k>0.
We also identify a potential problem with correctness properties based on step contention: with such properties, the composition of correct object implementations may result in an implementation that is not correct. In other words, implementations defined in terms of step contention are not always composable. To avoid this problem, we introduce a property based on interval contention, namely non-triviality, to define the correct behaviour of abortable and query-abortable object implementations.
|
79 |
An Interconnection Network for a Cache Coherent System on FPGAsMirian, Vincent 12 January 2011 (has links)
Field-Programmable Gate Arrays (FPGAs) systems now comprise many processing elements
that are processors running software and hardware engines used to accelerate
specific functions. To make the programming of such a system simpler, it is easiest to
think of a shared-memory environment, much like in current multi-core processor systems.
This thesis introduces a novel, shared-memory, cache-coherent infrastructure for
heterogeneous systems implemented on FPGAs that can then form the basis of a shared-memory programming model for heterogeneous systems. With simulation results, it is shown that the cache-coherent infrastructure outperforms the infrastructure of Woods [1] with a speedup of 1.10. The thesis explores the various configurations of the cache interconnection network and the benefit of the cache-to-cache cache line data transfer with its impact on main memory access. Finally, the thesis shows the cache-coherent infrastructure has very little overhead when using its cache coherence implementation.
|
80 |
An Interconnection Network for a Cache Coherent System on FPGAsMirian, Vincent 12 January 2011 (has links)
Field-Programmable Gate Arrays (FPGAs) systems now comprise many processing elements
that are processors running software and hardware engines used to accelerate
specific functions. To make the programming of such a system simpler, it is easiest to
think of a shared-memory environment, much like in current multi-core processor systems.
This thesis introduces a novel, shared-memory, cache-coherent infrastructure for
heterogeneous systems implemented on FPGAs that can then form the basis of a shared-memory programming model for heterogeneous systems. With simulation results, it is shown that the cache-coherent infrastructure outperforms the infrastructure of Woods [1] with a speedup of 1.10. The thesis explores the various configurations of the cache interconnection network and the benefit of the cache-to-cache cache line data transfer with its impact on main memory access. Finally, the thesis shows the cache-coherent infrastructure has very little overhead when using its cache coherence implementation.
|
Page generated in 0.0429 seconds