11 |
Large object space support for software distributed shared memoryCheung, Wang-leung, Benny. January 2005 (has links)
Thesis (Ph. D.)--University of Hong Kong, 2005. / Title proper from title frame. Also available in printed format.
|
12 |
RamboNodes for the Metropolitan Ad Hoc NetworkBeal, Jacob, Gilbert, Seth 17 December 2003 (has links)
We present an algorithm to store data robustly in a large, geographically distributed network by means of localized regions of data storage that move in response to changing conditions. For example, data might migrate away from failures or toward regions of high demand. The PersistentNode algorithm provides this service robustly, but with limited safety guarantees. We use the RAMBO framework to transform PersistentNode into RamboNode, an algorithm that guarantees atomic consistency in exchange for increased cost and decreased liveness. In addition, a half-life analysis of RamboNode shows that it is robust against continuous low-rate failures. Finally, we provide experimental simulations for the algorithm on 2000 nodes, demonstrating how it services requests and examining how it responds to failures.
|
13 |
Implementation and performance evaluation of doubly-linked list protocols on a cluster of workstationsLeung, K. H. W, 梁海宏. January 1999 (has links)
published_or_final_version / Electrical and Electronic Engineering / Master / Master of Philosophy
|
14 |
On the use and performance of communication primitives in software controlled cache-coherent cluster architectures /Qin, Xiaohan, January 1997 (has links)
Thesis (Ph. D.)--University of Washington, 1997. / Vita. Includes bibliographical references (leaves [117]-125).
|
15 |
Performance of parallel algorithms on a broadcast-based architecture /Narravula, Harsha V. Katsinis, Constantine. January 2003 (has links)
Thesis (Ph. D.)--Drexel University, 2003. / Includes abstract and vita. Includes bibliographical references (leaves 85-89).
|
16 |
Contention resolution and memory load balancing algorithms on distributed shared memory multiprocessors /Akay, Mehmet Fatih. Katsinis, Constantine. January 2005 (has links)
Thesis (Ph. D.)--Drexel University, 2005. / Includes abstract and vita. Includes bibliographical references (leaves 100-103).
|
17 |
RamboNodes for the Metropolitan Ad Hoc NetworkBeal, Jacob, Gilbert, Seth 17 December 2003 (has links)
We present an algorithm to store data robustly in a large, geographically distributed network by means of localized regions of data storage that move in response to changing conditions. For example, data might migrate away from failures or toward regions of high demand. The PersistentNode algorithm provides this service robustly, but with limited safety guarantees. We use the RAMBO framework to transform PersistentNode into RamboNode, an algorithm that guarantees atomic consistency in exchange for increased cost and decreased liveness. In addition, a half-life analysis of RamboNode shows that it is robust against continuous low-rate failures. Finally, we provide experimental simulations for the algorithm on 2000 nodes, demonstrating how it services requests and examining how it responds to failures.
|
18 |
Samhita: Virtual Shared Memory for Non-Cache-Coherent SystemsRamesh, Bharath 05 August 2013 (has links)
Among the key challenges of computing today are the emergence of many-core architectures and the resulting need to effectively exploit explicit parallelism. Indeed, programmers are striving to exploit parallelism across virtually all platforms and application domains. The shared memory programming model effectively addresses the parallelism needs of mainstream computing (e.g., portable devices, laptops, desktop, servers), giving rise to a growing ecosystem of shared memory parallel techniques, tools, and design practices. However, to meet the extreme demands for processing and memory of critical problem domains, including scientific computation and data intensive computing, computing researchers continue to innovate in the high-end distributed memory architecture space to create cost-effective and scalable solutions. The emerging distributed memory architectures are both highly parallel and increasingly heterogeneous. As a result, they do not present the programmer with a cache-coherent view of shared memory, either across the entire system or even at the level of an individual node. Furthermore, it remains an open research question which programming model is best for the heterogeneous platforms that feature multiple traditional processors along with accelerators or co-processors. Hence, we have two contradicting trends. On the one hand, programming convenience and the presence of shared memory call for a shared memory programming model across the entire heterogeneous system. On the other hand, increasingly parallel and heterogeneous nodes lacking cache-coherent shared memory call for a message passing model. In this dissertation, we present the architecture of Samhita, a distributed shared memory (DSM) system that addresses the challenge of providing shared memory for non-cache-coherent systems. We define regional consistency (RegC), the memory consistency model implemented by Samhita. We present performance results for Samhita on several computational kernels and benchmarks, on both cluster supercomputers and heterogeneous systems. The results demonstrate the promising potential of Samhita and the RegC model, and include the largest scale evaluation by a significant margin for any DSM system reported to date. / Ph. D.
|
19 |
On Optimizing and Leveraging Distributed Shared Memory for High Performance, Resource Aggregation, and Cache-coherent Heterogeneous-ISA ProcessorsChuang, Ho-Ren 28 June 2022 (has links)
This dissertation focuses on the problem space of heterogeneous-ISA multiprocessors – an architectural design point that is being studied by the academic research community and increasingly available in commodity systems. Since such architectures usually lack globally coherent shared memory, software-based distributed shared memory (DSM) is often used to provide the illusion of such a memory. The DSM abstraction typically provides this illusion using a reader-replicate, writer-invalidate memory consistency protocol that operates at the granularity of memory pages and is usually implemented as a first-class operating system abstraction. This enables symmetric multiprocessing (SMP) programming frameworks, augmented with a heterogeneous-ISA compiler, to use CPU cores of different ISAs for parallel computations as if they are of the same ISA, improving programmability, especially for legacy SMP applications which therefore can run unmodified on such hardware.
Past DSMs have been plagued by poor performance, in part due to the high latency and low bandwidth of interconnect network infrastructures. The dissertation revisits DSM in light of modern interconnects that reverse this performance trend. The dissertation presents Xfetch, a bulk page prefetching mechanism designed for the DEX DSM system. Xfetch exploits spatial locality, and aggressively and sequentially prefetches pages before potential read faults, improving DSM performance. Our experimental evaluations reveal that Xfetch achieves up to ≈142% speedup over the baseline DEX DSM that does not prefetch page data.
SMP programming models often allow primitives that permit weaker memory consistency semantics, where synchronization updates can be delayed, permitting greater parallelism and thereby higher performance. Inspired by such primitives, the dissertation presents a DSM protocol called MWPF that trades-off memory consistency for higher performance in select SMP code regions, targeting heterogeneous-ISA multiprocessor systems. MWPF also overcomes performance bottlenecks of past DSM systems for heterogeneous-ISA multiprocessors such as due to significant number of invalidation messages, false page sharing, large number of read page faults, and large synchronization overheads by using efficient protocol primitives that delay and batch invalidation messages, aggressively prefetch data pages, and perform cross-domain synchronization with low overhead. Our experimental evaluations reveal that MWPF achieves, on average, 11% speedup over the baseline DSM implementation.
The dissertation presents PuzzleHype, a distributed hypervisor that enables a single virtual machine (VM) to use fragmented resources in distributed virtualized settings such as CPU cores, memory, and devices of different physical hosts, and thereby decrease resource fragmentation and increase resource utilization. PuzzleHype leverages DSM implemented in host operating systems to present an unified and consistent view of a continuous pseudo-physical address space to guest operating systems. To transparently utilize CPU and I/O resources, PuzzleHype integrates multiple physical CPUs into a single VM by migrating threads, forwarding interrupts, and by delegating I/O. Our experimental evaluations reveal that PuzzleHype yields speedups in the range of 355%–173% over baseline over-provisioning scenarios which are otherwise necessary due to resource fragmentation.
To enable a distributed hypervisor to adapt to resource and workload changes, the dissertation proposes the concept of CPU borrowing that allows a VM's virtual CPU (vCPU) to migrate to an available physical CPU (pCPU) and release it when it is no longer necessary, i.e., CPU returning. CPU borrowing can thus be used when a node is over-committed, and CPU returning can be used when the borrowed CPU resource is no longer necessary. To transparently migrate a vCPU at runtime without incurring a significant downtime, the dissertation presents a suite of techniques including leveraging thread migration, loading/restoring vCPU in KVM states, maintaining a global vCPU location table, and creating a DSM kernel thread for handling on-demand paging. Our experimental evaluations reveal that migrating vCPUs to resource-available nodes achieves a speedup of 1.4x over running the vCPUs on distributed nodes.
When a VM spans multiple nodes, it is likelihood for failure increases. To mitigate this, the dissertation presents a distributed checkpoint/restart mechanism that allows a distributed VM to tolerate failures. A user interface is introduced for sending/receiving checkpoint/restart commands to a distributed VM. We implement the checkpoint/restart technique in the native KVM tool, and extend it to a distributed mode by converting Inter-Process Communication (IPC) into message passing between nodes, pausing/resuming distributed vCPU executions, and loading/restoring runtime states on the correct set of nodes. Our experimental evaluations indicate that the overhead of checkpointing a distributed VM is ≈10% or less than that of the native KVM tool with our checkpoint support. Restarting a distributed VM is faster than native KVM with our restart support because no additional page faults occur during restarting.
The dissertation's final contribution is PopHype, a system software stack that allows simulation of cache-coherent, shared memory heterogeneous-ISA hardware. PopHype includes a Linux operating system that implements DSM as an OS abstraction for processes, i.e., allows multiple processes running on multiple (ISA-different) machines to share memory. With KVM-enabled, this OS becomes a hypervisor that allows multiple, process-based instances of an architecture emulator such as QEMU to execute in a shared address space, allowing multiple QEMU instances to emulate different ISAs in shared memory, i.e., emulate shared memory heterogeneous-ISA hardware. PopHype also includes a modified QEMU to use process-level DSM and an optimized guest OS kernel for improved performance. Our experimental studies confirm PopHype's effectiveness, and reveal that PopHype achieves an average speedup of 7.32x over a baseline that runs multiple QEMU instances in shared memory atop a single host OS. / Doctor of Philosophy / Computing devices are ubiquitous around us. Each of these devices is powered by specialized chips called processors. These processors take in instructions, process them, and produce output. Such processing is what enables us, humans, to send messages to our loved ones, take photographs, as well as carry out various business functions such as using spreadsheet software. The kinds of instructions these processors execute are classified into so-called Instruction Set Architectures or ISAs. Chip designers build processors adopting different ISAs for various applications ranging from computing on mobile phones to cloud computing data centers used by large technology companies.
Within a data center, there are typically hundreds of thousands of computing devices that serve an organization's purpose to serve millions or even billions of users. Programming these computers individually to serve a collective goal is an arduous task requiring hundreds of software engineering experts. To simplify programming these computers on a large scale, this thesis envisions an abstraction where tens of devices appear as one computing unit to the programmer, allowing them to program multiple computers as if they are one. This allows for better resource utilization in the sense that the power of multiple computing devices can be pooled together without the need to acquire newer, larger, and more-expensive computers.
Furthermore, such pooling allows the software to leverage multiple different ISAs on different computers instead of a single ISA on one computer. This thesis also envisions a way for software to run on multiple computers with potentially different ISAs without exposing the difficulty of managing them to the software engineers.
|
20 |
Exploring Selective coherence as a Solution to Self-invalidation in ArgoDSMEdberg, Christopher January 2022 (has links)
Maintaining coherency in a distributed system can prove challenging, this is especially true for distributed shared memory systems. The problem with remote synchronization in the distributed shared memory software ArgoDSM occurs when a lock operation has to cross the boundaries of a node, this causes a large number of self-invalidations (SI) or self-downgrades (SD) which is costly. The performance of the coherency protocol can be improved if the SI/SD situations can be avoided by using a suitable alternative. This work explores if the use of selective coherence operations and non-synchronizing locking can help alleviate the issue of SI and SD in ArgoDSM in order to improve performance compared to the cache-wide coherence operations that are triggered by the default locking mechanism in ArgoDSM. An implementation of the concept is done by replacing the standard coherence protocol used in locking operations with selective operations which is then used to analyze the performance compared to the baseline software. The selective coherence operations are more powerful than the default protocol when applied to synchronization-heavy benchmarks, while the baseline software performs better when there is a lower amount of parallel work being done.
|
Page generated in 0.1075 seconds