Global ETD Search

31	An Asynchronous Event Communication Technique for Soft Real-Time GPGPU Applications Vestman, Alexander January 2015 (has links) Context Interactive GPGPU applications requires low response time feedback from events such as user input in order to provide a positive user experience. Communication of these events must be performed asynchronously as to not cause significant performance penalties. Objectives In this study the usage of CPU/GPU shared virtual memory to perform asynchronous communication is explored. Previous studies have shown that shared virtual memory can increase computational performance compared to other types of memory. Methods A communication technique that aimed to utilize the performance increasing properties of shared virtual memory was developed and implemented. The implemented technique was then compared to an implementation using explicitly transferred memory in an experiment measuring the performance of the various stages involved in the technique. Results The results from the experiment revealed that utilizing shared virtual memory for performing asynchronous communication was in general slightly slower than- or comparable to using explicitly transferred memory. In some cases, where the memory access pattern was right, utilization of shared virtual memory lead to a 50% reduction in execution time compared to explicitly transferred memory. Conclusions A conclusion that shared virtual memory can be utilized for performing asynchronous communication was reached. It was also concluded that by utilizing shared virtual memory a performance increase can be achieved over explicitly transferred memory. In addition it was concluded that careful consideration of data size and access pattern is required to utilize the performance increasing properties of shared virtual memory. GPGPU Asynchronous communication Shared memory Computer Sciences Datavetenskap (datalogi)
32	An API for Adaptive Loop Scheduling in Shared Address Space Architectures Govindaswamy, Kirthilakshmi 13 December 2003 (has links) The parallelization of complex, irregular scientific applications with various computational requirements often results in severe load imbalance. Load balancing increases the efficient utilization of available resources in parallel and distributed applications, thereby reducing the overall processor completion times. Loops are a rich source of parallelism in data parallel applications. In recent years, several loop scheduling schemes that balance processor workloads have been proposed and have been successfully implemented in data parallel applications. If the workload on processors is balanced, then the overall efficiency of a computation increases, and that, in turn reduces the computation run-time. Therefore, loop scheduling routines are incorporated into applications to insure that the workload is balanced for all the available processors. Significant research effort has been made towards embedding the most competitive loop scheduling algorithms into specific scientific applications. The application developer has to rewrite the algorithm to be incor-porated into a different application, each time a new one is developed. Certain compilers take advantage of loops present in the application and perform automatic parallelization on them. However, the automatic parallelization doesn?t address all sources of algorithmic and systemic variances in heterogeneous environments. These limitations raise a compelling need for building an application programmable interface (API) for adaptive loop scheduling algorithms that can be incorporated into any scientific application. This thesis presents an API for various adaptive loop scheduling strategies for data parallel applications in a shared address space architecture, which allows for parallelization as well as adaptive load balancing of a scientific application. This API has been incorporated into a few scientific applications in order to evaluate the performance of each application using the adaptive loop scheduling routines on shared address space parallel machines against the automatic loop scheduling offered by present parallelizing compiler technology. shared memory dynamic load balance adaptive loop scheduling
33	Parzsweep: A Novel Parallel Algorithm for Volume Rendering of Regular Datasets Ramswamy, Lakshmy 10 May 2003 (has links) The sweep paradigm for volume rendering has previously been successfully applied with irregular grids. This thesis describes a parallel volume rendering algorithm called PARZSweep for regular grids that utilizes the sweep paradigm. The sweep paradigm is a concept where a plane sweeps the data volume parallel to the viewing direction. As the sweeping proceeds in the increasing order of z, the faces incident on the vertices are projected onto the viewing volume to constitute to the image. The sweeping ensures that all faces are projected in the correct order and the image thus obtained is very accurate in its details. PARZSweep is an extension of a serial algorithm for regular grids called RZSweep. The hypothesis of this research is that a parallel version of RZSweep can be designed and implemented which will utilize multiple processors to reduce rendering times. PARZSweep follows an approach called image-based task scheduling or tiling. This approach divides the image space into tiles and allocates each tile to a processor for individual rendering. The sub images are composite to form a complete final image. PARZSweep uses a shared memory architecture in order to take advantage of inherent cache coherency for faster communication between processor. Experiments were conducted comparing RZSweep and PARZSweep with respect to prerendering times, rendering times and image quality. RZSweep and PARZSweep have approximately the same prerendering costs, produce exactly the same images and PARZSweep substantially reduced rendering times. PARZSweep was evaluated for scalability with respect to the number of tiles and number of processors. Scalability results were disappointing due to uneven data distribution. Scientific visualization Parallel volume rendering Shared memory architecture
34	On Optimizing and Leveraging Distributed Shared Memory for High Performance, Resource Aggregation, and Cache-coherent Heterogeneous-ISA Processors Chuang, Ho-Ren 28 June 2022 (has links) This dissertation focuses on the problem space of heterogeneous-ISA multiprocessors – an architectural design point that is being studied by the academic research community and increasingly available in commodity systems. Since such architectures usually lack globally coherent shared memory, software-based distributed shared memory (DSM) is often used to provide the illusion of such a memory. The DSM abstraction typically provides this illusion using a reader-replicate, writer-invalidate memory consistency protocol that operates at the granularity of memory pages and is usually implemented as a first-class operating system abstraction. This enables symmetric multiprocessing (SMP) programming frameworks, augmented with a heterogeneous-ISA compiler, to use CPU cores of different ISAs for parallel computations as if they are of the same ISA, improving programmability, especially for legacy SMP applications which therefore can run unmodified on such hardware. Past DSMs have been plagued by poor performance, in part due to the high latency and low bandwidth of interconnect network infrastructures. The dissertation revisits DSM in light of modern interconnects that reverse this performance trend. The dissertation presents Xfetch, a bulk page prefetching mechanism designed for the DEX DSM system. Xfetch exploits spatial locality, and aggressively and sequentially prefetches pages before potential read faults, improving DSM performance. Our experimental evaluations reveal that Xfetch achieves up to ≈142% speedup over the baseline DEX DSM that does not prefetch page data. SMP programming models often allow primitives that permit weaker memory consistency semantics, where synchronization updates can be delayed, permitting greater parallelism and thereby higher performance. Inspired by such primitives, the dissertation presents a DSM protocol called MWPF that trades-off memory consistency for higher performance in select SMP code regions, targeting heterogeneous-ISA multiprocessor systems. MWPF also overcomes performance bottlenecks of past DSM systems for heterogeneous-ISA multiprocessors such as due to significant number of invalidation messages, false page sharing, large number of read page faults, and large synchronization overheads by using efficient protocol primitives that delay and batch invalidation messages, aggressively prefetch data pages, and perform cross-domain synchronization with low overhead. Our experimental evaluations reveal that MWPF achieves, on average, 11% speedup over the baseline DSM implementation. The dissertation presents PuzzleHype, a distributed hypervisor that enables a single virtual machine (VM) to use fragmented resources in distributed virtualized settings such as CPU cores, memory, and devices of different physical hosts, and thereby decrease resource fragmentation and increase resource utilization. PuzzleHype leverages DSM implemented in host operating systems to present an unified and consistent view of a continuous pseudo-physical address space to guest operating systems. To transparently utilize CPU and I/O resources, PuzzleHype integrates multiple physical CPUs into a single VM by migrating threads, forwarding interrupts, and by delegating I/O. Our experimental evaluations reveal that PuzzleHype yields speedups in the range of 355%–173% over baseline over-provisioning scenarios which are otherwise necessary due to resource fragmentation. To enable a distributed hypervisor to adapt to resource and workload changes, the dissertation proposes the concept of CPU borrowing that allows a VM's virtual CPU (vCPU) to migrate to an available physical CPU (pCPU) and release it when it is no longer necessary, i.e., CPU returning. CPU borrowing can thus be used when a node is over-committed, and CPU returning can be used when the borrowed CPU resource is no longer necessary. To transparently migrate a vCPU at runtime without incurring a significant downtime, the dissertation presents a suite of techniques including leveraging thread migration, loading/restoring vCPU in KVM states, maintaining a global vCPU location table, and creating a DSM kernel thread for handling on-demand paging. Our experimental evaluations reveal that migrating vCPUs to resource-available nodes achieves a speedup of 1.4x over running the vCPUs on distributed nodes. When a VM spans multiple nodes, it is likelihood for failure increases. To mitigate this, the dissertation presents a distributed checkpoint/restart mechanism that allows a distributed VM to tolerate failures. A user interface is introduced for sending/receiving checkpoint/restart commands to a distributed VM. We implement the checkpoint/restart technique in the native KVM tool, and extend it to a distributed mode by converting Inter-Process Communication (IPC) into message passing between nodes, pausing/resuming distributed vCPU executions, and loading/restoring runtime states on the correct set of nodes. Our experimental evaluations indicate that the overhead of checkpointing a distributed VM is ≈10% or less than that of the native KVM tool with our checkpoint support. Restarting a distributed VM is faster than native KVM with our restart support because no additional page faults occur during restarting. The dissertation's final contribution is PopHype, a system software stack that allows simulation of cache-coherent, shared memory heterogeneous-ISA hardware. PopHype includes a Linux operating system that implements DSM as an OS abstraction for processes, i.e., allows multiple processes running on multiple (ISA-different) machines to share memory. With KVM-enabled, this OS becomes a hypervisor that allows multiple, process-based instances of an architecture emulator such as QEMU to execute in a shared address space, allowing multiple QEMU instances to emulate different ISAs in shared memory, i.e., emulate shared memory heterogeneous-ISA hardware. PopHype also includes a modified QEMU to use process-level DSM and an optimized guest OS kernel for improved performance. Our experimental studies confirm PopHype's effectiveness, and reveal that PopHype achieves an average speedup of 7.32x over a baseline that runs multiple QEMU instances in shared memory atop a single host OS. / Doctor of Philosophy / Computing devices are ubiquitous around us. Each of these devices is powered by specialized chips called processors. These processors take in instructions, process them, and produce output. Such processing is what enables us, humans, to send messages to our loved ones, take photographs, as well as carry out various business functions such as using spreadsheet software. The kinds of instructions these processors execute are classified into so-called Instruction Set Architectures or ISAs. Chip designers build processors adopting different ISAs for various applications ranging from computing on mobile phones to cloud computing data centers used by large technology companies. Within a data center, there are typically hundreds of thousands of computing devices that serve an organization's purpose to serve millions or even billions of users. Programming these computers individually to serve a collective goal is an arduous task requiring hundreds of software engineering experts. To simplify programming these computers on a large scale, this thesis envisions an abstraction where tens of devices appear as one computing unit to the programmer, allowing them to program multiple computers as if they are one. This allows for better resource utilization in the sense that the power of multiple computing devices can be pooled together without the need to acquire newer, larger, and more-expensive computers. Furthermore, such pooling allows the software to leverage multiple different ISAs on different computers instead of a single ISA on one computer. This thesis also envisions a way for software to run on multiple computers with potentially different ISAs without exposing the difficulty of managing them to the software engineers. Distributed Shared Memory Virtualization Scale-out Non-cache-coherent Domains
35	Contech: a shared memory parallel program analysis framework Vassenkov, Phillip 13 January 2014 (has links) We are in the era of multicore machines, where we must exploit thread level parallelism for programs to run better, smarter, faster, and more efficiently. In order to increase instruction level parallelism, processors and compilers perform heavy dataflow analyses between instructions. However, there isn’t much work done in the area of inter-thread dataflow analysis. In order to pave the way and find new ways to conserve resources across a variety of domains (i.e., execution speed, chip die area, power efficiency, and computational throughput), we propose a novel framework, termed Contech, to facilitate the analysis of multithreaded program in terms of its communication and execution patterns. We focus the scope on shared memory programs rather than message passing programs, since it is more difficult to analyze the communication and execution patterns for these programs. Discovering patterns of shared memory programs has the potential to allow general purpose computing machines to turn on or off architectural tricks according to application-specific features. Our design of Contech is modular in nature, so we can glean a large variety of information from an architecturally independent representation of the program under examination. Data race Experimentation Instrumentation Lock Program analysis tools Pthreads Race detection Shared memory Task graph Trace Operating systems (Computers) Multiprocessors Distributed shared memory
36	Cache-Efficient Aggregation: Hashing Is Sorting Müller, Ingo, Sanders, Peter, Lacurie, Arnaud, Lehner, Wolfgang, Färber, Franz 14 June 2022 (has links) For decades researchers have studied the duality of hashing and sorting for the implementation of the relational operators, especially for efficient aggregation. Depending on the underlying hardware and software architecture, the specifically implemented algorithms, and the data sets used in the experiments, different authors came to different conclusions about which is the better approach. In this paper we argue that in terms of cache efficiency, the two paradigms are actually the same. We support our claim by showing that the complexity of hashing is the same as the complexity of sorting in the external memory model. Furthermore we make the similarity of the two approaches obvious by designing an algorithmic framework that allows to switch seamlessly between hashing and sorting during execution. The fact that we mix hashing and sorting routines in the same algorithmic framework allows us to leverage the advantages of both approaches and makes their similarity obvious. On a more practical note, we also show how to achieve very low constant factors by tuning both the hashing and the sorting routines to modern hardware. Since we observe a complementary dependency of the constant factors of the two routines to the locality of the input, we exploit our framework to switch to the faster routine where appropriate. The result is a novel relational aggregation algorithm that is cache-efficient---independently and without prior knowledge of input skew and output cardinality---, highly parallelizable on modern multi-core systems, and operating at a speed close to the memory bandwidth, thus outperforming the state-of-the-art by up to 3.7x. info:eu-repo/classification/ddc/004 ddc:004
37	Exploring Selective coherence as a Solution to Self-invalidation in ArgoDSM Edberg, Christopher January 2022 (has links) Maintaining coherency in a distributed system can prove challenging, this is especially true for distributed shared memory systems. The problem with remote synchronization in the distributed shared memory software ArgoDSM occurs when a lock operation has to cross the boundaries of a node, this causes a large number of self-invalidations (SI) or self-downgrades (SD) which is costly. The performance of the coherency protocol can be improved if the SI/SD situations can be avoided by using a suitable alternative. This work explores if the use of selective coherence operations and non-synchronizing locking can help alleviate the issue of SI and SD in ArgoDSM in order to improve performance compared to the cache-wide coherence operations that are triggered by the default locking mechanism in ArgoDSM. An implementation of the concept is done by replacing the standard coherence protocol used in locking operations with selective operations which is then used to analyze the performance compared to the baseline software. The selective coherence operations are more powerful than the default protocol when applied to synchronization-heavy benchmarks, while the baseline software performs better when there is a lower amount of parallel work being done. ArgoDSM Coherence Performance Selective Coherence Distributed Shared Memory Software-based Distributed Shared Memory DSM Software DSM Computer Architecture Parallel Computing Computer and Information Sciences Data- och informationsvetenskap
38	REAL-TIME TELEMETRY DATA PROCESSING and LARGE SCALE PROCESSORS Dreibelbis, Harold N., Kelsch, Dennis, James, Larry 11 1900 (has links) International Telemetering Conference Proceedings / November 04-07, 1991 / Riviera Hotel and Convention Center, Las Vegas, Nevada / Real-time data processing of telemetry data has evolved from a highly centralized single large scale computer system to multiple mini-computers or super mini-computers tied together in a loosely coupled distributed network. Each mini-computer or super mini-computer essentially performing a single function in the real-time processing sequence of events. The reasons in the past for this evolution are many and varied. This paper will review some of the more significant factors in that evolution and will present some alternatives to a fully distributed mini-computer network that appear to offer significant real-time data processing advantages. Telemetry Real-Time Systems Large Scale Processors Shared Memory Flight-Test
39	Parallel processing in power systems computation on a distributed memory message passing multicomputer Hong, Chao, 洪潮 January 2000 (has links) published_or_final_version / Electrical and Electronic Engineering / Doctoral / Doctor of Philosophy Electric power system stability. Sparse matrices. Distributed shared memory.
40	Simulation modelling of distributed-shared memory multiprocessors Marurngsith, Worawan January 2006 (has links) Distributed shared memory (DSM) systems have been recognised as a compelling platform for parallel computing due to the programming advantages and scalability. DSM systems allow applications to access data in a logically shared address space by abstracting away the distinction of physical memory location. As the location of data is transparent, the sources of overhead caused by accessing the distant memories are difficult to analyse. This memory locality problem has been identified as crucial to DSM performance. Many researchers have investigated the problem using simulation as a tool for conducting experiments resulting in the progressive evolution of DSM systems. Nevertheless, both the diversity of architectural configurations and the rapid advance of DSM implementations impose constraints on simulation model designs in two issues: the limitation of the simulation framework on model extensibility and the lack of verification applicability during a simulation run causing the delay in verification process. This thesis studies simulation modelling techniques for memory locality analysis of various DSM systems implemented on top of a cluster of symmetric multiprocessors. The thesis presents a simulation technique to promote model extensibility and proposes a technique for verification applicability, called a Specification-based Parameter Model Interaction (SPMI). The proposed techniques have been implemented in a new interpretation-driven simulation called DSiMCLUSTER on top of a discrete event simulation (DES) engine known as HASE. Experiments have been conducted to determine which factors are most influential on the degree of locality and to determine the possibility to maximise the stability of performance. DSiMCLUSTER has been validated against a SunFire 15K server and has achieved similarity of cache miss results, an average of +-6% with the worst case less than 15% of difference. These results confirm that the techniques used in developing the DSiMCLUSTER can contribute ways to achieve both (a) a highly extensible simulation framework to keep up with the ongoing innovation of the DSM architecture, and (b) the verification applicability resulting in an efficient framework for memory analysis experiments on DSM architecture. 004

Search results