Global ETD Search

1	Efficient openMP over sequentially consistent distributed shared memory systems Costa Prats, Juan José 20 July 2011 (has links) Nowadays clusters are one of the most used platforms in High Performance Computing and most programmers use the Message Passing Interface (MPI) library to program their applications in these distributed platforms getting their maximum performance, although it is a complex task. On the other side, OpenMP has been established as the de facto standard to program applications on shared memory platforms because it is easy to use and obtains good performance without too much effort. So, could it be possible to join both worlds? Could programmers use the easiness of OpenMP in distributed platforms? A lot of researchers think so. And one of the developed ideas is the distributed shared memory (DSM), a software layer on top of a distributed platform giving an abstract shared memory view to the applications. Even though it seems a good solution it also has some inconveniences. The memory coherence between the nodes in the platform is difficult to maintain (complex management, scalability issues, high overhead and others) and the latency of the remote-memory accesses which can be orders of magnitude greater than on a shared bus due to the interconnection network. Therefore this research improves the performance of OpenMP applications being executed on distributed memory platforms using a DSM with sequential consistency evaluating thoroughly the results from the NAS parallel benchmarks. The vast majority of designed DSMs use a relaxed consistency model because it avoids some major problems in the area. In contrast, we use a sequential consistency model because we think that showing these potential problems that otherwise are hidden may allow the finding of some solutions and, therefore, apply them to both models. The main idea behind this work is that both runtimes, the OpenMP and the DSM layer, should cooperate to achieve good performance, otherwise they interfere one each other trashing the final performance of applications. We develop three different contributions to improve the performance of these applications: (a) a technique to avoid false sharing at runtime, (b) a technique to mimic the MPI behaviour, where produced data is forwarded to their consumers and, finally, (c) a mechanism to avoid the network congestion due to the DSM coherence messages. The NAS Parallel Benchmarks are used to test the contributions. The results of this work shows that the false-sharing problem is a relative problem depending on each application. Another result is the importance to move the data flow outside of the critical path and to use techniques that forwards data as early as possible, similar to MPI, benefits the final application performance. Additionally, this data movement is usually concentrated at single points and affects the application performance due to the limited bandwidth of the network. Therefore it is necessary to provide mechanisms that allows the distribution of this data through the computation time using an otherwise idle network. Finally, results shows that the proposed contributions improve the performance of OpenMP applications on this kind of environments. Distributed shared memory OpenMP Sequential consistency Presend Preinvalidation Chop Virtual synchronization point NanosDSM 004
2	A proposed memory consistency model for Chapel Srinivasa Murthy, Karthik, 1983- 20 December 2010 (has links) A memory consistency model for a language defines the order of memory operations performed by each thread in a parallel execution. Such a constraint is necessary to prevent the compiler and hardware optimizations from reordering certain memory operations, since such reordering might lead to unintuitive results. In this thesis, we propose a memory consistency model for Chapel, a parallel programming language from Cray Inc. Our memory model for Chapel is based on the idea of multiresolution and aims to provide a migration path from a program that is easy to reason about to a program that has better performance efficiency. Our model allows a programmer to write a parallel program with sequential consistency semantics, and then migrate to a performance-oriented version by increasingly changing different parts of the program to follow relaxed semantics. Sequential semantics helps in reasoning about the correctness of the parallel program and is provided by the strict sequential consistency model in our proposed memory model. The performance-oriented versions can be obtained either by using the compiler sequential consistency model, which maintains the sequential semantics, or by the relaxed consistency model, which maintains consistency only at global synchronization points. Our proposed memory model for Chapel thus combines strict sequential consistency model, compiler sequential consistency model and relaxed consistency model. We analyze the performance of the three consistency models by implementing three applications: Barnes-Hut, FFT and Random-Access in Chapel, and the hybrid model of MPI and Pthread. We conclude the following: The strict sequential consistency model is the best model to determine algorithmic errors in the applications, though it leads to the worst performance; the relaxed consistency model gives the best performance among the three models, but relies on the programmer to enforce synchronization correctly; the performance of the compiler sequential model depends on accuracy of the dependence analysis performed by the compiler; the relative performance of the consistency models across Chapel and the hybrid programming model of MPI and Pthread are the same. This shows that our model is not tightly bound to Chapel and can be applied on other programming models/languages. / text Chapel Memory models Sequential consistency Delay set analysis Parallel programming language
3	SNIC-DSM: SmartNIC based DSM Infrastructure for Heterogeneous-ISA Machines Ramesh, Hemanth 14 August 2023 (has links) Heterogeneous computing is increasingly used in today's datacenters to meet the increasing computational demands of applications. Heterogeneous hardware typically includes CPUs, GPUs, ASICs, and FPGAs, among others. An important emerging trend is instructionset- architecture (ISA)-heterogeneity: high-end x86 servers with attached SmartNICs and SmartSSDs that incorporate general-purpose CPUs, typically of the RISC ISA family (e.g., ARM, RISC-V). To alleviate resource congestion on server computing nodes, application workloads can be scaled-out across server x86 CPUs and SmartNIC ARM CPUs using the distributed shared memory (DSM) abstraction. We present SNIC-DSM, a SmartNIC-based DSM infrastructure for heterogeneous ISA machines. SNIC-DSM implements a low-latency messaging layer, which enables inter-node communication across multi-ISA CPUs, and a DSM protocol processor that provides memory coherency among these nodes, both implemented in SmartNIC's FPGA logic. SNIC-DSM is reconfigurable and allows the implementation of different memory consistency protocols. Our experimental studies using compute-intensive benchmarks reveal that SNIC-DSM outperforms the state-of-the-art DSM - Popcorn Linux's software DSM - when server resource congestion is high. / Master of Science / The availability of heterogeneous computing architectures has led to the development of distributed shared memory systems, which allows compute-intensive applications to run in a distributed manner on different types of computing devices such as graphics processors, reconfigurable logic devices, and custom integrated circuits. Adopting such a heterogeneous computing strategy yields better performance and improves power consumption. Generally, these DSM systems use a software-based approach, which offers great flexibility but suffers from software overheads. Hardware-based approaches are used to overcome these limitations but they generally do not offer flexibility. This thesis presents, SNIC-DSM, which is a reconfigurable implementation of the DSM framework. SNIC-DSM provides a platform for the host and smart networking devices such as SmartNICs to communicate with each other and enables application execution in a distributed manner by providing memory coherency. Our experimental evaluation using High-Performance Computing benchmarks reveals that SNIC-DSM improves performance when compared with software-based DSM. Popcorn Linux Distributed Shared Memory SmartNIC Sequential Consistency Instruction-Set-Architecture Corundum
4	f-DSM: An FPGA-Accelerated Distributed Shared Memory for Heterogeneous Instruction-Set-Architecture Hardware VSathish, Naarayanan Rao 03 March 2022 (has links) Due to the diminishing relevance of Moore's Law, traditional multi-core systems are increasingly struggling to meet the computational demands of many emerging workloads. Heterogeneous computing, which involves exploiting higher degrees of parallelism (e.g., GPUs) and application-specific specialization (e.g., FPGAs), is increasingly used to meet this demand. An important architectural trend in this space involves using instruction-set-architecture (ISA) heterogeneity. An exemplar case is emerging I/O devices that include CPU cores with ISAs (e.g., ARM, RISC-V) that differ from that of host CPUs (e.g., x86) and have physically discrete memory. Shared-memory programming of such systems requires the Dis- tributed Shared Memory (DSM) abstraction. Software DSM incurs significant OS overhead for maintaining memory coherency. Despite outperforming software predecessors, hardware DSM and cache-coherent interfaces require custom chips and lack the flexibility to experiment with different DSM consistency protocols. This thesis presents fDSM, an FPGA-accelerated DSM framework for ISA-heterogeneous hardware. fDSM implements a high-speed messaging layer to enable inter-node communication across ISA-different CPU cores and a DSM protocol processor that maintains virtual memory coherency using a multiple-reader single- writer DSM algorithm. Experimental studies reveal that fDSM outperforms prior art, including Popcorn Linux's software DSM abstraction, which uses TCP-IP and state-of-the-art Infiniband RDMA messaging layers by 2.8X and 7%, respectively. fDSM also provides reconfigurability and thereby allows implementation and experimentation of different memory consistency models. / Master of Science / Moore's Law predicts that the number of transistors in a chip will double approximately every two years. Chip vendors are increasingly observing that this law is nearing its limit when transistor sizes are shrunk to 5nm and 3nm due to power consumption and heat dissipation issues. As a result, innovation in new computing architectures has increasingly focused on heterogeneity, i.e., the use of hardware performance accelerators like graphic processors and reconfigurable logic used in confluence with a computer's CPU (host). To improve the programmability of these architectures, which usually have physically separate memory, the shared-memory programming model is usually used to provide coherent virtual memory. The shared memory model, when applied to such distributed systems, called distributed shared memory (or DSM), has been previously developed in software as well as in hardware. The former usually suffer from high latency overheads, while the latter often requires custom chips and lack programmability for implementing new memory consistency protocols. This thesis presents fDSM, a reconfigurable distributed shared memory framework that provides coherent shared memory between a host and a smart I/O device such as a SmartNIC. fDSM is implemented in FPGAs, which are increasingly available in hosts and Smart I/O devices at the commodity scale. Our prototype implementation uses ISA-heterogeneous hosts to emulate such an environment. Our experimental evaluation using applications from High- Performance Computing benchmark suites reveal that fDSM yields performance benefits over a state-of-the-art software DSM. Distributed Shared Memory Sequential Consistency Popcorn Linux Field programmable gate arrays Instruction-Set-Architecture
5	Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors Ubal Tena, Rafael 01 September 2010 (has links) Los procesadores superescalares actuales utilizan un reorder buffer (ROB) para contabilizar las instrucciones en vuelo. El ROB se implementa como una cola FIFO first in first out en la que las instrucciones se insertan en orden de programa después de ser decodificadas, y de la que se extraen también en orden de programa en la etapa commit. El uso de esta estructura proporciona un soporte simple para la especulación, las excepciones precisas y la reclamación de registros. Sin embargo, el hecho de retirar instrucciones en orden puede degradar las prestaciones si una operación de alta latencia está bloqueando la cabecera del ROB. Varias propuestas se han publicado atacando este problema. La mayoría utiliza retirada de instrucciones fuera de orden de forma especulativa, requiriendo almacenar puntos de recuperación (checkpoints) para restaurar un estado válido del procesador ante un fallo de especulación. Normalmente, los checkpoints necesitan implementarse con estructuras hardware costosas, y además requieren un crecimiento de otras estructuras del procesador, lo cual a su vez puede impactar en el tiempo de ciclo de reloj. Este problema afecta a muchos tipos de procesadores actuales, independientemente del número de hilos hardware (threads) y del número de núcleos de cómputo (cores) que incluyan. Esta tesis abarca el estudio de la retirada no especulativa de instrucciones fuera de orden en procesadores superescalares, multithread y multicore. / Ubal Tena, R. (2010). Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8535 Out-of-order retirement Reorder buffer Processor architecture Multithreading Multicore Superscalar Sequential consistency 120317 - Informática 120326 - Simulación 330406 - Arquitectura de ordenadores

1

Page generated in 0.1266 seconds