Spelling suggestions: "subject:"distributed shared memory"" "subject:"eistributed shared memory""
31 |
f-DSM: An FPGA-Accelerated Distributed Shared Memory for Heterogeneous Instruction-Set-Architecture HardwareVSathish, Naarayanan Rao 03 March 2022 (has links)
Due to the diminishing relevance of Moore's Law, traditional multi-core systems are increasingly struggling to meet the computational demands of many emerging workloads. Heterogeneous computing, which involves exploiting higher degrees of parallelism (e.g., GPUs) and application-specific specialization (e.g., FPGAs), is increasingly used to meet this demand. An important architectural trend in this space involves using instruction-set-architecture (ISA) heterogeneity. An exemplar case is emerging I/O devices that include CPU cores with ISAs (e.g., ARM, RISC-V) that differ from that of host CPUs (e.g., x86) and have physically discrete memory. Shared-memory programming of such systems requires the Dis- tributed Shared Memory (DSM) abstraction. Software DSM incurs significant OS overhead for maintaining memory coherency. Despite outperforming software predecessors, hardware DSM and cache-coherent interfaces require custom chips and lack the flexibility to experiment with different DSM consistency protocols. This thesis presents fDSM, an FPGA-accelerated DSM framework for ISA-heterogeneous hardware. fDSM implements a high-speed messaging layer to enable inter-node communication across ISA-different CPU cores and a DSM protocol processor that maintains virtual memory coherency using a multiple-reader single- writer DSM algorithm. Experimental studies reveal that fDSM outperforms prior art, including Popcorn Linux's software DSM abstraction, which uses TCP-IP and state-of-the-art Infiniband RDMA messaging layers by 2.8X and 7%, respectively. fDSM also provides reconfigurability and thereby allows implementation and experimentation of different memory consistency models. / Master of Science / Moore's Law predicts that the number of transistors in a chip will double approximately every two years. Chip vendors are increasingly observing that this law is nearing its limit when transistor sizes are shrunk to 5nm and 3nm due to power consumption and heat dissipation issues. As a result, innovation in new computing architectures has increasingly focused on heterogeneity, i.e., the use of hardware performance accelerators like graphic processors and reconfigurable logic used in confluence with a computer's CPU (host). To improve the programmability of these architectures, which usually have physically separate memory, the shared-memory programming model is usually used to provide coherent virtual memory. The shared memory model, when applied to such distributed systems, called distributed shared memory (or DSM), has been previously developed in software as well as in hardware. The former usually suffer from high latency overheads, while the latter often requires custom chips and lack programmability for implementing new memory consistency protocols. This thesis presents fDSM, a reconfigurable distributed shared memory framework that provides coherent shared memory between a host and a smart I/O device such as a SmartNIC. fDSM is implemented in FPGAs, which are increasingly available in hosts and Smart I/O devices at the commodity scale. Our prototype implementation uses ISA-heterogeneous hosts to emulate such an environment. Our experimental evaluation using applications from High- Performance Computing benchmark suites reveal that fDSM yields performance benefits over a state-of-the-art software DSM.
|
32 |
An API for adaptive loop scheduling in shared address space architecturesGovindaswamy, Kirthilakshmi. January 2003 (has links) (PDF)
Thesis (M.S.)--Mississippi State University. Department of Computer Science and Engineering. / Title from title screen. Includes bibliographical references.
|
33 |
Directory scalability in multi-agent based systemsHussain, Shahid, Shabbir, Hassan January 2008 (has links)
Simulation is one approach to analyze and model the real world complex problems. Multi-agent based systems provide a platform to develop simulations based on the concept of agent-oriented programming. In multi-agent systems, the local interaction between agents contributes to the emergence of the global phenomena by getting the result of the simulation runs. In MABS systems, interaction is one common aspect for all agents to perform their tasks. To interact with each other the agents require yellow page services from the platform to search for other agents. As more and more agents perform searches on this yellow page directory, there is a decrease in the performance due to a central bottleneck. In this thesis, we have investigated multiple solutions for this problem. The most promising solution is to integrate distributed shared memory with the directory systems. With our proposed solution, empirical analysis shows a statistically significant increase in performance of the directory service. We expect this result to make a considerable contribution to the state of the art in multiagent platforms.
|
34 |
An in-situ visualization approach for parallel coupling and steering of simulations through distributed shared memory files / Une approche de visualisation in-situ pour le couplage parallèle et le pilotage de simulations à travers des fichiers en mémoire distribuée partagéeSoumagne, Jérôme 14 December 2012 (has links)
Les codes de simulation devenant plus performants et plus interactifs, il est important de suivre l'avancement d'une simulation in-situ, en réalisant non seulement la visualisation mais aussi l'analyse des données en même temps qu'elles sont générées. Suivre l'avancement ou réaliser le post-traitement des données de simulation in-situ présente un avantage évident par rapport à l'approche conventionnelle consistant à sauvegarder—et à recharger—à partir d'un système de fichiers; le temps et l'espace pris pour écrire et ensuite lire les données à partir du disque est un goulet d'étranglement significatif pour la simulation et les étapes consécutives de post-traitement. Par ailleurs, la simulation peut être arrêtée, modifiée, ou potentiellement pilotée, conservant ainsi les ressources CPU.Nous présentons dans cette thèse une approche de couplage faible qui permet à une simulation de transférer des données vers un serveur de visualisation via l'utilisation de fichiers en mémoire. Nous montrons dans cette étude comment l'interface, implémentée au-dessus d'un format hiérarchique de données (HDF5), nous permet de réduire efficacement le goulet d'étranglement introduit par les I/Os en utilisant des stratégies efficaces de communication et de configuration des données. Pour le pilotage, nous présentons une interface qui permet non seulement la modification de simples paramètres, mais également le remaillage complet de grilles ou des opérations impliquant la régénérationde grandeurs numériques sur le domaine entier de calcul d'être effectués. Cette approche, testée et validée sur deux cas-tests industriels, est suffisamment générique pour qu'aucune connaissance particulière du modèle de données sous-jacent ne soit requise. / As simulation codes become more powerful and more interactive, it is increasingly desirable to monitor a simulation in-situ, performing not only visualization but also analysis of the incoming data as it is generated. Monitoring or post-processing simulation data in-situ has obvious advantage over the conventional approach of saving to—and reloading data from—the file system; the time and space it takes to write and then read the data from disk is a significant bottleneck for both the simulation and subsequent post-processing steps. Furthermore, the simulation may be stopped, modified, or potentially steered, thus conserving CPU resources. We present in this thesis a loosely coupled approach that enables a simulation to transfer data to a visualization server via the use of in-memory files. We show in this study how the interface, implemented on top of a widely used hierarchical data format (HDF5), allows us to efficiently decrease the I/O bottleneck by using efficient communication and data mapping strategies. For steering, we present an interface that allows not only simple parameter changes but also complete re-meshing of grids or operations involving regeneration of field values over the entire computational domain to be carried out. This approach, tested and validated on two industrial test cases, is generic enough so that no particular knowledge of the underlying model is required.
|
35 |
SHARED-GM: Arquitetura de Mem´oria Distribu´ıda para o Ambiente D-GM. / SHARED-GM: DISTRIBUTED MEMORY ARCHITECTURE FOR D-GM ENVIROZechlinski, Gustavo Mata 11 September 2010 (has links)
Made available in DSpace on 2016-03-22T17:26:43Z (GMT). No. of bitstreams: 1
gustavo.pdf: 2041277 bytes, checksum: 42156e7b6b140c8fd9dcda43abcba411 (MD5)
Previous issue date: 2010-09-11 / The recent advances in computer technology have increased the use of computer
clusters for running applications which require a large computational effort, making this
practice a strong tendency. Following this tendency, the D-GM (Geometric Distributed-
Machine) environment is a tool, composed by two software modules, VPE-GM (Visual
Programming Environment for Geometric Machine) and VirD-GM (Virtual Distributed
Geometric Machine), whose goals are the development of applications of the scientific
computation applying visual programming and parallel and/or distributed execution, respectively.
The core of the D-GM environment is based on the Geometric Machine (GM
Model), which is an abstract machine model for parallel and/or concurrent computations,
whose definitions cover the existing parallels to process executions.
The main contribution of this work is the formalization and development of a
distributed memory for the D-GM environment, designing, modeling and constructing the
integration between such environment and a distributed shared memory (DSM) system.
Therefore, it aims at obtaining a better execution dynamic with major functionality and
possibly, an increase in performance in the D-GM execution applications.
This integration, whose objective is to supply a shared distributed memory module
to the D-GM environment, is called ShareD-GM environment. Based on the study
of DSM softwares implementations, mainly on their characteristics which meet all the
requirements to implement the distributed memory of the D-GM environment, this work
considers the use of Terracotta system.
This study highlights two facilities both present in Terracota: the portability and
adaptability for distributed execution in a cluster of computers with no code modifications
(codeless clustering).
Besides these characteristics, one can observe that Terracotta does not make use
of RMI (Remote Method Invocation) for communication among objects in a JAVA environment.
From this point of view, one may also minimize the overhead of data serializations
(marshalling) in network transmissions. In addition, the development of applications
to evaluate the implementation of the architecture model provided by the ShareD-GM integration,
as the algorithm Smith-Waterman and the Jacobi method, showed a shorter
running time when compared to the previous VirD-GM execution module / O recente avanc¸o das tecnologias de computadores impulsionaram o uso de
clusters de computadores para execuc¸ ao de aplicac¸ oes que exijam um grande esforc¸o
computacional, tornando esta pr´atica uma forte tend encia atual. Acompanhando esta
tend encia, o Ambiente D-GM (Distributed-Geometric Machine) constitui-se em uma ferramenta
compreendendo dois m´odulos de software, VPE-GM (Visual Programming Environment
for Geometric Machine) e VirD-GM (Virtual Distributed Geometric Machine),
os quais objetivam o desenvolvimento de aplicac¸ oes da computac¸ ao cient´ıfica aplicando
a programac¸ ao visual e a execuc¸ ao paralela e/ou distribu´ıda, respectivamente.
O n´ucleo do Ambiente D-GM est´a fundamentado na M´aquina Geom´etrica (Geometric
Machine-GM), um modelo de m´aquina abstrato para computac¸ oes paralelas e/ou
concorrentes cujas definic¸ oes abrangem os paralelismos existentes para execuc¸ ao de processos.
A principal contribuic¸ ao deste trabalho ´e a formalizac¸ ao e desenvolvimento de
uma mem´oria distribu´ıda para o Ambiente D-GM atrav´es da concepc¸ ao, modelagem e
construc¸ ao da integrac¸ ao entre o Ambiente D-GM e um sistema DSM (Distributes Shared
Memory). Portanto, visando melhoria na din amica de execuc¸ ao com maior funcionalidade
e, possivelmente, com melhor desempenho no ambiente D-GM. A esta integrac¸ ao,
cujo objetivo ´e fornecer um modelo de mem´oria compartilhada distribu´ıda para o Ambiente
D-GM, d´a-se o nome de ShareD-GM. Com base no estudo de implementac¸ oes
em software de DSM e nas caracter´ısticas que atendem aos requisitos de implementac¸ ao
da mem´oria distribu´ıda do Ambiente D-GM, este trabalho considera o uso do sistema
Terracotta. Salientam-se duas facilidades apresentadas pelo Terracota: a portabilidade
e a adaptabilidade para execuc¸ ao distribu´ıda em clusters de computadores com pouca
ou at´e nenhuma modificac¸ ao no c´odigo (codeless clustering), as quais retornam grandes
benef´ıcios quando da integrac¸ ao com aplicac¸ oes JAVA. Al´em disso, verifica-se o fato de
que o Terracotta n ao utiliza RMI (Remote Method Invocation) para comunicac¸ ao entre os
objetos em um Ambiente JAVA. Neste perspectiva, procura-se minimizar o overhead dos
dados produzidos pelas serializac¸ oes (marshalling) nas transmiss oes via rede. P ode-se
tamb´em comprovar durante o desenvolvimento de testes de avaliac¸ ao da implementac¸ ao
da arquitetura proporcionada pela integrac¸ ao ShareD-GM, que a execuc¸ ao de aplicac¸ oes
modeladas no Ambiente D-GM, como o algoritmo de Smith-Waterman e o m´etodo de Jacobi,
apresentaram menor tempo de execuc¸ ao quando comparados com a implementac¸ ao
anterior, no m´odulo VirD-GM de execuc¸ ao do Ambiente D-GM
|
36 |
Um ambiente de execução para suporte à programação paralela com variáveis compartilhadas em sistemas distribuídos heterogêneos. / A runtime system for parallel programing with shared memory paradigm over a heterogeneus distributed systems.Craveiro, Gisele da Silva 31 October 2003 (has links)
O avanço na tecnologia de hardware está permitindo que máquinas SMP de 2 a 8 processadores estejam disponíveis a um custo cada vez menor, possibilitando que a incorporação de tais máquinas em aglomerados de PC's ou até mesmo a composição de um aglomerado de SMP's sejam alternativas cada vez mais viáveis para computação de alto desempenho. O grande desafio é extrair o potencial que tal conjunto de máquinas oferece. Uma alternativa é usar um paradigma híbrido de programação para aproveitar a arquitetura de memória compartilhada através de multihreadeing e utilizar o modelo de troca de mensagens para comunicação entre os nós. Contudo, essa estratégia impõe uma tarefa árdua e pouco produtiva para o programador da aplicação. Este trabalho apresenta o sistema CPAR- Cluster que oferece uma abstração de memória compartilhada no topo de um aglomerado formado por nós mono e multiprocessadores. O sistema é implementado no nível de biblioteca e não faz uso de recursos especiais tais como hardware especializado ou alteração na camada de sistema operacional. Serão apresentados os modelos, estratégias, questões de implementação e os resultados obtidos através de testes realizados com a ferramenta e que apresentaram comportamento esperado. / The advance in hardware technologies is making small configuration SMP machines (from 2 to 8 processors) available at a low cost. For this reason, the inclusion of an SMP node into a cluster of PCs or even clusters of SMPs are becoming viable alternatives for high performance computing. The challenge is the exploitation of the computational resources that these platforms provide. A Hybrid programming paradigm which uses shared memory architecture through multihreading and also message passing model for inter node communication is an alternative. However, programming in such paradigm is very hard. This thesis presents CPAR- Cluster, a runtime system, that provides shared memory abstraction on top of a cluster composed by mono and multiprocessor nodes. Its implementation is at the library level and doesn't require special resources such as particular hardware or operating system moditfications. Models, strategies, implementation aspects and results will be presented.
|
37 |
Eidolon: adapting distributed applications to their environment.Potts, Daniel Paul, Computer Science & Engineering, Faculty of Engineering, UNSW January 2008 (has links)
Grids, multi-clusters, NUMA systems, and ad-hoc collections of distributed computing devices all present diverse environments in which distributed computing applications can be run. Due to the diversity of features provided by these environments a distributed application that is to perform well must be specifically designed and optimised for the environment in which it is deployed. Such optimisations generally affect the application's communication structure, its consistency protocols, and its communication protocols. This thesis explores approaches to improving the ability of distributed applications to share consistent data efficiently and with improved functionality over wide-area and diverse environments. We identify a fundamental separation of concerns for distributed applications. This is used to propose a new model, called the view model, which is a hybrid, cost-conscious approach to remote data sharing. It provides the necessary mechanisms and interconnects to improve the flexibility and functionality of data sharing without defining new programming models or protocols. We employ the view model to adapt distributed applications to their run-time environment without modifying the application or inventing new consistency or communication protocols. We explore the use of view model properties on several programming models and their consistency protocols. In particular, we focus on programming models used in distributed-shared-memory middleware and applications, as these can benefit significantly from the properties of the view model. Our evaluation demonstrates the benefits, side effects and potential short-comings of the view model by comparing our model with traditional models when running distributed applications across several multi-clusters scenarios. In particular, we show that the view model improves the performance of distributed applications while reducing resource usage and communication overheads.
|
38 |
Software Techniques for Distributed Shared MemoryRadovic, Zoran January 2005 (has links)
<p>In large multiprocessors, the access to shared memory is often nonuniform, and may vary as much as ten times for some distributed shared-memory architectures (DSMs). This dissertation identifies another important nonuniform property of DSM systems: <i>nonuniform communication architecture</i>, NUCA. High-end hardware-coherent machines built from large nodes, or from chip multiprocessors, are typical NUCA systems, since they have a lower penalty for reading recently written data from a neighbor's cache than from a remote cache. This dissertation identifies <i>node affinity</i> as an important property for scalable general-purpose locks. Several software-based hierarchical lock implementations exploiting NUCAs are presented and evaluated. NUCA-aware locks are shown to be almost twice as efficient for contended critical sections compared to traditional lock implementations.</p><p>The shared-memory “illusion”' provided by some large DSM systems may be implemented using either hardware, software or a combination thereof. A software-based implementation can enable cheap cluster hardware to be used, but typically suffers from poor and unpredictable performance characteristics.</p><p>This dissertation advocates a new software-hardware trade-off design point based on a new combination of techniques. The two low-level techniques, fine-grain deterministic coherence and synchronous protocol execution, as well as profile-guided protocol flexibility, are evaluated in isolation as well as in a combined setting using all-software implementations. Finally, a minimum of hardware trap support is suggested to further improve the performance of coherence protocols across cluster nodes. It is shown that all these techniques combined could result in a fairly stable performance on par with hardware-based coherence.</p>
|
39 |
Checkpointing Algorithms for Parallel ComputersKalaiselvi, S 02 1900 (has links)
Checkpointing is a technique widely used in parallel/distributed computers for rollback error recovery. Checkpointing is defined as the coordinated saving of process state information at specified time instances. Checkpoints help in restoring the computation from the latest saved state, in case of failure. In addition to fault recovery, checkpointing has applications in fault detection, distributed debugging and process migration.
Checkpointing in uniprocessor systems is easy due to the fact that there is a single clock and events occur with respect to this clock. There is a clear demarcation of events that happens before a checkpoint and events that happens after a checkpoint. In parallel computers a large number of computers coordinate to solve a single problem. Since there might be multiple streams of execution, checkpoints have to be introduced along all these streams simultaneously. Absence of a global clock necessitates explicit coordination to obtain a consistent global state.
Events occurring in a distributed system, can be ordered partially using Lamport's happens before relation. Lamport's happens before relation ->is a partial ordering relation to identify dependent and concurrent events occurring in a distributed system.
It is defined as follows:
·If two events a and b happen in the same process, and if a happens before b, then a->b
·If a is the sending event of a message and b is the receiving event of the same message then a -> b
·If neither a à b nor b -> a, then a and b are said to be concurrent.
A consistent global state may have concurrent checkpoints. In the first chapter of the thesis we discuss issues regarding ordering of events in a parallel computer, need for coordination among checkpoints and other aspects related to checkpointing. Checkpointing locations can either be identified statically or dynamically. The static approach assumes that a representation of a program to be checkpointed is available with information that enables a programmer to specify the places where checkpoints are to be taken. The dynamic approach identifies the checkpointing locations at run time. In this thesis, we have proposed algorithms for both static and dynamic checkpointing. The main contributions of this thesis are as follows:
1. Parallel computers that are being built now have faster communication and hence more efficient clock synchronisation compared to those built a few years ago. Based on efficient clock synchronisation protocols, the clock drift in current machines can be maintained within a few microseconds. We have proposed a dynamic checkpointing algorithm for parallel computers assuming bounded clock drifts.
2. The shared memory paradigm is convenient for programming while message passing paradigm is easy to scale. Distributed Shared Memory (DSM) systems combine the advantage of both paradigms and can be visualized easily on top of a network of workstations. IEEE has recently proposed an interconnect standard called Scalable Coherent Interface (SCI), to con6gure computers as a Distributed Shared Memory system. A periodic dynamic checkpointing algorithm has been proposed in the thesis for a DSM system which uses the SCI standard.
3. When information about a parallel program is available one can make use of this knowledge to perform efficient checkpointing. A static checkpointing approach based on task graphs is proposed for parallel programs. The proposed task graph based static checkpointing approach has been implemented on a Parallel Virtual Machine (PVM) platform.
We now give a gist of various chapters of the thesis. Chapter 2 of the thesis gives a classification of existing checkpointing algorithms. The chapter surveys algorithm that have been reported in literature for checkpointing parallel/distributed systems. A point to be noted is that most of the algorithms published for checkpointing message passing systems are based on the seminal article by Chandy & Lamport. A large number of checkpointing algorithms have been published by relaxing the assumptions made in the above mentioned article and by extending the features to minimise the overheads of coordination and context saving.
Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory
in stable storage. Chapter 2 concludes with brief comments on the desirable features of a
checkpointing algorithm.
In Chapter 3, we develop a dynamic checkpointing algorithm for message passing systems assuming that the clock drift of processors in the system is bounded. Efficient clock synchronisation protocols have been implemented on recent parallel computers owing to the fact that communication between processors is very fast. Based on efficient clock synchronisation protocols, clock skew can be limited to a few microseconds. The algorithm proposed in the thesis uses clocks for checkpoint coordination and vector counts for identifying messages to be logged. The algorithm is a periodic, distributed algorithm. We prove correctness of the algorithm and compare it with similar clock based algorithms.
Distributed Shared Memory (DSM) systems provide the benefit of ease of programming in a scalable system. The recently proposed IEEE Scalable Coherent Interface (SCI) standard, facilitates the construction of scalable coherent systems. In Chapter 4 we discuss a checkpointing algorithm for an SCI based DSM system. SCI maintains cache coherence in hardware using a distributed cache directory which scales with the number of processors in the system. SCI recommends a two phase transaction protocol for communication. Our algorithm is a two phase centralised coordinated algorithm. Phase one initiates checkpoints and the checkpointing activity is completed in phase two. The correctness of the algorithm is established theoretically. The chapter concludes with the discussion of the features of SCI exploited by the checkpointing algorithm proposed in the thesis.
In Chapter 5, a static checkpointing algorithm is developed assuming that the program to be executed on a parallel computer is given as a directed acyclic task graph. We assume that the estimates of the time to execute each task in the task graph is given. Given the timing at which checkpoints are to be taken, the algorithm identifies a set of edges where checkpointing tasks can be placed ensuring that they form a consistent global checkpoint. The proposed algorithm eliminates coordination overhead at run time. It significantly reduces the context saving overhead by taking checkpoints along edges of the task graph. The algorithm is used as a preprocessing step before scheduling the tasks to processors. The algorithm complexity is O(km) where m is the number of edges in the graph and k the maximum number of global checkpoints to be taken.
The static algorithm is implemented on a parallel computer with a PVM environment as it is widely available and portable. The task graph of a program can be constructed manually or through program development tools. Our implementation is a collection of preprocessing and run time routines. The preprocessing routines operate on the task graph information to generate a set of edges to be checkpointed for each global checkpoint and write the information on disk. The run time routines save the context along the marked edges. In case of recovery, the recovery algorithms read the information from stable storage and reconstruct the context. The limitation of our static checkpointing algorithm is that it can operate only on deterministic task graphs. To demonstrate the practical feasibility of the proposed approach, case studies of checkpointing some parallel programs are included in the thesis.
We conclude the thesis with a summary of proposed algorithms and possible directions to continue research in the area of checkpointing.
|
40 |
Software Techniques for Distributed Shared MemoryRadovic, Zoran January 2005 (has links)
In large multiprocessors, the access to shared memory is often nonuniform, and may vary as much as ten times for some distributed shared-memory architectures (DSMs). This dissertation identifies another important nonuniform property of DSM systems: nonuniform communication architecture, NUCA. High-end hardware-coherent machines built from large nodes, or from chip multiprocessors, are typical NUCA systems, since they have a lower penalty for reading recently written data from a neighbor's cache than from a remote cache. This dissertation identifies node affinity as an important property for scalable general-purpose locks. Several software-based hierarchical lock implementations exploiting NUCAs are presented and evaluated. NUCA-aware locks are shown to be almost twice as efficient for contended critical sections compared to traditional lock implementations. The shared-memory “illusion”' provided by some large DSM systems may be implemented using either hardware, software or a combination thereof. A software-based implementation can enable cheap cluster hardware to be used, but typically suffers from poor and unpredictable performance characteristics. This dissertation advocates a new software-hardware trade-off design point based on a new combination of techniques. The two low-level techniques, fine-grain deterministic coherence and synchronous protocol execution, as well as profile-guided protocol flexibility, are evaluated in isolation as well as in a combined setting using all-software implementations. Finally, a minimum of hardware trap support is suggested to further improve the performance of coherence protocols across cluster nodes. It is shown that all these techniques combined could result in a fairly stable performance on par with hardware-based coherence.
|
Page generated in 0.1095 seconds