51 |
A Study of the Performance Benefits of Controlling Parallel Asynochrous Iteractive ApplicationsJoseph, P J 09 1900 (has links)
High performance networks of workstation are becoming increasingly popular a parallel computing platform because of their lower cost. Both message passing and software distributed shared memory (DSM) programming paradigms have been developed and employed on such distributed hardware platforms. An important performance bottleneck in these systems is the effective data transmission latency, which is poorer than in high-speed parallel computer interconnection networks.
Iterative algorithms are used in a large class of applications like solution of partial algorithms are used, optimization problems, solutions to systems of linear equations, and so on. These can be parallelized in a straight-forward fashion with cad1 node computing a part of the data set and also synchronizing and exchanging data with the other nodes as required. But these synchronous version perform poorly when message transmission delays are high, as is the case in network of workstations. The asynchronous parallel version of these algorithms provide an additional degree of freedom to address large data transmission latencies. These algorithms do not synchronize, and behave correctly in the presence of losses and delays in the propagation of updates. Thus, in shared memory systems they do not synchronize accesses to shared data and they will work correctly even in the presence of delays and losses in updates. This gives synchronous algorithms a performance advantage over their synchronous counterparts since synchronization costs are avoided and further computation can be overlapped with communication.
The message generation rate of asynchronous algorithms is however greater than that of their synchronous counterparts. Uncontrolled asynchronous runs can create a large network load resulting in large queuing delays, which in turn can increase the message generation of the asynchronous algorithms. This is especially possible in lower bandwidth network like that in network of workstations. Such a positive feedback loop leads to unstable network conditions.
Recent work has tried to improve the performance of asynchronous algorithms on a distributed shared memory (DSM) system by adaptively buffering shared memory updates depending on the network load, and transmitting multiple updates together. This reduces congestion and message transmission overhead, but could still result in slow convergence since nothing is guaranteed about, the update propagation delay. Also, although adaptive throttling of message will kick in once the network gets heavily loaded, it cannot, prevent the initial flooding. Furthermore, the processor is not freed when computation with the available values does not result in much further convergence.
In this thesis we present an alternate method of controlling iterative methods and present performance results for the same. We propose a new system-supported blocking read primitive, termed Global Read that is guaranteed to return a value of acceptable age of the specified location in a DSM system. The main idea is to enforce an upper bound on the age of shared updates seen by a node in a manner visible to the underlying system (DSM). Information about processes being blocked can be used for adapting the underlying system, especially the network, towards better performance. A reading process is throttled until its Global-Read is satisfied, thus implementing program-level flow control and also freeing the processor. The Global-Read can also help in communication-based scheduling of, processes.
Performance evaluation using a benchmark from Cray, on a network of workstations and on the IBM SP2 parallel computer, showed good performance improvements. We also present results of a systematic study wherein we implement parallel code for different iterative techniques for the solution of Lap lace equation wing PVM, anti characterize when controlled asynchrony work befit. We studied the improvements in computation time and analyzed the sources of this improvement, using various input and parallelism, on both IBM SP2 and a network of workstations. We find significant performance improvements for controlling asynchrony when the traffic generated by the asynchronous algorithm becomes more than what can be sustained by the network. But we found that the scalability of the applications is limited by the software overhead for messages. Systems which have reduced software overhead will show very good scalable performance for controlled asynchrony.
|
52 |
Υλοποίηση μεταφέρσιμου συστήματος κατανεμημένης κοινής μνήμης / Implementation of portable distributed shared memoryΚαραντάσης, Κωνσταντίνος 01 August 2007 (has links)
Η ανάπτυξη και εγκατάσταση συστάδων υπολογιστών (clusters) και διαδικτυακών πλεγμάτων υπολογισμού (computational grids), διαρκώς αυξανόμενη στις μέρες μας, διαμορφώνει ένα σαφώς κατανεμημένο περιβάλλον, ικανό για την εφαρμογή υπολογισμού στο εύρος του διαδικτύου. Στο πλαίσιο αυτό, η παράλληλη επεξεργασία καλείται να επωφεληθεί από την εγγύτητα των υπολογιστικών πόρων, όπως αυτή διαμορφώνεται από τα σύγχρονα δίκτυα υψηλών ταχυτήτων. Την ίδια στιγμή, οι προγραμματιστές παράλληλων εφαρμογών βρίσκονται σε δίλημμα ανάμεσα σε μοντέλα προγραμματισμού κοινής μνήμης ή κατανεμημένα. Με τα κατανεμημένα μοντέλα να αποτελούν την αρχική και πιο φυσική επιλογή στο περιβάλλον των συστάδων και των πλεγμάτων, ο προγραμματιστής έρχεται ξανά αντιμέτωπος με τα διαχρονικά προβλήματα που ενέχει η αποτύπωση του παραλληλισμού των εφαρμογών και ο προγραμματισμός με τη χρήση μοντέλων ανταλλαγής μηνυμάτων (message passing models). Έχοντας σαν στόχο την απαλλαγή του προγραμματιστή από τις δυσκολίες των κατανεμημένων μοντέλων, γίνεται σημαντική ερευνητική προσπάθεια για την υλοποίηση συστημάτων και εργαλείων που θα μπορέσουν να παρέχουν ένα αξιόπιστο περιβάλλον προγραμματισμού κοινής μνήμης, επιτυγχάνοντας ταυτόχρονα συγκρίσιμη απόδοση με τα αντίστοιχα μοντέλα ανταλλαγής μηνυμάτων. Ωστόσο, ένα από τα βασικά χαρακτηριστικά των σύγχρονων περιβαλλόντων υπολογισμού, που δυσχεραίνει την μεταφορά της υπάρχουσας τεχνολογίας συστημάτων κατανεμημένης κοινής μνήμης από τις συστάδες υπολογιστών στα πλέγματα, είναι η εκτεταμένη ετερογένεια που παρατηρείται στα συστήματα που συμμετέχουν σε ένα υπολογιστικό πλέγμα.
Συμμετέχοντας στην προσπάθεια πρότασης ενός εύχρηστου και αποδοτικού περιβάλλοντος προγραμματισμού, καταρχάς σε συστάδες υπολογιστών και με την προοπτική επέκτασης σε υπολογιστικά πλέγματα, στο πλαίσιο της συγκεκριμένης μεταπτυχιακής εργασίας υλοποιείται το σύστημα Pleiad. To Pleiad αποτελεί ολοκληρωμένο πρωτότυπο της αφαίρεσης κατανεμημένης κοινής μνήμης σε επίπεδο λογισμικού (Software Distributed Shared Memory - SDSM). Κύριος στόχος, δεδομένης της ετερογένειας των σύγχρονων παράλληλων συστημάτων, είναι τόσο η μεταφερσιμότητα όσο και η διαλειτουργικότητα του συστήματος και γι' αυτό το λόγο επιλέγεται για την υλοποίηση του η πλατφόρμα Java. Το σύστημα Pleiad είναι σε θέση να αξιοποιήσει τη σύγχρονη τάση στα πολυεπεξεργαστικά συστήματα, όπως αυτή καθορίζεται από την ευρεία διάθεση επεξεργαστών πολλαπλών πυρήνων, επιτρέποντας την εκτέλεση πολυνηματικών εφαρμογών στο εύρος του κατανεμημένου συστήματος. Επιπλέον η υλοποίηση λαμβάνει χώρα σε επίπεδο χρήστη (user-level), προσδίδοντας στο σύστημα μεγαλύτερη ευελιξία στο περιβάλλον των ιδεατών οργανισμών (virtual organizations - VOs) που διαθέτουν συστάδες υπολογιστών στο πλαίσιο πλεγμάτων. Τα αποτελέσματα από την πειραματική σύγκριση του συστήματος Pleiad με συναφή συστήματα είναι ενθαρρυντικά. Σε κάθε περίπτωση το πρωτότυπο του συστήματος Pleiad όπως παρουσιάζεται στη μεταπτυχιακή εργασία, αποτελεί έργο υποδομής, με αρκετά ενδιαφέροντα ζητήματα ανοικτά στην προοπτική μελλοντικής ερευνητικής δραστηριότητας. / The development and the deployment of clusters and computational grids, continuously increasing in our times, clearly form a distributed environment that is able to conduct computation at the scale of the Internet. Under these circumstances, parallel processing is urged to utilize the proximity of the afforded computational resources as it is accomplished by the advancements on high speed networks. At the same time the parallel applications programmers are quite often up against a dilemma having to choose between shared memory or distributed memory programming models. While distributed memory programming models are the most typical choice in the field of clusters and grids, the programmer encounters well known obstacles during his effort to extract the parallelism of the application. Willing to release the programmer from the need to explicitly express parallelism through message passing orchestration, much research has been done to implement middleware that provides the abstraction of shared memory programming while at the same time achieves acceptable performance compared to other message passing models. Nevertheless, one of the most fundamental characteristics of the modern, distributed computing environments that encumbers porting the existing DSM technology from clusters to grids, is the broad heterogeneity of the afforded computing resources.
Participating in the effort of providing a simple, robust and yet efficient programming environment, firstly designated for clusters with the intention support seamless parallel programming on top of grids, at the present thesis we present Pleiad. Pleiad consists our research prototype providing the abstraction of shared memory programming, implemented at the software level (Software Distributed Shared Memory - SDSM). Considering by default the heterogeneity of the contemporary parallel systems, we have defined as a target of the presented thesis to provide a simple portable and interoperable DSM system. That direction led us to choose Java as our development platform. Pleiad is also able to utilize the trend in modern multiprocessors as it is defined by the advent of multicore CPUs by enabling the execution of multithreaded applications on top of the distributed hardware architecture. Moreover, the implementation of Pleiad takes place at the user level, which is the most appropriate decision concerning the highly diverse environment of the virtual organizations that are formed as parts of a grid. The first results of the experimental evaluation of Pleiad compared to similar systems are emboldening. In any case the first prototype of Pleiad as it is presented in the current thesis provides the essential infrastructure that will be used to further address open issues concerning our research interests on the topic of distributed shared memory abstraction.
|
53 |
Interprocess Communication Mechanisms With Inter-Virtual Machine Shared MemoryKe, Xiaodi Unknown Date
No description available.
|
54 |
Portierbare numerische Simulation auf parallelen ArchitekturenRehm, W. 30 October 1998 (has links) (PDF)
The workshop ¨Portierbare numerische Simulationen auf parallelen Architekturen¨
(¨Portable numerical simulations on parallel architectures¨) was organized by the Fac-
ulty of Informatics/Professorship Computer Architecture at 18 April 1996 and held in
the framework of the Sonderforschungsbereich (Joint Research Initiative) ¨Numerische
Simulationen auf massiv parallelen Rechnern¨ (SFB 393) (¨Numerical simulations on
massiv parallel computers¨) ( http://www.tu-chemnitz.de/~pester/sfb/sfb393.html )
The SFB 393 is funded by the German National Science Foundation (DFG).
The purpose of the workshop was to bring together scientists using parallel computing
to provide integrated discussions on portability issues, requirements and future devel-
opments in implementing parallel software efficiently as well as portable on Clusters of
Symmetric Multiprocessorsystems.
I hope that the present paper gives the reader some helpful hints for further discussions
in this field.
|
55 |
Scientific Computing on Multicore ArchitecturesTillenius, Martin January 2014 (has links)
Computer simulations are an indispensable tool for scientists to gain new insights about nature. Simulations of natural phenomena are usually large, and limited by the available computer resources. By using the computer resources more efficiently, larger and more detailed simulations can be performed, and more information can be extracted to help advance human knowledge. The topic of this thesis is how to make best use of modern computers for scientific computations. The challenge here is the high level of parallelism that is required to fully utilize the multicore processors in these systems. Starting from the basics, the primitives for synchronizing between threads are investigated. Hardware transactional memory is a new construct for this, which is evaluated for a new use of importance for scientific software: atomic updates of floating point values. The evaluation includes experiments on real hardware and comparisons against standard methods. Higher level programming models for shared memory parallelism are then considered. The state of the art for efficient use of multicore systems is dynamically scheduled task-based systems, where tasks can depend on data. In such systems, the software is divided up into many small tasks that are scheduled asynchronously according to their data dependencies. This enables a high level of parallelism, and avoids global barriers. A new system for managing task dependencies is developed in this thesis, based on data versioning. The system is implemented as a reusable software library, and shown to be as efficient or more efficient than other shared-memory task-based systems in experimental comparisons. The developed runtime system is then extended to distributed memory machines, and used for implementing a parallel version of a software for global climate simulations. By running the optimized and parallelized version on eight servers, an equally sized problem can be solved over 100 times faster than in the original sequential version. The parallel version also allowed significantly larger problems to be solved, previously unreachable due to memory constraints. / UPMARC / eSSENCE
|
56 |
Parallel processing in power systems computation on a distributed memory message passing multicomputer /Hong, Chao, January 2000 (has links)
Thesis (Ph. D.)--University of Hong Kong, 2000. / Includes bibliographical references (leaves 160-169).
|
57 |
Techniques for collective physical memory ubiquity within networked clusters of virtual machinesHines, Michael R. January 2009 (has links)
Thesis (Ph. D.)--State University of New York at Binghamton, Thomas J. Watson School of Engineering and Applied Science, Department of Computer Science, 2009. / Includes bibliographical references.
|
58 |
Automatic task and data mapping in shared memory architectures / Mapeamento automático de processos e dados em arquiteturas de memória compartilhadaDiener, Matthias January 2015 (has links)
Arquiteturas paralelas modernas têm hierarquias de memória complexas, que consistem de vários níveis de memórias cache privadas e compartilhadas, bem como Non-Uniform Memory Access (NUMA) devido a múltiplos controladores de memória por sistema. Um dos grandes desafios dessas arquiteturas é melhorar a localidade e o balanceamento de acessos à memória de tal forma que a latência média de acesso à memória é reduzida. Dessa forma, o desempenho e a eficiência energética de aplicações paralelas podem ser melhorados. Os acessos podem ser melhorados de duas maneiras: (1) processos que acessam dados compartilhados (comunicação entre processos) podem ser alocados em unidades de execução próximas na hierarquia de memória, a fim de melhorar o uso das caches. Esta técnica é chamada de mapeamento de processos. (2) Mapear as páginas de memória que cada processo acessa ao nó NUMA que ele está sendo executado, assim, pode-se reduzir o número de acessos a memórias remotas em arquiteturas NUMA. Essa técnica é conhecida como mapeamento de dados. Para melhores resultados, os mapeamentos de processos e dados precisam ser realizados de forma integrada. Trabalhos anteriores nesta área executam os mapeamentos separadamente, o que limita os ganhos que podem ser alcançados. Além disso, a maioria dos mecanismos anteriores exigem operações caras, como traços de acessos à memória, para realizar o mapeamento, além de exigirem mudanças no hardware ou na aplicação paralela. Estes mecanismos não podem ser considerados soluções genéricas para o problema de mapeamento. Nesta tese, fazemos duas contribuições principais para o problema de mapeamento. Em primeiro lugar, nós introduzimos um conjunto de métricas e uma metodologia para analisar aplicações paralelas, a fim de determinar a sua adequação para um melhor mapeamento e avaliar os possíveis ganhos que podem ser alcançados através desse mapeamento otimizado. Em segundo lugar, propomos um mecanismo que executa o mapeamento de processos e dados online. Este mecanismo funciona no nível do sistema operacional e não requer alterações no hardware, os códigos fonte ou bibliotecas. Uma extensa avaliação com múltiplos conjuntos de carga de trabalho paralelos mostram consideráveis melhorias em desempenho e eficiência energética. / Reducing the cost of memory accesses, both in terms of performance and energy consumption, is a major challenge in shared-memory architectures. Modern systems have deep and complex memory hierarchies with multiple cache levels and memory controllers, leading to a Non-Uniform Memory Access (NUMA) behavior. In such systems, there are two ways to improve the memory affinity: First, by mapping tasks that share data (communicate) to cores with a shared cache, cache usage and communication performance are improved. Second, by mapping memory pages to memory controllers that perform the most accesses to them and are not overloaded, the average cost of accesses is reduced. We call these two techniques task mapping and data mapping, respectively. For optimal results, task and data mapping need to be performed in an integrated way. Previous work in this area performs the mapping only separately, which limits the gains that can be achieved. Furthermore, most previous mechanisms require expensive operations, such as communication or memory access traces, to perform the mapping, require changes to the hardware or to the parallel application, or use a simple static mapping. These mechanisms can not be considered generic solutions for the mapping problem. In this thesis, we make two contributions to the mapping problem. First, we introduce a set of metrics and a methodology to analyze parallel applications in order to determine their suitability for an improved mapping and to evaluate the possible gains that can be achieved using an optimized mapping. Second, we propose two automatic mechanisms that perform task mapping and combined task/data mapping, respectively, during the execution of a parallel application. These mechanisms work on the operating system level and require no changes to the hardware, the applications themselves or their runtime libraries. An extensive evaluation with parallel applications from multiple benchmark suites as well as real scientific applications shows substantial performance and energy efficiency improvements that are significantly higher than simple mechanisms and previous work, while maintaining a low overhead.
|
59 |
The GraphGrind framework : fast graph analytics on large shared-memory systemsSun, Jiawen January 2018 (has links)
As shared memory systems support terabyte-sized main memory, they provide an opportunity to perform efficient graph analytics on a single machine. Graph analytics is characterised by frequent synchronisation, which is addressed in part by shared memory systems. However, performance is limited by load imbalance and poor memory locality, which originate in the irregular structure of small-world graphs. This dissertation demonstrates how graph partitioning can be used to optimise (i) load balance, (ii) Non-Uniform Memory Access (NUMA) locality and (iii) temporal locality of graph partitioning in shared memory systems. The developed techniques are implemented in GraphGrind, a new shared memory graph analytics framework. At first, this dissertation shows that heuristic edge-balanced partitioning results in an imbalance in the number of vertices per partition. Thus, load imbalance exists between partitions, either for loops iterating over vertices, or for loops iterating over edges. To address this issue, this dissertation introduces a classification of algorithms to distinguish whether they algorithmically benefit from edge-balanced or vertex-balanced partitioning. This classification supports the adaptation of partitions to the characteristics of graph algorithms. Evaluation in GraphGrind, shows that this outperforms state-of-the-art graph analytics frameworks for shared memory including Ligra by 1.46x on average, and Polymer by 1.16x on average, using a variety of graph algorithms and datasets. Secondly, this dissertation demonstrates that increasing the number of graph partitions is effective to improve temporal locality due to smaller working sets. However, the increasing number of partitions results in vertex replication in some graph data structures. This dissertation resorts to using a graph layout that is immune to vertex replication and an automatic graph traversal algorithm that extends the previously established graph traversal heuristics to a 3-way graph layout choice is designed. This new algorithm furthermore depends upon the classification of graph algorithms introduced in the first part of the work. These techniques achieve an average speedup of 1.79x over Ligra and 1.42x over Polymer. Finally, this dissertation presents a graph ordering algorithm to challenge the widely accepted heuristic to balance the number of edges per partition and minimise edge or vertex cut. This algorithm balances the number of edges per partition as well as the number of unique destinations of those edges. It balances edges and vertices for graphs with a power-law degree distribution. Moreover, this dissertation shows that the performance of graph ordering depends upon the characteristics of graph analytics frameworks, such as NUMA-awareness. This graph ordering algorithm achieves an average speedup of 1.87x over Ligra and 1.51x over Polymer.
|
60 |
Enabling Multi-threaded Applications on Hybrid Shared Memory Manycore ArchitecturesJanuary 2014 (has links)
abstract: As the number of cores per chip increases, maintaining cache coherence becomes prohibitive for both power and performance. Non Coherent Cache (NCC) architectures do away with hardware-based cache coherence, but they become difficult to program. Some existing architectures provide a middle ground by providing some shared memory in the hardware. Specifically, the 48-core Intel Single-chip Cloud Computer (SCC) provides some off-chip (DRAM) shared memory some on-chip (SRAM) shared memory. We call such architectures Hybrid Shared Memory, or HSM, manycore architectures. However, how to efficiently execute multi-threaded programs on HSM architectures is an open problem. To be able to execute a multi-threaded program correctly on HSM architectures, the compiler must: i) identify all the shared data and map it to the shared memory, and ii) map the frequently accessed shared data to the on-chip shared memory. This work presents a source-to-source translator written using CETUS that identifies a conservative superset of all the shared data in a multi-threaded application and maps it to the shared memory such that it enables execution on HSM architectures. / Dissertation/Thesis / Masters Thesis Computer Science 2014
|
Page generated in 0.04 seconds