Spelling suggestions: "subject:"distributedmemory"" "subject:"distributeddaemon""
11 |
Towards Scalable Parallel Simulation of the Structural Mechanics of Piezoelectric-Controlled BeamsRotter, Jeremy Michael 13 July 1999 (has links)
In this thesis we present a parallel implementation of an engineering code which simulates the deformations caused when forces are applied to a piezoelectric-controlled smart structure. The parallel simulation, whose domain decomposition relies on the finite element representation of the structure, is created with an emphasis on scalability of both memory requirements and run time. We take into consideration sequential performance enhancements, the structure of a banded symmetric positive definite linear system, and the overhead required to completely distribute the problem across the processors. The resulting code is scalable, with the exception of a banded Cholesky factorization, which does not fully utilize the parallel environment. / Master of Science
|
12 |
Automatic Data Partitioning By Hierarchical Genetic SearchShenoy, U Nagaraj 09 1900 (has links)
CDAC / The introduction of languages like High Performance Fortran (HPF) which allow the programmer to indicate how the arrays used in the
program have to be distributed across the local memories of a multi-computer has not completely unburdened the parallel programmer from the intricacies of these architectures. In order to tap the full potential of these architectures, the compiler has to perform this crucial task of data partitioning automatically. This would not only
unburden the programmer but would make the programs more efficient since the compiler can be made more intelligent to take care of the
architectural nuances.
The topic of this thesis namely the automatic data partitioning deals with finding the best data partition for the various arrays used in
the entire program in such a way that the cost of execution of the entire program is minimized. The compiler could resort to runtime redistribution of the arrays at various points in the program if found profitable. Several aspects of this problem have been proven to be NP-complete. Other researchers have suggested heuristic solutions to solve this problem. In this thesis we propose a genetic algorithm namely the Hierarchical Genetic Search algorithm to solve this problem.
|
13 |
Hierarchical Matrix Techniques on Massively Parallel ComputersIzadi, Mohammad 12 April 2012 (has links)
Hierarchical matrix (H-matrix) techniques can be used to efficiently treat dense matrices. With an H-matrix, the storage
requirements and performing all fundamental operations, namely matrix-vector multiplication, matrix-matrix multiplication and matrix inversion
can be done in almost linear complexity.
In this work, we tried to gain even further
speedup for the H-matrix arithmetic by utilizing multiple processors. Our approach towards an H-matrix distribution
relies on the splitting of the index set.
The main results achieved in this work based on the index-wise H-distribution are: A highly scalable algorithm for the H-matrix truncation and matrix-vector multiplication, a scalable algorithm for the H-matrix matrix multiplication, a limited scalable algorithm for the H-matrix inversion for a large number of processors.
|
14 |
Effcient Simulation of Message-Passing in Distributed-Memory ArchitecturesDemaine, Erik January 1996 (has links)
In this thesis we propose a distributed-memory parallel-computer simulation system called PUPPET (Performance Under a Pseudo-Parallel EnvironmenT). It allows the evaluation of parallel programs run in a pseudo-parallel system, where a single processor is used to multitask the program's processes, as if they were run on the simulated system. This allows development of applications and teaching of parallel programming without the use of valuable supercomputing resources. We use a standard message-passing language, MPI, so that when desired (e. g. , development is complete) the program can be run on a truly parallel system without any changes. There are several features in PUPPET that do not exist in any other simulation system. Support for all deterministic MPI features is available, including collective and non-blocking communication. Multitasking (more processes than processors) can be simulated, allowing the evaluation of load-balancing schemes. PUPPET is very loosely coupled with the program, so that a program can be run once and then evaluated on many simulated systems with multiple process-to-processor mappings. Finally, we propose a new model of direct networks that ignores network traffic, greatly improving simulation speed and often not signficantly affecting accuracy.
|
15 |
Distributed Execution of Recursive Irregular ApplicationsNikhil Hegde (7043171) 13 August 2019 (has links)
Massive computing power and applications running on this power, primarily confined to expensive supercomputers a decade ago, have now become mainstream through the availability of clusters with commodity computers and high-speed interconnects running big-data era applications. The challenges associated with programming such systems, for effectively utilizing the computing power, have led to the creation of intuitive abstractions and implementations targeting average users, domain experts, and savvy (parallel) programmers. There is often a trade-off between the ease of programming and performance when using these abstractions. This thesis develops tools to bridge the gap between ease of programming and performance of irregular programs—programs that involve one or more of irregular- data structures, control structures, and communication patterns—on distributed-memory systems. <div><br></div><div>Irregular programs are focused heavily in domains ranging from data mining to bioinformatics to scientific computing. In contrast to regular applications such as stencil codes and dense matrix-matrix multiplications, which have a predictable pattern of data access and control flow, typical irregular applications operate over graphs, trees, and sparse matrices and involve input-dependent data access pattern and control flow. This makes it difficult to apply optimizations such as those targeting locality and parallelism to programs implementing irregular applications. Moreover, irregular programs are often used with large data sets that prohibit single-node execution due to memory limitations on the node. Hence, distributed solutions are necessary in order to process all the data.<br><br>In this thesis, we introduce SPIRIT, a framework consisting of an abstraction and a space-adaptive runtime system for simplifying the creation of distributed implementations of recursive irregular programs based on spatial acceleration structures. SPIRIT addresses the insufficiency of traditional data-parallel approaches and existing systems in effectively parallelizing computations involving repeated tree traversals. SPIRIT employs locality optimizations applied in a shared-memory context, introduces a novel pipeline-parallel approach to execute distributed traversals, and trades-off performance with memory usage to create a space-adaptive system that achieves a scalable performance, and outperforms implementations done in contemporary distributed graph processing frameworks.</div><div><br>We next introduce Treelogy to understand the connection between optimizations and tree-algorithms. Treelogy provides an ontology and a benchmark suite of a broader class of tree algorithms to help answer: (i) is there any existing optimization that is applicable or effective for a new tree algorithm? (ii) can a new optimization developed for a tree algorithm be applied to existing tree algorithms from other domains? We show that a categorization (ontology) based on structural properties of tree- algorithms is useful for both developers of new optimizations and new tree algorithm creators. With the help of a suite of tree traversal kernels spanning the ontology, we show that GPU, shared-, and distributed-memory implementations are scalable and the two-point correlation algorithm with vptree performs better than the standard kdtree implementation.</div><div><br>In the final part of the thesis, we explore the possibility of automatically generating efficient distributed-memory implementations of irregular programs. As manually creating distributed-memory implementations is challenging due to the explicit need for managing tasks, parallelism, communication, and load-balancing, we introduce a framework, D2P, to automatically generate efficient distributed implementations of recursive divide-conquer algorithms. D2P automatically generates a distributed implementation of a recursive divide-conquer algorithm from its specification, which is a high-level outline of a recursive formulation. We evaluate D2P with recursive Dynamic programming (DP) algorithms. The computation in DP algorithms is not irregular per se. However, when distributed, the computation in efficient recursive formulations of DP algorithms requires irregular communication. User-configurable knobs in D2P allow for tuning the amount of available parallelism. Results show that D2P programs scale well, are significantly better than those produced using a state-of-the-art framework for parallelizing iterative DP algorithms, and outperform even hand-written distributed-memory implementations in most cases.</div>
|
16 |
Análises estatísticas para a paralelização de linguagens de atribuição única para sistemas de memória distribuída / Static analysis for the parallelization of single assigment languages for distributed memory systemsNakashima, Raul Junji 24 September 2001 (has links)
Este trabalho descreve técnicas de análise estática de compilação baseadas na álgebra e programação linear que buscam otimizar a distribuição de loops forall e array em programas escritos na linguagem S/SAL visando à execução em máquinas paralelas de memória distribuídas. Na fase de alinhamento, nós trabalhamos com o alinhamento de hiperplanos onde objetivo é tentar encontrar as porções dos diferentes arrays que necessitam ser distribuídas juntas. Na fase de divisão, que tenta quebrar em partes independente dados e computações, nós usamos duas funções afins, a função de decomposição de dados e a função de decomposição de computação. A última fase, o mapeamento, distribui os elementos de computação nos elementos de processamento usando um conjunto de inequações. As técnicas foram implementadas num compilador SISAL, mas pode ser usada sem mudanças em outras linguagens de associação simples e com a adição de análise de dependências pode ser usada em linguagens imperativas. / This work describes static compiler analysis techniques based on linear algebra and linear programming for optimizing the distribution of forall loops and of array elements in programs written in the SISAL programming language for distributed memory parallel machines. In the alignment phase, attempt is made in the identification of portions of different arrays that need to be distributed jointly by means of alignment of hyperplanes. In the partitioning phase, effort is made in breaking as even possible the computation and pertinent data in independent parts, by means of using related functions: the data decomposition function and the computation decomposition function. The last phase is dedicated to the mapping, which comprises the distribution of the elements of computation into the existing processing elements by means of a set of inequations. These techniques are being implemented in a SISAL compiler, but can be also used without changes by means of other single assignment languages or, with the addition of dependency analysis when using other set of languages, as well.
|
17 |
Effcient Simulation of Message-Passing in Distributed-Memory ArchitecturesDemaine, Erik January 1996 (has links)
In this thesis we propose a distributed-memory parallel-computer simulation system called PUPPET (Performance Under a Pseudo-Parallel EnvironmenT). It allows the evaluation of parallel programs run in a pseudo-parallel system, where a single processor is used to multitask the program's processes, as if they were run on the simulated system. This allows development of applications and teaching of parallel programming without the use of valuable supercomputing resources. We use a standard message-passing language, MPI, so that when desired (e. g. , development is complete) the program can be run on a truly parallel system without any changes. There are several features in PUPPET that do not exist in any other simulation system. Support for all deterministic MPI features is available, including collective and non-blocking communication. Multitasking (more processes than processors) can be simulated, allowing the evaluation of load-balancing schemes. PUPPET is very loosely coupled with the program, so that a program can be run once and then evaluated on many simulated systems with multiple process-to-processor mappings. Finally, we propose a new model of direct networks that ignores network traffic, greatly improving simulation speed and often not signficantly affecting accuracy.
|
18 |
Análises estatísticas para a paralelização de linguagens de atribuição única para sistemas de memória distribuída / Static analysis for the parallelization of single assigment languages for distributed memory systemsRaul Junji Nakashima 24 September 2001 (has links)
Este trabalho descreve técnicas de análise estática de compilação baseadas na álgebra e programação linear que buscam otimizar a distribuição de loops forall e array em programas escritos na linguagem S/SAL visando à execução em máquinas paralelas de memória distribuídas. Na fase de alinhamento, nós trabalhamos com o alinhamento de hiperplanos onde objetivo é tentar encontrar as porções dos diferentes arrays que necessitam ser distribuídas juntas. Na fase de divisão, que tenta quebrar em partes independente dados e computações, nós usamos duas funções afins, a função de decomposição de dados e a função de decomposição de computação. A última fase, o mapeamento, distribui os elementos de computação nos elementos de processamento usando um conjunto de inequações. As técnicas foram implementadas num compilador SISAL, mas pode ser usada sem mudanças em outras linguagens de associação simples e com a adição de análise de dependências pode ser usada em linguagens imperativas. / This work describes static compiler analysis techniques based on linear algebra and linear programming for optimizing the distribution of forall loops and of array elements in programs written in the SISAL programming language for distributed memory parallel machines. In the alignment phase, attempt is made in the identification of portions of different arrays that need to be distributed jointly by means of alignment of hyperplanes. In the partitioning phase, effort is made in breaking as even possible the computation and pertinent data in independent parts, by means of using related functions: the data decomposition function and the computation decomposition function. The last phase is dedicated to the mapping, which comprises the distribution of the elements of computation into the existing processing elements by means of a set of inequations. These techniques are being implemented in a SISAL compiler, but can be also used without changes by means of other single assignment languages or, with the addition of dependency analysis when using other set of languages, as well.
|
19 |
Lokale Realisierung von Vektoroperationen auf ParallelrechnernGroh, U. 30 October 1998 (has links) (PDF)
For the basic algebraic vector operations several variants of a local
implementation on distributed memory parallel computers are presented and discussed
systematically. In particular necessary and sufficient conditions are shown for the local realizability
of the multiplication matrix by vector.
|
20 |
A Distributed Memory Implementation of LOCIGeorge, Thomas 14 December 2001 (has links)
Distributed memory systems have gained immense popularity due to their favorable price/performance ratios. This study seeks to reduce the complexities, involved in developing parallel applications for distributed memory systems. The Loci system is a coordination framework which was developed to eliminate most of the accidental complexities involved in numerical simulation software development. A distributed memory version of Loci is developed and has been tested and validated using a finite-rate chemically reacting flow solver developed in the sequential Loci framework. The application developed in the original sequential version of Loci was parallelized with minimal changes in its source code. A comparison with the results from the original sequential version guarantees a correct implementation. The performance measurements indicate that an efficient implementation has been achieved.
|
Page generated in 0.062 seconds