Spelling suggestions: "subject:"distributed demory"" "subject:"distributed amemory""
1 |
Virtual memory on data diffusion architecturesBuenabad-Chavez, Jorge January 1998 (has links)
No description available.
|
2 |
Parallel load-balancing on message passing architecturesMuniz, Francisco Junqueira January 1994 (has links)
No description available.
|
3 |
A statistical approach to parallel sorting and selection algorithms designLoo, Alfred January 2000 (has links)
No description available.
|
4 |
Hierarchical Matrix Techniques on Massively Parallel ComputersIzadi, Mohammad 11 December 2012 (has links) (PDF)
Hierarchical matrix (H-matrix) techniques can be used to efficiently treat dense matrices. With an H-matrix, the storage
requirements and performing all fundamental operations, namely matrix-vector multiplication, matrix-matrix multiplication and matrix inversion
can be done in almost linear complexity.
In this work, we tried to gain even further
speedup for the H-matrix arithmetic by utilizing multiple processors. Our approach towards an H-matrix distribution
relies on the splitting of the index set.
The main results achieved in this work based on the index-wise H-distribution are: A highly scalable algorithm for the H-matrix truncation and matrix-vector multiplication, a scalable algorithm for the H-matrix matrix multiplication, a limited scalable algorithm for the H-matrix inversion for a large number of processors.
|
5 |
Hyperplane Partitioning : An Approach To Global Data Partitioning For Distributed Memory MachinesPrakash, S R 07 1900 (has links)
Automatic Global Data Partitioning for Distributed Memory Machines (DMMs)
is a difficult problem. Distributed memory machines are scalable,
but since the memory is distributed across processors, the scheme
of placement of
data (arrays) onto local memories of different processors become
crucial since any communication between processors for non-local
data access is an order of magnitude costlier than access to local
memory. Researchers have given varied solutions
to this problem, most of which work for uniform dependences in loops
and they suggest HPF-like distributions only. For non-uniform
dependences the loop was made to run sequentially.
In this work, we present a partitioning strategy
called Hyperplane Partitioning which works well with
loops with non-uniform dependences also. In this method of partitioning,
the iteration space
is partitioned into as many number of partitions as there are
number of logical processors, in such a way that the overall
inter-processor communication will be minimum. The idea is to
localize as many as dependences as possible so that overall
communication both beacuse of non-local data as well as
inter-processor synchronizations are reduced.
These partitions are
then induced into data spaces of the arrays referenced in the loop.
Each processor then runs its part of iteration space keeping the data
partition that it owns locally. Any non-local data access is
implemented by inter-processor communication at run-time.The Hyperplane Partitioning is also extended to
a sequence of loops. This is done by first finding
Best Local Distribution (BLD) for every loop first and
then finding the best way of grouping different adjacent loops
(just for finding the data partition)
which gives best global data partition. This sequence of
distributions/redistributions is found by constructing a
data structure called Data Distribution Tree (DDT) and finding
the least cost path from the source to any of the leaf nodes
in the DDT. The costs for the edges come from the communication
cost incurred while running a loop with a particular distribution
and redistribution to suit the requirement at the next loop.
For this a communication cost estimator is developed which
works well for fewer dimensions. To handle complete programs
we use some heuristic to find the best global distribution
for the entire program.Some optimizations like message optimization to reduce the number
of messages sent across processors, time optimization
which is done by uniform scheduling across processors, and
space optimization to keep only the part of array space
that any processor owns onto its local memory, are studied.
Hyperplane Partitioning is also implemented using an algorithm for
synchronization to handle non-local memory access as well
as obeying data dependence constraints. The algorithm is also
proved to be correct. The target machine is IBM-SP2 using
PVM for the message passing library. The performance of the tool
on some standard benchmarks (ADI and RHS) and also on some
programs designed by us to show the specific merits of the tool.
The results show that the loops which have non-uniform dependences
also can be run on DMM with good speed-ups.
|
6 |
Exploration d'architecture d'accélérateurs à mémoire distribuée / Design space exploration of distributed-memory acceleratorsBusseuil, Rémi 04 December 2012 (has links)
Bien que le développement actuel d'accélérateurs se concentre principalement sur la création de puces Multiprocesseurs (MPSoC) hétérogènes, c'est-à-dire composés de processeurs spécialisées, de nombreux acteurs de la microélectronique s'intéressent au développement d'un autre type de MPSoC, constitué d'une grille de processeurs identiques. Ces MPSoC homogènes, bien que composés de processeurs énergétiquement moins efficaces, possèdent une programmabilité et une flexibilité plus importante que les MPSoC hétérogènes, ce qui favorise notamment l'adaptation du système à la charge demandée, et offre un espace de solutions de configuration potentiellement plus vaste et plus simple à contrôler. C'est dans ce contexte que s'inscrit cette thèse, en exposant la création d'une architecture MPSoC homogène scalable (c'est-à-dire dont la mise à l'échelle des performances est linéaire), ainsi que le développement de différents systèmes d'adaptation et de programmation sur celle-ci.Cette architecture, constituée d'une grille de processeurs de type MicroBlaze, possédant chacun sa propre mémoire, au sein d'un Réseau sur Puce 2D, a été développée conjointement avec un système d'exploitation temps réel (RTOS) spécialisé et modulaire. Grâce à la création d'une pile de communication complexe, plusieurs mécanismes d'adaptation ont été mis en œuvre : une migration de tâche « avec redirection de données », permettant de diminuer l'impact de cette migration avec des applications de type flux de données, ainsi qu'un mécanisme dit « d'exécution distante ». Ce dernier consiste non plus à migrer le code instruction d'une mémoire à une autre, mais de conserver le code dans sa mémoire initiale et de le faire exécuter par un processeur distinct. Les différentes expériences réalisées avec ce mécanisme ont permis de souligner la meilleure réactivité de celui-ci face à la migration de tâche, tout en possédant des performances d'adaptation plus faible.Ce dernier mécanisme a conduit naturellement à la création d'un modèle de programmation de type « mémoire partagée » au sein de l'architecture. La mise en place de ce dernier nécessitait la création d'un mécanisme de cohérence mémoire, qui a été réalisé de façon matérielle/logicielle et scalable par l'intermédiaire du développement de la librairie PThread. Les performances ainsi obtenues mettent en évidence les avantages d'un MPSoC homogène tout en utilisant une programmation « classique » de type multiprocesseur. / Although the accelerators market is dominated by heterogeneous MultiProcessor Systems-on-Chip (MPSoC), i.e. with different specialized processors, a growing interest is put on another type of MPSoC, composed by an array of identical processors. Even if these processors achieved lower performance to power ratio, the better flexibility and programmability of these homogeneous MPSoC allow an easier adaptation to the load, and offer a wider space of configurations. In this context, this thesis exposes the development of a scalable homogeneous MPSoC – i.e. with linear performance scaling – and different kind of adaptive mechanisms and programming model on it.This architecture is based on an array of MicroBlaze-like processors, each having its own memory, and connected through a 2D NoC. A modular RTOS was build on top of it. Thanks to a complex communication stack, different adaptive mechanisms were made: a “redirected data” task migration mechanism, reducing the impact of the migration mechanism for data-flow applications, and a “remote execution” mechanism. Instead of migrate the instruction code from a memory to another, this last consists in only migrate the execution, keeping the code in its initial memory. The different experiments shows faster reactivity but lower performance of this mechanism compared to migration.This development naturally led to the creation of a shared memory programming model. To achieve this, a scalable hardware/software memory consistency and cache coherency mechanism has been made, through the PThread library development. Experiments show the advantage of using NoC based homogeneous MPSoC with a brand programming model.
|
7 |
Fast Sorting on a Distributed-Memory ArchitectureCheng, David R., Shah, Viral, Gilbert, John R., Edelman, Alan 01 1900 (has links)
We consider the often-studied problem of sorting, for a parallel computer. Given an input array distributed evenly over p processors, the task is to compute the sorted output array, also distributed over the p processors. Many existing algorithms take the approach of approximately load-balancing the output, leaving each processor with Θ(n/p) elements. However, in many cases, approximate load-balancing leads to inefficiencies in both the sorting itself and in further uses of the data after sorting. We provide a deterministic parallel sorting algorithm that uses parallel selection to produce any output distribution exactly, particularly one that is perfectly load-balanced. Furthermore, when using a comparison sort, this algorithm is 1-optimal in both computation and communication. We provide an empirical study that illustrates the efficiency of exact data splitting, and shows an improvement over two sample sort algorithms. / Singapore-MIT Alliance (SMA)
|
8 |
A Parallelizing Compiler for FortranJanaki, S 08 1900 (has links)
With the advent of Distributed Memory Machines (DMMs) numerous work have been undertaken to ease the work of a programmer these systems. Data parallel languages like Fortran D, Vienna Fortran, High Performance Fortran and C+ allow the user to specify data distribution across processor with some directives, and the compiler for these language use the directives to compile the programme in to an SPMD code. There are number of old program which are still in use and rewriting them in to new data parallel languages is a costly effort.
Most of the work on these parallelizing compilers concentrate on efficient data communication between the processors.With the advancement in technology, data communication time is also decreasing.This allows bigger programs to execute in the same time span.The resources of a DMM being finite puts a limit on the size of the
problem that can be run. Improving the memory usage for a problem will hence allow us run bigger size problems.
Further, as communication speed increases, the overhead caused by house-keeping computations like global index to local index transformation, and owner processor computation will degrade the performance of the resultant code. Hence a uniform and efficient method for these computations also becomes a necessity.
We have implemented parallelizing parts of a compiler using the SUIF compiler system, which accepts programs written in Fortran77 with directives to the compiler as comments. The output of the compiler is an SPMD C program,
with embedded PVM calls for message communication between the processors.
We have also proposed algorithms to improve data communications,and minimizing memory usage in the output code. A uniform method for performing owner processor computations and global-to-local transformations has also been implemented.
|
9 |
Distributed Memory Based FPGA DebugHale, Robert Benjamin 13 April 2020 (has links)
Field-programmable gate arrays (FPGAs) are powerful integrated circuits for low-overhead custom computing needs and design prototyping. Due to the hardware nature of programming an FPGA, finding bugs in a design can be a very challenging process. Signals need to be physically probed and data recorded in real time. This is often done by dedicating some resources on the FPGA itself towards an embedded logic analyzer. This method is effective but can be time and resource consuming. Academic research projects have produced a variety of methods for reducing this difficulty. One option that has previously been unexplored is the use of distributed LUT memory for debug trace buffers, rather than dedicated FPGA BRAM. This dissertation presents a novel, lean embedded logic analyzer that leverages leftover LUT resources on the FPGA for this purpose. Distributed Memory Debug (abbreviated as "DIME Debug") provides some amount of signal visibility into very large (90\%+ LUT utilized) FPGA designs or designs where the programmer requires all available device BRAM, situations in which currently available embedded logic analyzers are likely to fail. The ubiquitous nature of LUTs on FPGAs provides opportunities to insert debug circuitry near signals of interest without disturbing placement of the user design. Using only leftover LUTs for trace buffers allows for effectively no area overhead. The DIME Debug system typically has a critical path delay in the 7-9ns range, which can force non-ideal slower timing constraints on the user design. A simulated annealing based placement algorithm and other optimizations are shown to improve timing closure results from 20-50\% depending on benchmark and probe count. DIME debug can be instrumented into a fully implemented design incrementally using the RapidWright CAD tool, resulting in debug iterations under 15 minutes even for very large benchmarks. Another interesting possibility introduced by the use of memory LUTs for debug trace buffers is preallocating these resources. Setting aside a certain number of LUTs before implementation of the user design leaves them available for incremental debug instrumentation. Experiments with a preallocation scheme show that, with virtually no penalty to the user design, debug critical paths are lowered by approximately 1ns and 2-3X the number of trace buffers can be instrumented into most benchmarks.
|
10 |
Efficient Run-time Support For Global View Programming of Linked Data Structures on Distributed Memory Parallel SystemsLarkins, Darrell Brian 30 July 2010 (has links)
No description available.
|
Page generated in 0.071 seconds