Global ETD Search

1	Virtual memory on data diffusion architectures Buenabad-Chavez, Jorge January 1998 (has links) No description available. 621.39 Distributed memory; Parallel programming
2	Parallel load-balancing on message passing architectures Muniz, Francisco Junqueira January 1994 (has links) No description available. 621.39 MIMD distributed-memory; Transputers
3	A statistical approach to parallel sorting and selection algorithms design Loo, Alfred January 2000 (has links) No description available. 005 Parallel programming; Distributed memory
4	Hierarchical Matrix Techniques on Massively Parallel Computers Izadi, Mohammad 11 December 2012 (has links) (PDF) Hierarchical matrix (H-matrix) techniques can be used to efficiently treat dense matrices. With an H-matrix, the storage requirements and performing all fundamental operations, namely matrix-vector multiplication, matrix-matrix multiplication and matrix inversion can be done in almost linear complexity. In this work, we tried to gain even further speedup for the H-matrix arithmetic by utilizing multiple processors. Our approach towards an H-matrix distribution relies on the splitting of the index set. The main results achieved in this work based on the index-wise H-distribution are: A highly scalable algorithm for the H-matrix truncation and matrix-vector multiplication, a scalable algorithm for the H-matrix matrix multiplication, a limited scalable algorithm for the H-matrix inversion for a large number of processors. Hierarchische Matrizen parallelen Algorithmus Distributed-Memory-Systeme Hierarchical matrices parallel algorithm Distributed-Memory-System ddc:000
5	Hyperplane Partitioning : An Approach To Global Data Partitioning For Distributed Memory Machines Prakash, S R 07 1900 (has links) Automatic Global Data Partitioning for Distributed Memory Machines (DMMs) is a difficult problem. Distributed memory machines are scalable, but since the memory is distributed across processors, the scheme of placement of data (arrays) onto local memories of different processors become crucial since any communication between processors for non-local data access is an order of magnitude costlier than access to local memory. Researchers have given varied solutions to this problem, most of which work for uniform dependences in loops and they suggest HPF-like distributions only. For non-uniform dependences the loop was made to run sequentially. In this work, we present a partitioning strategy called Hyperplane Partitioning which works well with loops with non-uniform dependences also. In this method of partitioning, the iteration space is partitioned into as many number of partitions as there are number of logical processors, in such a way that the overall inter-processor communication will be minimum. The idea is to localize as many as dependences as possible so that overall communication both beacuse of non-local data as well as inter-processor synchronizations are reduced. These partitions are then induced into data spaces of the arrays referenced in the loop. Each processor then runs its part of iteration space keeping the data partition that it owns locally. Any non-local data access is implemented by inter-processor communication at run-time.The Hyperplane Partitioning is also extended to a sequence of loops. This is done by first finding Best Local Distribution (BLD) for every loop first and then finding the best way of grouping different adjacent loops (just for finding the data partition) which gives best global data partition. This sequence of distributions/redistributions is found by constructing a data structure called Data Distribution Tree (DDT) and finding the least cost path from the source to any of the leaf nodes in the DDT. The costs for the edges come from the communication cost incurred while running a loop with a particular distribution and redistribution to suit the requirement at the next loop. For this a communication cost estimator is developed which works well for fewer dimensions. To handle complete programs we use some heuristic to find the best global distribution for the entire program.Some optimizations like message optimization to reduce the number of messages sent across processors, time optimization which is done by uniform scheduling across processors, and space optimization to keep only the part of array space that any processor owns onto its local memory, are studied. Hyperplane Partitioning is also implemented using an algorithm for synchronization to handle non-local memory access as well as obeying data dependence constraints. The algorithm is also proved to be correct. The target machine is IBM-SP2 using PVM for the message passing library. The performance of the tool on some standard benchmarks (ADI and RHS) and also on some programs designed by us to show the specific merits of the tool. The results show that the loops which have non-uniform dependences also can be run on DMM with good speed-ups. Computer and Information Science Parallellizing Compiler Automatic Data Partitioning Hyper-plane Partitioning Distributed Memory Machine Electronic Data Processing Multiprogramming Distributed Memory Multiprocessors Distributed Memory Multicomputers
6	Exploration d'architecture d'accélérateurs à mémoire distribuée / Design space exploration of distributed-memory accelerators Busseuil, Rémi 04 December 2012 (has links) Bien que le développement actuel d'accélérateurs se concentre principalement sur la création de puces Multiprocesseurs (MPSoC) hétérogènes, c'est-à-dire composés de processeurs spécialisées, de nombreux acteurs de la microélectronique s'intéressent au développement d'un autre type de MPSoC, constitué d'une grille de processeurs identiques. Ces MPSoC homogènes, bien que composés de processeurs énergétiquement moins efficaces, possèdent une programmabilité et une flexibilité plus importante que les MPSoC hétérogènes, ce qui favorise notamment l'adaptation du système à la charge demandée, et offre un espace de solutions de configuration potentiellement plus vaste et plus simple à contrôler. C'est dans ce contexte que s'inscrit cette thèse, en exposant la création d'une architecture MPSoC homogène scalable (c'est-à-dire dont la mise à l'échelle des performances est linéaire), ainsi que le développement de différents systèmes d'adaptation et de programmation sur celle-ci.Cette architecture, constituée d'une grille de processeurs de type MicroBlaze, possédant chacun sa propre mémoire, au sein d'un Réseau sur Puce 2D, a été développée conjointement avec un système d'exploitation temps réel (RTOS) spécialisé et modulaire. Grâce à la création d'une pile de communication complexe, plusieurs mécanismes d'adaptation ont été mis en œuvre : une migration de tâche « avec redirection de données », permettant de diminuer l'impact de cette migration avec des applications de type flux de données, ainsi qu'un mécanisme dit « d'exécution distante ». Ce dernier consiste non plus à migrer le code instruction d'une mémoire à une autre, mais de conserver le code dans sa mémoire initiale et de le faire exécuter par un processeur distinct. Les différentes expériences réalisées avec ce mécanisme ont permis de souligner la meilleure réactivité de celui-ci face à la migration de tâche, tout en possédant des performances d'adaptation plus faible.Ce dernier mécanisme a conduit naturellement à la création d'un modèle de programmation de type « mémoire partagée » au sein de l'architecture. La mise en place de ce dernier nécessitait la création d'un mécanisme de cohérence mémoire, qui a été réalisé de façon matérielle/logicielle et scalable par l'intermédiaire du développement de la librairie PThread. Les performances ainsi obtenues mettent en évidence les avantages d'un MPSoC homogène tout en utilisant une programmation « classique » de type multiprocesseur. / Although the accelerators market is dominated by heterogeneous MultiProcessor Systems-on-Chip (MPSoC), i.e. with different specialized processors, a growing interest is put on another type of MPSoC, composed by an array of identical processors. Even if these processors achieved lower performance to power ratio, the better flexibility and programmability of these homogeneous MPSoC allow an easier adaptation to the load, and offer a wider space of configurations. In this context, this thesis exposes the development of a scalable homogeneous MPSoC – i.e. with linear performance scaling – and different kind of adaptive mechanisms and programming model on it.This architecture is based on an array of MicroBlaze-like processors, each having its own memory, and connected through a 2D NoC. A modular RTOS was build on top of it. Thanks to a complex communication stack, different adaptive mechanisms were made: a “redirected data” task migration mechanism, reducing the impact of the migration mechanism for data-flow applications, and a “remote execution” mechanism. Instead of migrate the instruction code from a memory to another, this last consists in only migrate the execution, keeping the code in its initial memory. The different experiments shows faster reactivity but lower performance of this mechanism compared to migration.This development naturally led to the creation of a shared memory programming model. To achieve this, a scalable hardware/software memory consistency and cache coherency mechanism has been made, through the PThread library development. Experiments show the advantage of using NoC based homogeneous MPSoC with a brand programming model. Cohérence de cache Mémoire Distribuée MPSoC Cache coherency Distributed memory MPSoC
7	Fast Sorting on a Distributed-Memory Architecture Cheng, David R., Shah, Viral, Gilbert, John R., Edelman, Alan 01 1900 (has links) We consider the often-studied problem of sorting, for a parallel computer. Given an input array distributed evenly over p processors, the task is to compute the sorted output array, also distributed over the p processors. Many existing algorithms take the approach of approximately load-balancing the output, leaving each processor with Θ(n/p) elements. However, in many cases, approximate load-balancing leads to inefficiencies in both the sorting itself and in further uses of the data after sorting. We provide a deterministic parallel sorting algorithm that uses parallel selection to produce any output distribution exactly, particularly one that is perfectly load-balanced. Furthermore, when using a comparison sort, this algorithm is 1-optimal in both computation and communication. We provide an empirical study that illustrates the efficiency of exact data splitting, and shows an improvement over two sample sort algorithms. / Singapore-MIT Alliance (SMA) Parallel sorting distributed-memory algorithms High Performance Computing
8	A Parallelizing Compiler for Fortran Janaki, S 08 1900 (has links) With the advent of Distributed Memory Machines (DMMs) numerous work have been undertaken to ease the work of a programmer these systems. Data parallel languages like Fortran D, Vienna Fortran, High Performance Fortran and C+ allow the user to specify data distribution across processor with some directives, and the compiler for these language use the directives to compile the programme in to an SPMD code. There are number of old program which are still in use and rewriting them in to new data parallel languages is a costly effort. Most of the work on these parallelizing compilers concentrate on efficient data communication between the processors.With the advancement in technology, data communication time is also decreasing.This allows bigger programs to execute in the same time span.The resources of a DMM being finite puts a limit on the size of the problem that can be run. Improving the memory usage for a problem will hence allow us run bigger size problems. Further, as communication speed increases, the overhead caused by house-keeping computations like global index to local index transformation, and owner processor computation will degrade the performance of the resultant code. Hence a uniform and efficient method for these computations also becomes a necessity. We have implemented parallelizing parts of a compiler using the SUIF compiler system, which accepts programs written in Fortran77 with directives to the compiler as comments. The output of the compiler is an SPMD C program, with embedded PVM calls for message communication between the processors. We have also proposed algorithms to improve data communications,and minimizing memory usage in the output code. A uniform method for performing owner processor computations and global-to-local transformations has also been implemented. Computer and Information Science Parallelizing Compiler FORTRAN Compiler Distributed Memory Machines
9	Distributed Memory Based FPGA Debug Hale, Robert Benjamin 13 April 2020 (has links) Field-programmable gate arrays (FPGAs) are powerful integrated circuits for low-overhead custom computing needs and design prototyping. Due to the hardware nature of programming an FPGA, finding bugs in a design can be a very challenging process. Signals need to be physically probed and data recorded in real time. This is often done by dedicating some resources on the FPGA itself towards an embedded logic analyzer. This method is effective but can be time and resource consuming. Academic research projects have produced a variety of methods for reducing this difficulty. One option that has previously been unexplored is the use of distributed LUT memory for debug trace buffers, rather than dedicated FPGA BRAM. This dissertation presents a novel, lean embedded logic analyzer that leverages leftover LUT resources on the FPGA for this purpose. Distributed Memory Debug (abbreviated as "DIME Debug") provides some amount of signal visibility into very large (90\%+ LUT utilized) FPGA designs or designs where the programmer requires all available device BRAM, situations in which currently available embedded logic analyzers are likely to fail. The ubiquitous nature of LUTs on FPGAs provides opportunities to insert debug circuitry near signals of interest without disturbing placement of the user design. Using only leftover LUTs for trace buffers allows for effectively no area overhead. The DIME Debug system typically has a critical path delay in the 7-9ns range, which can force non-ideal slower timing constraints on the user design. A simulated annealing based placement algorithm and other optimizations are shown to improve timing closure results from 20-50\% depending on benchmark and probe count. DIME debug can be instrumented into a fully implemented design incrementally using the RapidWright CAD tool, resulting in debug iterations under 15 minutes even for very large benchmarks. Another interesting possibility introduced by the use of memory LUTs for debug trace buffers is preallocating these resources. Setting aside a certain number of LUTs before implementation of the user design leaves them available for incremental debug instrumentation. Experiments with a preallocation scheme show that, with virtually no penalty to the user design, debug critical paths are lowered by approximately 1ns and 2-3X the number of trace buffers can be instrumented into most benchmarks. FPGA debug embedded logic analyzer distributed memory Engineering
10	Efficient Run-time Support For Global View Programming of Linked Data Structures on Distributed Memory Parallel Systems Larkins, Darrell Brian 30 July 2010 (has links) No description available. Computer Science pgas trees graphs distributed memory high performance computing

Search results