Global ETD Search

241	On resource placements and fault-tolerant broadcasting in toroidal networks AlMohammad, Bader Fahed AlBedaiwi 13 November 1997 (has links) Parallel computers are classified into: Multiprocessors, and multicomputers. A multiprocessor system usually has a shared memory through which its processors can communicate. On the other hand, the processors of a multicomputer system communicate by message passing through an interconnection network. A widely used class of interconnection networks is the toroidal networks. Compared to a hypercube, a torus has a larger diameter, but better tradeoffs, such as higher channel bandwidth and lower node degree. Results on resource placements and fault-tolerant broadcasting in toroidal networks are presented. Given a limited number of resources, it is desirable to distribute these resources over the interconnection network so that the distance between a non-resource and a closest resource is minimized. This problem is known as distance-d placement. In such a placement, each non-resource must be within a distance of d or less from at least one resource, where the number of resources used is the least possible. Solutions for distance-d placements in 2D and 3D tori are proposed. These solutions are compared with placements used so far in practice. Simulation experiments show that the proposed solutions are superior to the placements used in practice in terms of reducing average network latency. The complexity of a multicomputer increases the chances of having processor failures. Therefore, designing fault-tolerant communication algorithms is quite necessary for a sufficient utilization of such a system. Broadcasting (single-node one-to-all) in a multicomputer is one of the important communication primitives. A non-redundant fault-tolerant broadcasting algorithm in a faulty toroidal network is designed. The algorithm can adapt up to (2n-2) processor failures. Compared to the optimal algorithm in a fault-free n-dimensional toroidal network, the proposed algorithm requires at most 3 extra communication steps using cut through packet routing, and (n + 1) extra steps using store-and-forward routing. / Graduation date: 1998
242	CounterDataFlow architecture : design and performance Miller, Michael F., (Michael Frederic) 17 July 1997 (has links) The counterflow pipeline concept was originated by Sproull and Sutherland to demonstrate the concept of asynchronous circuits. This architecture relies on distributed decision making and localized clocking and data movement. We have taken these ideas and reformulated them into a substantially faster more scalable architecture that has the same distributed decision making and locality for clocking and data, but adds very aggressive speculation, no stalls, and other desirable characteristics. A high level Java simulator has been built to explore the design tradeoffs and evaluate performance. / Graduation date: 1998 Computer architecture
243	Memory optimization for a parallel sorting hardware architecture Beyer, Dale A. 22 May 1997 (has links) Sorting is one of the more computationally intensive tasks a computer performs. One of the most effective ways to speed up the task of sorting is by using parallel algorithms. When implementing a parallel algorithm, the designer has to make several decisions. Among the decisions are the algorithm and the physical implementation of the algorithm. A dedicated hardware solution is often physically quicker than a software solution. In this thesis, we will investigate the optimization of a hardware implementation of max-min sort. I propose an optimization to the data structures used in the algorithm. The new data structure allows quicker sorting by changing the basic workings of the max-min sort. The results are presented by comparing the new data structure with the original data structure. The thesis also discusses the design and performance issues related to implementing the algorithm in hardware. / Graduation date: 1998 Sorting (Electronic computers) Computer algorithms
244	Simulation studies of routing algorithms for multicomputer systems Choi, Jangmin 28 June 1995 (has links) Efficient routing of messages is critical to the performance of multicomputers. Many adaptive routing algorithms have been proposed to improve the network efficiency; however, they can make only short-sighted decisions to choose a channel for message routing because of limited information about network condition. The short-sighted decisions may cause unnecessary waits at the next clock cycle if the adaptive routing algorithm chooses a channel which has high probability of message block at the next clock cycle. In this thesis, look-ahead routing scheme is introduced to provide better performance for conventional adaptive routing algorithms. The look-ahead scheme as an adaptive routing algorithm makes longer-sighted decision to avoid possible blocks at the next clock cycle. This thesis also simulates the XY, the West-First, and the Double-Y channel routing algorithms, which are deterministic, partially-adaptive and fully-adaptive, respectively, in a 16 x 16 mesh network. Their performances are compared and analyzed. The simulation includes the examination of look-ahead effect for the West-First and the Double-Y channel routing algorithms. / Graduation date: 1996 Computer algorithms Computer networks
245	Data decomposition and load balancing for networked data-parallel processing Crandall, Phyllis E. 19 April 1994 (has links) Graduation date: 1994 Parallel computers Computer network architectures
246	Scheduling non-uniform parallel loops on MIMD computers Liu, Jie 22 September 1993 (has links) Parallel loops are one of the main sources of parallelism in scientific applications, and many parallel loops do not have a uniform iteration execution time. To achieve good performance for such applications on a parallel computer, iterations of a parallel loop have to be assigned to processors in such a way that each processor has roughly the same amount of work in terms of execution time. A parallel computer with a large number of processors tends to have distributed-memory. To run a parallel loop on a distributed-memory machine, data distribution also needs to be considered. This research investigates the scheduling of non-uniform parallel loops on both shared-memory and distributed-memory parallel computers. We present Safe Self-Scheduling (SSS), a new scheduling scheme that combines the advantages of both static and dynamic scheduling schemes. SSS has two phases: a static scheduling phase and a dynamic self-scheduling phase that together reduce the scheduling overhead while achieving a well balanced workload. The techniques introduced in SSS can be used by other self-scheduling schemes. The static scheduling phase further improves the performance by maintaining a high cache hit ratio resulting from increased affinity of iterations to processors. SSS is also very well suited for distributed-memory machines. We introduce methods to duplicate data on a number of processors. The methods eliminate data movement during computation and increase the scalability of problem size. We discuss a systematic approach to implement a given self-scheduling scheme on a distributed-memory. We also show a multilevel scheduling scheme to self-schedule parallel loops on a distributed-memory machine with a large number of processors to eliminate the bottleneck resulting from a central scheduler. We proposed a method using abstractions to automate both self-scheduling methods and data distribution methods in parallel programming environments. The abstractions are tested using CHARM, a real parallel programming environment. Methods are also developed to tolerate processor faults caused by both physical failure and reassignment of processors by the operating system during the execution of a parallel loop. We tested the techniques discussed using simulations and real applications. Good results have been obtained on both shared-memory and distributed-memory parallel computers. / Graduation date: 1994 MIMD computers Parallel computers
247	Embeddings in parallel systems Kwon, Younggeun 04 May 1993 (has links) Graduation date: 1993 Embeddings (Mathematics) Hypercube networks (Computer networks)
248	Allocation of SISAL program graphs to processors using BLAS Raisinghani, Manoj H. 07 April 1994 (has links) There are a number of well known techniques for extracting parallelism from a given program. They range from hardware implementations, building restructuring compilers or reorganizing of programs so as to specify all the available parallelism. The success rate of any of the known techniques is rather poor over all types of programs. This has pushed the research community to explore new languages and design different architectures to exploit program parallelism. The principles of dataflow architectures have addressed the problem of exploiting parallelism in systems by executing dataflow graphs. These graphs or programs represent data dependencies among instructions and execution of the graph proceeds in a data-driven manner. That is, an instruction is executed as soon as all its operands are available, without waiting for any program counter to sequence its execution, as is the case in conventional von Neumann architectures. In this thesis, data flow graphs are generated during the intermediate compilation of a functional language called SISAL (Streams and Iterations in a Single Assignment Language). The Intermediate Form (IFl) is a graphical language consisting of multiple acyclic function graphs that represent a given program. Each graph consists of a sequence of nodes and edges. The nodes specify the operation and the edges indicate the dependencies between the nodes. The graphs are further connected to each other by means of implicit dependencies. The Automator package developed in this project, preprocesses these multiple IF1 graphs and translates them into a single connected graph. It converts all implicit dependencies into actual ones. Additionally, complex language constructs like For All, loops and if-then-else are treated in special ways together with their nested levels by the Automator. There is virtually no limit to the number of nested levels that can be translated by this package. The Automator's prime contribution is in translating real programs written in SISAL into a specified format required by an allocation algorithm called the Balanced Layered Allocation Scheme (BLAS). BLAS partitions a connected graph into independent tasks and assigns them to processors in a multicomputer system. The problem of program allocation lies in maximizing parallelism while minimizing interprocessor communication costs. Hence, allocation is based on the best choice of communication to execution ratio for each task. BLAS utilizes heuristic rules to find a balance between computation and communication costs in the target system. Here the target architecture is a simulated nCUBE 3E computer, having a hypercube topology. Simulations show that, BLAS is effective in reducing the overall execution time of a program by considering the communication costs on the execution times. The results will help in understanding the effects in packing nodes (grain-packing), routing issues in the network and in general, the allocation problem to any processor in a network. In addition, tasks have also been assigned to adjacent processors only, instead of any processor on the hypercube network. The adjacent allocation to processors helps to determine trade-offs required between achieved speed-ups and the time it takes to completely allocate large graphs on compilation. / Graduation date: 1994 Computer architecture
249	A multiresolutional approach for large data visualization Wang, Chaoli, January 2006 (has links) Thesis (Ph. D.)--Ohio State University, 2006. / Title from first page of PDF file. Includes bibliographical references (p. 116-123).
250	Program allocation for hypercube based dataflow systems Freytag, Vincent R. 18 March 1993 (has links) The dataflow model of computation differs from the traditional control-flow model of computation in that it does not utilize a program counter to sequence instructions in a program. Instead, the execution of instructions is based solely on the availability of their operands. Thus, an instruction is executed in a dataflow computer when all of its operands are available. This asynchronous nature of the dataflow model of computation allows the exploitation of fine-grain parallelism inherent in programs. Although the dataflow model of computation exploits parallelism, the problem of optimally allocating a program to processors belongs to the class of NP-complete problems. Therefore, one of the major issues facing designers of dataflow multiprocessors is the proper allocation of programs to processors. The problem of program allocation lies in maximizing parallelism while minimizing interprocessor communication costs. The culmination of research in the area of program allocation has produced the proposed method called the Balanced Layered Allocation Scheme that utilizes heuristic rules to strike a balance between computation time and communication costs in dataflow multiprocessors. Specifically, the proposed allocation scheme utilizes Critical Path and Longest Directed Path heuristics when allocating instructions to processors. Simulation studies indicate that the proposed scheme is effective in reducing the overall execution time of a program by considering the effects of communication costs on computation times. / Graduation date: 1993 Data flow computing Hypercube Multiprocessors

Search results