Spelling suggestions: "subject:"parallelprocessing"" "subject:"parallelprocessors""
241 |
On resource placements and fault-tolerant broadcasting in toroidal networksAlMohammad, Bader Fahed AlBedaiwi 13 November 1997 (has links)
Parallel computers are classified into: Multiprocessors, and multicomputers. A
multiprocessor system usually has a shared memory through which its processors
can communicate. On the other hand, the processors of a multicomputer system
communicate by message passing through an interconnection network. A widely
used class of interconnection networks is the toroidal networks. Compared to a
hypercube, a torus has a larger diameter, but better tradeoffs, such as higher channel
bandwidth and lower node degree. Results on resource placements and fault-tolerant
broadcasting in toroidal networks are presented.
Given a limited number of resources, it is desirable to distribute these resources
over the interconnection network so that the distance between a non-resource and a
closest resource is minimized. This problem is known as distance-d placement. In
such a placement, each non-resource must be within a distance of d or less from at
least one resource, where the number of resources used is the least possible. Solutions
for distance-d placements in 2D and 3D tori are proposed. These solutions are
compared with placements used so far in practice. Simulation experiments show
that the proposed solutions are superior to the placements used in practice in terms of reducing average network latency.
The complexity of a multicomputer increases the chances of having processor failures. Therefore, designing fault-tolerant communication algorithms is quite necessary for a sufficient utilization of such a system. Broadcasting (single-node one-to-all) in a multicomputer is one of the important communication primitives. A non-redundant fault-tolerant broadcasting algorithm in a faulty toroidal network is designed. The algorithm can adapt up to (2n-2) processor failures. Compared to the optimal algorithm in a fault-free n-dimensional toroidal network, the proposed algorithm requires at most 3 extra communication steps using cut through packet routing, and (n + 1) extra steps using store-and-forward routing. / Graduation date: 1998
|
242 |
CounterDataFlow architecture : design and performanceMiller, Michael F., (Michael Frederic) 17 July 1997 (has links)
The counterflow pipeline concept was originated by Sproull and Sutherland to demonstrate
the concept of asynchronous circuits. This architecture relies on distributed decision
making and localized clocking and data movement. We have taken these ideas and reformulated
them into a substantially faster more scalable architecture that has the same distributed
decision making and locality for clocking and data, but adds very aggressive
speculation, no stalls, and other desirable characteristics. A high level Java simulator has
been built to explore the design tradeoffs and evaluate performance. / Graduation date: 1998
|
243 |
Memory optimization for a parallel sorting hardware architectureBeyer, Dale A. 22 May 1997 (has links)
Sorting is one of the more computationally intensive tasks a computer performs.
One of the most effective ways to speed up the task of sorting is by using parallel
algorithms. When implementing a parallel algorithm, the designer has to make several
decisions. Among the decisions are the algorithm and the physical implementation of the
algorithm. A dedicated hardware solution is often physically quicker than a software
solution.
In this thesis, we will investigate the optimization of a hardware implementation
of max-min sort. I propose an optimization to the data structures used in the algorithm.
The new data structure allows quicker sorting by changing the basic workings of the
max-min sort. The results are presented by comparing the new data structure with the
original data structure. The thesis also discusses the design and performance issues
related to implementing the algorithm in hardware. / Graduation date: 1998
|
244 |
Simulation studies of routing algorithms for multicomputer systemsChoi, Jangmin 28 June 1995 (has links)
Efficient routing of messages is critical to the performance of multicomputers.
Many adaptive routing algorithms have been proposed to improve the network
efficiency; however, they can make only short-sighted decisions to choose a channel
for message routing because of limited information about network condition. The
short-sighted decisions may cause unnecessary waits at the next clock cycle if the
adaptive routing algorithm chooses a channel which has high probability of message
block at the next clock cycle.
In this thesis, look-ahead routing scheme is introduced to provide better
performance for conventional adaptive routing algorithms. The look-ahead scheme
as an adaptive routing algorithm makes longer-sighted decision to avoid possible
blocks at the next clock cycle.
This thesis also simulates the XY, the West-First, and the Double-Y channel
routing algorithms, which are deterministic, partially-adaptive and fully-adaptive,
respectively, in a 16 x 16 mesh network. Their performances are compared and
analyzed. The simulation includes the examination of look-ahead effect for the
West-First and the Double-Y channel routing algorithms. / Graduation date: 1996
|
245 |
Data decomposition and load balancing for networked data-parallel processingCrandall, Phyllis E. 19 April 1994 (has links)
Graduation date: 1994
|
246 |
Scheduling non-uniform parallel loops on MIMD computersLiu, Jie 22 September 1993 (has links)
Parallel loops are one of the main sources of parallelism in scientific applications,
and many parallel loops do not have a uniform iteration execution time. To
achieve good performance for such applications on a parallel computer, iterations
of a parallel loop have to be assigned to processors in such a way that each processor
has roughly the same amount of work in terms of execution time. A parallel
computer with a large number of processors tends to have distributed-memory. To
run a parallel loop on a distributed-memory machine, data distribution also needs
to be considered. This research investigates the scheduling of non-uniform parallel
loops on both shared-memory and distributed-memory parallel computers.
We present Safe Self-Scheduling (SSS), a new scheduling scheme that combines
the advantages of both static and dynamic scheduling schemes. SSS has two
phases: a static scheduling phase and a dynamic self-scheduling phase that together
reduce the scheduling overhead while achieving a well balanced workload. The techniques
introduced in SSS can be used by other self-scheduling schemes. The static
scheduling phase further improves the performance by maintaining a high cache hit
ratio resulting from increased affinity of iterations to processors. SSS is also very
well suited for distributed-memory machines.
We introduce methods to duplicate data on a number of processors. The
methods eliminate data movement during computation and increase the scalability
of problem size. We discuss a systematic approach to implement a given self-scheduling
scheme on a distributed-memory. We also show a multilevel scheduling
scheme to self-schedule parallel loops on a distributed-memory machine with a large
number of processors to eliminate the bottleneck resulting from a central scheduler.
We proposed a method using abstractions to automate both self-scheduling
methods and data distribution methods in parallel programming environments. The
abstractions are tested using CHARM, a real parallel programming environment.
Methods are also developed to tolerate processor faults caused by both physical
failure and reassignment of processors by the operating system during the execution
of a parallel loop.
We tested the techniques discussed using simulations and real applications.
Good results have been obtained on both shared-memory and distributed-memory
parallel computers. / Graduation date: 1994
|
247 |
Embeddings in parallel systemsKwon, Younggeun 04 May 1993 (has links)
Graduation date: 1993
|
248 |
Allocation of SISAL program graphs to processors using BLASRaisinghani, Manoj H. 07 April 1994 (has links)
There are a number of well known techniques for extracting parallelism from a
given program. They range from hardware implementations, building restructuring
compilers or reorganizing of programs so as to specify all the available parallelism. The
success rate of any of the known techniques is rather poor over all types of programs.
This has pushed the research community to explore new languages and design different
architectures to exploit program parallelism.
The principles of dataflow architectures have addressed the problem of exploiting
parallelism in systems by executing dataflow graphs. These graphs or programs represent
data dependencies among instructions and execution of the graph proceeds in a data-driven
manner. That is, an instruction is executed as soon as all its operands are
available, without waiting for any program counter to sequence its execution, as is the
case in conventional von Neumann architectures.
In this thesis, data flow graphs are generated during the intermediate compilation of
a functional language called SISAL (Streams and Iterations in a Single Assignment
Language). The Intermediate Form (IFl) is a graphical language consisting of multiple
acyclic function graphs that represent a given program. Each graph consists of a
sequence of nodes and edges. The nodes specify the operation and the edges indicate the
dependencies between the nodes. The graphs are further connected to each other by
means of implicit dependencies.
The Automator package developed in this project, preprocesses these multiple IF1
graphs and translates them into a single connected graph. It converts all implicit
dependencies into actual ones. Additionally, complex language constructs like For All,
loops and if-then-else are treated in special ways together with their nested levels by the
Automator. There is virtually no limit to the number of nested levels that can be
translated by this package.
The Automator's prime contribution is in translating real programs written in SISAL
into a specified format required by an allocation algorithm called the Balanced Layered
Allocation Scheme (BLAS).
BLAS partitions a connected graph into independent tasks and assigns them to
processors in a multicomputer system. The problem of program allocation lies in
maximizing parallelism while minimizing interprocessor communication costs. Hence,
allocation is based on the best choice of communication to execution ratio for each task.
BLAS utilizes heuristic rules to find a balance between computation and communication
costs in the target system. Here the target architecture is a simulated nCUBE 3E
computer, having a hypercube topology.
Simulations show that, BLAS is effective in reducing the overall execution time of
a program by considering the communication costs on the execution times. The results
will help in understanding the effects in packing nodes (grain-packing), routing issues in
the network and in general, the allocation problem to any processor in a network. In
addition, tasks have also been assigned to adjacent processors only, instead of any
processor on the hypercube network. The adjacent allocation to processors helps to
determine trade-offs required between achieved speed-ups and the time it takes to
completely allocate large graphs on compilation. / Graduation date: 1994
|
249 |
A multiresolutional approach for large data visualizationWang, Chaoli, January 2006 (has links)
Thesis (Ph. D.)--Ohio State University, 2006. / Title from first page of PDF file. Includes bibliographical references (p. 116-123).
|
250 |
Program allocation for hypercube based dataflow systemsFreytag, Vincent R. 18 March 1993 (has links)
The dataflow model of computation differs from the traditional control-flow
model of computation in that it does not utilize a program counter to sequence
instructions in a program. Instead, the execution of instructions is based solely on the
availability of their operands. Thus, an instruction is executed in a dataflow computer
when all of its operands are available. This asynchronous nature of the dataflow model of
computation allows the exploitation of fine-grain parallelism inherent in programs.
Although the dataflow model of computation exploits parallelism, the problem of
optimally allocating a program to processors belongs to the class of NP-complete
problems. Therefore, one of the major issues facing designers of dataflow
multiprocessors is the proper allocation of programs to processors.
The problem of program allocation lies in maximizing parallelism while
minimizing interprocessor communication costs. The culmination of research in the area
of program allocation has produced the proposed method called the Balanced Layered
Allocation Scheme that utilizes heuristic rules to strike a balance between computation
time and communication costs in dataflow multiprocessors. Specifically, the proposed
allocation scheme utilizes Critical Path and Longest Directed Path heuristics when
allocating instructions to processors. Simulation studies indicate that the proposed
scheme is effective in reducing the overall execution time of a program by considering
the effects of communication costs on computation times. / Graduation date: 1993
|
Page generated in 0.0903 seconds