11 |
Performance modelling of message-passing parallel programs /Grove, Duncan A. January 2003 (has links) (PDF)
Thesis (Ph.D.)--University of Adelaide, Dept. of Computer Science, 2003. / Also available in an electronic version.
|
12 |
Performance modelling of message-passing parallel programsGrove, Duncan A. January 2003 (has links)
Thesis (Ph.D.)--University of Adelaide, Dept. of Computer Science. / Also available in a print form.
|
13 |
Virtual links for multicomputersWai, Siu-kit, 衛兆傑 January 1996 (has links)
(Uncorrected OCR)
Abstract of Thesis entitled 'Virtual Links for Multicomputers' Submitted by
Siu Kit Wai
for the degree of
Master of Philosophy at Univsersity of Hong Kong in October 1996
In order to increase computation power, multiple autonomous computers or processors are connected to form a multicomputer. The performance boost is the result of exploiting in parallel the processing power available in individual processors. Parallel processing, however, requires the cooperation among the processors, which implies interprocessor communication. The efficiency of such communications is limited by the bandwidth and number of communication channels between directly connected processors.
Multiple processes on a processor share a few hardware communication links/channels to communication with processes executing on a different processor. Effective and efficient sharing of channels is important for the overall system performance; hence it is important that the sharing be properly managed. When the sharing is not provided by the hardware, it can be provided in software at system level. Without a managing component, processes need to be programmed to flight for and gain exclusive access to the communication links. This is usually not effective, error-prone, and could reduce the overall performance of processes executing in the processor.
Flexibility is a main advantage of providing a channel-sharing mechanism at system level. Parameters such as packet size, and configuration of the system can be customized and tuned to meet the communication characteristics of different applications.
In this project, we investigate how link sharing can be provided at system level. Our approach is based on idea of virtual links. The system is designed to be as transparent and easy to be used as possible. We will discuss how different
parameters and configurations affect the system functionality and performance. We also compare this software solution to other existing solutions including a hardware solution.
ii / abstract / toc / Computer Science / Master / Master of Philosophy
|
14 |
A cost analysis for a higher-order parallel programming modelRangaswami, Roopa January 1996 (has links)
Programming parallel computers remains a difficult task. An ideal programming environment should enablethe user to concentrate on the problem solving activity at a convenient level of abstraction, while managing the intricate low-level details without sacrificing performance. This thesis investigates a model of parallel programming based on the Bird-Meertens Formalism (BMF). This is a set of higher-order functions, many of which are implicitly parallel. Programs are expressed in terms of functions borrowed from BMF. A parallel implementation is defined for each of these functions for a particular topology, and the associated execution costs are derived. The topologies which have been considered include the hypercube, 2-D torus, tree and the linear array. An analyser estimates the costs associated with different implementations of a given program and selects a cost-effective one for a given topology. All the analysis is performed at compile-time which has the advantage of reducing run-time overheads. the cost model's accuracy in choosing a cost-effective implementation and predicting its performance has been studied for three programs. The main contribution of the thesis is the cost model which aims to predict realistic performances and which considers several possible parallel implementations for a given programbefore selecting a cost-effective one.
|
15 |
Hardware for Fast Global Operations on Distributed Memory Multicomputers and MultiprocessorsHall, Douglas V. 01 January 1995 (has links)
"Grand Challenge" problems such as climate modeling to predict droughts and human genome mapping to predict and possibly cure diseases such as cancer require massive computing power. Three kinds of computer systems currently used in attempts to solve these problems are "Big Iron" multicomputers such as the Intel Paragon, workstation cluster multicomputers, and distributed shared memory multiprocessors such as the Cray T3D. Machines such as these are inefficient in executing some or all of a set of global program operations which are important in many of the "Grand Challenge" programs. These operations include synchronization, reduction, MAX, MIN, one-to-all broadcasting, all-to-all broadcasting, and orderly access to global shared variables. My hypothesis was that a secondary network with a wide tree topology and one or more centralized processors optimized for these operations could substantially decrease their execution time on all three types of systems. To test my hypothesis, I developed the secondary network and Coordination Processor(COP) system described in this dissertation, modeled the major blocks of the design in VHDL, and simulated these blocks to verify their logic and get realistic timing values. The analyses developed for the COP system clearly demonstrate that it can speed up a variety of common global operations by as much as 2-3 orders of magnitude when added to any of several current multicomputers and multiprocessors. Examples show that this speedup reduces overall execution time for important scientific programs and computational kernels by an average of 25% at an increase in system cost of only about 2%. Further analyses show that for these global operations the COP system has a greater combination of speed and versatility than any other system.
|
16 |
Performance Evaluation of Specialized Hardware for Fast Global Operations on Distributed Memory MulticomputersSankaran, Rajesh Madukkarumukumana 27 October 1995 (has links)
Workstation cluster multicomputers are increasingly being applied for solving scientific problems that require massive computing power. Parallel Virtual Machine (PVM) is a popular message-passing model used to program these clusters. One of the major performance limiting factors for cluster multicomputers is their inefficiency in performing parallel program operations involving collective communications. These operations include synchronization, global reduction, broadcast/multicast operations and orderly access to shared global variables. Hall has demonstrated that a .secondary network with wide tree topology and centralized coordination processors (COP) could improve the performance of global operations on a variety of distributed architectures [Hall94a]. My hypothesis was that the efficiency of many PVM applications on workstation clusters could be significantly improved by utilizing a COP system for collective communication operations. To test my hypothesis, I interfaced COP system with PVM. The interface software includes a virtual memory-mapped secondary network interface driver, and a function library which allows to use COP system in place of PVM function calls in application programs. My implementation makes it possible to easily port any existing PVM applications to perform fast global operations using the COP system. To evaluate the performance improvements of using a COP system, I measured cost of various PVM global functions, derived the cost of equivalent COP library global functions, and compared the results. To analyze the cost of global operations on overall execution time of applications, I instrumented a complex molecular dynamics PVM application and performed measurements. The measurements were performed for a sample cluster size of 5 and for message sizes up to 16 kilobytes. The comparison of PVM and COP system global operation performance clearly demonstrates that the COP system can speed up a variety of global operations involving small-to-medium sized messages by factors of 5-25. Analysis of the example application for a sample cluster size of 5 show that speedup provided by my global function libraries and the COP system reduces overall execution time for this and similar applications by above 1.5 times. Additionally, the performance improvement seen by applications increases as the cluster size increases, thus providing a scalable solution for performing global operations.
|
17 |
Performance modelling of message-passing parallel programsGrove, Duncan A. January 2003 (has links) (PDF)
This dissertation describes a new performance modelling system, called the Performance Evaluating Virtual Parallel Machine (PEVPM). It uses a novel bottom-up approach, where submodels of individual computation and communication events are dynamically constructed from data-dependencies, current contention levels and the performance distributions of low-level operations, which define performance variability in the face of contention.
|
18 |
Performance modelling of message-passing parallel programsGrove, Duncan A. January 2003 (has links)
Electronic publication; full text available in PDF format; abstract in HTML format. This dissertation describes a new performance modelling system, called the Performance Evaluating Virtual Parallel Machine (PEVPM). It uses a novel bottom-up approach, where submodels of individual computation and communication events are dynamically constructed from data-dependencies, current contention levels and the performance distributions of low-level operations, which define performance variability in the face of contention. Electronic reproduction.[Australia] :Australian Digital Theses Program,2001. xvii, 295 p. : ill., charts (col.) ; 30 cm.
|
19 |
Locality and parallel optimizations for parallel supercomputingHarrison, Ian, January 2003 (has links)
Thesis (B.A.)--Haverford College, Dept. of Computer Science, 2003. / Includes bibliographical references.
|
20 |
Reliable low latency I/O in torus-based interconnection networksAzeez, Babatunde 25 April 2007 (has links)
In today's high performance computing environment I/O remains the main bottleneck in
achieving the optimal performance expected of the ever improving processor and
memory technologies. Interconnection networks therefore combines processing units,
system I/O and high speed switch network fabric into a new paradigm of I/O based
network. It decouples the system into computational and I/O interconnections each
allowing "any-to-any" communications among processors and I/O devices unlike the
shared model in bus architecture. The computational interconnection, a network of
processing units (compute-nodes), is used for inter-processor communication in carrying
out computation tasks, while the I/O interconnection manages the transfer of I/O requests
between the compute-nodes and the I/O or storage media through some dedicated I/O
processing units (I /O-nodes). Considering the special functions performed by the I/O
nodes, their placement and reliability become important issues in improving the overall
performance of the interconnection system.
This thesis focuses on design and topological placement of I/O-nodes in torus based
interconnection networks, with the aim of reducing I/O communication latency between
compute-nodes and I/O-nodes even in the presence of faulty I/O-nodes. We propose an
efficient and scalable relaxed quasi-perfect placement scheme using Lee distance error
correction code such that compute-nodes are at distance-t or at most distance-t+1 from an
I/O-node for a given t. This scheme provides a better and optimal alternative placement
than quasi perfect placement when perfect placement cannot be found for a particular
torus. Furthermore, in the occurrence of faulty I/O-nodes, the placement scheme is also
used in determining other alternative I/O-nodes for rerouting I/O traffic from affected
compute-nodes with minimal slowdown. In order to guarantee the quality of service
required of inter-processor communication, a scheduling algorithm was developed at the router level to prioritize message forwarding according to inter-process and I/O messages
with the former given higher priority.
Our simulation results show that relaxed quasi-perfect outperforms quasi-perfect and the
conventional I/O placement (where I/O nodes are concentrated at the base of the torus
interconnection) with little degradation in inter-process communication performance.
Also the fault tolerant redirection scheme provides a minimal slowdown, especially when
the number of faulty I/O nodes is less than half of the initial available I/O nodes.
|
Page generated in 0.2339 seconds