Spelling suggestions: "subject:"alongparallel"" "subject:"inparallel""
371 |
Performance modelling of message-passing parallel programsGrove, Duncan A. January 2003 (has links) (PDF)
This dissertation describes a new performance modelling system, called the Performance Evaluating Virtual Parallel Machine (PEVPM). It uses a novel bottom-up approach, where submodels of individual computation and communication events are dynamically constructed from data-dependencies, current contention levels and the performance distributions of low-level operations, which define performance variability in the face of contention.
|
372 |
Performance modelling of message-passing parallel programsGrove, Duncan A. January 2003 (has links)
Electronic publication; full text available in PDF format; abstract in HTML format. This dissertation describes a new performance modelling system, called the Performance Evaluating Virtual Parallel Machine (PEVPM). It uses a novel bottom-up approach, where submodels of individual computation and communication events are dynamically constructed from data-dependencies, current contention levels and the performance distributions of low-level operations, which define performance variability in the face of contention. Electronic reproduction.[Australia] :Australian Digital Theses Program,2001. xvii, 295 p. : ill., charts (col.) ; 30 cm.
|
373 |
Performance Modelling of Message-Passing Parallel ProgramsGrove, Duncan January 2003 (has links)
Parallel computing is essential for solving very large scientific and engineering problems. An effective parallel computing solution requires an appropriate parallel machine and a well-optimised parallel program, both of which can be selected via performance modelling. This dissertation describes a new performance modelling system, called the Performance Evaluating Virtual Parallel Machine (PEVPM). Unlike previous techniques, the PEVPM system is relatively easy to use, inexpensive to apply and extremely accurate. It uses a novel bottom-up approach, where submodels of individual computation and communication events are dynamically constructed from data-dependencies, current contention levels and the performance distributions of low-level operations, which define performance variability in the face of contention. During model evaluation, the performance distribution attached to each submodel is sampled using Monte Carlo techniques, thus simulating the effects of contention. This allows the PEVPM to accurately simulate a program's execution structure, even if it is non-deterministic, and thus to predict its performance. Obtaining these performance distributions required the development of a new benchmarking tool, called MPIBench. Unlike previous tools, which simply measure average message-passing time over a large number of repeated message transfers, MPIBench uses a highly accurate and globally synchronised clock to measure the performance of individual communication operations. MPIBench was used to benchmark three parallel computers, which encompassed a wide range of network performance capabilities, namely those provided by Fast Ethernet, Myrinet and QsNet. Network contention, a problem ignored by most research in this area, was found to cause extensive performance variation during message-passing operations. For point-to-point communication, this variation was best described by Pearson 5 distributions. Collective communication operations were able to be modelled using their constituent point-to-point operations. In cases of severe contention, extreme outliers were common in the observed performance distributions, which were shown to be the result of lost messages and their subsequent retransmit timeouts. The highly accurate benchmark results provided by MPIBench were coupled with the PEVPM models of a range of parallel programs, and simulated by the PEVPM. These case studies proved that, unlike previous modelling approaches, the PEVPM technique successfully unites generality, flexibility, cost-effectiveness and accuracy in one performance modelling system for parallel programs. This makes it avaluable tool for the development of parallel computing solutions. / Thesis (Ph.D.)--Computer Science, 2003.
|
374 |
Multithreaded virtual processor on DSMAn, Ho Seok 15 December 1999 (has links)
Modern superscalar processors exploit instruction-level parallelism (ILP) by
issuing multiple instructions in a single cycle because of increasing demand for higher
performance in computing. However, stalls due to cache misses severely degrade the
performance by disturbing the exploitation of ILP. Multiprocessors also greatly
exacerbate the memory latency problem. In SMPs, contention due to the shared bus
located between the processors's L2 cache and the shared memory adds additional delay
to the memory latency. In distributed shared memory (DSM) systems, the memory
latency problem becomes even more severe because a miss on the local memory requires
access to remote memory. This limits the performance because the processor can not
spend its time on useful work until the reply from the remote memory is received.
There are a number of techniques that effectively reduce the memory latency.
Multithreadings has emerged as one of the most promising and exciting techniques to
tolerate memory latency. This thesis aims to realize a simulator that supports software-controlled
multithreadings environment on a Distributed Shared Memory and to show
preliminary simulation results. / Graduation date: 2000
|
375 |
A performance study of multithreadingKwak, Hantak 07 December 1998 (has links)
As the performance gap between processor and memory grows, memory latency
will be a major bottleneck in achieving high processor utilization. Multithreading has
emerged as one of the most promising and exciting techniques used to tolerate memory
latency by exploiting thread-level parallelism. The question however remains as to how
effective multithreading is on tolerating memory latency. Due to the current availability
of powerful microprocessors, high-speed networks and software infrastructure systems,
a cost-effective parallel machine is often realized using a network of workstations.
Therefore, we examine the possibility and the effectiveness of using multithreading in a
networked computing environment. Also, we propose the Multithreaded Virtual Processor
model as a means of integrating multithreaded programming paradigm and modern
superscalar processor with support for fast context switching and thread scheduling. In
order to validate our idea, a simulator was developed using a POSIX compliant Pthreads
package and a generic superscalar simulator called Simple Scalar glued together with
support for multithreading. The simulator is a powerful workbench that enables us to
study how future superscalar design and thread management should be modified to better
support multithreading. Our studies with MVP show that, in general, the performance
improvement comes not only from tolerating memory latency, but also due to the
data sharing among threads. / Graduation date: 1999
|
376 |
Similarity-based real-time concurrency control protocolsLai, Chih 29 January 1999 (has links)
Serializability is unnecessarily strict for real-time systems because most transactions
in such systems occur periodically and changes among data values over a
few consecutive periods are often insignificant. Hence, data values produced within
a short interval can be treated as if they are "similar" and interchangeable. This
notion of similarity allows higher concurrency than serializability, and the increased
concurrency may help more transactions to meet their deadlines. The similarity stack
protocol (SSP) proposed in [25, 26] utilizes the concept of similarity. The rules of SSP
are constructed based on prior knowledge of worst-case execution time (WCET) and
data requirements of transactions. As a result, SSP rules need to be re-constructed
each time a real-time application is changed. Moreover, if WCET and data require
merits of transactions are over-estimated, the benefits provided by similarity can be
quickly overshadowed, causing feasible schedules to be rejected.
The advantages of similarity and the drawbacks of SSP motivate us to design
other similarity-based protocols that can better utilize similarity without relying on
any prior information. Since optimistic approaches usually do not require prior information
of transactions, we explore the ideas of integrating optimistic approaches
with similarity in this thesis. We develop three different protocols based on either the
forward-validation or backward-validation mechanisms. We then compare implementation
overheads, number of transaction restarts, length of transaction blocking time,
and predictabilities of these protocols. One important characteristic of our design
is that, when similarity is not applicable, our protocols can still accept serializable
histories. We also study how to extend our protocols to handle aperiodic transactions
and data freshness in this thesis. Finally, a set of simulation experiments is conducted
to compare the deadline miss rates between SSP and one of our protocol. / Graduation date: 1999
|
377 |
Low-power high-performance 32-bit 0.5[u]m CMOS adderShah, Parag Shantu 08 July 1998 (has links)
Currently, the two most critical factors of microprocessor design are performance and power. The optimum balance of these two factors is reflected in the speed-power product(SPP). 32-bit CMOS adders are used as representative circuits to investigate a method of
reducing the SPP. The purpose of this thesis is to show that sizing gates according to fan-out and removing buffer drivers can reduce the SPP. This thesis presents a method for sizing gates in large fan-out parallel prefix circuits to reduce the SPP and compares it to
other methods. Three different parallel prefix adders are used to compare propagation delay and SPP. The first adder uses the depth-optimal prefix circuit. The second adder is based on Wei, Thompson, and Chen's time-optimal adder. The third adder uses a recursive doubling formation where all cells have minimum transistor width dimensions. The component cells in the adders are static CMOS as described by Brent and Kung. For all circuits, the smallest propagation delay occurs when the highest voltage supply is
applied. The smallest SPP occurs when the lowest voltage supply is applied, but with the lowest performance. The Recursive Doubling Adder always has the lowest propagation delay for a particular set of parameters. However, its SPP is nearly equal to the Brent-Kung Adders and lower than Wei's Adder. The power-frequency analysis reveals that a decrease in Vt causes higher power consumption due to leakage. / Graduation date: 1999
|
378 |
Resource placement, data rearrangement, and Hamiltonian cycles in torus networksBae, Myung Mun 14 November 1996 (has links)
Many parallel machines, both commercial and experimental, have been/are being designed with toroidal interconnection networks. For a given number of nodes, the torus has a relatively larger diameter, but better cost/performance tradeoffs, such as higher channel bandwidth, and lower node degree, when compared to the hypercube. Thus, the torus is becoming a popular topology for the interconnection network of a high performance parallel computers.
In a multicomputer, the resources, such as I/O devices or software packages, are distributed over the networks. The first part of the thesis investigates efficient methods of distributing resources in a torus network. Three classes of placement methods are studied. They are (1) distant-t placement problem: in this case, any non-resource node is at a distance of at most t from some resource nodes, (2) j-adjacency problem: here, a non-resource node is adjacent to at least j resource nodes, and (3) generalized placement problem: a non-resource node must be a distance of at most t from at least j resource nodes.
This resource placement technique can be applied to allocating spare processors to provide fault-tolerance in the case of the processor failures. Some efficient
spare processor placement methods and reconfiguration schemes in the case of processor failures are also described.
In a torus based parallel system, some algorithms give best performance if the data are distributed to processors numbered in Cartesian order; in some other cases, it is better to distribute the data to processors numbered in Gray code order. Since the placement patterns may be changed dynamically, it is essential to find efficient methods of rearranging the data from Gray code order to Cartesian order and vice versa. In the second part of the thesis, some efficient methods for data transfer from Cartesian order to radix order and vice versa are developed.
The last part of the thesis gives results on generating edge disjoint Hamiltonian cycles in k-ary n-cubes, hypercubes, and 2D tori. These edge disjoint cycles are quite useful for many communication algorithms. / Graduation date: 1997
|
379 |
High-performance data-parallel input/outputMoore, Jason Andrew 19 July 1996 (has links)
Existing parallel file systems are proving inadequate in two important arenas:
programmability and performance. Both of these inadequacies can largely be traced
to the fact that nearly all parallel file systems evolved from Unix and rely on a Unix-oriented,
single-stream, block-at-a-time approach to file I/O. This one-size-fits-all
approach to parallel file systems is inadequate for supporting applications running
on distributed-memory parallel computers.
This research provides a migration path away from the traditional approaches
to parallel I/O at two levels. At the level seen by the programmer, we show how
file operations can be closely integrated with the semantics of a parallel language.
Principles for this integration are illustrated in their application to C*, a virtual-processor-
oriented language. The result is that traditional C file operations with
familiar semantics can be used in C* where the programmer works--at the virtual
processor level. To facilitate high performance within this framework, machine-independent
modes are used. Modes change the performance of file operations,
not their semantics, so programmers need not use ambiguous operations found in
many parallel file systems. An automatic mode detection technique is presented
that saves the programmer from extra syntax and low-level file system details. This
mode detection system ensures that the most commonly encountered file operations
are performed using high-performance modes.
While the high-performance modes allow fast collective movement of file data,
they must include optimizations for redistribution of file data, a common operation
in production scientific code. This need is addressed at the file system level, where
we provide enhancements to Disk-Directed I/O for redistributing file data. Two
enhancements are geared to speeding fine-grained redistributions. One uses a two-phase,
or indirect, approach to redistributing data among compute nodes. The
other relies on I/O nodes to guide the redistribution by building packets bound for
compute nodes. We model the performance of these enhancements and determine
the key parameters determining when each approach should be used. Finally, we
introduce the notion of collective prefetching and identify its performance benefits
and implementation tradeoffs. / Graduation date: 1997
|
380 |
Evaluation of scheduling heuristics for non-identical parallel processorsKuo, Chun-Ho 29 September 1994 (has links)
An evaluation of scheduling heuristics for non-identical
parallel processors was performed. There has been
limited research that has focused on scheduling of parallel
processors. This research generalizes the results from
prior work in this area and examines complex scheduling
rules in terms of flow time, tardiness, and proportion of
tardy jobs. Several factors affecting the system were
examined and scheduling heuristics were developed. These
heuristics combine job allocation and job sequencing
functions. A number of system features were considered in
developing these heuristics, including setup times and
processor utilization spread. The heuristics used different
sequencing rules for job sequencing including random,
Shortest Process Time (SPT), Earlier Due Date (EDD), and
Smaller Slack (SS).
A simulation model was developed and executed to study
the system. The results of the study show that the effect
of the number of machines, the number of products, system
loading, and setup times were significant for all
performance measures. The effect of number of machines was
also found to be significant on flow time and tardiness.
Several two-factor interactions were identified as
significant for flow time and tardiness.
The SPT-based heuristic resulted in minimum job flow
times. For tardiness and proportion of tardy jobs, the EDD-based
heuristic gave the best results. Based on these
conclusions, a "Hybrid" heuristic that combined SPT and EDD
considerations was developed to provide tradeoff between
flow time and due date based measures. / Graduation date: 1995
|
Page generated in 0.0589 seconds