Global ETD Search

371	Performance modelling of message-passing parallel programs Grove, Duncan A. January 2003 (has links) (PDF) This dissertation describes a new performance modelling system, called the Performance Evaluating Virtual Parallel Machine (PEVPM). It uses a novel bottom-up approach, where submodels of individual computation and communication events are dynamically constructed from data-dependencies, current contention levels and the performance distributions of low-level operations, which define performance variability in the face of contention. Parallel computers Evaluation Electronic digital computers Evaluation
372	Performance modelling of message-passing parallel programs Grove, Duncan A. January 2003 (has links) Electronic publication; full text available in PDF format; abstract in HTML format. This dissertation describes a new performance modelling system, called the Performance Evaluating Virtual Parallel Machine (PEVPM). It uses a novel bottom-up approach, where submodels of individual computation and communication events are dynamically constructed from data-dependencies, current contention levels and the performance distributions of low-level operations, which define performance variability in the face of contention. Electronic reproduction.[Australia] :Australian Digital Theses Program,2001. xvii, 295 p. : ill., charts (col.) ; 30 cm. Parallel computers Evaluation Electronic digital computers Evaluation
373	Performance Modelling of Message-Passing Parallel Programs Grove, Duncan January 2003 (has links) Parallel computing is essential for solving very large scientific and engineering problems. An effective parallel computing solution requires an appropriate parallel machine and a well-optimised parallel program, both of which can be selected via performance modelling. This dissertation describes a new performance modelling system, called the Performance Evaluating Virtual Parallel Machine (PEVPM). Unlike previous techniques, the PEVPM system is relatively easy to use, inexpensive to apply and extremely accurate. It uses a novel bottom-up approach, where submodels of individual computation and communication events are dynamically constructed from data-dependencies, current contention levels and the performance distributions of low-level operations, which define performance variability in the face of contention. During model evaluation, the performance distribution attached to each submodel is sampled using Monte Carlo techniques, thus simulating the effects of contention. This allows the PEVPM to accurately simulate a program's execution structure, even if it is non-deterministic, and thus to predict its performance. Obtaining these performance distributions required the development of a new benchmarking tool, called MPIBench. Unlike previous tools, which simply measure average message-passing time over a large number of repeated message transfers, MPIBench uses a highly accurate and globally synchronised clock to measure the performance of individual communication operations. MPIBench was used to benchmark three parallel computers, which encompassed a wide range of network performance capabilities, namely those provided by Fast Ethernet, Myrinet and QsNet. Network contention, a problem ignored by most research in this area, was found to cause extensive performance variation during message-passing operations. For point-to-point communication, this variation was best described by Pearson 5 distributions. Collective communication operations were able to be modelled using their constituent point-to-point operations. In cases of severe contention, extreme outliers were common in the observed performance distributions, which were shown to be the result of lost messages and their subsequent retransmit timeouts. The highly accurate benchmark results provided by MPIBench were coupled with the PEVPM models of a range of parallel programs, and simulated by the PEVPM. These case studies proved that, unlike previous modelling approaches, the PEVPM technique successfully unites generality, flexibility, cost-effectiveness and accuracy in one performance modelling system for parallel programs. This makes it avaluable tool for the development of parallel computing solutions. / Thesis (Ph.D.)--Computer Science, 2003.
374	Multithreaded virtual processor on DSM An, Ho Seok 15 December 1999 (has links) Modern superscalar processors exploit instruction-level parallelism (ILP) by issuing multiple instructions in a single cycle because of increasing demand for higher performance in computing. However, stalls due to cache misses severely degrade the performance by disturbing the exploitation of ILP. Multiprocessors also greatly exacerbate the memory latency problem. In SMPs, contention due to the shared bus located between the processors's L2 cache and the shared memory adds additional delay to the memory latency. In distributed shared memory (DSM) systems, the memory latency problem becomes even more severe because a miss on the local memory requires access to remote memory. This limits the performance because the processor can not spend its time on useful work until the reply from the remote memory is received. There are a number of techniques that effectively reduce the memory latency. Multithreadings has emerged as one of the most promising and exciting techniques to tolerate memory latency. This thesis aims to realize a simulator that supports software-controlled multithreadings environment on a Distributed Shared Memory and to show preliminary simulation results. / Graduation date: 2000 Distributed shared memory
375	A performance study of multithreading Kwak, Hantak 07 December 1998 (has links) As the performance gap between processor and memory grows, memory latency will be a major bottleneck in achieving high processor utilization. Multithreading has emerged as one of the most promising and exciting techniques used to tolerate memory latency by exploiting thread-level parallelism. The question however remains as to how effective multithreading is on tolerating memory latency. Due to the current availability of powerful microprocessors, high-speed networks and software infrastructure systems, a cost-effective parallel machine is often realized using a network of workstations. Therefore, we examine the possibility and the effectiveness of using multithreading in a networked computing environment. Also, we propose the Multithreaded Virtual Processor model as a means of integrating multithreaded programming paradigm and modern superscalar processor with support for fast context switching and thread scheduling. In order to validate our idea, a simulator was developed using a POSIX compliant Pthreads package and a generic superscalar simulator called Simple Scalar glued together with support for multithreading. The simulator is a powerful workbench that enables us to study how future superscalar design and thread management should be modified to better support multithreading. Our studies with MVP show that, in general, the performance improvement comes not only from tolerating memory latency, but also due to the data sharing among threads. / Graduation date: 1999 Threads (Computer programs)
376	Similarity-based real-time concurrency control protocols Lai, Chih 29 January 1999 (has links) Serializability is unnecessarily strict for real-time systems because most transactions in such systems occur periodically and changes among data values over a few consecutive periods are often insignificant. Hence, data values produced within a short interval can be treated as if they are "similar" and interchangeable. This notion of similarity allows higher concurrency than serializability, and the increased concurrency may help more transactions to meet their deadlines. The similarity stack protocol (SSP) proposed in [25, 26] utilizes the concept of similarity. The rules of SSP are constructed based on prior knowledge of worst-case execution time (WCET) and data requirements of transactions. As a result, SSP rules need to be re-constructed each time a real-time application is changed. Moreover, if WCET and data require merits of transactions are over-estimated, the benefits provided by similarity can be quickly overshadowed, causing feasible schedules to be rejected. The advantages of similarity and the drawbacks of SSP motivate us to design other similarity-based protocols that can better utilize similarity without relying on any prior information. Since optimistic approaches usually do not require prior information of transactions, we explore the ideas of integrating optimistic approaches with similarity in this thesis. We develop three different protocols based on either the forward-validation or backward-validation mechanisms. We then compare implementation overheads, number of transaction restarts, length of transaction blocking time, and predictabilities of these protocols. One important characteristic of our design is that, when similarity is not applicable, our protocols can still accept serializable histories. We also study how to extend our protocols to handle aperiodic transactions and data freshness in this thesis. Finally, a set of simulation experiments is conducted to compare the deadline miss rates between SSP and one of our protocol. / Graduation date: 1999 Database management
377	Low-power high-performance 32-bit 0.5[u]m CMOS adder Shah, Parag Shantu 08 July 1998 (has links) Currently, the two most critical factors of microprocessor design are performance and power. The optimum balance of these two factors is reflected in the speed-power product(SPP). 32-bit CMOS adders are used as representative circuits to investigate a method of reducing the SPP. The purpose of this thesis is to show that sizing gates according to fan-out and removing buffer drivers can reduce the SPP. This thesis presents a method for sizing gates in large fan-out parallel prefix circuits to reduce the SPP and compares it to other methods. Three different parallel prefix adders are used to compare propagation delay and SPP. The first adder uses the depth-optimal prefix circuit. The second adder is based on Wei, Thompson, and Chen's time-optimal adder. The third adder uses a recursive doubling formation where all cells have minimum transistor width dimensions. The component cells in the adders are static CMOS as described by Brent and Kung. For all circuits, the smallest propagation delay occurs when the highest voltage supply is applied. The smallest SPP occurs when the lowest voltage supply is applied, but with the lowest performance. The Recursive Doubling Adder always has the lowest propagation delay for a particular set of parameters. However, its SPP is nearly equal to the Brent-Kung Adders and lower than Wei's Adder. The power-frequency analysis reveals that a decrease in Vt causes higher power consumption due to leakage. / Graduation date: 1999 Electric circuits, Parallel
378	Resource placement, data rearrangement, and Hamiltonian cycles in torus networks Bae, Myung Mun 14 November 1996 (has links) Many parallel machines, both commercial and experimental, have been/are being designed with toroidal interconnection networks. For a given number of nodes, the torus has a relatively larger diameter, but better cost/performance tradeoffs, such as higher channel bandwidth, and lower node degree, when compared to the hypercube. Thus, the torus is becoming a popular topology for the interconnection network of a high performance parallel computers. In a multicomputer, the resources, such as I/O devices or software packages, are distributed over the networks. The first part of the thesis investigates efficient methods of distributing resources in a torus network. Three classes of placement methods are studied. They are (1) distant-t placement problem: in this case, any non-resource node is at a distance of at most t from some resource nodes, (2) j-adjacency problem: here, a non-resource node is adjacent to at least j resource nodes, and (3) generalized placement problem: a non-resource node must be a distance of at most t from at least j resource nodes. This resource placement technique can be applied to allocating spare processors to provide fault-tolerance in the case of the processor failures. Some efficient spare processor placement methods and reconfiguration schemes in the case of processor failures are also described. In a torus based parallel system, some algorithms give best performance if the data are distributed to processors numbered in Cartesian order; in some other cases, it is better to distribute the data to processors numbered in Gray code order. Since the placement patterns may be changed dynamically, it is essential to find efficient methods of rearranging the data from Gray code order to Cartesian order and vice versa. In the second part of the thesis, some efficient methods for data transfer from Cartesian order to radix order and vice versa are developed. The last part of the thesis gives results on generating edge disjoint Hamiltonian cycles in k-ary n-cubes, hypercubes, and 2D tori. These edge disjoint cycles are quite useful for many communication algorithms. / Graduation date: 1997 High performance computing
379	High-performance data-parallel input/output Moore, Jason Andrew 19 July 1996 (has links) Existing parallel file systems are proving inadequate in two important arenas: programmability and performance. Both of these inadequacies can largely be traced to the fact that nearly all parallel file systems evolved from Unix and rely on a Unix-oriented, single-stream, block-at-a-time approach to file I/O. This one-size-fits-all approach to parallel file systems is inadequate for supporting applications running on distributed-memory parallel computers. This research provides a migration path away from the traditional approaches to parallel I/O at two levels. At the level seen by the programmer, we show how file operations can be closely integrated with the semantics of a parallel language. Principles for this integration are illustrated in their application to C, a virtual-processor- oriented language. The result is that traditional C file operations with familiar semantics can be used in C where the programmer works--at the virtual processor level. To facilitate high performance within this framework, machine-independent modes are used. Modes change the performance of file operations, not their semantics, so programmers need not use ambiguous operations found in many parallel file systems. An automatic mode detection technique is presented that saves the programmer from extra syntax and low-level file system details. This mode detection system ensures that the most commonly encountered file operations are performed using high-performance modes. While the high-performance modes allow fast collective movement of file data, they must include optimizations for redistribution of file data, a common operation in production scientific code. This need is addressed at the file system level, where we provide enhancements to Disk-Directed I/O for redistributing file data. Two enhancements are geared to speeding fine-grained redistributions. One uses a two-phase, or indirect, approach to redistributing data among compute nodes. The other relies on I/O nodes to guide the redistribution by building packets bound for compute nodes. We model the performance of these enhancements and determine the key parameters determining when each approach should be used. Finally, we introduce the notion of collective prefetching and identify its performance benefits and implementation tradeoffs. / Graduation date: 1997 High performance computing
380	Evaluation of scheduling heuristics for non-identical parallel processors Kuo, Chun-Ho 29 September 1994 (has links) An evaluation of scheduling heuristics for non-identical parallel processors was performed. There has been limited research that has focused on scheduling of parallel processors. This research generalizes the results from prior work in this area and examines complex scheduling rules in terms of flow time, tardiness, and proportion of tardy jobs. Several factors affecting the system were examined and scheduling heuristics were developed. These heuristics combine job allocation and job sequencing functions. A number of system features were considered in developing these heuristics, including setup times and processor utilization spread. The heuristics used different sequencing rules for job sequencing including random, Shortest Process Time (SPT), Earlier Due Date (EDD), and Smaller Slack (SS). A simulation model was developed and executed to study the system. The results of the study show that the effect of the number of machines, the number of products, system loading, and setup times were significant for all performance measures. The effect of number of machines was also found to be significant on flow time and tardiness. Several two-factor interactions were identified as significant for flow time and tardiness. The SPT-based heuristic resulted in minimum job flow times. For tardiness and proportion of tardy jobs, the EDD-based heuristic gave the best results. Based on these conclusions, a "Hybrid" heuristic that combined SPT and EDD considerations was developed to provide tradeoff between flow time and due date based measures. / Graduation date: 1995 Production scheduling

Search results