Global ETD Search

21	Fast Multipole-Based Elliptic PDE Solver and Preconditioner Ibeid, Huda 07 December 2016 (has links) Exascale systems are predicted to have approximately one billion cores, assuming Gigahertz cores. Limitations on affordable network topologies for distributed memory systems of such massive scale bring new challenges to the currently dominant parallel programing model. Currently, there are many efforts to evaluate the hardware and software bottlenecks of exascale designs. It is therefore of interest to model application performance and to understand what changes need to be made to ensure extrapolated scalability. Fast multipole methods (FMM) were originally developed for accelerating N-body problems for particle-based methods in astrophysics and molecular dynamics. FMM is more than an N-body solver, however. Recent efforts to view the FMM as an elliptic PDE solver have opened the possibility to use it as a preconditioner for even a broader range of applications. In this thesis, we (i) discuss the challenges for FMM on current parallel computers and future exascale architectures, with a focus on inter-node communication, and develop a performance model that considers the communication patterns of the FMM for spatially quasi-uniform distributions, (ii) employ this performance model to guide performance and scaling improvement of FMM for all-atom molecular dynamics simulations of uniformly distributed particles, and (iii) demonstrate that, beyond its traditional use as a solver in problems for which explicit free-space kernel representations are available, the FMM has applicability as a preconditioner in finite domain elliptic boundary value problems, by equipping it with boundary integral capability for satisfying conditions at finite boundaries and by wrapping it in a Krylov method for extensibility to more general operators. Compared with multilevel methods, FMM is capable of comparable algebraic convergence rates down to the truncation error of the discretized PDE, and it has superior multicore and distributed memory scalability properties on commodity architecture supercomputers. Compared with other methods exploiting the low rank character of off-diagonal blocks of the dense resolvent operator, FMM-preconditioned Krylov iteration may reduce the amount of communication because it is matrix-free and exploits the tree structure of FMM. Fast multipole-based solvers and preconditioners are demonstrably poised to play a leading role in exascale computing. Fast Multiple Method Performance modeling Boundary element method Preconditioning
22	Characterization and Exploitation of GPU Memory Systems Lee, Kenneth Sydney 25 October 2012 (has links) Graphics Processing Units (GPUs) are workhorses of modern performance due to their ability to achieve massive speedups on parallel applications. The massive number of threads that can be run concurrently on these systems allow applications which have data-parallel computations to achieve better performance when compared to traditional CPU systems. However, the GPU is not perfect for all types of computation. The massively parallel SIMT architecture of the GPU can still be constraining in terms of achievable performance. GPU-based systems will typically only be able to achieve between 40%-60% of their peak performance. One of the major problems affecting this effeciency is the GPU memory system, which is tailored to the needs of graphics workloads instead of general-purpose computation. This thesis intends to show the importance of memory optimizations for GPU systems. In particular, this work addresses problems of data transfer and global atomic memory contention. Using the novel AMD Fusion architecture, we gain overall performance improvements over discrete GPU systems for data-intensive applications. The fused architecture systems offer an interesting trade off by increasing data transfer rates at the cost of some raw computational power. We characterize the performance of different memory paths that are possible because of the shared memory space present on the fused architecture. In addition, we provide a theoretical model which can be used to correctly predict the comparative performance of memory movement techniques for a given data-intensive application and system. In terms of global atomic memory contention, we show improvements in scalability and performance for global synchronization primitives by avoiding contentious global atomic memory accesses. In general, this work shows the importance of understanding the memory system of the GPU architecture to achieve better application performance. / Master of Science Data Transfer Performance Modeling GPGPU APU GPU Memory Systems
23	Environmental and Statistical Performance Mapping Model for Underwater Acoustic Detection Systems McDowell, Pamela 14 May 2010 (has links) This manuscript describes a methodology to combine environmental models, acoustic signal predictions, statistical detection models and operations research to form a framework for calculating and communicating performance. This methodology has been applied to undersea target detection systems and has come to be known as Performance Surface modeling. The term Performance Surface refers to a geo-spatial representation of the predicted performance of one or more sensors constrained by all-source forecasts for a geophysical area of operations. Recent improvements in ocean, atmospheric and underwater acoustic models, along with advances in parallel computing provide an opportunity to forecast the effects of a complex and dynamic acoustic environment on undersea target detection system performance. This manuscript describes a new process that calculates performance in a straight-forward "sonar-equation" manner utilizing spatially complex and temporally dynamic environmental models. This performance model is constructed by joining environmental acoustic signal predictions with a detection model to form a probabilistic prediction which is then combined with probabilities of target location to produce conditional, joint and marginal probabilities. These joint and marginal probabilities become the scalar estimates of system performance. This manuscript contains two invited articles recently accepted for publication. The first article describes the Performance Surface model development with sections on current applications and future extensions to a more stochastic model. The second article is written from the operational perspective of a Naval commanding officer with co-authors from the active force. Performance Surface tools have been demonstrated at the Naval Oceanographic Office (NAVOCEANO) and the Naval Oceanographic Anti-Submarine Warfare (ASW) Center (NOAC) in support of recent naval exercises. The model also has recently been a major representation for the "performance" layer of the Naval Meteorological and Oceanographic Command (NAVMETOCCOM) in its Battlespace on Demand strategy for supporting the Fleet with oceanographic products. Underwater Acoustics Anti-Submarine Warfare Performance Surface Sonar Performance Statistical Performance Modeling
24	Performance Analysis and Modeling of Parallel Applications in the Context of Architectural Rooflines Shaila, Nashid 27 October 2016 (has links) Understanding the performance of applications on modern multi- and manycore platforms is a difficult task and involves complex measurement, analysis, and modeling. The Roofline model is used to assess an application's performance on a given architecture. Not much work has been done with the Roofline model using real measurements. Because it can be a very useful tool for understanding application performance on a given architecture, in this thesis we demonstrate the use of architectural roofline data with measured data for analyzing the performance of different benchmarks. We first explain how to use different toolkits to measure the performance of a program. Next, these data are used to generate the roofline plots, based on which we can decide how can we make the application more efficient and remove bottlenecks. Our results show that this can be a powerful tool for analyzing performance of applications over different architectures and different code versions. High performance computing Performance analysis Performance modeling Roofline model TAU PAPI Vtune SDE
25	Run-time optimization of adaptive irregular applications Yu, Hao 15 November 2004 (has links) Compared to traditional compile-time optimization, run-time optimization could oﬀer signiﬁcant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identiﬁed a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Speciﬁcally, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm speciﬁed by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems. compiler optimizations adaptive optimization performance modeling run-time parallelization run-time optimization reduction parallelization
26	E-AMOM: An Energy-Aware Modeling and Optimization Methodology for Scientific Applications on Multicore Systems Lively, Charles 2012 May 1900 (has links) Power consumption is an important constraint in achieving efficient execution on High Performance Computing Multicore Systems. As the number of cores available on a chip continues to increase, the importance of power consumption will continue to grow. In order to achieve improved performance on multicore systems scientific applications must make use of efficient methods for reducing power consumption and must further be refined to achieve reduced execution time. In this dissertation, we introduce a performance modeling framework, E-AMOM, to enable improved execution of scientific applications on parallel multicore systems with regards to a limited power budget. We develop models for each application based upon performance hardware counters. Our models utilize different performance counters for each application and for each performance component (runtime, system power consumption, CPU power consumption, and memory power consumption) that are selected via our performance-tuned principal component analysis method. Models developed through E-AMOM provide insight into the performance characteristics of each application that affect performance for each component on a parallel multicore system. Our models are more than 92% accurate across both Hybrid (MPI/OpenMP) and MPI implementations for six scientific applications. E-AMOM includes an optimization component that utilizes our models to employ run-time Dynamic Voltage and Frequency Scaling (DVFS) and Dynamic Concurrency Throttling to reduce power consumption of the scientific applications. Further, we optimize our applications based upon insights provided by the performance models to reduce runtime of the applications. Our methods and techniques are able to save up to 18% in energy consumption for Hybrid (MPI/OpenMP) and MPI scientific applications and reduce the runtime of the applications up to 11% on parallel multicore systems. Performance Modeling Power consumption Multicore Parallel Programming MPI Hybrid Power prediction
27	Validation and integration of a rubber engine model into an MDO environment Wemming, Hannes January 2010 (has links) Multidisciplinary design optimization (MDO) is a technique that has found use in the field of aerospace engineering for aircraft design. It uses optimization to simultaneously solve design problems with several disciplines involved. In order to predict aircraft performance an engine performance simulation model, also called “rubber engine”, is vital. The goal of this project is to validate and integrate a rubber engine model into an MDO environment. A method for computer simulation of gas turbine aero engine performance was created. GasTurb v11, a commercial gas turbine performance simulation software, was selected for doing the simulation models. The method was validated by applying it to five different jet engines of different size, different type and different age. It was shown that the simulation engine model results are close to the engine manufacturer data in terms of SFC and net thrust during cruise, maximum climb (MCL) and take off (MTO) thrust ratings. The cruise, take off and climb SFC was in general predicted within 2% error when compared to engine manufacturer performance data. The take off and climb net thrust was in general predicted with less than 5% error. The integration of the rubber engine model with the MDO framework was started and it was demonstrated that the model can run within the MDO software. Four different jet engine models have been prepared for use within the optimization software. The main conclusion is that GasTurb v11 can be used to make accurate jet engine performance simulation models and that it is possible to incorporate these models into an MDO environment. Aero engine Gas turbine Jet engine MDO Rubber engine Performance modeling GasTurb Vehicle engineering Farkostteknik
28	Theory Modeling and Analysis of MEA of A Proton Exchange membrane Fuel Cell Chou, Hsuan-Jen 16 July 2002 (has links) A mathematical model for a proton exchange membrane fuel cell is the focus of this thesis. Modeling and simulations are carried out with an aim to understand the influence of operational and geometrical parameters on the inner reaction and performance of a proton exchange membrane fuel cell, and discuss the distributions of physical phenomena in membrane and catalyst layer. Than, the results of modeling are compared and analyzed with the experiments, and discuss the reasons of influences of the performance of PEMFC. The results show that activation overpotential is the major reason of influence of the performance at low current density (less than ), and diffusion and ohmic overpotential are substantially increased at high current density (great than ). The membrane of higher membrane conductivity and more thin, increasing pressure of cathode gas and use oxygen can enhance the performance of a PEMFC. The performance almost no influence for the catalyst layer over 0.3£gm. The catalyst layer thin and uniform can decrease coating of this layer. The results of modeling and experiments show that experiments have contact resistance between materials, and the performance slightly lower than performance of modeling, and the differences that at high current density great than low current density. membrane and electrode assembly (MEA) Performance Modeling
29	Run-time optimization of adaptive irregular applications Yu, Hao 15 November 2004 (has links) Compared to traditional compile-time optimization, run-time optimization could oﬀer signiﬁcant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identiﬁed a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Speciﬁcally, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm speciﬁed by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems. compiler optimizations adaptive optimization performance modeling run-time parallelization run-time optimization reduction parallelization
30	Resource management for efficient single-ISA heterogeneous computing Chen, Jian, doctor of electrical and computer engineering 11 July 2012 (has links) Single-ISA heterogeneous multi-core processors (SHMP) have become increasingly important due to their potential to significantly improve the execution efficiency for diverse workloads and thereby alleviate the power density constraints in Chip Multiprocessors (CMP). The importance of SHMP is further underscored by the fact that manufacturing defects and process variation could also cause single-ISA heterogeneity in CMPs even though the CMP is originally designed as homogeneous. However, to fully exploit the execution efficiency that SHMP has to offer, programs have to be efficiently mapped/scheduled to the appropriate cores such that the hardware resources of the cores match the resource demands of the programs, which is challenging and remains an open problem. This dissertation presents a comprehensive set of off-line and on-line techniques that leverage analytical performance modeling to bridge the gap between the workload diversity and the hardware heterogeneity. For the off-line scenario, this dissertation presents an efficient resource demand analysis framework that can estimate the resource demands of a program based on the inherent characteristics of the program without using any detailed simulation. Based on the estimated resource demands, this dissertation further proposes a multi-dimensional program-core matching technique that projects program resource demands and core configurations to a unified multi-dimensional space, and uses the weighted Euclidean distance between these two to identify the matching program-core pair. This dissertation also presents a dynamic and predictive application scheduler for SHMPs. It uses a set of hardware-efficient online profilers and an analytical performance model to simultaneously predict the application’s performance on different cores. Based on the predicted performance, the scheduler identifies and enforces near-optimal application assignment for each scheduling interval without any trial runs or off-line profiling. Using only a few kilo-bytes of extra hardware, the proposed heterogeneity-aware scheduler improves the weighted speedup by 11.3% compared with the commodity OpenSolaris scheduler and by 6.8% compared with the best known research scheduler. Finally, this dissertation presents a predictive yet cost effective mechanism to manage intra-core and/or inter-core resources in dynamic SHMP. It also uses a set of hardware-efficient online profilers and an analytical performance model to predict the application’s performance with different resource allocations. Based on the predicted performance, the resource allocator identifies and enforces near optimum resource partitions for each epoch without any trial runs. The experimental results show that the proposed predictive resource management framework could improve the weighted speedup of the CMP system by an average of 11.6% compared with the equal partition scheme, and 9.3% compared with existing reactive resource management scheme. / text Performance modeling Heterogeneous multicore processor Hardware resource demand Application scheduling Resource management

Search results