Global ETD Search

11	Accelerating Atmospheric Modeling Through Emerging Multi-core Technologies Linford, John Christian 18 May 2010 (has links) The new generations of multi-core chipset architectures achieve unprecedented levels of computational power while respecting physical and economical constraints. The cost of this power is bewildering program complexity. Atmospheric modeling is a grand-challenge problem that could make good use of these architectures if they were more accessible to the average programmer. To that end, software tools and programming methodologies that greatly simplify the acceleration of atmospheric modeling and simulation with emerging multi-core technologies are developed. A general model is developed to simulate atmospheric chemical transport and atmospheric chemical kinetics. The Cell Broadband Engine Architecture (CBEA), General Purpose Graphics Processing Units (GPGPUs), and homogeneous multi-core processors (e.g. Intel Quad-core Xeon) are introduced. These architectures are used in case studies of transport modeling and kinetics modeling and demonstrate per-kernel speedups as high as 40x. A general analysis and code generation tool for chemical kinetics called "KPPA" is developed. KPPA generates highly tuned C, Fortran, or Matlab code that uses every layer of heterogeneous parallelism in the CBEA, GPGPU, and homogeneous multi-core architectures. A scalable method for simulating chemical transport is also developed. The Weather Research and Forecasting Model with Chemistry (WRF-Chem) is accelerated with these methods with good results: real forecasts of air quality are generated for the Eastern United States 65% faster than the state-of-the-art models. / Ph. D. hardware high performance computing software
12	Profiling of RT-PICLS Code Kelling, Jeffrey, Juckeland, Guido 15 May 2017 (has links) (PDF) It was observed, that the RT-PICLS code ran by FWKT on the hypnos cluster was producing an unusual amount of system load, according to Ganglia metrics. Since this may point to an IO-problem in the code, this code was analyzed more closely. Hochleistungsrechnen Profiling high-performance computing profiling
13	Resource placement, data rearrangement, and Hamiltonian cycles in torus networks Bae, Myung Mun 14 November 1996 (has links) Many parallel machines, both commercial and experimental, have been/are being designed with toroidal interconnection networks. For a given number of nodes, the torus has a relatively larger diameter, but better cost/performance tradeoffs, such as higher channel bandwidth, and lower node degree, when compared to the hypercube. Thus, the torus is becoming a popular topology for the interconnection network of a high performance parallel computers. In a multicomputer, the resources, such as I/O devices or software packages, are distributed over the networks. The first part of the thesis investigates efficient methods of distributing resources in a torus network. Three classes of placement methods are studied. They are (1) distant-t placement problem: in this case, any non-resource node is at a distance of at most t from some resource nodes, (2) j-adjacency problem: here, a non-resource node is adjacent to at least j resource nodes, and (3) generalized placement problem: a non-resource node must be a distance of at most t from at least j resource nodes. This resource placement technique can be applied to allocating spare processors to provide fault-tolerance in the case of the processor failures. Some efficient spare processor placement methods and reconfiguration schemes in the case of processor failures are also described. In a torus based parallel system, some algorithms give best performance if the data are distributed to processors numbered in Cartesian order; in some other cases, it is better to distribute the data to processors numbered in Gray code order. Since the placement patterns may be changed dynamically, it is essential to find efficient methods of rearranging the data from Gray code order to Cartesian order and vice versa. In the second part of the thesis, some efficient methods for data transfer from Cartesian order to radix order and vice versa are developed. The last part of the thesis gives results on generating edge disjoint Hamiltonian cycles in k-ary n-cubes, hypercubes, and 2D tori. These edge disjoint cycles are quite useful for many communication algorithms. / Graduation date: 1997 High performance computing
14	High-performance data-parallel input/output Moore, Jason Andrew 19 July 1996 (has links) Existing parallel file systems are proving inadequate in two important arenas: programmability and performance. Both of these inadequacies can largely be traced to the fact that nearly all parallel file systems evolved from Unix and rely on a Unix-oriented, single-stream, block-at-a-time approach to file I/O. This one-size-fits-all approach to parallel file systems is inadequate for supporting applications running on distributed-memory parallel computers. This research provides a migration path away from the traditional approaches to parallel I/O at two levels. At the level seen by the programmer, we show how file operations can be closely integrated with the semantics of a parallel language. Principles for this integration are illustrated in their application to C, a virtual-processor- oriented language. The result is that traditional C file operations with familiar semantics can be used in C where the programmer works--at the virtual processor level. To facilitate high performance within this framework, machine-independent modes are used. Modes change the performance of file operations, not their semantics, so programmers need not use ambiguous operations found in many parallel file systems. An automatic mode detection technique is presented that saves the programmer from extra syntax and low-level file system details. This mode detection system ensures that the most commonly encountered file operations are performed using high-performance modes. While the high-performance modes allow fast collective movement of file data, they must include optimizations for redistribution of file data, a common operation in production scientific code. This need is addressed at the file system level, where we provide enhancements to Disk-Directed I/O for redistributing file data. Two enhancements are geared to speeding fine-grained redistributions. One uses a two-phase, or indirect, approach to redistributing data among compute nodes. The other relies on I/O nodes to guide the redistribution by building packets bound for compute nodes. We model the performance of these enhancements and determine the key parameters determining when each approach should be used. Finally, we introduce the notion of collective prefetching and identify its performance benefits and implementation tradeoffs. / Graduation date: 1997 High performance computing
15	Computational Parameter Selection and Simulation of Complex Sphingolipid Pathway Metabolism Henning, Peter Allen 22 May 2006 (has links) Systems biology is an emerging field of study that seeks to provide systems-level understanding of biological systems through the integration of high-throughput biological data into predictive computational models. The integrative nature of this field is in sharp contrast as compared to the Reductionist methods that have been employed since the advent of molecular biology. Systems biology investigates not only the individual components of the biological system, such as metabolic pathways, organelles, and signaling cascades, but also considers the relationships and interactions between the components in the hope that an understandable model of the entire system can eventually be developed. This field of study is being hailed by experts as a potential vital technology in revolutionizing the pharmaceutical development process in the post-genomic era. This work not only provides a systems biology investigation into principles governing de novo sphingolipid metabolism but also the various computational obstacles that are present in converting high-throughput data into an insightful model. Systems biology Kinetics High performance computing
16	Design and Implementation of High Performance Algorithms for the (n,k)-Universal Set Problem Luo, Ping 14 January 2010 (has links) The k-path problem is to find a simple path of length k. This problem is NP-complete and has applications in bioinformatics for detecting signaling pathways in protein interaction networks and for biological subnetwork matching. There are algorithms implemented to solve the problem for k up to 13. The fastest implementation has running time O^(4.32^k), which is slower than the best known algorithm of running time O^(4^k). To implement the best known algorithm for the k-path problem, we need to construct (n,k)-universal set. In this thesis, we study the practical algorithms for constructing the (n,k)-universal set problem. We propose six algorithm variants to handle the increasing computational time and memory space needed for k=3, 4, ..., 8. We propose two major empirical techniques that cut the time and space tremendously, yet generate good results. For the case k=7, the size of the universal set found by our algorithm is 1576, and is 4611 for the case k=8. We implement the proposed algorithms with the OpenMP parallel interface and construct universal sets for k=3, 4, ..., 8. Our experiments show that our algorithms for the (n,k)-universal set problem exhibit very good parallelism and hence shed light on its MPI implementation. Ours is the first implementation effort for the (n,k)-universal set problem. We share the effort by proposing an extensible universal set construction and retrieval system. This system integrates universal set construction algorithms and the universal sets constructed. The sets are stored in a centralized database and an interface is provided to access the database easily. The (n,k)-universal set have been applied to many other NP-complete problems such as the set splitting problems and the matching and packing problems. The small (n,k)-universal set constructed by us will reduce significantly the time to solve those problems. Parameterized Algorithms Algorithms Engineering High Performance Computing
17	Memory management for high-performance applications Berger, Emery David. January 2002 (has links) (PDF) Thesis (Ph. D.)--University of Texas at Austin, 2002. / Vita. Includes bibliographical references. Available also from UMI Company.
18	Assessment of open-source software for high-performance computing Rapur, Gayatri. January 2003 (has links) (PDF) Thesis (M.S.)--Mississippi State University. Department of Computer Science and Engineering. / Title from title screen. Includes bibliographical references.
19	Power and performance modeling for high-performance computing algorithms Choi, Jee Whan 08 June 2015 (has links) The overarching goal of this thesis is to provide an algorithm-centric approach to analyzing the relationship between time, energy, and power. This research is aimed at algorithm designers and performance tuners so that they may be able to make decisions on how algorithms should be designed and tuned depending on whether the goal is to minimize time or to minimize energy on current and future systems. First, we present a simple analytical cost model for energy and power. Assuming a simple von Neumann architecture with a two-level memory hierarchy, this model pre- dicts energy and power for algorithms using just a few simple parameters, such as the number of floating point operations (FLOPs or flops) and the amount of data moved (bytes or words). Using highly optimized microbenchmarks and a small number of test platforms, we show that although this model uses only a few simple parameters, it is, nevertheless, accurate. We can also visualize this model using energy “arch lines,” analogous to the “rooflines” in time. These “rooflines in energy” allow users to easily assess and com- pare different algorithms’ intensities in energy and time to various target systems’ balances in energy and time. This visualization of our model gives us many inter- esting insights, and as such, we refer to our analytical model as the energy roofline model. Second, we present the results of our microbenchmarking study of time, energy, and power costs of computation and memory access of several candidate compute- node building blocks of future high–performance computing (HPC) systems. Over a dozen server-, desktop-, and mobile-class platforms that span a range of compute and power characteristics were evaluated, including x86 (both conventional and Xeon Phi accelerator), ARM, graphics processing units (GPU), and hybrid (AMD accelerated processing units (APU) and other system–on–chip (SoC)) processors. The purpose of this study was twofold; first, it was to extend the validation of the energy roofline model to a more comprehensive set of target systems to show that the model works well independent of system hardware and microarchitecture; second, it was to improve the model by uncovering and remedying potential shortcomings, such as incorporating the effects of power “capping,” multi–level memory hierarchy, and different implementation strategies on power and performance. Third, we incorporate dynamic voltage and frequency scaling (DVFS) into the energy roofline model to explore its potential for saving energy. Rather than the more traditional approach of using DVFS to reduce energy, whereby a “slack” in computation is used as an opportunity to dynamically cycle down the processor clock, the energy roofline model can be used to determine precisely how the time and energy costs of different operations, both compute and memory, change with respect to frequency and voltage settings. This information can be used to target a specific optimization goal, whether that be time, energy, or a combination of both. In the final chapter of this thesis, we use our model to predict the energy dissi- pation of a real application running on a real system. The fast multipole method (FMM) kernel was executed on the GPU component of the Tegra K1 SoC under various frequency and voltage settings and a breakdown of instructions and data ac- cess pattern was collected via performance counters. The total energy dissipation of FMM was then calculated as a weighted sum of these instructions and the associated costs in energy. On eight different voltage and frequency settings and eight different algorithm–specific input parameters per setting, for a total of 64 total test cases, the accuracy of the energy roofline model for predicting total energy dissipation was within 6.2%, with a standard deviation of 4.7%, when compared to actual energy measurements. Despite its simplicity and its foundation on the first principles of algorithm anal- ysis, the energy roofline model has proven to be both practical and accurate for real applications running on a real system. And as such, it can be an invaluable tool for al- gorithm designers and performance tuners with which they can more precisely analyze the impact of their design decisions on both performance and energy efficiency. High-performance computing Energy Power Performance Modeling
20	xBFT : Byzantine fault tolerance with high performance, low cost, and aggressive fault isolation Kotla, Ramakrishna Rao, 1976- 24 September 2012 (has links) We are increasingly relying on online services to store, access, share, and disseminate critical information from anywhere and at all times. Such services include email, digital storage, photos, video, health and financial services, etc. With increasing evidence of non-fail-stop failures in practical systems, Byzantine fault tolerant state machine replication technique is becoming increasingly attractive for building highlyreliable services in order to tolerate such failures. However, existing Byzantine fault tolerant techniques fall short of providing high availability, high performance, and long-term data durability guarantees with competitive replication cost. In this dissertation, we present BFT replication techniques that facilitate the design and implementation of such highly-reliable services by providing high availability, high performance and high durability with competitive replication cost (hardware, software, network, management). First, we propose CBASE, a BFT state machine replication architecture that leverages application-level parallelism to improve throughput of the replicated system by identifying and executing independent requests concurrently. Traditional state machine replication based Byzantine fault tolerant (BFT) techniques provide high availability and security but fail to provide high throughput. This limitation stems from the fundamental assumption of generalized state machine replication techniques that all replicas execute requests sequentially in the same total order to ensure consistency across replicas. Our architecture thus provides a general way to exploit application parallelism in order to provide high throughput without compromising correctness. Second, we present Zyzzyva, an efficient BFT agreement protocol that uses speculation to significantly reduce the performance overhead and replication cost of BFT state machine replication. In Zyzzyva, replicas respond to a client’s request without first running an expensive three-phase commit protocol to reach agreement on the order in which the request must be processed. Instead, they optimistically adopt the order proposed by the primary and respond immediately to the client. Replicas can thus become temporarily inconsistent with one another, but clients detect inconsistencies, help correct replicas converge on a single total ordering of requests, and only rely on responses that are consistent with this total order. This approach allows Zyzzyva to reduce replication overheads to near their theoretical minima. Third, we design and implement SafeStore, a distributed storage system designed to maintain long-term data durability despite conventional hardware and software faults, environmental disruptions, and administrative failures caused by human error or malice. The architecture of SafeStore is based on fault isolation, which SafeStore applies aggressively along administrative, physical, and temporal dimensions by spreading data across autonomous storage service providers (SSPs). SafeStore also performs an efficient end-to-end audit of SSPs to detect data loss quickly and improve data durability by reducing MTTR. SafeStore offers durable storage with cost, performance, and availability competitive with traditional storage systems. We evaluate these techniques by implementing BFT replication libraries and further demonstrate the practicality of these approaches by implementing an NFS based replicated file system(CBASE-FS) and a durable storage system (SafeStore-FS). / text Fault-tolerant computing High performance computing

Search results