Global ETD Search

11	Maximizing I/O Bandwidth for Out-of-Core HPC Applications on Homogeneous and Heterogeneous Large-Scale Systems Alturkestani, Tariq 30 September 2020 (has links) Out-of-Core simulation systems often produce a massive amount of data that cannot t on the aggregate fast memory of the compute nodes, and they also require to read back these data for computation. As a result, I/O data movement can be a bottleneck in large-scale simulations. Advances in memory architecture have made it feasible and a ordable to integrate hierarchical storage media on large-scale systems, starting from the traditional Parallel File Systems (PFSs) to intermediate fast disk technologies (e.g., node-local and remote-shared NVMe and SSD-based Burst Bu ers) and up to CPU main memory and GPU High Bandwidth Memory (HBM). However, while adding additional and faster storage media increases I/O bandwidth, it pressures the CPU, as it becomes responsible for managing and moving data between these layers of storage. Simulation systems are thus vulnerable to being blocked by I/O operations. The Multilayer Bu er System (MLBS) proposed in this research demonstrates a general and versatile method for overlapping I/O with computation that helps to ameliorate the strain on the processors through asynchronous access. The main idea consists in decoupling I/O operations from computational phases using dedicated hardware resources to perform expensive context switches. MLBS monitors I/O tra c in each storage layer allowing fair utilization of shared resources. By continually prefetching up and down across all hardware layers of the memory and storage subsystems, MLBS transforms the original I/O-bound behavior of evaluated applications and shifts it closer to a memory-bound or compute-bound regime. The evaluation on the Cray XC40 Shaheen-2 supercomputer for a representative I/Obound application, seismic inversion, shows that MLBS outperforms state-of-the-art PFSs, i.e., Lustre, Data Elevator and DataWarp by 6.06X, 2.23X, and 1.90X, respectively. On the IBM-built Summit supercomputer, using 2048 compute nodes equipped with a total of 12288 GPUs, MLBS achieves up to 1.4X performance speedup compared to the reference PFS-based implementation. MLBS is also demonstrated on applications from cosmology, combustion, and a classic out-of-core computational physics and linear algebra routines. hpc data I/O supercomputer Burst Buffer Heterogeneous Computing
12	Optimising a fluid plasma turbulence simulation on modern high performance computers Edwards, Thomas David January 2010 (has links) Nuclear fusion offers the potential of almost limitless energy from sea water and lithium without the dangers of carbon emissions or long term radioactive waste. At the forefront of fusion technology are the tokamaks, toroidal magnetic confinement devices that contain miniature stars on Earth. Nuclei can only fuse by overcoming the strong electrostatic forces between them which requires high temperatures and pressures. The temperatures in a tokamak are so great that the Deuterium-Tritium fusion fuel forms a plasma which must be kept hot and under pressure to maintain the fusion reaction. Turbulence in the plasma causes disruption by transporting mass and energy away from this core, reducing the efficiency of the reaction. Understanding and controlling the mechanisms of plasma turbulence is key to building a fusion reactor capable of producing sustained output. The extreme temperatures make detailed empirical observations difficult to acquire, so numerical simulations are used as an additional method of investigation. One numerical model used to study turbulence and diffusion is CENTORI, a direct two-fluid magneto-hydrodynamic simulation of a tokamak plasma developed by the Culham Centre for Fusion Energy (CCFE formerly UKAEA:Fusion). It simulates the entire tokamak plasma with realistic geometry, evolving bulk plasma quantities like pressure, density and temperature through millions of timesteps. This requires CENTORI to run in parallel on a Massively Parallel Processing (MPP) supercomputer to produce results in an acceptable time. Any improvements in CENTORI’s performance increases the rate and/or total number of results that can be obtained from access to supercomputer resources. This thesis presents the substantial effort to optimise CENTORI on the current generation of academic supercomputers. It investigates and reviews the properties of contemporary computer architectures then proposes, implements and executes a benchmark suite of CENTORI’s fundamental kernels. The suite is used to compare the performance of three competing memory layouts of the primary vector data structure using a selection of compilers on a variety of computer architectures. The results show there is no optimal memory layout on all platforms so a flexible optimisation strategy was adopted to pursue “portable” optimisation i.e optimisations that can easily be added, adapted or removed from future platforms depending on their performance. This required designing an interface to functions and datatypes that separate CENTORI’s fundamental algorithms from repetitive, low-level implementation details. This approach offered multiple benefits including: the clearer representation of CENTORI’s core equations as mathematical expressions in Fortran source code allows rapid prototyping and development of new features; the reduction in the total data volume by a factor of three reduces the amount of data transferred over the memory bus to almost a third; and the reduction in the number of intense floating point kernels reduces the effort of optimising the application on new platforms. The project proceeds to rewrite CENTORI using the new Application Programming Interface (API) and evaluates two optimised implementations. The first is a traditional library implementation that uses hand optimised subroutines to implement the library functions. The second uses a dynamic optimisation engine to perform automatic stripmining to improve the performance of the memory hierarchy. The automatic stripmining implementation uses lazy evaluation to delay calculations until absolutely necessary, allowing it to identify temporary data structures and minimise them for optimal cache use. This novel technique is combined with highly optimised implementations of the kernel operations and optimised parallel communication routines to produce a significant improvement in CENTORI’s performance. The maximum measured speed up of the optimised versions over the original code was 3.4 times on 128 processors on HPCx, 2.8 times on 1024 processors on HECToR and 2.3 times on 256 processors on HPC-FF. 539.7
13	EXPLOITING SPARSENESS OF COMMUNICATION PATTERNS FOR THE DESIGN OF NETWORKS IN MASSIVELY PARALLEL SUPERCOMPUTERS Mattox, Timothy Ian 01 January 2006 (has links) A limited set of Processing Element (PE) pairs in a parallel computer cover the internal communications of scalable parallel programs. We take advantage of this property using the concept of Sparse Flat Neighborhood Networks (Sparse FNNs). Sparse FNNs are network designs that provide single-switch latency and full wire bandwidth for each specified PE pair, despite using relatively few network interfaces per PE and switches that have far fewer ports than there are PEs. This dissertation discusses the design problem, runtime support, and working prototype (KASY0) for Sparse FNNs. KASY0 not only demonstrated the claimed properties, but also set world records for its price/performance and performance on a specific application. Parallel supercomputers execute many portions of an application simultaneously. For scalable programs, the more PEs the system has, the greater the potential speedup. Portions executing on different PEs may be able to work independently for short periods, but the performance desired might not be achieved due to delays in communication between PEs. The set of PE pairs that will communicate often is both predictable and small relative to the number of possible PE pairings. This sparseness property can be exploited in the design and implementation of networks for massively parallel supercomputers. The sparseness of communicating pairs is rooted in the fact that each of the human-designed communication patterns commonly used in parallel programs has the property that the number of communicating pairs grows relatively slowly as the number of PEs is increased. Additionally, the number of pairs in the union of all communication patterns used in a suite of parallel programs grows surprisingly slowly due to pair synergy: the same pair often appears in multiple communication patterns. Detailed analysis of communication patterns clearly shows that the number of PE pairs actually communicating is very sparse, although the structure of the sparseness can be complex.
14	Design and Measurement of StrongARM Comparators Whitehead, Nathan Robert 29 October 2019 (has links) The StrongARM comparator is utilized in many analog-to-digital converters (ADCs) because of its high power efficiency and rail-to-rail outputs. The performance of the comparator directly affects the speed, power, and accuracy of an ADC. However, the StrongARM comparator performance parameters such as delay, noise, and offset measured directly from silicon prototypes are rare in literature and often consist of small sample sets. In addition, existing techniques to measure the comparator require large chip areas, making it impractical to characterize a large number of comparators to obtain stochastic parameters such as offset and noise. This work presents novel circuit techniques to measure a large number of comparators (4,000) in a compact chip area to directly obtain silicon data including delay, noise, offset, and power. The proposed techniques also relax the requirement on the test instruments to measure the small time values. Four comparators with different transistor size ratios have been designed and measured to study the performance tradeoffs. In addition, this work presents a method utilizing supercomputing resources to simulate the large design space of the StrongARM comparator to observe the performance trends. Measurements are compared to simulations showing their accuracy and, for the first time, detailed study on the performance trends with different transistor size ratios. StrongARM comparator latch input-referred noise delay energy supercomputer cadence simulations Engineering
15	High-Concurrency Visualization on Supercomputers Nouanesengsy, Boonthanome 30 August 2012 (has links) No description available. Computer Science visualization parallel supercomputers supercomputer streamlines ftle pathlines rendering shared-memory
16	Optimizing Applications and Message-Passing Libraries for the QPACE Architecture Wunderlich, Simon 18 July 2012 (has links) (PDF) The goal of the QPACE project is to build a novel cost-efficient massive parallel supercomputer optimized for LQCD (Lattice Quantum Chromodynamics) applications. Unlike previous projects which use custom ASICs, this is accomplished by using the general purpose multi-core CPU PowerXCell 8i processor tightly coupled with a custom network processor implemented on a modern FPGA. The heterogeneous architecture of the PowerXCell 8i processor and its core-independent OS-bypassing access to the custom network hardware and application-oriented 3D torus topology pose interesting challenges for the implementation of the applications. This work will describe and evaluate the implementation possibilities of message passing APIs: the more general MPI, and the more QCD-oriented QMP, and their performance in PPE centric or SPE centric scenarios. These results will then be employed to optimize HPL for the QPACE architecture. Finally, the developed approaches and concepts will be briefly discussed regarding their applicability to heterogeneous node/network architectures as is the case in the "High-speed Network Interface with Collective Operation Support for Cell BE (NICOLL)" project. PowerXCell PowerXCell 8i QPACE Cell PPE SPE MPI QCD QMP NICOLL Torus parallel supercomputer PowerXCell PowerXCell 8i QPACE Cell PPE SPE MPI QCD QMP NICOLL Torus parallel supercomputer ddc:000 Quantenchromodynamik HPL Field programmable gate array
17	ANALYZING SUPERCOMPUTER UTILIZATION UNDER QUEUING WITH A PRIORITY FORMULA AND A STRICT BACKFILL POLICY Vanderlan, Michael David 01 May 2011 (has links) Supercomputers have become increasingly important in recent years due to the growing amount of data available and the increasing demand for quicker results in the scientific community. Since supercomputers carry a high cost to build and maintain, efficiency becomes more important to the owners, administrators, and users of these supercomputers. One important factor in determining the efficiency of a supercomputer is the scheduling of jobs that are submitted by users of the system. Previous work has dealt with optimizing the schedule on the system’s end while the users are blinded from the process. The work presented in this thesis investigates a scheduling system that is implemented at the Oak Ridge National Laboratory (ORNL) supercomputer Kraken with a backfilling policy and attempts to outline the optimal methods from the user’s point of view in the scheduling system, along with using a simulation approach to optimize the priority formula. Normally the user has no idea which scheduling algorithms are used, but the users at ORNL not only know how the scheduling works but they can also view the current activity of the system. This gives an advantage to the users who are willing to benefit from this knowledge by utilizing some elementary game theory to optimize their strategies. The results will show a benefit to both the users, since they will be able to process their jobs sooner, and the system, since it will better utilized with little expense to the administrators, through competition. Queuing models and simulation have been well studied in almost all relevant aspects of the modern world. Higher efficiency is the goal of many researchers in several different fields; the supercomputer queues are no different. Efficient use of the resources makes the system administrator pleased while benefiting the users with more timely results. Studying these queuing models through simulation should help all parties involved by increasing utilization. The simulation will be validated and the utilization improvement will be measured and reported. User defined formulas will be developed for future users to help maximize utilization and minimize wait times. Supercomputer queuing priority formula utilization strategy Industrial Engineering Operational Research Systems Engineering
18	Mitteilungen des URZ 3/4/1994 Richter, Frank, Riedel, Wolfgang, Schier, Thomas, Schoeniger, Frank, Wagner, Jens, Ziegler, Christoph 22 August 1995 (has links) Supercomputer in Betrieb Chemnitzer Studentennetz eingeweiht Neue Compute-Server Neuer Dienst: PC-Integration TeX -Service des URZ Software-News Advent, Advent - Geschichtenzeit Weihnachtsstory TeX-Service PC-Integration Computer-Server CSN Studentennetz Software-News ddc:050 ddc:004 Parallelrechner Supercomputer
19	A Survey of Barrier Algorithms for Coarse Grained Supercomputers Hoefler, Torsten, Mehlan, Torsten, Mietke, Frank, Rehm, Wolfgang 28 June 2005 (has links) (PDF) There are several different algorithms available to perform a synchronization of multiple processors. Some of them support only shared memory architectures or very fine grained supercomputers. This work gives an overview about all currently known algorithms which are suitable for distributed shared memory architectures and message passing based computer systems (loosely coupled or coarse grained supercomputers). No absolute decision can be made for choosing a barrier algorithm for a machine. Several architectural aspects have to be taken into account. The overview about known barrier algorithms given in this work is mostly targeted to implementors of libraries supporting collective communication (such as MPI). Barrier Collective Communication Kollektive Operationen MPI_Barrier ddc:004 MPI <Schnittstelle> Mpi-Sprache Netzwerk <Graphentheorie> Supercomputer
20	Optimalizace distribuovaného I/O subsystému projektu k-Wave / Optimization of the Distributed I/O Subsystem of the k-Wave Project Vysocký, Ondřej January 2016 (has links) This thesis deals with an effective solution of the parallel I/O of the k-Wave tool, which is designed for time domain acoustic and ultrasound simulations. k-Wave is a supercomputer application, it runs on a Lustre file system and it requires to be implemented with MPI and stores the data in suitable data format (HDF5). I designed three methods of optimization which fits k-Wave's needs. It uses accumulation and redistribution techniques. In comparison with the native write, every optimization method led to better write speed, up to 13.6GB/s. It is possible to use these methods to optimize every data distributed application with the write speed issue.

Search results