Spelling suggestions: "subject:"HPC lemsystems"" "subject:"HPC atemsystems""
1 |
Software support for advanced applications on distributed memory multiprocessor systemsChapman, Barbara Mary January 1998 (has links)
No description available.
|
2 |
Towards Using Free Memory to Improve Microarchitecture PerformancePanwar, Gagandeep 18 May 2020 (has links)
A computer system's memory is designed to accommodate the worst-case workloads with the highest memory requirement; as such, memory is underutilized when a system runs workloads with common-case memory requirements. Through a large-scale study of four production HPC systems, we find that memory underutilization problem in HPC systems is very severe. As unused memory is wasted memory, we propose exposing a compute node's unused memory to its CPU(s) through a user-transparent CPU-OS codesign. This can enable many new microarchitecture techniques that transparently leverage unused memory locations to help improve microarchitecture performance. We refer to these techniques as Free-memory-aware Microarchitecture Techniques (FMTs). In the context of HPC systems, we present a detailed example of an FMT called Free-memory-aware Replication (FMR). FMR replicates in-use data to unused memory locations to effectively reduce average memory read latency. On average across five HPC benchmark suites, FMR provides 13% performance and 8% system-level energy improvement. / M.S. / Random-access memory (RAM) or simply memory, stores the temporary data of applications that run on a computer system. Its size is determined by the worst-case application workload that the computer system is supposed to run. Through our memory utilization study of four large multi-node high-performance computing (HPC) systems, we find that memory is underutilized severely in these systems. Unused memory is a wasted resource that does nothing. In this work, we propose techniques that can make use of this wasted memory to boost computer system performance. We call these techniques Free-memory-aware Microarchitecture Techniques (FMTs). We then present an FMT for HPC systems in detail called Free-memory-aware Replication (FMR) that provides performance improvement of over 13%.
|
3 |
Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC SystemsRahman, Md Wasi-ur- January 2016 (has links)
No description available.
|
4 |
Effective Automatic Computation Placement and Data Allocation for Parallelization of Regular ProgramsChandan, G January 2014 (has links) (PDF)
Scientific applications that operate on large data sets require huge amount of computation power and memory. These applications are typically run on High Performance Computing (HPC) systems that consist of multiple compute nodes, connected over an network interconnect such as InfiniBand. Each compute node has its own memory and does not share the address space with other nodes. A significant amount of work has been done in past two decades on parallelizing for distributed-memory architectures. A majority of this work was done in developing compiler technologies such as high performance Fortran (HPF) and partitioned global address space (PGAS). However, several steps involved in achieving good performance remained manual. Hence, the approach currently used to obtain the best performance is to rely on highly tuned libraries such as ScaLAPACK. The objective of this work is to improve automatic compiler and runtime support for distributed-memory clusters for regular programs. Regular programs typically use arrays as their main data structure and array accesses are affine functions of outer loop indices and program parameters. A lot of scientific applications such as linear-algebra kernels, stencils, partial differential equation solvers, data-mining applications and dynamic programming codes fall in this category.
In this work, we propose techniques for finding computation mapping and data allocation when compiling regular programs for distributed-memory clusters. Techniques for transformation and detection of parallelism, relying on the polyhedral framework already exist. We propose automatic techniques to determine computation placements for identified parallelism and allocation of data. We model the problem of finding good computation placement as a graph partitioning problem with the constraints to minimize both communication volume and load imbalance for entire program. We show that our approach for computation mapping is more effective than those that can be developed using vendor-supplied libraries. Our approach for data allocation is driven by tiling of data spaces along with a compiler assisted runtime scheme to allocate and deallocate tiles on-demand and reuse them. Experimental results on some sequences of BLAS calls demonstrate a mean speedup of 1.82× over versions written with ScaLAPACK. Besides enabling weak scaling for distributed memory, data tiling also improves locality for shared-memory parallelization. Experimental results on a 32-core shared-memory SMP system shows a mean speedup of 2.67× over code that is not data tiled.
|
Page generated in 0.0427 seconds