Global ETD Search

281	Improving the Efficiency of Parallel Applications on Multithreaded and Multicore Systems Curtis-Maury, Matthew 15 April 2008 (has links) The scalability of parallel applications executing on multithreaded and multicore multiprocessors is often quite limited due to large degrees of contention over shared resources on these systems. In fact, negative scalability frequently occurs such that a non-negligable performance loss is observed through the use of more processors and cores. In this dissertation, we present a prediction model for identifying efficient operating points of concurrency in multithreaded scientific applications in terms of both performance as a primary objective and power secondarily. We also present a runtime system that uses live analysis of hardware event rates through the prediction model to optimize applications dynamically. We discuss a dynamic, phase-aware performance prediction model (DPAPP), which combines statistical learning techniques, including multivariate linear regression and artificial neural networks, with runtime analysis of data collected from hardware event counters to locate optimal operating points of concurrency. We find that the scalability model achieves accuracy approaching 95%, sufficiently accurate to identify improved concurrency levels and thread placements from within real parallel scientific applications. Using DPAPP, we develop a prediction-driven runtime optimization scheme, called ACTOR, which throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each parallel execution phase in an application. ACTOR successfully identifies and exploits program phases where limited scalability results in a performance loss through the use of more processing elements, providing simultaneous reductions in execution time by 5%-18% and power consumption by 0%-11% across a variety of parallel applications and architectures. Further, we extend DPAPP and ACTOR to include support for runtime adaptation of DVFS, allowing for the synergistic exploitation of concurrency throttling and DVFS from within a single, autonomically-acting library, providing improved energy-efficiency compared to either approach in isolation. / Ph. D. power-aware computing high-performance computing performance prediction multicore processors runtime adaptation concurrency throttling
282	Scheduling on Asymmetric Architectures Blagojevic, Filip 22 July 2008 (has links) We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. To maximize performance on heterogeneous multi-core processors, programs need to expose multiple dimensions of parallelism simultaneously. Unfortunately, programming with multiple dimensions of parallelism is to date an ad hoc process, relying heavily on the intuition and skill of programmers. Formal techniques are needed to optimize multi-dimensional parallel program designs. We investigate user- and kernel-level schedulers that dynamically "rightsize" the dimensions and degrees of parallelism on the asymmetric parallel platforms. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. Our runtime environment outperforms the native Linux and MPI scheduling environment by up to a factor of 2.7. We also present a model of multi-dimensional parallel computation for steering the parallelization process on heterogeneous multi-core processors. The model predicts with high accuracy the execution time and scalability of a program using conventional processors and accelerators simultaneously. More specifically, the model reveals optimal degrees of multi-dimensional, task-level and data-level concurrency, to maximize performance across cores. We evaluate our runtime policies as well as the performance model we developed, on an IBM Cell BladeCenter, as well as on a cluster composed of Playstation3 nodes, using two realistic bioinformatics applications. / Ph. D. process scheduling performance prediction high-performance computing runtime adaptation Multicore processors Cell BE
283	An Adaptive Framework for Managing Heterogeneous Many-Core Clusters Rafique, Muhammad Mustafa 21 October 2011 (has links) The computing needs and the input and result datasets of modern scientific and enterprise applications are growing exponentially. To support such applications, High-Performance Computing (HPC) systems need to employ thousands of cores and innovative data management. At the same time, an emerging trend in designing HPC systems is to leverage specialized asymmetric multicores, such as IBM Cell and AMD Fusion APUs, and commodity computational accelerators, such as programmable GPUs, which exhibit excellent price to performance ratio as well as the much needed high energy efficiency. While such accelerators have been studied in detail as stand-alone computational engines, integrating the accelerators into large-scale distributed systems with heterogeneous computing resources for data-intensive computing presents unique challenges and trade-offs. Traditional programming and resource management techniques cannot be directly applied to many-core accelerators in heterogeneous distributed settings, given the complex and custom instruction sets architectures, memory hierarchies and I/O characteristics of different accelerators. In this dissertation, we explore the design space of using commodity accelerators, specifically IBM Cell and programmable GPUs, in distributed settings for data-intensive computing and propose an adaptive framework for programming and managing heterogeneous clusters. The proposed framework provides a MapReduce-based extended programming model for heterogeneous clusters, which distributes tasks between asymmetric compute nodes by considering workload characteristics and capabilities of individual compute nodes. The framework provides efficient data prefetching techniques that leverage general-purpose cores to stage the input data in the private memories of the specialized cores. We also explore the use of an advanced layered-architecture based software engineering approach and provide mixin-layers based reusable software components to enable easy and quick deployment of heterogeneous clusters. The framework also provides multiple resource management and scheduling policies under different constraints, e.g., energy-aware and QoS-aware, to support executing concurrent applications on multi-tenant heterogeneous clusters. When applied to representative applications and benchmarks, our framework yields significantly improved performance in terms of programming efficiency and optimal resource management as compared to conventional, hand-tuned, approaches to program and manage accelerator-based heterogeneous clusters. / Ph. D. Heterogeneous Computing High-Performance Computing Resource Sharing Resource Management and Scheduling Programming Models
284	Prediction Models for Multi-dimensional Power-Performance Optimization on Many Cores Shah, Ankur Savailal 28 May 2008 (has links) Power has become a primary concern for HPC systems. Dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) are two software tools (or knobs) for reducing the dynamic power consumption of HPC systems. To date, few works have considered the synergistic integration of DVFS and DCT in performance-constrained systems, and, to the best of our knowledge, no prior research has developed application-aware simultaneous DVFS and DCT controllers in real systems and parallel programming frameworks. We present a multi-dimensional, online performance prediction framework, which we deploy to address the problem of simultaneous runtime optimization of DVFS, DCT, and thread placement on multi-core systems. We present results from an implementation of the prediction framework in a runtime system linked to the Intel OpenMP runtime environment and running on a real dual-processor quad-core system as well as a dual-processor dual-core system. We show that the prediction framework derives near-optimal settings of the three power-aware program adaptation knobs that we consider. Our overall runtime optimization framework achieves significant reductions in energy (12.27% mean) and ED² (29.6% mean), through simultaneous power savings (3.9% mean) and performance improvements (10.3% mean). Our prediction and adaptation framework outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS. Further, our results indicate that prediction-based schemes for runtime adaptation compare favorably and typically improve upon heuristic search-based approaches in both performance and energy savings. / Master of Science concurrency throttling power-aware computing runtime adaptation performance prediction high-performance computing Multicore processors
285	Power Saving Analysis and Experiments for Large Scale Global Optimization Cao, Zhenwei 03 August 2009 (has links) Green computing, an emerging field of research that seeks to reduce excess power consumption in high performance computing (HPC), is gaining popularity among researchers. Research in this field often relies on simulation or only uses a small cluster, typically 8 or 16 nodes, because of the lack of hardware support. In contrast, System G at Virginia Tech is a 2592 processor supercomputer equipped with power aware components suitable for large scale green computing research. DIRECT is a deterministic global optimization algorithm, implemented in the mathematical software package VTDIRECT95. This thesis explores the potential energy savings for the parallel implementation of DIRECT, called pVTdirect, when used with a large scale computational biology application, parameter estimation for a budding yeast cell cycle model, on System G. Two power aware approaches for pVTdirect are developed and compared against the CPUSPEED power saving system tool. The results show that knowledge of the parallel workload of the underlying application is beneficial for power management. / Master of Science VTDIRECT95 power aware computing high performance computing DVFS large scale global optimization budding yeast problem
286	Enabling the use of Heterogeneous Computing for Bioinformatics Bijanapalli Chakri, Ramakrishna 02 October 2013 (has links) The huge amount of information in the encoded sequence of DNA and increasing interest in uncovering new discoveries has spurred interest in accelerating the DNA sequencing and alignment processes. The use of heterogeneous systems, that use different types of computational units, has seen a new light in high performance computing in recent years; However expertise in multiple domains and skills required to program these systems is causing an hindrance to bioinformaticians in rapidly deploying their applications into these heterogeneous systems. This work attempts to make an heterogeneous system, Convey HC-1, with an x86-based host processor and FPGA-based co-processor, accessible to bioinformaticians. First, a highly efficient dynamic programming based Smith-Waterman kernel is implemented in hardware, which is able to achieve a peak throughput of 307.2 Giga Cell Updates per Second (GCUPS) on Convey HC-1. A dynamic programming accelerator interface is provided to any application that uses Smith-Waterman. This implementation is also extended to General Purpose Graphics Processing Units (GP-GPUs), which achieved a peak throughput of 9.89 GCUPS on NVIDIA GTX580 GPU. Second, a well known graphical programming tool, LabVIEW is enabled as a programming tool for the Convey HC-1. A connection is established between the graphical interface and the Convey HC-1 to control and monitor the application running on the FPGA-based co-processor. / Master of Science Field programmable gate arrays Hardware Acceleration High Performance Computing DNA Alignment LabVIEW Heterogeneous Computing GP-GPUs
287	Energy and Performance Models Enabling Design Space Exploration using Domain Specific Languages Umar, Mariam 25 May 2018 (has links) With the advent of exascale architectures maximizing performance while maintaining energy consumption within reasonable limits has become one of the most critical design constraints. This constraint is particularly significant in light of the power budget of 20 MWatts set by the U.S. Department of Energy for exascale supercomputing facilities. Therefore, understanding an application's characteristics, execution pattern, energy footprint, and the interactions of such aspects is critical to improving the application's performance as well as its utilization of the underlying resources. With conventional methods of analyzing performance and energy consumption trends scientists are forced to limit themselves to a manageable number of design parameters. While these modeling techniques have catered to the needs of current high-performance computing systems, the complexity and scale of exascale systems demands that large-scale design-space-exploration techniques are developed to enable comprehensive analysis and evaluations. In this dissertation we present research on performance and energy modeling of current high performance computing and future exascale systems. Our thesis is focused on the design space exploration of current and future architectures, in terms of their reconfigurability, application's sensitivity to hardware characteristics (e.g., system clock, memory bandwidth), application's execution patterns, application's communication behavior, and utilization of resources. Our research is aimed at understanding the methods by which we may maximize performance of exascale systems, minimize energy consumption, and understand the trade offs between the two. We use analytical, statistical, and machine-learning approaches to develop accurate, portable and scalable performance and energy models. We develop application and machine abstractions using Aspen (a domain specific language) to implement and evaluate our modeling techniques. As part of our research we develop and evaluate system-level performance and energy-consumption models that form part of an automated modeling framework, which analyzes application signatures to evaluate sensitivity of reconfigurable hardware components for candidate exascale proxy applications. We also develop statistical and machine-learning based models of the application's execution patterns on heterogeneous platforms. We also propose a communication and computation modeling and mapping framework for exascale proxy architectures and evaluate the framework for an exascale proxy application. These models serve as external and internal extensions to Aspen, which enable proxy exascale architecture implementations and thus facilitate design space exploration of exascale systems. / Ph. D. Energy Modeling Performance Modeling Aspen Domain Specific Language Analytical Modeling High-Performance Computing Exascale Computing
288	Scalable Data Management for Object-based Storage Systems Wadhwa, Bharti 19 August 2020 (has links) Parallel I/O performance is crucial to sustain scientific applications on large-scale High-Performance Computing (HPC) systems. Large scale distributed storage systems, in particular the object-based storage systems, face severe challenges for managing the data efficiently. Inefficient data management leads to poor I/O and storage performance in HPC applications and scientific workflows. Some of the main challenges for efficient data management arise from poor resource allocation, load imbalance in object storage targets, and inflexible data sharing between applications in a workflow. In addition, parallel I/O makes it challenging to shoehorn new interfaces, such as taking advantage of multiple layers of storage and support for analysis in the data path. Solving these challenges to improve performance and efficiency of object-based storage systems is crucial, especially for upcoming era of exascale systems. This dissertation is focused on solving these major challenges in object-based storage systems by providing scalable data management strategies. In the first part of the dis-sertation (Chapter 3), we present a resource contention aware load balancing tool (iez) for large scale distributed object-based storage systems. In Chapter 4, we extend iez to support Progressive File Layout for object-based storage system: Lustre. In the second part (Chapter 5), we present a technique to facilitate data sharing in scientific workflows using object-based storage, with our proposed tool Workflow Data Communicator. In the last part of this dissertation, we present a solution for transparent data management in multi-layer storage hierarchy of present and next-generation HPC systems.This dissertation shows that by intelligently employing scalable data management techniques, scientific applications' and workflows' flexibility and performance in object-based storage systems can be enhanced manyfold. Our proposed data management strategies can guide next-generation HPC storage systems' software design to efficiently support data for scientific applications and workflows. / Doctor of Philosophy / Large scale object-based storage systems face severe challenges to manage the data efficiently for HPC applications and workflows. These storage systems often manage and share data inflexibly, without considering the load imbalance and resource contention in the underlying multi-layer storage hierarchy. This dissertation first studies how resource contention and inflexible data sharing mechanisms impact HPC applications' storage and I/O performance; and then presents a series of efficient techniques, tools and algorithms to provide efficient and scalable data management for current and next-generation HPC storage systems Lustre Ceph High Performance Computing Parallel File Systems ParallelI/O Optimization Load Imbalance Resource Contention
289	On the Use of Containers in High Performance Computing Abraham, Subil 09 July 2020 (has links) The lightweight, portable, and flexible nature of containers is driving their widespread adoption in cloud solutions. Data analysis and deep learning applications have especially benefited from containerized solutions. As such data analysis is also being utilized in the high performance computing (HPC) domain, the need for container support in HPC has become paramount. However, container adoption in HPC face crucial performance and I/O challenges. One obstacle is that while there have been container solutions for HPC, such solutions have not been thoroughly investigated, especially from the aspect of their impact on the crucial I/O throughput needs of HPC. To this end, this paper provides a first-of-its-kind empirical analysis of state-of-the-art representative container solutions (Docker, Podman, Singularity, and Charliecloud) in HPC environments, especially how containers interact with the HPC storage systems. We present the design of an analysis framework that is deployed on all nodes in an HPC environment, and captures aspects such as CPU, memory, network, and file I/O statistics from the nodes and the storage system. We are able to garner key insights from our analysis, e.g., Charliecloud outperforms other container solutions in terms of container start-up time, while Singularity and Charliecloud are equivalent in I/O throughput. But this comes at a cost, as Charliecloud invokes the most metadata and I/O operations on the underlying Lustre file system. By identifying such optimization opportunities, we can enhance performance of containers atop HPC and help the aforementioned applications. / Master of Science / Containers are a technology that allow for applications to be packaged along with its ideal environment, all the way down to its preferred operating system. This allows an application to run anywhere that can support containers without a huge hit to the application performance. Hence containers have seen wide adoption for use in the cloud. These qualities have also made it very appealing for use in the world of scientific research in national labs. Modern research heavily relies on the power of computing in order to model, simulate, and test the behavior of real world entities, often making use of large amounts of data and utilizing machine learning and deep learning. Doing this often requires the high performance computing power found in supercomputers. In most cases, scientists just want to be able to write their code and expect it to just work. Their applications might depend on other source code that form part of their standard toolkit and would expect to also be installed in the supercomputing environment. This may not always be the case, taking the scientist's focus away from their work in order ensure their requirements are set up in the supercomputing environment which might require extensive cooperation with the operations team responsible for the supercomputers. Containers easily solve this problem because it can package everything together. However, the use of containers in these environments have not been extensively tested, especially for applications that are very heavy on the analysis of large quantities of data. To fill this gap, this work analyzes the performance of several state-of-the-art container technologies (Docker, Podman, Singularity, Charliecloud), with a particular focus on its interaction with the Lustre data storage systems widely used in supercomputing environments. As part of this work, we design an analysis setup that captures the behavior of various aspects of the high performance computing environment like CPU, memory, network usage and data movement while using containers to run data heavy applications. We garner important insights about their performance that can help inform the best choice of container technology given an environment and the kind of application that needs to be run. Container Performance High Performance Computing Parallel File Systems HPC Storage and I/O
290	Interpolants, Error Bounds, and Mathematical Software for Modeling and Predicting Variability in Computer Systems Lux, Thomas Christian Hansen 23 September 2020 (has links) Function approximation is an important problem. This work presents applications of interpolants to modeling random variables. Specifically, this work studies the prediction of distributions of random variables applied to computer system throughput variability. Existing approximation methods including multivariate adaptive regression splines, support vector regressors, multilayer perceptrons, Shepard variants, and the Delaunay mesh are investigated in the context of computer variability modeling. New methods of approximation using Box splines, Voronoi cells, and Delaunay for interpolating distributions of data with moderately high dimension are presented and compared with existing approaches. Novel theoretical error bounds are constructed for piecewise linear interpolants over functions with a Lipschitz continuous gradient. Finally, a mathematical software that constructs monotone quintic spline interpolants for distribution approximation from data samples is proposed. / Doctor of Philosophy / It is common for scientists to collect data on something they are studying. Often scientists want to create a (predictive) model of that phenomenon based on the data, but the choice of how to model the data is a difficult one to answer. This work proposes methods for modeling data that operate under very few assumptions that are broadly applicable across science. Finally, a software package is proposed that would allow scientists to better understand the true distribution of their data given relatively few observations. Approximation Theory Numerical Analysis High Performance Computing Computer Security Nonparametric Statistics Mathematical Software

Search results