Global ETD Search

281	An Adaptive Framework for Managing Heterogeneous Many-Core Clusters Rafique, Muhammad Mustafa 21 October 2011 (has links) The computing needs and the input and result datasets of modern scientific and enterprise applications are growing exponentially. To support such applications, High-Performance Computing (HPC) systems need to employ thousands of cores and innovative data management. At the same time, an emerging trend in designing HPC systems is to leverage specialized asymmetric multicores, such as IBM Cell and AMD Fusion APUs, and commodity computational accelerators, such as programmable GPUs, which exhibit excellent price to performance ratio as well as the much needed high energy efficiency. While such accelerators have been studied in detail as stand-alone computational engines, integrating the accelerators into large-scale distributed systems with heterogeneous computing resources for data-intensive computing presents unique challenges and trade-offs. Traditional programming and resource management techniques cannot be directly applied to many-core accelerators in heterogeneous distributed settings, given the complex and custom instruction sets architectures, memory hierarchies and I/O characteristics of different accelerators. In this dissertation, we explore the design space of using commodity accelerators, specifically IBM Cell and programmable GPUs, in distributed settings for data-intensive computing and propose an adaptive framework for programming and managing heterogeneous clusters. The proposed framework provides a MapReduce-based extended programming model for heterogeneous clusters, which distributes tasks between asymmetric compute nodes by considering workload characteristics and capabilities of individual compute nodes. The framework provides efficient data prefetching techniques that leverage general-purpose cores to stage the input data in the private memories of the specialized cores. We also explore the use of an advanced layered-architecture based software engineering approach and provide mixin-layers based reusable software components to enable easy and quick deployment of heterogeneous clusters. The framework also provides multiple resource management and scheduling policies under different constraints, e.g., energy-aware and QoS-aware, to support executing concurrent applications on multi-tenant heterogeneous clusters. When applied to representative applications and benchmarks, our framework yields significantly improved performance in terms of programming efficiency and optimal resource management as compared to conventional, hand-tuned, approaches to program and manage accelerator-based heterogeneous clusters. / Ph. D. Heterogeneous Computing High-Performance Computing Resource Sharing Resource Management and Scheduling Programming Models
282	Prediction Models for Multi-dimensional Power-Performance Optimization on Many Cores Shah, Ankur Savailal 28 May 2008 (has links) Power has become a primary concern for HPC systems. Dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) are two software tools (or knobs) for reducing the dynamic power consumption of HPC systems. To date, few works have considered the synergistic integration of DVFS and DCT in performance-constrained systems, and, to the best of our knowledge, no prior research has developed application-aware simultaneous DVFS and DCT controllers in real systems and parallel programming frameworks. We present a multi-dimensional, online performance prediction framework, which we deploy to address the problem of simultaneous runtime optimization of DVFS, DCT, and thread placement on multi-core systems. We present results from an implementation of the prediction framework in a runtime system linked to the Intel OpenMP runtime environment and running on a real dual-processor quad-core system as well as a dual-processor dual-core system. We show that the prediction framework derives near-optimal settings of the three power-aware program adaptation knobs that we consider. Our overall runtime optimization framework achieves significant reductions in energy (12.27% mean) and ED² (29.6% mean), through simultaneous power savings (3.9% mean) and performance improvements (10.3% mean). Our prediction and adaptation framework outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS. Further, our results indicate that prediction-based schemes for runtime adaptation compare favorably and typically improve upon heuristic search-based approaches in both performance and energy savings. / Master of Science concurrency throttling power-aware computing runtime adaptation performance prediction high-performance computing Multicore processors
283	Power Saving Analysis and Experiments for Large Scale Global Optimization Cao, Zhenwei 03 August 2009 (has links) Green computing, an emerging field of research that seeks to reduce excess power consumption in high performance computing (HPC), is gaining popularity among researchers. Research in this field often relies on simulation or only uses a small cluster, typically 8 or 16 nodes, because of the lack of hardware support. In contrast, System G at Virginia Tech is a 2592 processor supercomputer equipped with power aware components suitable for large scale green computing research. DIRECT is a deterministic global optimization algorithm, implemented in the mathematical software package VTDIRECT95. This thesis explores the potential energy savings for the parallel implementation of DIRECT, called pVTdirect, when used with a large scale computational biology application, parameter estimation for a budding yeast cell cycle model, on System G. Two power aware approaches for pVTdirect are developed and compared against the CPUSPEED power saving system tool. The results show that knowledge of the parallel workload of the underlying application is beneficial for power management. / Master of Science VTDIRECT95 power aware computing high performance computing DVFS large scale global optimization budding yeast problem
284	Enabling the use of Heterogeneous Computing for Bioinformatics Bijanapalli Chakri, Ramakrishna 02 October 2013 (has links) The huge amount of information in the encoded sequence of DNA and increasing interest in uncovering new discoveries has spurred interest in accelerating the DNA sequencing and alignment processes. The use of heterogeneous systems, that use different types of computational units, has seen a new light in high performance computing in recent years; However expertise in multiple domains and skills required to program these systems is causing an hindrance to bioinformaticians in rapidly deploying their applications into these heterogeneous systems. This work attempts to make an heterogeneous system, Convey HC-1, with an x86-based host processor and FPGA-based co-processor, accessible to bioinformaticians. First, a highly efficient dynamic programming based Smith-Waterman kernel is implemented in hardware, which is able to achieve a peak throughput of 307.2 Giga Cell Updates per Second (GCUPS) on Convey HC-1. A dynamic programming accelerator interface is provided to any application that uses Smith-Waterman. This implementation is also extended to General Purpose Graphics Processing Units (GP-GPUs), which achieved a peak throughput of 9.89 GCUPS on NVIDIA GTX580 GPU. Second, a well known graphical programming tool, LabVIEW is enabled as a programming tool for the Convey HC-1. A connection is established between the graphical interface and the Convey HC-1 to control and monitor the application running on the FPGA-based co-processor. / Master of Science Field programmable gate arrays Hardware Acceleration High Performance Computing DNA Alignment LabVIEW Heterogeneous Computing GP-GPUs
285	Energy and Performance Models Enabling Design Space Exploration using Domain Specific Languages Umar, Mariam 25 May 2018 (has links) With the advent of exascale architectures maximizing performance while maintaining energy consumption within reasonable limits has become one of the most critical design constraints. This constraint is particularly significant in light of the power budget of 20 MWatts set by the U.S. Department of Energy for exascale supercomputing facilities. Therefore, understanding an application's characteristics, execution pattern, energy footprint, and the interactions of such aspects is critical to improving the application's performance as well as its utilization of the underlying resources. With conventional methods of analyzing performance and energy consumption trends scientists are forced to limit themselves to a manageable number of design parameters. While these modeling techniques have catered to the needs of current high-performance computing systems, the complexity and scale of exascale systems demands that large-scale design-space-exploration techniques are developed to enable comprehensive analysis and evaluations. In this dissertation we present research on performance and energy modeling of current high performance computing and future exascale systems. Our thesis is focused on the design space exploration of current and future architectures, in terms of their reconfigurability, application's sensitivity to hardware characteristics (e.g., system clock, memory bandwidth), application's execution patterns, application's communication behavior, and utilization of resources. Our research is aimed at understanding the methods by which we may maximize performance of exascale systems, minimize energy consumption, and understand the trade offs between the two. We use analytical, statistical, and machine-learning approaches to develop accurate, portable and scalable performance and energy models. We develop application and machine abstractions using Aspen (a domain specific language) to implement and evaluate our modeling techniques. As part of our research we develop and evaluate system-level performance and energy-consumption models that form part of an automated modeling framework, which analyzes application signatures to evaluate sensitivity of reconfigurable hardware components for candidate exascale proxy applications. We also develop statistical and machine-learning based models of the application's execution patterns on heterogeneous platforms. We also propose a communication and computation modeling and mapping framework for exascale proxy architectures and evaluate the framework for an exascale proxy application. These models serve as external and internal extensions to Aspen, which enable proxy exascale architecture implementations and thus facilitate design space exploration of exascale systems. / Ph. D. Energy Modeling Performance Modeling Aspen Domain Specific Language Analytical Modeling High-Performance Computing Exascale Computing
286	Scalable Data Management for Object-based Storage Systems Wadhwa, Bharti 19 August 2020 (has links) Parallel I/O performance is crucial to sustain scientific applications on large-scale High-Performance Computing (HPC) systems. Large scale distributed storage systems, in particular the object-based storage systems, face severe challenges for managing the data efficiently. Inefficient data management leads to poor I/O and storage performance in HPC applications and scientific workflows. Some of the main challenges for efficient data management arise from poor resource allocation, load imbalance in object storage targets, and inflexible data sharing between applications in a workflow. In addition, parallel I/O makes it challenging to shoehorn new interfaces, such as taking advantage of multiple layers of storage and support for analysis in the data path. Solving these challenges to improve performance and efficiency of object-based storage systems is crucial, especially for upcoming era of exascale systems. This dissertation is focused on solving these major challenges in object-based storage systems by providing scalable data management strategies. In the first part of the dis-sertation (Chapter 3), we present a resource contention aware load balancing tool (iez) for large scale distributed object-based storage systems. In Chapter 4, we extend iez to support Progressive File Layout for object-based storage system: Lustre. In the second part (Chapter 5), we present a technique to facilitate data sharing in scientific workflows using object-based storage, with our proposed tool Workflow Data Communicator. In the last part of this dissertation, we present a solution for transparent data management in multi-layer storage hierarchy of present and next-generation HPC systems.This dissertation shows that by intelligently employing scalable data management techniques, scientific applications' and workflows' flexibility and performance in object-based storage systems can be enhanced manyfold. Our proposed data management strategies can guide next-generation HPC storage systems' software design to efficiently support data for scientific applications and workflows. / Doctor of Philosophy / Large scale object-based storage systems face severe challenges to manage the data efficiently for HPC applications and workflows. These storage systems often manage and share data inflexibly, without considering the load imbalance and resource contention in the underlying multi-layer storage hierarchy. This dissertation first studies how resource contention and inflexible data sharing mechanisms impact HPC applications' storage and I/O performance; and then presents a series of efficient techniques, tools and algorithms to provide efficient and scalable data management for current and next-generation HPC storage systems Lustre Ceph High Performance Computing Parallel File Systems ParallelI/O Optimization Load Imbalance Resource Contention
287	On the Use of Containers in High Performance Computing Abraham, Subil 09 July 2020 (has links) The lightweight, portable, and flexible nature of containers is driving their widespread adoption in cloud solutions. Data analysis and deep learning applications have especially benefited from containerized solutions. As such data analysis is also being utilized in the high performance computing (HPC) domain, the need for container support in HPC has become paramount. However, container adoption in HPC face crucial performance and I/O challenges. One obstacle is that while there have been container solutions for HPC, such solutions have not been thoroughly investigated, especially from the aspect of their impact on the crucial I/O throughput needs of HPC. To this end, this paper provides a first-of-its-kind empirical analysis of state-of-the-art representative container solutions (Docker, Podman, Singularity, and Charliecloud) in HPC environments, especially how containers interact with the HPC storage systems. We present the design of an analysis framework that is deployed on all nodes in an HPC environment, and captures aspects such as CPU, memory, network, and file I/O statistics from the nodes and the storage system. We are able to garner key insights from our analysis, e.g., Charliecloud outperforms other container solutions in terms of container start-up time, while Singularity and Charliecloud are equivalent in I/O throughput. But this comes at a cost, as Charliecloud invokes the most metadata and I/O operations on the underlying Lustre file system. By identifying such optimization opportunities, we can enhance performance of containers atop HPC and help the aforementioned applications. / Master of Science / Containers are a technology that allow for applications to be packaged along with its ideal environment, all the way down to its preferred operating system. This allows an application to run anywhere that can support containers without a huge hit to the application performance. Hence containers have seen wide adoption for use in the cloud. These qualities have also made it very appealing for use in the world of scientific research in national labs. Modern research heavily relies on the power of computing in order to model, simulate, and test the behavior of real world entities, often making use of large amounts of data and utilizing machine learning and deep learning. Doing this often requires the high performance computing power found in supercomputers. In most cases, scientists just want to be able to write their code and expect it to just work. Their applications might depend on other source code that form part of their standard toolkit and would expect to also be installed in the supercomputing environment. This may not always be the case, taking the scientist's focus away from their work in order ensure their requirements are set up in the supercomputing environment which might require extensive cooperation with the operations team responsible for the supercomputers. Containers easily solve this problem because it can package everything together. However, the use of containers in these environments have not been extensively tested, especially for applications that are very heavy on the analysis of large quantities of data. To fill this gap, this work analyzes the performance of several state-of-the-art container technologies (Docker, Podman, Singularity, Charliecloud), with a particular focus on its interaction with the Lustre data storage systems widely used in supercomputing environments. As part of this work, we design an analysis setup that captures the behavior of various aspects of the high performance computing environment like CPU, memory, network usage and data movement while using containers to run data heavy applications. We garner important insights about their performance that can help inform the best choice of container technology given an environment and the kind of application that needs to be run. Container Performance High Performance Computing Parallel File Systems HPC Storage and I/O
288	Interpolants, Error Bounds, and Mathematical Software for Modeling and Predicting Variability in Computer Systems Lux, Thomas Christian Hansen 23 September 2020 (has links) Function approximation is an important problem. This work presents applications of interpolants to modeling random variables. Specifically, this work studies the prediction of distributions of random variables applied to computer system throughput variability. Existing approximation methods including multivariate adaptive regression splines, support vector regressors, multilayer perceptrons, Shepard variants, and the Delaunay mesh are investigated in the context of computer variability modeling. New methods of approximation using Box splines, Voronoi cells, and Delaunay for interpolating distributions of data with moderately high dimension are presented and compared with existing approaches. Novel theoretical error bounds are constructed for piecewise linear interpolants over functions with a Lipschitz continuous gradient. Finally, a mathematical software that constructs monotone quintic spline interpolants for distribution approximation from data samples is proposed. / Doctor of Philosophy / It is common for scientists to collect data on something they are studying. Often scientists want to create a (predictive) model of that phenomenon based on the data, but the choice of how to model the data is a difficult one to answer. This work proposes methods for modeling data that operate under very few assumptions that are broadly applicable across science. Finally, a software package is proposed that would allow scientists to better understand the true distribution of their data given relatively few observations. Approximation Theory Numerical Analysis High Performance Computing Computer Security Nonparametric Statistics Mathematical Software
289	Characterization of Sparsity-aware Optimization Paths for Graph Traversal on FPGA Gondhalekar, Atharva 25 May 2023 (has links) Breath-first search (BFS) is a fundamental building block in many graph-based applications, but it is difficult to optimize for a field-programmable gate array (FPGA) due to its irregular memory-access patterns. Prior work, based on hardware description languages (HDLs) and high-level synthesis (HLS), address the memory-access bottleneck of BFS by using techniques such as data alignment and compute-unit replication on FPGAs. The efficacy of such optimizations depends on factors such as the sparsity of target graph datasets. Optimizations intended for sparse graphs may not work as effectively for dense graphs on an FPGA and vice versa. This thesis presents two sets of FPGA optimization strategies for BFS, one for near-hypersparse graphs and the other designed for sparse to moderately dense graphs. For near-hypersparse graphs, a queue-based kernel with maximal use of local memory on FPGA is implemented. For denser graphs, an array-based kernel with compute-unit replication is implemented. Across a diverse collection of graphs, our OpenCL optimization strategies for near-hypersparse graphs delivers a 5.7x to 22.3x speedup over a state-of-the-art OpenCL implementation, when evaluated on an Intel Stratix~10 FPGA. The optimization strategies for sparse to moderately dense graphs deliver 1.1x to 2.3x speedup over a state-of-the-art OpenCL implementation on the same FPGA. Finally, this work uses graph metrics such as average degree and Gini coefficient to observe the impact of graph properties on the performance of the proposed optimization strategies. / M.S. / A graph is a data structure that typically consists of two sets -- a set of vertices and a set of edges representing connections between the vertices. Graphs are used in a broad set of application domains such as the testing and verification of digital circuits, data mining of social networks, and analysis of road networks. In such application areas, breadth-first search (BFS) is a fundamental building block. BFS is used to identify the minimum number of edges needed to be traversed from a source vertex to one or many destination vertices. In recent years, several attempts have been made to optimize the performance of BFS on reconfigurable architectures such as field-programmable gate arrays (FPGAs). However, the optimization strategies for BFS are not necessarily applicable to all types of graphs. Moreover, the efficacy of such optimizations oftentimes depends on the sparsity of input graphs. To that end, this work presents optimization strategies for graphs with varying levels of sparsity. Furthermore, this work shows that by tailoring the BFS design based on the sparsity of the input graph, significant performance improvements are obtained over the state-of-the-art BFS implementations on an FPGA. High performance computing Reconfigurable computing Graph traversal Field-programmable gate arrays
290	Impact of data dependencies for real-time high performance computing. Hossain, M. Alamgir, Kabir, U., Tokhi, M.O. January 2002 (has links) No / This paper presents an investigation into the impact of data dependencies in real-time high performance sequential and parallel processing. An adaptive active vibration control algorithm is considered to demonstrate the impact of data dependencies in real-time computing. The algorithm is analysed in detail to explore the inherent data dependencies. To minimize the impact of data dependencies, an investigation into reducing memory access in sequential computing is provided. The impact of data dependencies with various interconnections is also explored and demonstrated in real-time parallel processing through a set of experiments. Active vibration control Data dependency High performance computing Memory management Finite difference method Real-time control

Search results