Global ETD Search

51	Towards a Resource Efficient Framework for Distributed Deep Learning Applications Han, Jingoo 24 August 2022 (has links) Distributed deep learning has achieved tremendous success for solving scientific problems in research and discovery over the past years. Deep learning training is quite challenging because it requires training on large-scale massive dataset, especially with graphics processing units (GPUs) in latest high-performance computing (HPC) supercomputing systems. HPC architectures bring different performance trends in training throughput compared to the existing studies. Multiple GPUs and high-speed interconnect are used for distributed deep learning on HPC systems. Extant distributed deep learning systems are designed for non-HPC systems without considering efficiency, leading to under-utilization of expensive HPC hardware. In addition, increasing resource heterogeneity has a negative effect on resource efficiency in distributed deep learning methods including federated learning. Thus, it is important to focus on an increasing demand for both high performance and high resource efficiency for distributed deep learning systems, including latest HPC systems and federated learning systems. In this dissertation, we explore and design novel methods and frameworks to improve resource efficiency of distributed deep learning training. We address the following five important topics: performance analysis on deep learning for supercomputers, GPU-aware deep learning job scheduling, topology-aware virtual GPU training, heterogeneity-aware adaptive scheduling, and token-based incentive algorithm. In the first chapter (Chapter 3), we explore and focus on analyzing performance trend of distributed deep learning on latest HPC systems such as Summitdev supercomputer at Oak Ridge National Laboratory. We provide insights by conducting a comprehensive performance study on how deep learning workloads have effects on the performance of HPC systems with large-scale parallel processing capabilities. In the second part (Chapter 4), we design and develop a novel deep learning job scheduler MARBLE, which considers efficiency of GPU resource based on non-linear scalability of GPUs in a single node and improves GPU utilization by sharing GPUs with multiple deep learning training workloads. The third part of this dissertation (Chapter 5) proposes topology-aware virtual GPU training systems TOPAZ, specifically designed for distributed deep learning on recent HPC systems. In the fourth chapter (Chapter 6), we conduct exploration on an innovative holistic federated learning scheduling that employs a heterogeneity-aware adaptive selection method for improving resource efficiency and accuracy performance, coupled with resource usage profiling and accuracy monitoring to achieve multiple goals. In the fifth part of this dissertation (Chapter 7), we are focused on how to provide incentives to participants according to contribution for reaching high performance of final federated model, while tokens are used as a means of paying for the services of providing participants and the training infrastructure. / Doctor of Philosophy / Distributed deep learning is widely used for solving critical scientific problems with massive datasets. However, to accelerate the scientific discovery, resource efficiency is also important for the deployment on real-world systems, such as high-performance computing (HPC) systems. Deployment of existing deep learning applications on these distributed systems may lead to underutilization of HPC hardware resources. In addition, extreme resource heterogeneity has negative effects on distributed deep learning training. However, much of the prior work has not focused on specific challenges in distributed deep learning including HPC systems and heterogeneous federated systems, in terms of optimizing resource utilization.This dissertation addresses the challenges in improving resource efficiency of distributed deep learning applications, through performance analysis on deep learning for supercomputers, GPU-aware deep learning job scheduling, topology-aware virtual GPU training, and heterogeneity-aware adaptive federated learning scheduling and incentive algorithms. Deep Learning Federated Learning HPC Distributed Systems
52	[en] AN INFRASTRUCTURE FOR DISTRIBUTED EXECUTION OF SOFTWARE COMPONENTS / [pt] UMA INFRA-ESTRUTURA PARA A EXECUÇÃO DISTRIBUÍDA DE COMPONENTES DE SOFTWARE CARLOS EDUARDO LARA AUGUSTO 04 March 2009 (has links) [pt] Infra-estruturas de suporte a sistemas baseados em componentes de software tipicamente incluem facilidades para instalação, execução e conﬁguração dinâmica das dependências dos componentes de um sistema. Tais facilidades são especialmente importantes quando os componentes do sistema executam em um ambiente distribuído. Neste trabalho, investigamos alguns dos problemas que precisam ser tratados por infra-estruturas de execução de sistemas distribuídos baseados em componentes de software. Para realizar tal investigação, desenvolvemos um conjunto de servi¸cos para o middleware OpenBus, com o intuito de prover facilidades para a execução de aplicações distribuídas. Para ilustrar e avaliar o uso dos serviços desenvolvidos, apresentamos alguns exemplos onde a infra-estrutura é utilizada para executar cenários de teste de uma aplicação distribuída. / [en] Support infrastructures for component-based software systems usually include facilities for installation, execution and dynamic configuration of the system component`s dependencies. Such facilities are specially important when those system components execute in a distributed environment. In this work, we investigate some of the problems that must be handled by runtime infrastructures for distributed systems based on software components. To perform such investigation, we developed a set of services for the OpenBus middleware, aiming to provide facilities for execution of distributed applications. To illustrate and evaluate the use of the developed services, we present some examples where the infrastructure is used for executing test scenarios of a distributed application. [pt] SISTEMAS DISTRIBUIDOS [en] DISTRIBUTED SYSTEMS [pt] MIDDLEWARE [en] MIDDLEWARE [en] DISTRIBUTED SYSTEMS DEPLOYMENT
53	Parameterized Verification and Synthesis for Distributed Agreement-Based Systems Nouraldin Jaber (13796296) 19 September 2022 (has links) <p> </p> <p>Distributed agreement-based systems use common distributed agreement protocols such as leader election and consensus as building blocks for their target functionality—processes in these systems may need to agree on a leader, on the members of a group, on owners of locks, or on updates to replicated data. Such distributed agreement-based systems are common and potentially permit modular, scalable verification approaches that mimic their modular design. Interestingly, while there are many verification efforts that target agreement protocols themselves, little attention has been given to distributed agreement-based systems that build on top of these protocols. </p> <p>In this work, we aim to develop a fully-automated, modular, and usable parameterized verification approach for distributed agreement-based systems. To do so, we need to overcome the following challenges. First, the fully automated parameterized verification problem, i.e, the problem of algorithmically checking if the system is correct for any number of processes, is a well-known <em>undecidable </em>problem. Second, to enable modular verification that leverages the inherently-modular nature of these agreement-based systems, we need to be able to support <em>abstractions </em>of agreement protocols. Such abstractions can replace the agreement protocols’ implementations when verifying the overall system; enabling modular reasoning. Finally, even when the verification is fully automated, a system designer still needs assistance in <em>modeling </em>their distributed agreement-based systems. </p> <p>We systematically tackle these challenges through the following contributions. </p> <p>First, we support efficient, decidable verification of distributed agreement-based systems by developing a computational model—the GSP model—for reasoning about distributed (agreement-based) systems that admits decidability and <em>cutoff </em>results. Cutoff results enable practical verification by reducing the parameterized verification problem to the verification problem of a system with a fixed, finite number of processes. The GSP model supports generalized communication primitives and global guards, both of which are essential to enable abstractions of agreement protocols. </p> <p>Then, we address the usability and modularity aspects by developing a framework, QuickSilver, tailored for modeling and modular parameterized verification of distributed agreement-based systems. QuickSilver provides an intuitive domain-specific language, called Mercury, that is equipped with two agreement primitives capable of abstracting away agreement protocols when modeling agreement-based systems; enabling modular verification. QuickSilver extends the decidability and cutoff results of the GSP model to provide fully automated, efficient parameterized verification for a large class of systems modeled in Mercury. </p> <p>Finally, we leverage synthesis techniques to further enhance the usability of our approach and propose Cinnabar, a tool that supports synthesis of distributed agreement-based systems with efficiently-decidable parameterized verification. Cinnabar allows a system de- signer to provide a sketch of their Mercury model and uses a counterexample-guided synthesis procedure to search for model completions that both belong to the efficiently-decidable fragment of Mercury and are correct. </p> <p>We evaluate our contributions on various interesting distributed agreement-based systems adapted from real-world applications, such as a data store, a lock service, a surveillance system, a pathfinding algorithm for mobile robots, and more. </p> Distributed systems and algorithms Programming languages Distributed Systems Parameterized Verification Agreement-Based Systems
54	InDiGo: an infrastructure for optimization of distributed algorithms Kolesnikov, Valeriy January 1900 (has links) Doctor of Philosophy / Department of Computing and Information Sciences / Gurdip Singh / Many frameworks have been proposed which provide distributed algorithms encapsulated as middleware services to simplify application design. The developers of such algorithms are faced with two opposing forces. One is to design generic algorithms that are reusable in a large number of applications. Efficiency considerations, on the other hand, force the algorithms to be customized to specific operational contexts. This problem is often attacked by simply re-implementing all or large portions of an algorithm. We propose InDiGO, an infrastructure which allows design of generic but customizable algorithms and provides tools to customize such algorithms for specific applications. InDiGO provides the following capabilities: (a) Tools to generate intermediate representations of an application which can be leveraged for analysis, (b) Mechanisms to allow developers to design customizable algorithms by exposing design knowledge in terms of configurable options, and (c) An optimization engine to analyze an application to derive the information necessary to optimize the algorithms. Specifically, we optimize algorithms by removing communication which is redundant in the context of a specific application. We perform three types of optimizations: static optimization, dynamic optimization and physical topology-based optimization. We present experimental results to demonstrate the advantages of our infrastructure. distributed systems middleware optimization distributed algorithms Computer Science (0984)
55	REAL-TIME TENA-ENABLED DATA GATEWAY Achtzehnter, Joachim, Hauck, Preston 10 1900 (has links) International Telemetering Conference Proceedings / October 18-21, 2004 / Town & Country Resort, San Diego, California / This paper describes the TENA architecture, which has been proposed by the Foundation Initiative 2010 (FI 2010) project as the basis for future US Test Range software systems. The benefits of this new architecture are explained by comparing the future TENA-enabled range infrastructure with the current situation of largely non-interoperable range resources. Legacy equipment and newly acquired off-the-shelf equipment that does not directly support TENA can be integrated into a TENA environment using TENA Gateways. This paper focuses on issues related to the construction of such gateways, including the important issue of real-time requirements when dealing with real-world data acquisition instruments. The benefits of leveraging commercial off-the-shelf (COTS) Data Acquisition Systems that are based on true real-time operating systems are discussed in the context of TENA Gateway construction. TENA Gateway Data Acquisition Real Time Distributed Systems
56	TCP PERFORMANCE ENHANCEMENT OVER IRIDIUM Torgerson, Leigh, Hutcherson, Joseph, McKelvey, James 10 1900 (has links) ITC/USA 2007 Conference Proceedings / The Forty-Third Annual International Telemetering Conference and Technical Exhibition / October 22-25, 2007 / Riviera Hotel & Convention Center, Las Vegas, Nevada / In support of iNET maturation, NASA-JPL has collaborated with NASA-Dryden to develop, test and demonstrate an over-the-horizon vehicle-to-ground networking capability, using Iridium as the vehicle-to-ground communications link for relaying critical vehicle telemetry. To ensure reliability concerns are met, the Space Communications Protocol Standards (SCPS) transport protocol was investigated for its performance characteristics in this environment. In particular, the SCPS-TP software performance was compared to that of the standard Transmission Control Protocol (TCP) over the Internet Protocol (IP). This paper will report on the results of this work. Network telemetry distributed systems disruption tolerant networking airborne science
57	IMPROVING REAL-TIME LATENCY PERFORMANCE ON COTS ARCHITECTURES Bono, John, Hauck, Preston 10 1900 (has links) International Telemetering Conference Proceedings / October 20-23, 2003 / Riviera Hotel and Convention Center, Las Vegas, Nevada / Telemetry systems designed to support the current needs of mission-critical applications often have stringent real-time requirements. These systems must guarantee a maximum worst-case processing and response time when incoming data is received. These real-time tolerances continue to tighten as data rates increase. At the same time, end user requirements for COTS pricing efficiencies have forced many telemetry systems to now run on desktop operating systems like Windows or Unix. While these desktop operating systems offer advanced user interface capabilities, they cannot meet the realtime requirements of the many mission-critical telemetry applications. Furthermore, attempts to enhance desktop operating systems to support real-time constraints have met with only limited success. This paper presents a telemetry system architecture that offers real-time guarantees while at the same time extensively leveraging inexpensive COTS hardware and software components. This is accomplished by partitioning the telemetry system onto two processors. The first processor is a NetAcquire subsystem running a real-time operating system (RTOS). The second processor runs a desktop operating system running the user interface. The two processors are connected together with a high-speed Ethernet IP internetwork. This architecture affords an improvement of two orders of magnitude over the real-time performance of a standalone desktop operating system. Telemetry COTS Real Time Low Latency Deterministic Distributed Systems
58	Investigating performance and energy efficiency on a private cloud Smith, James William January 2014 (has links) Organizations are turning to private clouds due to concerns about security, privacy and administrative control. They are attracted by the flexibility and other advantages of cloud computing but are wary of breaking decades-old institutional practices and procedures. Private Clouds can help to alleviate these concerns by retaining security policies, in-organization ownership and providing increased accountability when compared with public services. This work investigates how it may be possible to develop an energy-aware private cloud system able to adapt workload allocation strategies so that overall energy consumption is reduced without loss of performance or dependability. Current literature focuses on consolidation as a method for improving the energy-efficiency of cloud systems, but if consolidation is undesirable due to the performance penalties, dependability or latency then another approach is required. Given a private cloud in which the machines are constant, with no machines being powered down in response to changing workloads, and a set of virtual machines to run, each with different characteristics and profiles, it is possible to mix the virtual machine placement to reduce energy consumption or improve performance of the VMs. Through a series of experiments this work demonstrates that workload mixes can have an effect on energy consumption and the performance of applications running inside virtual machines. These experiments took the form of measuring the performance and energy usage of applications running inside virtual machines. The arrangement of these virtual machines on their hosts was varied to determine the effect of different workload mixes. The insights from these experiments have been used to create a proof-of- concept custom VM Allocator system for the OpenStack private cloud computing platform. Using CloudMonitor, a lightweight monitoring application to gather data on system performance and energy consumption, the implementation uses a holistic view of the private cloud state to inform workload placement decisions. 004.165
59	Improving the performance of distributed multi-agent based simulation Mengistu, Dawit January 2011 (has links) This research investigates approaches to improve the performance of multi-agent based simulation (MABS) applications executed in distributed computing environments. MABS is a type of micro-level simulation used to study dynamic systems consisting of interacting entities, and in some cases, the number of the simulated entities can be very large. Most of the existing publicly available MABS tools are single-threaded desktop applications that are not suited for distributed execution. For this reason, general-purpose multi-agent platforms with multi-threading support are sometimes used for deploying MABS on distributed resources. However, these platforms do not scale well for large simulations due to huge communication overheads. In this research, different strategies to deploy large scale MABS in distributed environments are explored, e.g., tuning existing multi-agent platforms, porting single-threaded MABS tools to distributed environment, and implementing a service oriented architecture (SOA) deployment model. Although the factors affecting the performance of distributed applications are well known, the relative significance of the factors is dependent on the architecture of the application and the behaviour of the execution environment. We developed mathematical performance models to understand the influence of these factors and, to analyze the execution characteristics of MABS. These performance models are then used to formulate algorithms for resource management and application tuning decisions. The most important performance improvement solutions achieved in this thesis include: predictive estimation of optimal resource requirements, heuristics for generation of agent reallocation to reduce communication overhead and, an optimistic synchronization algorithm to minimize time management overhead. Additional application tuning techniques such as agent directory caching and message aggregations for fine-grained simulations are also proposed. These solutions were experimentally validated in different types of distributed computing environments. Another contribution of this research is that all improvement measures proposed in this work are implemented on the application level. It is often the case that the improvement measures should not affect the configuration of the computing and communication resources on which the application runs. Such application level optimizations are useful for application developers and users who have limited access to remote resources and lack authorization to carry out resource level optimizations. agent based simulation MABS distributed systems application performance
60	Development and Validation of Distributed Reactive Control Systems/Développement et Validation de Systèmes de Contrôle Reactifs Distribués Meuter, Cédric 14 March 2008 (has links) A reactive control system is a computer system reacting to certain stimuli emitted by its environment in order to maintain it in a desired state. Distributed reactive control systems are generally composed of several processes, running in parallel on one or more computers, communicating with one another to perform the required control task. By their very nature, distributed reactive control systems are hard to design. Their distributed nature and/or the communication scheme used can introduce subtle unforeseen behaviours. When dealing with critical applications, such as plane control systems, or traﬃc light control systems, those unintended behaviours can have disastrous consequences. It is therefore essential, for the designer, to ensure that this does not happen. For that purpose, rigorous and systematic techniques can (and should) be applied as early as possible in the development process. In that spirit, this work aims at providing the designer with the necessary tools in order to facilitate the development and validation of such distributed reactive control systems. In particular, we show how using a dedicated language called dSL (Distributed Supervision language) can be used to ease the development process. We also study how validations techniques such as model-checking and testing can be applied in this context. temporal logics distributed systems control testing model checking formal methods

Search results