Global ETD Search

41	Towards a Resource Efficient Framework for Distributed Deep Learning Applications Han, Jingoo 24 August 2022 (has links) Distributed deep learning has achieved tremendous success for solving scientific problems in research and discovery over the past years. Deep learning training is quite challenging because it requires training on large-scale massive dataset, especially with graphics processing units (GPUs) in latest high-performance computing (HPC) supercomputing systems. HPC architectures bring different performance trends in training throughput compared to the existing studies. Multiple GPUs and high-speed interconnect are used for distributed deep learning on HPC systems. Extant distributed deep learning systems are designed for non-HPC systems without considering efficiency, leading to under-utilization of expensive HPC hardware. In addition, increasing resource heterogeneity has a negative effect on resource efficiency in distributed deep learning methods including federated learning. Thus, it is important to focus on an increasing demand for both high performance and high resource efficiency for distributed deep learning systems, including latest HPC systems and federated learning systems. In this dissertation, we explore and design novel methods and frameworks to improve resource efficiency of distributed deep learning training. We address the following five important topics: performance analysis on deep learning for supercomputers, GPU-aware deep learning job scheduling, topology-aware virtual GPU training, heterogeneity-aware adaptive scheduling, and token-based incentive algorithm. In the first chapter (Chapter 3), we explore and focus on analyzing performance trend of distributed deep learning on latest HPC systems such as Summitdev supercomputer at Oak Ridge National Laboratory. We provide insights by conducting a comprehensive performance study on how deep learning workloads have effects on the performance of HPC systems with large-scale parallel processing capabilities. In the second part (Chapter 4), we design and develop a novel deep learning job scheduler MARBLE, which considers efficiency of GPU resource based on non-linear scalability of GPUs in a single node and improves GPU utilization by sharing GPUs with multiple deep learning training workloads. The third part of this dissertation (Chapter 5) proposes topology-aware virtual GPU training systems TOPAZ, specifically designed for distributed deep learning on recent HPC systems. In the fourth chapter (Chapter 6), we conduct exploration on an innovative holistic federated learning scheduling that employs a heterogeneity-aware adaptive selection method for improving resource efficiency and accuracy performance, coupled with resource usage profiling and accuracy monitoring to achieve multiple goals. In the fifth part of this dissertation (Chapter 7), we are focused on how to provide incentives to participants according to contribution for reaching high performance of final federated model, while tokens are used as a means of paying for the services of providing participants and the training infrastructure. / Doctor of Philosophy / Distributed deep learning is widely used for solving critical scientific problems with massive datasets. However, to accelerate the scientific discovery, resource efficiency is also important for the deployment on real-world systems, such as high-performance computing (HPC) systems. Deployment of existing deep learning applications on these distributed systems may lead to underutilization of HPC hardware resources. In addition, extreme resource heterogeneity has negative effects on distributed deep learning training. However, much of the prior work has not focused on specific challenges in distributed deep learning including HPC systems and heterogeneous federated systems, in terms of optimizing resource utilization.This dissertation addresses the challenges in improving resource efficiency of distributed deep learning applications, through performance analysis on deep learning for supercomputers, GPU-aware deep learning job scheduling, topology-aware virtual GPU training, and heterogeneity-aware adaptive federated learning scheduling and incentive algorithms. Deep Learning Federated Learning HPC Distributed Systems
42	[en] AN INFRASTRUCTURE FOR DISTRIBUTED EXECUTION OF SOFTWARE COMPONENTS / [pt] UMA INFRA-ESTRUTURA PARA A EXECUÇÃO DISTRIBUÍDA DE COMPONENTES DE SOFTWARE CARLOS EDUARDO LARA AUGUSTO 04 March 2009 (has links) [pt] Infra-estruturas de suporte a sistemas baseados em componentes de software tipicamente incluem facilidades para instalação, execução e conﬁguração dinâmica das dependências dos componentes de um sistema. Tais facilidades são especialmente importantes quando os componentes do sistema executam em um ambiente distribuído. Neste trabalho, investigamos alguns dos problemas que precisam ser tratados por infra-estruturas de execução de sistemas distribuídos baseados em componentes de software. Para realizar tal investigação, desenvolvemos um conjunto de servi¸cos para o middleware OpenBus, com o intuito de prover facilidades para a execução de aplicações distribuídas. Para ilustrar e avaliar o uso dos serviços desenvolvidos, apresentamos alguns exemplos onde a infra-estrutura é utilizada para executar cenários de teste de uma aplicação distribuída. / [en] Support infrastructures for component-based software systems usually include facilities for installation, execution and dynamic configuration of the system component`s dependencies. Such facilities are specially important when those system components execute in a distributed environment. In this work, we investigate some of the problems that must be handled by runtime infrastructures for distributed systems based on software components. To perform such investigation, we developed a set of services for the OpenBus middleware, aiming to provide facilities for execution of distributed applications. To illustrate and evaluate the use of the developed services, we present some examples where the infrastructure is used for executing test scenarios of a distributed application. [pt] SISTEMAS DISTRIBUIDOS [en] DISTRIBUTED SYSTEMS [pt] MIDDLEWARE [en] MIDDLEWARE [en] DISTRIBUTED SYSTEMS DEPLOYMENT
43	REAL-TIME TENA-ENABLED DATA GATEWAY Achtzehnter, Joachim, Hauck, Preston 10 1900 (has links) International Telemetering Conference Proceedings / October 18-21, 2004 / Town & Country Resort, San Diego, California / This paper describes the TENA architecture, which has been proposed by the Foundation Initiative 2010 (FI 2010) project as the basis for future US Test Range software systems. The benefits of this new architecture are explained by comparing the future TENA-enabled range infrastructure with the current situation of largely non-interoperable range resources. Legacy equipment and newly acquired off-the-shelf equipment that does not directly support TENA can be integrated into a TENA environment using TENA Gateways. This paper focuses on issues related to the construction of such gateways, including the important issue of real-time requirements when dealing with real-world data acquisition instruments. The benefits of leveraging commercial off-the-shelf (COTS) Data Acquisition Systems that are based on true real-time operating systems are discussed in the context of TENA Gateway construction. TENA Gateway Data Acquisition Real Time Distributed Systems
44	TCP PERFORMANCE ENHANCEMENT OVER IRIDIUM Torgerson, Leigh, Hutcherson, Joseph, McKelvey, James 10 1900 (has links) ITC/USA 2007 Conference Proceedings / The Forty-Third Annual International Telemetering Conference and Technical Exhibition / October 22-25, 2007 / Riviera Hotel & Convention Center, Las Vegas, Nevada / In support of iNET maturation, NASA-JPL has collaborated with NASA-Dryden to develop, test and demonstrate an over-the-horizon vehicle-to-ground networking capability, using Iridium as the vehicle-to-ground communications link for relaying critical vehicle telemetry. To ensure reliability concerns are met, the Space Communications Protocol Standards (SCPS) transport protocol was investigated for its performance characteristics in this environment. In particular, the SCPS-TP software performance was compared to that of the standard Transmission Control Protocol (TCP) over the Internet Protocol (IP). This paper will report on the results of this work. Network telemetry distributed systems disruption tolerant networking airborne science
45	IMPROVING REAL-TIME LATENCY PERFORMANCE ON COTS ARCHITECTURES Bono, John, Hauck, Preston 10 1900 (has links) International Telemetering Conference Proceedings / October 20-23, 2003 / Riviera Hotel and Convention Center, Las Vegas, Nevada / Telemetry systems designed to support the current needs of mission-critical applications often have stringent real-time requirements. These systems must guarantee a maximum worst-case processing and response time when incoming data is received. These real-time tolerances continue to tighten as data rates increase. At the same time, end user requirements for COTS pricing efficiencies have forced many telemetry systems to now run on desktop operating systems like Windows or Unix. While these desktop operating systems offer advanced user interface capabilities, they cannot meet the realtime requirements of the many mission-critical telemetry applications. Furthermore, attempts to enhance desktop operating systems to support real-time constraints have met with only limited success. This paper presents a telemetry system architecture that offers real-time guarantees while at the same time extensively leveraging inexpensive COTS hardware and software components. This is accomplished by partitioning the telemetry system onto two processors. The first processor is a NetAcquire subsystem running a real-time operating system (RTOS). The second processor runs a desktop operating system running the user interface. The two processors are connected together with a high-speed Ethernet IP internetwork. This architecture affords an improvement of two orders of magnitude over the real-time performance of a standalone desktop operating system. Telemetry COTS Real Time Low Latency Deterministic Distributed Systems
46	Investigating performance and energy efficiency on a private cloud Smith, James William January 2014 (has links) Organizations are turning to private clouds due to concerns about security, privacy and administrative control. They are attracted by the flexibility and other advantages of cloud computing but are wary of breaking decades-old institutional practices and procedures. Private Clouds can help to alleviate these concerns by retaining security policies, in-organization ownership and providing increased accountability when compared with public services. This work investigates how it may be possible to develop an energy-aware private cloud system able to adapt workload allocation strategies so that overall energy consumption is reduced without loss of performance or dependability. Current literature focuses on consolidation as a method for improving the energy-efficiency of cloud systems, but if consolidation is undesirable due to the performance penalties, dependability or latency then another approach is required. Given a private cloud in which the machines are constant, with no machines being powered down in response to changing workloads, and a set of virtual machines to run, each with different characteristics and profiles, it is possible to mix the virtual machine placement to reduce energy consumption or improve performance of the VMs. Through a series of experiments this work demonstrates that workload mixes can have an effect on energy consumption and the performance of applications running inside virtual machines. These experiments took the form of measuring the performance and energy usage of applications running inside virtual machines. The arrangement of these virtual machines on their hosts was varied to determine the effect of different workload mixes. The insights from these experiments have been used to create a proof-of- concept custom VM Allocator system for the OpenStack private cloud computing platform. Using CloudMonitor, a lightweight monitoring application to gather data on system performance and energy consumption, the implementation uses a holistic view of the private cloud state to inform workload placement decisions. 004.165
47	Improving the performance of distributed multi-agent based simulation Mengistu, Dawit January 2011 (has links) This research investigates approaches to improve the performance of multi-agent based simulation (MABS) applications executed in distributed computing environments. MABS is a type of micro-level simulation used to study dynamic systems consisting of interacting entities, and in some cases, the number of the simulated entities can be very large. Most of the existing publicly available MABS tools are single-threaded desktop applications that are not suited for distributed execution. For this reason, general-purpose multi-agent platforms with multi-threading support are sometimes used for deploying MABS on distributed resources. However, these platforms do not scale well for large simulations due to huge communication overheads. In this research, different strategies to deploy large scale MABS in distributed environments are explored, e.g., tuning existing multi-agent platforms, porting single-threaded MABS tools to distributed environment, and implementing a service oriented architecture (SOA) deployment model. Although the factors affecting the performance of distributed applications are well known, the relative significance of the factors is dependent on the architecture of the application and the behaviour of the execution environment. We developed mathematical performance models to understand the influence of these factors and, to analyze the execution characteristics of MABS. These performance models are then used to formulate algorithms for resource management and application tuning decisions. The most important performance improvement solutions achieved in this thesis include: predictive estimation of optimal resource requirements, heuristics for generation of agent reallocation to reduce communication overhead and, an optimistic synchronization algorithm to minimize time management overhead. Additional application tuning techniques such as agent directory caching and message aggregations for fine-grained simulations are also proposed. These solutions were experimentally validated in different types of distributed computing environments. Another contribution of this research is that all improvement measures proposed in this work are implemented on the application level. It is often the case that the improvement measures should not affect the configuration of the computing and communication resources on which the application runs. Such application level optimizations are useful for application developers and users who have limited access to remote resources and lack authorization to carry out resource level optimizations. agent based simulation MABS distributed systems application performance
48	Development and Validation of Distributed Reactive Control Systems/Développement et Validation de Systèmes de Contrôle Reactifs Distribués Meuter, Cédric 14 March 2008 (has links) A reactive control system is a computer system reacting to certain stimuli emitted by its environment in order to maintain it in a desired state. Distributed reactive control systems are generally composed of several processes, running in parallel on one or more computers, communicating with one another to perform the required control task. By their very nature, distributed reactive control systems are hard to design. Their distributed nature and/or the communication scheme used can introduce subtle unforeseen behaviours. When dealing with critical applications, such as plane control systems, or traﬃc light control systems, those unintended behaviours can have disastrous consequences. It is therefore essential, for the designer, to ensure that this does not happen. For that purpose, rigorous and systematic techniques can (and should) be applied as early as possible in the development process. In that spirit, this work aims at providing the designer with the necessary tools in order to facilitate the development and validation of such distributed reactive control systems. In particular, we show how using a dedicated language called dSL (Distributed Supervision language) can be used to ease the development process. We also study how validations techniques such as model-checking and testing can be applied in this context. temporal logics distributed systems control testing model checking formal methods
49	Enabling and Achieving Self-Management for Large Scale Distributed Systems : Platform and Design Methodology for Self-Management Al-Shishtawy, Ahmad January 2010 (has links) <p>Autonomic computing is a paradigm that aims at reducing administrative overhead by using autonomic managers to make applications self-managing. To better deal with large-scale dynamic environments; and to improve scalability, robustness, and performance; we advocate for distribution of management functions among several cooperative autonomic managers that coordinate their activities in order to achieve management objectives. Programming autonomic management in turn requires programming environment support and higher level abstractions to become feasible.</p><p>In this thesis we present an introductory part and a number of papers that summaries our work in the area of autonomic computing. We focus on enabling and achieving self-management for large scale and/or dynamic distributed applications. We start by presenting our platform, called Niche, for programming self-managing component-based distributed applications. Niche supports a network-transparent view of system architecture simplifying designing application self-* code. Niche provides a concise and expressive API for self-* code. The implementation of the framework relies on scalability and robustness of structured overlay networks. We have also developed a distributed file storage service, called YASS, to illustrate and evaluate Niche.</p><p>After introducing Niche we proceed by presenting a methodology and design space for designing the management part of a distributed self-managing application in a distributed manner. We define design steps, that includes partitioning of management functions and orchestration of multiple autonomic managers. We illustrate the proposed design methodology by applying it to the design and development of an improved version of our distributed storage service YASS as a case study.</p><p>We continue by presenting a generic policy-based management framework which has been integrated into Niche. Policies are sets of rules that govern the system behaviors and reflect the business goals or system management objectives. The policy based management is introduced to simplify the management and reduce the overhead, by setting up policies to govern system behaviors. A prototype of the framework is presented and two generic policy languages (policy engines and corresponding APIs), namely SPL and XACML, are evaluated using our self-managing file storage application YASS as a case study.</p><p>Finally, we present a generic approach to achieve robust services that is based on finite state machine replication with dynamic reconfiguration of replica sets. We contribute a decentralized algorithm that maintains the set of resource hosting service replicas in the presence of churn. We use this approach to implement robust management elements as robust services that can operate despite of churn.</p><p> </p> / QC 20100520 Autonomic Computing Self-Management Distributed Systems Computer science Datavetenskap
50	An Efficient Computation of Convex Closure on Abstract Events Bedasse, Dwight Samuel January 2005 (has links) The behaviour of distributed applications can be modeled as the occurrence of events and how these events relate to each other. Event data collected according to this event model can be visualized using process-time diagrams that are constructed from a collection of traces and events. One of the main characteristics of a distributed system is the large number of events that are involved, especially in practical situations. This large number of events, and hence large process-time diagrams, make distributed-system observation difficult for the user. However, event-predicate detection, a search mechanism able to detect and locate arbitrary predicates within a process-time diagram or event collection, can help the user to make sense of this large amount of data. Ping Xie used the convex-abstract event concept, developed by Thomas Kunz, to search for hierarchical event predicates. However, his algorithm for computing convex closure to construct compound events, and especially hierarchical compound events (i. e. , compound events that contain other compound events), is inefficient. In one case it took, on average, close to four hours to search the collection of event data for a specific hierarchical event predicate. In another case, it took nearly one hour. This dissertation discusses an efficient algorithm, an extension of Ping Xie?s algorithm, that employs a caching scheme to build compound and hierarchical compound events based on matched sub-patterns. In both cases cited above, the new execution times were reduced by over 94%. They now take, on average, less than four minutes. Electrical & Computer Engineering Event-detection distributed systems

Search results