Spelling suggestions: "subject:"distributed systems anda algorithms"" "subject:"distributed systems ando algorithms""
11 |
Multi-Agent-Based Collaborative Machine Learning in Distributed Resource EnvironmentsAhmad Esmaeili (19153444) 18 July 2024 (has links)
<p dir="ltr">This dissertation presents decentralized and agent-based solutions for organizing machine learning resources, such as datasets and learning models. It aims to democratize the analysis of these resources through a simple yet flexible query structure, automate common ML tasks such as training, testing, model selection, and hyperparameter tuning, and enable privacy-centric building of ML models over distributed datasets. Based on networked multi-agent systems, the proposed approach represents ML resources as autonomous and self-reliant entities. This representation makes the resources easily movable, scalable, and independent of geographical locations, alleviating the need for centralized control and management units. Additionally, as all machine learning and data mining tasks are conducted near their resources, providers can apply customized rules independently of other parts of the system. </p><p><br></p>
|
12 |
Resource-Aware Decentralized Federated Learning over Heterogeneous NetworksShahryar Zehtabi (19833777) 20 November 2024 (has links)
<p dir="ltr">A recent emphasis of distributed learning research has been on federated learning (FL), in which model training is conducted by the data-collecting devices. In traditional FL algorithms, trained models at the edge are periodically sent to a central server for aggregation, utilizing a star topology as the underlying communication graph. However, assuming access to a central coordinator is not always practical, e.g., in ad hoc wireless network settings, motivating efforts to fully decentralize FL. Consequently, Decentralized federated learning (DFL) captures FL settings where both (i) model updates and (ii) model aggregations are exclusively carried out by the clients without a central server. Inherent challenges due to distributed nature of FL training, i.e., data heterogeneity and resource heterogeneity, become even more prevalent in DFL since it lacks a central server as a coordinator. In this thesis, we present two algorithms for resource-aware DFL, which result in achieving an overall desired performance across the clients in shorter amount of time compared to existing conventional DFL algorithms which do not factor in the resource availability of clients in their approaches.</p><p dir="ltr"><br></p><p dir="ltr">In the first project, we propose EF-HC, a novel methodology for distributed model aggregations via asynchronous, event-triggered consensus iterations over the network graph topology. We consider personalized/heterogeneous communication event thresholds at each device that weigh the change in local model parameters against the available local resources in deciding whether an aggregation would be beneficial enough to incur a communication delay on the system. In the second project, we propose Decentralized Sporadic Federated Learning (DSpodFL), a DFL methodology built on a generalized notion of sporadicity in both local gradient and aggregation processes. DSpodFL subsumes many existing decentralized optimization methods under a unified algorithmic framework by modeling the per-iteration (i) occurrence of gradient descent at each client and (ii) exchange of models between client pairs as arbitrary indicator random variables, thus capturing heterogeneous and time-varying computation/communication scenarios. We analytically characterize the convergence behavior of both algorithms for strongly convex models using both a constant and a diminishing learning rate, under mild assumptions on the communication graph connectivity, data heterogeneity across clients, and gradient noises. In DSpodFL, we do the same for non-convex models as well. Our numerical experiments demonstrate that both EF-HC and DSpodFL consistently achieve improved training speeds compared with baselines under various system settings.</p>
|
13 |
FEDERATED LEARNING AMIDST DYNAMIC ENVIRONMENTSBhargav Ganguly (19119859) 08 November 2024 (has links)
<p dir="ltr">Federated Learning (FL) is a prime example of a large-scale distributed machine learning framework that has emerged as a result of the exponential growth in data generation and processing capabilities on smart devices. This framework enables the efficient processing and analysis of vast amounts of data, leveraging the collective power of numerous devices to achieve unprecedented scalability and performance. In the FL framework, each end-user device trains a local model using its own data. Through the periodic synchronization of local models, FL achieves a global model that incorporates the insights from all participat- ing devices. This global model can then be used for various applications, such as predictive analytics, recommendation systems, and more.</p><p dir="ltr">Despite its potential, traditional Federated Learning (FL) frameworks face significant hur- dles in real-world applications. These challenges stem from two primary issues: the dynamic nature of data distributions and the efficient utilization of network resources in diverse set- tings. Traditional FL frameworks often rely on the assumption that data distributions remain stationary over time. However, real-world environments are inherently dynamic, with data distributions constantly evolving, which in turn becomes a potential source of <i>temporal</i> het- erogeneity in FL. Another significant challenge in traditional FL frameworks is the efficient use of network resources in heterogeneous settings. Real-world networks consist of devices with varying computational capabilities, communication protocols, and network conditions. Traditional FL frameworks often struggle to adapt to these diverse <i>spatially</i> heterogeneous settings, leading to inefficient use of network resources and increased latency.</p><p dir="ltr">The primary focus of this thesis is to investigate algorithmic frameworks that can miti- gate the challenges posed by <i>temporal</i> and <i>spatial</i> system heterogeneities in FL. One of the significant sources of <i>temporal</i> heterogeneities in FL is owed to the dynamic drifting of client datasets over time, whereas <i>spatial</i> heterogeneities majorly broadly subsume the diverse computational capabilities and network conditions of devices in a network. We introduce two novel FL frameworks: MASTER-FL, which addresses model staleness in the presence of <i>temporally</i> drifting datasets, and Cooperative Edge-Assisted Dynamic Federated Learning CE-FL, which manages both <i>spatial</i> and <i>temporal</i> heterogeneities in extensive hierarchical FL networks. MASTER-FL is specifically designed to ensure that the global model remains accurate and up-to-date even in environments which are characterized by rapidly changing datasets across time. CE-FL, on the other hand, leverages server-side computing capabili- ties, intelligent data offloading, floating aggregation and cooperative learning strategies to manage the diverse computational capabilities and network conditions often associated with modern FL systems.</p>
|
14 |
Federated Learning for Connected and Automated VehiclesVishnu Chellapandi (11367891) 10 January 2025 (has links)
<p dir="ltr">The subject of this dissertation is the development of Machine Learning (ML) algorithms and their applications in Connected and Automated Vehicles (CAV). Decentralized machine learning algorithms have been proposed for CAV with noisy communication channels. Decentralized ML algorithms, referred to as Federated Learning (FL), are developed for multiple vehicles to collaboratively train models, thus enhancing performance while ensuring data privacy and security. Applications of FL for CAV (FL4CAV) are analyzed. Both centralized and decentralized FL frameworks are considered, along with various data sources, models, and data security techniques relevant to FL in CAVs. Three innovative algorithms for Decentralized Federated Learning (DFL) that effectively handle noisy communication channels are proposed. Theoretical and experimental results demonstrate that the proposed algorithms that share gradients through noisy channels instead of parameters are more robust under noisy conditions compared to parameter-mixing algorithms. Building on the exploration of decentralized federated learning, a novel decentralized noisy model update tracking algorithm is proposed to further enhance robustness and efficiency while addressing the challenge of data heterogeneity impact. The proposed algorithm performs better than the existing algorithms in handling imperfect information sharing. Expanding on these findings, applications of FL are proposed for heavy-duty diesel engines, which remain crucial due to their fuel efficiency and emissions characteristics. Finally, an FL algorithm to better predict and control aftertreatment temperature overshoots in real-time is demonstrated. </p>
|
15 |
Automating Formal Verification of Distributed Systems via Property-Driven ReductionsChristopher Wagner (20817524) 05 March 2025 (has links)
<p dir="ltr">Distributed protocols, with their immense state spaces and complex behaviors, have long been popular targets for formal verification. Cutoff reductions offer an enticing path for verifying parameterized distributed systems, composed of arbitrarily many processes. While parameterized verification (i.e., algorithmically checking correctness of a system with an arbitrary number of processes) is generally undecidable, these reductions allow one to verify certain classes of parameterized systems by reducing verification of an infinite family of systems to that of a single finite instance. The finiteness of the resulting target system enables fully-automated verification of the entire unbounded system family. In this work, we aim to establish pathways for automated verification via cutoff reductions which emphasize a modular approach to establishing correctness.</p><p dir="ltr">First, we consider distributed, agreement-based (DAB) systems. That is, systems which are built on top of agreement protocols, such as agreement and consensus. While much attention has been paid to the correctness of the protocols themselves, relatively little consideration as been given to systems which utilize these protocols to achieve some higher-level functionality. To this end, we present the GSP model, a system model based on two types of globally-synchronous transitions: k-sender and k-maximal, the latter of which was introduced by this author. This model enables us to formalize systems built on distributed consensus and leader election, and define conditions under which such systems may be verified automatically, despite the involvement of an arbitrary number of participant processes (a problem which is generally undecidable). Further, we identify conditions under which these systems can be verified efficiently and provide proofs of their correctness developed in part by this author. We then present QuickSilver, a user-friendly framework for designing and verifying parameterized DAB systems and, on this author’s suggestion, lift the GSP decidability results to QuickSilver using this author’s notion of when the behavior of all processes in the system can be separated into sections of their control flow, called “phase analysis”.</p><p dir="ltr">Next, we address verification of systems beyond agreement-based protocols. We find that, among parameterized systems, a class of systems we refer to as star-networked systems has received limited attention as the subject of cutoff reductions. These systems combine heterogeneous client and server process definitions with both pairwise and broadcast communication, so they often fall outside the requirements of existing cutoff computations. We address these challenges in a novel cutoff reduction based on careful analysis of the interactions between a central process and an arbitrary number of peripheral client processes as they progress toward an error state. The key to our approach rests on identifying systems in which the central process coordinates primarily with a finite number of core client processes, and outside of such core clients, the system’s progress can be enabled by a finite number of auxiliary clients.</p><p dir="ltr">Finally, we examine systems that are doubly-unbounded, in particular, parameterized DAB systems that additionally have unbounded data domains. We present a novel reduction which leverages value symmetry and a new notion of data saturation to reduce verification of doubly-unbounded DAB systems to model checking of small, finite-state systems. We also demonstrate that this domain reduction can be applied beyond DAB systems, including to star-networked systems.</p><p dir="ltr">We implement our reductions in several frameworks to enable efficient verification of sophisticated DAB and star-networked system models, including the arbitration mechanism for a consortium blockchain, a simple key-value store, and a lock server. We show that, by reducing the complexity of verification problems, cutoff reductions open up avenues for the application of a variety of verification techniques, including further reduction.</p>
|
16 |
Dissertation_Miller_Alexander_DTECH_5APR2023.pdfAlexander Thomas Miller (15204598) 12 April 2023 (has links)
<p>The advent and spread of counterfeit goods in medical supply chains is an opportunistic activity by actors who take advantage of information asymmetry between themselves and the rest of the supply chain. Counterfeiters, fraudsters, and companies who intend to cut corners have asymmetric knowledge of the quality (substandard) of the products they possess and are not obligated, but have a disincentive, to share that asymmetric knowledge. This has led to a medical supply chain that is riddled with asymmetric information from consumers, all the way upstream to manufacturers. The asymmetric information present in the supply chain allows agents to take advantage of the demand and chaos in the system to act contrary to the principal’s, in this case the supply chain, best interest as described in the agent-principal theory. The problem related to information asymmetry in principal-agent relationships, including those encapsulated in supply chains, is well documented in prior literature. The missing piece of research deals with quantification of information asymmetry metrics and assessing supply chain of goods.</p>
<p>This research explored current and proposed information asymmetry mitigating activities including the potential applications of technology-based methods of reducing information asymmetry within the medical supply chains including distributed ledger technology. Five data aggregation services were searched for relevant literature generating a final sample for analysis of 90 documents (ndocuments = 90). A qualitative meta-analysis methodology was conducted using Nvivo as exploratory research to analyze content in the corpus of documents and extract key themes relevant to each research question then synthesize frequencies of key themes such that information asymmetry in medical supply chains can be decomposed into agents, conditions, and contributing factors.</p>
|
17 |
Analyses and Scalable Algorithms for Byzantine-Resilient Distributed OptimizationKananart Kuwaranancharoen (16480956) 03 July 2023 (has links)
<p>The advent of advanced communication technologies has given rise to large-scale networks comprised of numerous interconnected agents, which need to cooperate to accomplish various tasks, such as distributed message routing, formation control, robust statistical inference, and spectrum access coordination. These tasks can be formulated as distributed optimization problems, which require agents to agree on a parameter minimizing the average of their local cost functions by communicating only with their neighbors. However, distributed optimization algorithms are typically susceptible to malicious (or "Byzantine") agents that do not follow the algorithm. This thesis offers analysis and algorithms for such scenarios. As the malicious agent's function can be modeled as an unknown function with some fundamental properties, we begin in the first two parts by analyzing the region containing the potential minimizers of a sum of functions. Specifically, we explicitly characterize the boundary of this region for the sum of two unknown functions with certain properties. In the third part, we develop resilient algorithms that allow correctly functioning agents to converge to a region containing the true minimizer under the assumption of convex functions of each regular agent. Finally, we present a general algorithmic framework that includes most state-of-the-art resilient algorithms. Under the strongly convex assumption, we derive a geometric rate of convergence of all regular agents to a ball around the optimal solution (whose size we characterize) for some algorithms within the framework.</p>
|
18 |
OFFLOADING NVME OVER FABRICS (NVME-OF) TO SMARTNICS ON AN AT-SCALE DISTRIBUTED TESTBEDShoaib Basu (21194486) 01 May 2025 (has links)
<p dir="ltr">The rapid growth of cloud deployments has made them foundational to modern com-</p><p dir="ltr">puting infrastructure, enabling unprecedented scalability, flexibility, and resource sharing.</p><p dir="ltr">However, traditional general-purpose CPUs are burdened with managing critical infrastruc-</p><p dir="ltr">ture functions such as storage, security, and networking. SmartNICs have emerged as a</p><p dir="ltr">transformative technology, integrating specialized processing cores and accelerators directly</p><p dir="ltr">on NIC hardware to offload and accelerate these essential functions. By handling complex</p><p dir="ltr">tasks at line speed, SmartNICs enhance throughput, reduce latency, and alleviate CPU</p><p dir="ltr">load, effectively addressing the limitations of conventional NICs in cloud environments. This</p><p dir="ltr">work explores the offloading of NVMe over Fabrics (NVMe-oF) functionality to SmartNICs,</p><p dir="ltr">investigating the performance implications of hardware-accelerated storage management.</p><p dir="ltr">The experimental approach of this study leverages SmartNIC technology to measure</p><p dir="ltr">performance metrics directly indicative of CPU efficiency, including CPU interrupts, context</p><p dir="ltr">switches, and I/O throughput, which demonstrate improved performance through offload-</p><p dir="ltr">ing. To investigate if these metrics indicate real-world performance gains for the CPU, this</p><p dir="ltr">study benchmarks two database applications, PostgreSQL and MongoDB, and the CPU</p><p dir="ltr">performance is measured using completed database transactions and average latency per op-</p><p dir="ltr">eration. By comparing the offloaded and non-offloaded configurations, the increase in CPU</p><p dir="ltr">efficiency due to SmartNIC offloading in distributed storage infrastructures is quantified.</p><p dir="ltr">The number of database operations rises by at least 20% in the least efficient case, with a</p><p dir="ltr">corresponding decrease of 10% in database transaction latency when the operations are of-</p><p dir="ltr">floaded. These results highlight the potential of SmartNICs in improving CPU performance,</p><p dir="ltr">reducing transaction latency, and maximizing resource utilization, ultimately demonstrating</p><p dir="ltr">their transformative impact on scalability and performance in modern cloud-native infras-</p><p dir="ltr">tructure. This study contributes to understanding the feasibility and benefits of SmartNIC</p><p dir="ltr">deployment for NVMe-oF in cloud infrastructures, providing valuable insights into future</p><p dir="ltr">cloud optimization strategies</p>
|
19 |
Scalable Parallel Machine Learning on High Performance Computing Systems–Clustering and Reinforcement LearningWeijian Zheng (14226626) 08 December 2022 (has links)
<p>High-performance computing (HPC) and machine learning (ML) have been widely adopted by both academia and industries to address enormous data problems at extreme scales. While research has reported on the interactions of HPC and ML, achieving high performance and scalability for parallel and distributed ML algorithms is still a challenging task. This dissertation first summarizes the major challenges for applying HPC to ML applications: 1) poor performance and scalability, 2) loss of the convergence rate, 3) lower quality of the trained model, and 4) a lack of performance optimization techniques designed for specific applications. Researchers can address the four challenges in new ML applications. This dissertation shows how to solve them for two specific applications: 1) a clustering algorithm and 2) graph optimization algorithms that use reinforcement learning (RL).</p>
<p>As to the clustering algorithm, we first propose an algorithm called the simulated-annealing clustering algorithm. By combining a blocked data layout and asynchronous local optimization within each thread, the simulated-annealing enhanced clustering algorithm has a convergence rate that is comparable to the K-means algorithm but with much higher performance. Experiments with synthetic and real-world datasets show that the simulated-annealing enhanced clustering algorithm is significantly faster than the MPI K-means library using up to 1024 cores. However, the optimization costs (Sum of Square Error (SSE)) of the simulated-annealing enhanced clustering algorithm became higher than the original costs. To tackle this problem, we devise a new algorithm called the full-step feel-the-way clustering algorithm. In the full-step feel-the-way algorithm, there are L local steps within each block of data points. We use the first local step’s results to compute accurate global optimization costs. Our results show that the full-step algorithm can significantly reduce the global number of iterations needed to converge while obtaining low SSE costs. However, the time spent on the local steps is greater than the benefits of the saved iterations. To improve this performance, we next optimize the local step time by incorporating a sampling-based method called reassignment-history-aware sampling. Extensive experiments with various synthetic and real world datasets (e.g., MNIST, CIFAR-10, ENRON, and PLACES-2) show that our parallel algorithms can outperform the fastest open-source MPI K-means implementation by up to 110% on 4,096 CPU cores with comparable SSE costs.</p>
<p>Our evaluations of the sampling-based feel-the-way algorithm establish the effectiveness of the local optimization strategy, the blocked data layout, and the sampling methods for addressing the challenges of applying HPC to ML applications. To explore more parallel strategies and optimization techniques, we focus on a more complex application: graph optimization problems using reinforcement learning (RL). RL has proved successful for automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised RL’s ability to solve large scale graph optimization problems due to the lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop OpenGraphGym-MG, a high performance distributed-GPU RL framework for solving graph optimization problems. OpenGraphGym-MG focuses on a class of computationally demanding RL problems in which both the RL environment and the policy model are highly computation intensive. In this work, we distribute large-scale graphs across distributed GPUs and use spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of spatial and data parallelism and highlight their differences. To support graph neural network (GNN) layers that take data samples partitioned across distributed GPUs as input, we design new parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design new parallel graph environments to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high performance OpenGraphGym-MG training and inference algorithms in parallel.</p>
<p>To summarize, after proposing the major challenges for applying HPC to ML applications, this thesis explores several parallel strategies and performance optimization techniques using two ML applications. Specifically, we propose a local optimization strategy, a blocked data layout, and sampling methods for accelerating the clustering algorithm, and we create a spatial parallelism strategy, a parallel graph environment, agent, and policy model, and an optimized replay buffer, and multi-node selection strategy for solving large optimization problems over graphs. Our evaluations prove the effectiveness of these strategies and demonstrate that our accelerations can significantly outperform the state-of-the-art ML libraries and frameworks without loss of quality in trained models.</p>
|
20 |
Investigation of Backdoor Attacks and Design of Effective Countermeasures in Federated LearningAgnideven Palanisamy Sundar (11190282) 03 September 2024 (has links)
<p dir="ltr">Federated Learning (FL), a novel subclass of Artificial Intelligence, decentralizes the learning process by enabling participants to benefit from a comprehensive model trained on a broader dataset without direct sharing of private data. This approach integrates multiple local models into a global model, mitigating the need for large individual datasets. However, the decentralized nature of FL increases its vulnerability to adversarial attacks. These include backdoor attacks, which subtly alter classification in some categories, and byzantine attacks, aimed at degrading the overall model accuracy. Detecting and defending against such attacks is challenging, as adversaries can participate in the system, masquerading as benign contributors. This thesis provides an extensive analysis of the various security attacks, highlighting the distinct elements of each and the inherent vulnerabilities of FL that facilitate these attacks. The focus is primarily on backdoor attacks, which are stealthier and more difficult to detect compared to byzantine attacks. We explore defense strategies effective in identifying malicious participants or mitigating attack impacts on the global model. The primary aim of this research is to evaluate the effectiveness and limitations of existing server-level defenses and to develop innovative defense mechanisms under diverse threat models. This includes scenarios where the server collaborates with clients to thwart attacks, cases where the server remains passive but benign, and situations where no server is present, requiring clients to independently minimize and isolate attacks while enhancing main task performance. Throughout, we ensure that the interventions do not compromise the performance of both global and local models. The research predominantly utilizes 2D and 3D datasets to underscore the practical implications and effectiveness of proposed methodologies.</p>
|
Page generated in 0.1442 seconds