Global ETD Search

291	A Scalable Leader Based Consensus Algorithm Gulati, Ishaan 10 August 2023 (has links) Present-day commonly used systems like Cassandra, Spanner, and CockroachDB require high availability and strict consistency guarantees. High availability is attained through redundancy. In the field of computing, redundancy is attained through state machine repli- cation. Protocols like Raft, Multi-Paxos, ZAB, or other variants of Paxos are commonly used to achieve state machine replication. These protocols choose one of the processes from multiple processes running on various machines in a distributed setting as the leader. The leader is responsible for client interactions, replicating client operations on all the followers, and maintaining a consistent view across the system. In these protocols, the leader is more loaded than other nodes or followers in the system, making the leader a significant scalabil- ity bottleneck for multi-datacenter and edge deployments. The overall commit throughput and latency are further exacerbated in majority agreement with the hardware and network heterogeneity. This work aims to reduce the load on the leader by using reduced dynamic latency-aware flexible quorums while maintaining strict correctness guarantees like linearizability. In this thesis, we implement dynamic reduced-size commit quorums to reduce the leader’s load and improve throughput and latency, called FDRaft. The commit quorums are computed based on an exponentially moving weighted average of the followers’ time to respond to the leader, accounting for the heterogeneity in hardware and network. The reduced commit quorum requires a bigger election quorum, but elections rarely happen, and a single leader can serve for significant durations. We evaluate this protocol using a key-value store built on FDRaft and Raft and compare multi-datacenter and edge deployments. The evaluation shows 2x improved throughput and around 55% improved latency over Raft during normal operations and 45% improvement over Raft with vanilla flexible-quorums under failure conditions. / M.S. / In our day-to-day life, we rely heavily on different internet applications, be it Instagram for sharing pictures, Amazon for our shopping, Doordaash for our food orders, Spotify for listening to music, or Uber for traveling. These applications share many commonalities, like the scale at which they operate, maintaining strict latency guarantees, high availability to serve the users, and using databases to maintain shared states. The data is replicated across multiple servers to provide fault tolerance against failures. The replication across multiple servers is achieved through state-machine replication. In state-machine replication, multiple servers start with the same initial state and perform operations in the same order to reach the same final state. This process of replication in computing is achieved through a consensus algorithm. Con- sensus means agreement, and consensus algorithms are used to reach an agreement for a particular value. Raft, Multi-Paxos, or any other variant of Paxos are the commonly used consensus algorithms to achieve agreement on a particular value in a distributed setting. In these algorithms, one of the servers is chosen as the leader responsible for client interactions, replicating and maintaining the same state across all the servers, even when faced with server and network failures. Every time the leader receives a client operation, it starts the consensus process by forwarding the client request to all the servers and committing the client request after receiving an agreement from the majority. As the leader does most of the work, it is more loaded than other servers and becomes a significant scalability bottleneck. The leader bottleneck becomes more evident in multi-datacenters and edge deployments. The hardware and network heterogeneity also severely affects the overall commit throughput and latency in majority agreement. In this thesis, we reduce the load on the leader by building a smaller-sized dynamic commit quorum with latency-aware server selection based on an exponentially weighted moving av- erage of the followers’ response time to the leader’s requests without compromising safety and liveness properties. Our design also provides a higher efficiency for throughput and commit latency. We evaluate this protocol against multiple workloads and failure conditions and find that it outperforms Raft by 2x in terms of throughput and around 55% in latency over Raft during normal operations. It also shows improvement in throughput and latency by 45% over Raft with vanilla flexible-quorums under failure conditions. Distributed Systems Fault Tolerance State Machine Replication Consensus Consistency Raft Paxos Flexible Commit Quorums
292	Runtime Systems for Load Balancing and Fault Tolerance on Distributed Systems Arafat, Md Humayun January 2014 (has links) No description available. Computer Science
293	Position, Attitude, and Fault-Tolerant Control of Tilting-Rotor Quadcopter Kumar, Rumit 16 June 2017 (has links) No description available. Aerospace Materials Quadcopter Fault Tolerance Tilt-Rotor Differential Flatness PD Control Stability
294	Maximizing Parallelization Opportunities by Automatically Inferring Optimal Container Memory for Asymmetrical Map Tasks Shrimal, Shubhendra 18 July 2016 (has links) No description available. Computer Science optimization fault-tolerance Hadoop MapReduce asymmetric inputs parallelism YARN
295	Handling Soft and Hard Errors for Scientific Applications Liu, Jiaqi 18 May 2017 (has links) No description available. Computer Engineering Computer Science hard error soft error scientific application fault tolerance resilience
296	Distributed resource allocation with scalable crash containment Pike, Scott Mason 29 September 2004 (has links) No description available. Computer Science dining philosophers fault tolerance partial synchrony failure detectors mutual exclusion failure locality
297	On reliable and scalable management of wireless sensor networks Bapat, Sandip Shriram 30 November 2006 (has links) No description available. Computer Science Wireless sensor networks Wireless networks Network management Stabilization Fault tolerance
298	Runtime Support for Improving Reliability in System Software Gao, Qi 23 August 2010 (has links) No description available. Computer Science Software Reliability Bug Detection Bug Diagnosis Fault Tolerance Runtime Support
299	Stateless Parallel Processing Architecture for Extreme Scale HPC and Auction-based Clouds Taifi, Moussa January 2013 (has links) Extreme scale HPC (high performance computing) applications require massively many nodes. At these scales, transient hardware and software failures, as well as network congestion and disconnections increase linearly with the number of components. This volatility contributed to the dramatic decrease in applications' MTBF (mean time between failures). Traditional point-to-point transmission APIs semantics are ill-fitted to support applications of extreme scale. In this thesis, we investigate an application dependent network design that focuses on the sustainability of extreme scale high performance computing applications using packet-switching-inspired statistical multiplexing of semantic data tuples and decoupled computations. We report the design and implementation of a distributed tuple space using Cassandra and Zookeeper for tunable spatial and temporal redundancies without negative impact on application performance. We detail the various failure scenarios that can be handled seamlessly by our system and provide a description of the advantages of Stateless Parallel Processing for HPC applications. We report our results on performance, reliability and overall application sustainability. In the preliminary tests, for the most common HPC application categories, the prototype has demonstrated sustained performance, while providing a reliable computing architecture that can withstand multiple failure types without manual checkpoint-restart(CPR). The feasibility of efficient non-stop HPC enables aution-based cloud for more cost efficient HPC applications. For all HPC application categories, we first report a novel method for determining bid-aware checkpointing intervals using fluctuating cloud providers' pricing histories. Subsequently, we explore the effects of bidding in the case of virtual HPC clusters composed of EC2 Spot Instances. We expose the counter-intuitive effects of uniform versus non-uniform bidding, especially in terms of failure rate and failure model, and we propose a method to deal with the problem of predicting the runtime of parallel applications under various bidding strategies. We then show that CPR-free HPC applications require a new optimization strategy. As extreme scale HPC and auction-based cloud computing offer the ultimate computational scale and resource efficiency, they challenge the very foundations in computer science research and development. This thesis answers some critical questions about these challenges and we hope to pave the way for future improvements of the HPC field under increasingly harsh and volatile conditions. / Computer and Information Science Computer Science Fault Tolerance Performance of Systems Scalability
300	A High Level Synthesis Approach for Reduced Interconnects and Fault Tolerance Lemstra, David 01 1900 (has links) <p> High Level Synthesis (HLS) is a promising approach to managing design complexity at a more abstract level as integrated circuit technology edges deeper into sub-micron design. One useful facet of HLS is the ability to automatically integrate architectural components that can address potential reliability issues, which may be on the increase due to miniaturization. Research into harnessing HLS for fault tolerance (FT) has been progressing since the early 1990's. There currently exists a large body of work regarding methods to incorporate capabilities such as fault detection, compensation, and recovery into HLS design.</p> <p> While many avenues of FT have been explored in the HLS environment, very little work has considered the effectiveness and feasibility of these techniques in the context of large HLS systems, which presumably is the raison d'etre of HLS. While existing HLS FT approaches are often elegant and involve highly sophisticated techniques to achieve optimal solutions, the costs of HLS infrastructure in regards to scalability are not well reported. The intent of this thesis is to explore the ramifications of applying common HLS techniques to large designs.</p> <p> Furthermore, a new HLS tool entitled RIFT is presented that is specifically designed to mitigate infrastructure costs that mount as greater parallelism is utilized. RIFT is named for its design philosophy of "Reducing Interconnects for Fault Tolerance". RIFT iteratively builds a logical hardware representation, which consists of both the components instantiated and their interconnections, one operation at a time. It chooses the next operation to be "mapped" to the burgeoning design based on scheduling constraints as well as the extra hardware and interconnect costs required to support a particular selection. Emphasis is placed on minimizing the delay of the datapath in effort to reduce the performance cost associated with the extra interconnects needed for FT. RIFT has been used to generate efficient solutions for FT designs requiring as many as a thousand operations.</p> / Thesis / Master of Applied Science (MASc)

Search results