Global ETD Search

Return to search

Stateless Parallel Processing Architecture for Extreme Scale HPC and Auction-based Clouds

Extreme scale HPC (high performance computing) applications require massively many nodes. At these scales, transient hardware and software failures, as well as network congestion and disconnections increase linearly with the number of components. This volatility contributed to the dramatic decrease in applications' MTBF (mean time between failures). Traditional point-to-point transmission APIs semantics are ill-fitted to support applications of extreme scale. In this thesis, we investigate an application dependent network design that focuses on the sustainability of extreme scale high performance computing applications using packet-switching-inspired statistical multiplexing of semantic data tuples and decoupled computations. We report the design and implementation of a distributed tuple space using Cassandra and Zookeeper for tunable spatial and temporal redundancies without negative impact on application performance. We detail the various failure scenarios that can be handled seamlessly by our system and provide a description of the advantages of Stateless Parallel Processing for HPC applications. We report our results on performance, reliability and overall application sustainability. In the preliminary tests, for the most common HPC application categories, the prototype has demonstrated sustained performance, while providing a reliable computing architecture that can withstand multiple failure types without manual checkpoint-restart(CPR). The feasibility of efficient non-stop HPC enables aution-based cloud for more cost efficient HPC applications. For all HPC application categories, we first report a novel method for determining bid-aware checkpointing intervals using fluctuating cloud providers' pricing histories. Subsequently, we explore the effects of bidding in the case of virtual HPC clusters composed of EC2 Spot Instances. We expose the counter-intuitive effects of uniform versus non-uniform bidding, especially in terms of failure rate and failure model, and we propose a method to deal with the problem of predicting the runtime of parallel applications under various bidding strategies. We then show that CPR-free HPC applications require a new optimization strategy. As extreme scale HPC and auction-based cloud computing offer the ultimate computational scale and resource efficiency, they challenge the very foundations in computer science research and development. This thesis answers some critical questions about these challenges and we hope to pave the way for future improvements of the HPC field under increasingly harsh and volatile conditions. / Computer and Information Science

Computer Science

Fault Tolerance

Performance of Systems

Scalability

Identifer	oai:union.ndltd.org:TEMPLE/oai:scholarshare.temple.edu:20.500.12613/3629
Date	January 2013
Creators	Taifi, Moussa
Contributors	Shi, Justin Y., Wu, Jie, 1961-, Tan, Chiu C., Khreishah, Abdallah, Szymanski, Boleslaw
Publisher	Temple University. Libraries
Source Sets	Temple University
Language	English
Detected Language	English
Type	Thesis/Dissertation, Text
Format	178 pages
Rights	IN COPYRIGHT- This Rights Statement can be used for an Item that is in copyright. Using this statement implies that the organization making this Item available has determined that the Item is in copyright and either is the rights-holder, has obtained permission from the rights-holder(s) to make their Work(s) available, or makes the Item available under an exception or limitation to copyright (including Fair Use) that entitles it to make the Item available., http://rightsstatements.org/vocab/InC/1.0/
Relation	http://dx.doi.org/10.34944/dspace/3611, Theses and Dissertations

Page generated in 0.0019 seconds

Stateless Parallel Processing Architecture for Extreme Scale HPC and Auction-based Clouds

Description

Links & Downloads

Tags

Additional Fields