Return to search

Towards Data-Driven I/O Load Balancing in Extreme-Scale Storage Systems

Storage systems used for supercomputers and high performance computing (HPC) centers exhibit load imbalance and resource contention. This is mainly due to two factors: the bursty nature of the I/O of scientific applications; and the complex and distributed I/O path without centralized arbitration and control. For example, the extant Lustre parallel storage system, which forms the backend storage for many HPC centers, comprises numerous components, all connected in custom network topologies, and serve varying demands of large number of users and applications. Consequently, some storage servers can be more loaded than others, creating bottlenecks, and reducing overall application I/O performance. Existing solutions focus on per application load balancing, and thus are not effective due to the lack of a global view of the system.

In this thesis, we adopt a data-driven quantitative approach to load balance the I/O servers at extreme scale. To this end, we design a global mapper on Lustre Metadata Server (MDS), which gathers runtime statistics collected from key storage components on the I/O path, and applies Markov chain modeling and a dynamic maximum flow algorithm to decide where data should be placed in a load-balanced fashion. Evaluation using a realistic system simulator shows that our approach yields better load balancing, which in turn can help yield higher end-to-end performance. / Master of Science / Critical jobs such as meteorological prediction are run at exa-scale supercomputing facilities like Oak Ridge Leadership Computing Facility (OLCF). It is necessary for these centers to provide an optimally running infrastructure to support these critical workloads. The amount of data that is being produced and processed is increasing rapidly necessitating the need for these High Performance Computing (HPC) centers to design systems to support the increasing volume of data.

Lustre is a parallel filesystem that is deployed in HPC centers. Lustre being a hierarchical filesystem comprises of a distributed layer of Object Storage Servers (OSSs) that are responsible for I/O on the Object Storage Targets (OSTs). Lustre employs a traditional capacity-based Round-Robin approach for file placement on the OSTs. This results in the usage of only a small fraction of OSTs. Traditional Round-Robin approach also increases the load on the same set of OSSs which results in a decreased performance. Thus, it is imperative to have a better load balanced file placement algorithm that can evenly distribute the load across all OSSs and the OSTs in order to meet the future demands of data storage and processing.

We approach the problem of load imbalance by splicing the whole system into two views: filesystem and applications. We first collect the current usage statistics of the filesystem by means of a distributed monitoring tool. We then predict the applications’ I/O request pattern by employing a Markov Chain Model. Finally, we make use of both these components to design a load balancing algorithm that eventually evens out the load on both the OSSs and OSTs.

We evaluate our algorithm on a custom-built simulator that simulates the behavior of the actual filesystem.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/86272
Date15 June 2017
CreatorsBanavathi Srinivasa, Sangeetha
ContributorsComputer Science, Butt, Ali R., Raghvendra, Sharath, Polys, Nicholas F.
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
Detected LanguageEnglish
TypeThesis
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0015 seconds