Global ETD Search

Return to search

Relational Computing Using HPC Resources: Services and Optimizations

Computational epidemiology involves processing, analysing and managing large volumes of data. Such massive datasets cannot be handled efficiently by using traditional standalone database management systems, owing to their limitation in the degree of computational efficiency and bandwidth to scale to large volumes of data. In this thesis, we address management and processing of large volumes of data for modeling, simulation and analysis in epidemiological studies. Traditionally, compute intensive tasks are processed using high performance computing resources and supercomputers whereas data intensive tasks are delegated to standalone databases and some custom programs. DiceX framework is a one-stop solution for distributed database management and processing and its main mission is to leverage and utilize supercomputing resources for data intensive computing, in particular relational data processing.

While standalone databases are always on and a user can submit queries at any time for required results, supercomputing resources must be acquired and are available for a limited time period. These resources are relinquished either upon completion of execution or at the expiration of the allocated time period. This kind of reservation based usage style poses critical challenges, including building and launching a distributed data engine onto the supercomputer, saving the engine and resuming from the saved image, devising efficient optimization upgrades to the data engine and enabling other applications to seamlessly access the engine . These challenges and requirements cause us to align our approach more closely with cloud computing paradigms of Infrastructure as a Service(IaaS) and Platform as a Service(PaaS). In this thesis, we propose cloud computing like workflows, but using supercomputing resources to manage and process relational data intensive tasks. We propose and implement several services including database freeze and migrate and resume, ad-hoc resource addition and table redistribution. These services assist in carrying out the workflows defined.

We also propose an optimization upgrade to the query planning module of postgres-XC, the core relational data processing engine of the DiceX framework. With a knowledge of domain semantics, we have devised a more robust data distribution strategy that would enable to push down most time consuming sql operations forcefully to the postgres-XC data nodes, bypassing its query planner's default shippability criteria without compromising correctness. Forcing query push down reduces the query processing time by a factor of almost 40%-60% for certain complex spatio-temporal queries on our epidemiology datasets.

As part of this work, a generic broker service has also been implemented, which acts as an interface to the DiceX framework by exposing restful apis, which applications can make use of to query and retrieve results irrespective of the programming language or environment. / Master of Science

distributed databases

HPC

supercomputers

computational epidemiology

Identifer	oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/56586
Date	15 September 2015
Creators	Soundarapandian, Manikandan
Contributors	Computer Science, Marathe, Madhav Vishnu, Bisset, Keith R., Chen, Jiangzhuo, Vullikanti, Anil Kumar S., Gupta, Sandeep
Publisher	Virginia Tech
Source Sets	Virginia Tech Theses and Dissertation
Detected Language	English
Type	Thesis
Format	ETD, application/pdf
Rights	In Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0023 seconds

Relational Computing Using HPC Resources: Services and Optimizations

Description

Links & Downloads

Tags

Additional Fields