Return to search

Pingo: A Framework for the Management of Storage of Intermediate Outputs of Computational Workflows

abstract: Scientific workflows allow scientists to easily model and express the entire data processing steps, typically as a directed acyclic graph (DAG). These scientific workflows are made of a collection of tasks that usually take a long time to compute and that produce a considerable amount of intermediate datasets. Because of the nature of scientific exploration, a scientific workflow can be modified and re-run multiple times, or new scientific workflows are created that might make use of past intermediate datasets. Storing intermediate datasets has the potential to save time in computations. Since storage is limited, one main problem that needs a solution is determining which intermediate datasets need to be saved at creation time in order to minimize the computational time of the workflows to be run in the future. This research thesis proposes the design and implementation of Pingo, a system that is capable of managing the computations of scientific workflows as well as the storage, provenance and deletion of intermediate datasets. Pingo uses the history of workflows submitted to the system to predict the most likely datasets to be needed in the future, and subjects the decision of dataset deletion to the optimization of the computational time of future workflows. / Dissertation/Thesis / Masters Thesis Computer Science 2017

Identiferoai:union.ndltd.org:asu.edu/item:44289
Date January 2017
Contributorsde Armas, Jadiel (Author), Bazzi, Rida (Advisor), Huang, Dijiang (Committee member), Syrotiuk, Violet (Committee member), Arizona State University (Publisher)
Source SetsArizona State University
LanguageEnglish
Detected LanguageEnglish
TypeMasters Thesis
Format78 pages
Rightshttp://rightsstatements.org/vocab/InC/1.0/, All Rights Reserved

Page generated in 0.0018 seconds