Spelling suggestions: "subject:"cientific workflows"" "subject:"cientific workflows2""
1 |
Towards harnessing computational workflow provenance for experiment reportingAlper, Pinar January 2016 (has links)
We’re witnessing the era of Data-Oriented Science, where investigations routinely involve computational data analysis. The research lifecycle has now become more elaborate to support the sharing and re-use of scientific data. To establish the veracity of shared data, scientific communities aim for systematising 1) the process of analysing data, and, 2) the reporting of analyses and results. Scientific workflows are a prominent mechanism for systematising analyses by encoding them as automated processes and documenting process executions with Workflow Provenance. Meanwhile, systematic reporting calls for discipline-specific Experimental Metadata to be provided outlining the context of data analysis such as source/reference datasets and community resources used, analytical methods and their parameter settings. A natural expectation would be that investigations, which adopt a systematic, workflow-based approach to the analysis can be advantageous at the time of reporting. This premise holds weakly. While workflow provenance supports streamlined enactment of analyses, their auditability and verifiability, we conjecture that it has limited contribution to reporting. This dissertation focuses on eliciting the apparent disconnect of Workflow Provenance and Experimental Metadata as the provenance gap. We identify complexity, mixed granularity, and genericity as characteristics of workflow provenance that underlie this gap. In response we develop techniques for provenance abstraction, analysis and annotation. We argue that workflow provenance is accompanied with implicit information, that can be made explicit to inform these techniques. Through empirical evidence we show that workflow steps have common functional characteristics, which we capture in a taxonomy of Workflow Motifs. We show how formally defined Graph Transformations can exploit Motifs to identify causes of complexity in workflows and abstract them to structurally simpler forms. We build on insight from prior research to show how execution and provenance collection behaviour of a workflow system can anticipate the granularity characteristics of provenance. We provide declarative anticipatory rules for the static-analysis of workflows of the Taverna system. We observe that scientific context is often available in embedded form in data and argue that data can be lifted to become metadata by discipline-specific metadata extractors. We outline a framework, that can be plugged with extractors and provide operators that encapsulate generic procedures to annotate workflow provenance. We implement our techniques with technology-independent provenance models and we showcase their benefit using real-world workflows.
|
2 |
Auspice: Automatic Service Planning in Cloud/Grid EnvironmentsChiu, David T. 31 August 2010 (has links)
No description available.
|
3 |
Semantic Web Queries over Scientific DataAndrejev, Andrej January 2016 (has links)
Semantic Web and Linked Open Data provide a potential platform for interoperability of scientific data, offering a flexible model for providing machine-readable and queryable metadata. However, RDF and SPARQL gained limited adoption within the scientific community, mainly due to the lack of support for managing massive numeric data, along with certain other important features – such as extensibility with user-defined functions, query modularity, and integration with existing environments and workflows. We present the design, implementation and evaluation of Scientific SPARQL – a language for querying data and metadata combined, represented using the RDF graph model extended with numeric multidimensional arrays as node values – RDF with Arrays. The techniques used to store RDF with Arrays in a scalable way and process Scientific SPARQL queries and updates are implemented in our prototype software – Scientific SPARQL Database Manager, SSDM, and its integrations with data storage systems and computational frameworks. This includes scalable storage solutions for numeric multidimensional arrays and an efficient implementation of array operations. The arrays can be physically stored in a variety of external storage systems, including files, relational databases, and specialized array data stores, using our Array Storage Extensibility Interface. Whenever possible SSDM accumulates array operations and accesses array contents in a lazy fashion. In scientific applications numeric computations are often used for filtering or post-processing the retrieved data, which can be expressed in a functional way. Scientific SPARQL allows expressing common query sub-tasks with functions defined as parameterized queries. This becomes especially useful along with functional language abstractions such as lexical closures and second-order functions, e.g. array mappers. Existing computational libraries can be interfaced and invoked from Scientific SPARQL queries as foreign functions. Cost estimates and alternative evaluation directions may be specified, aiding the construction of better execution plans. Costly array processing, e.g. filtering and aggregation, is thus preformed on the server, saving the amount of communication. Furthermore, common supported operations are delegated to the array storage back-ends, according to their capabilities. Both expressivity and performance of Scientific SPARQL are evaluated on a real-world example, and further performance tests are run using our mini-benchmark for array queries.
|
4 |
Pingo: A Framework for the Management of Storage of Intermediate Outputs of Computational WorkflowsJanuary 2017 (has links)
abstract: Scientific workflows allow scientists to easily model and express the entire data processing steps, typically as a directed acyclic graph (DAG). These scientific workflows are made of a collection of tasks that usually take a long time to compute and that produce a considerable amount of intermediate datasets. Because of the nature of scientific exploration, a scientific workflow can be modified and re-run multiple times, or new scientific workflows are created that might make use of past intermediate datasets. Storing intermediate datasets has the potential to save time in computations. Since storage is limited, one main problem that needs a solution is determining which intermediate datasets need to be saved at creation time in order to minimize the computational time of the workflows to be run in the future. This research thesis proposes the design and implementation of Pingo, a system that is capable of managing the computations of scientific workflows as well as the storage, provenance and deletion of intermediate datasets. Pingo uses the history of workflows submitted to the system to predict the most likely datasets to be needed in the future, and subjects the decision of dataset deletion to the optimization of the computational time of future workflows. / Dissertation/Thesis / Masters Thesis Computer Science 2017
|
5 |
WorkflowDSL: Scalable Workflow Execution with ProvenanceFernando, Tharidu January 2017 (has links)
Scientific workflow systems enable scientists to perform large-scale data intensive scientific experiments using distributed computing resources. Due to the diversity of domains and complexity of technology, delivering a successful outcome efficiently requires collaboration between domain experts and technical experts. However, existing scientific workflow systems require a large investment of time to familiarise and adapt existing workflows. Thus, many scientific workflows are still being implemented by script based languages (such as Python and R) due to familiarity and extensive third party library support. In this thesis, we implement a framework that uses a domain specific language that enables domain experts to collaborate on fine-tuning workflows. Technical experts are able to use Python for task implementations. Moreover, the framework includes support for parallel execution without any specialized code. It also provides a provenance capturing framework that enables users to analyse past executions and retrieve complete lineage of any data item generated. Experiments which were performed using a real-world scientific workflow from the bioinformatics domain show that users were able to execute workflows efficiently while using our DSL for workflow composition and Python for task implementations. Moreover, we show that captured provenance can be useful for analysing past workflow executions. / Vetenskapliga arbetsflödessystem gör det möjligt för forskare att utföra storskaliga dataintensiva vetenskapliga experiment med hjälp av distribuerade datorresurser. På grund av mångfalden av domäner, och komplexitet i teknik, krävs samarbete mellan domänexperter och tekniska experter för att på ett effektivt sätt leverera en framgångsrik lösning. Befintliga vetenskapliga arbetsflödessystem kräver dock en stor investering i tid för att bekanta och anpassa befintliga arbetsflöden. Som ett resultat av detta implementeras många vetenskapliga arbetsflöden fortfarande av skriptbaserade språk (som Python och R) på grund av förtrogenhet och omfattande support från tredje part. I denna avhandling implementeras ett framework som använder ett domänsspecifikt språk som gör det möjligt för domänexperter att samarbeta med att finjustera arbetsflöden. Tekniska experter kan använda Python för att genomföra uppgifter. Dessutom innehåller ramverket stöd för parallell exekvering utan någon specialkod. Detta ger också ett ursprungsfångande framework som gör det möjligt för användare att analysera tidigare exekveringar och att hämta fullständiga härstamningar för samtliga genererade dataobjekt. Experiment som utfördes med hjälp av ett verkligt vetenskapligt arbetsflöde från bioinformatikdomänen visar att användarna effektivt kunde utföra arbetsflöden medan de använde en DSL för arbetsflödesammansättning och Python för uppdragsimplementationer. Dessutom visar vi hur fångade ursprung kan vara användbara för att analysera tidigare genomförda arbetsflödesexekveringar.
|
6 |
Similarity measures for scientific workflowsStarlinger, Johannes 08 January 2016 (has links)
In Laufe der letzten zehn Jahre haben Scientific Workflows als Werkzeug zur Erstellung von reproduzierbaren, datenverarbeitenden in-silico Experimenten an Aufmerksamkeit gewonnen, in die sowohl lokale Skripte und Anwendungen, als auch Web-Services eingebunden werden können. Über spezialisierte Online-Bibliotheken, sogenannte Repositories, können solche Workflows veröffentlicht und wiederverwendet werden. Mit zunehmender Größe dieser Repositories werden Ähnlichkeitsmaße für Scientific Workflows notwendig, etwa für Duplikaterkennung, Ähnlichkeitssuche oder Clustering von funktional ähnlichen Workflows. Die vorliegende Arbeit untersucht solche Ähnlichkeitsmaße für Scientific Workflows. Als erstes untersuchen wir ähnlichkeitsrelevante Eigenschaften von Scientific Workflows und identifizieren Charakteristika der Wiederverwendung ihrer Komponenten. Als zweites analysieren und reimplementieren wir existierende Lösungen für den Vergleich von Scientific Workflows entlang definierter Teilschritte des Vergleichsprozesses. Wir erstellen einen großen Gold-Standard Corpus von Workflowähnlichkeiten, der über 2400 Bewertungen für 485 Workflowpaare enthält, die von 15 Experten aus 6 Institutionen beigetragen wurden. Zum ersten Mal erlauben diese Vorarbeiten eine umfassende, vergleichende Evaluation verschiedener Ähnlichkeitsmaße für Scientific Workflows, in der wir einige vorige Ergebnisse bestätigen, andere aber revidieren. Als drittes stellen wir ein neue Methode für das Vergleichen von Scientific Workflows vor. Unsere Evaluation zeigt, dass diese neue Methode bessere und konsistentere Ergebnisse liefert und leicht mit anderen Ansätzen kombiniert werden kann, um eine weitere Qualitätssteigerung zu erreichen. Als viertes zweigen wir, wie die Resultate aus den vorangegangenen Schritten genutzt werden können, um aus Standardkomponenten eine Suchmaschine für schnelle, qualitativ hochwertige Ähnlichkeitssuche im Repositorymaßstab zu implementieren. / Over the last decade, scientific workflows have gained attention as a valuable tool to create reproducible in-silico experiments. Specialized online repositories have emerged which allow such workflows to be shared and reused by the scientific community. With increasing size of these repositories, methods to compare scientific workflows regarding their functional similarity become a necessity. To allow duplicate detection, similarity search, or clustering, similarity measures for scientific workflows are an essential prerequisite. This thesis investigates similarity measures for scientific workflows. We carry out four consecutive research tasks: First, we closely investigate the relevant properties of scientific workflows regarding their similarity and identify characteristics of re-use of their components. Second, we review and dissect existing approaches to scientific workflow comparison into a defined set of subtasks necessary in the process of workflow comparison, and re-implement previous approaches to each subtask. We create a large gold-standard corpus of expert-ratings on workflow similarity, with more than 2400 ratings provided for 485 pairs of workflows by 15 workflow experts from 6 institutions. For the first time, this allows comprehensive, comparative evaluation of different scientific workflow similarity measures, confirming some previous findings, but rejecting others. Third, we propose and evaluate a novel method for scientific workflow comparison. We show that this novel method provides results of both higher quality and higher consistency than previous approaches, and can easily be stacked and ensembled with other approaches for still better performance and higher speed. Fourth, we show how our findings can be leveraged to implement a search engine using off-the-shelf tools that performs fast, high quality similarity search for scientific workflows at repository-scale, a premier area of application for similarity measures for scientific workflows.
|
7 |
Informações de suporte ao escalonamento de workflows científicos para a execução em plataformas de computação em nuvem / Support information to scientific workflow scheduling for execution in cloud computing platformsTeixeira, Eduardo Cotrin 26 April 2016 (has links)
A ciência tem feito uso frequente de recursos computacionais para execução de experimentos e processos científicos, que podem ser modelados como workflows que manipulam grandes volumes de dados e executam ações como seleção, análise e visualização desses dados segundo um procedimento determinado. Workflows científicos têm sido usados por cientistas de várias áreas, como astronomia e bioinformática, e tendem a ser computacionalmente intensivos e fortemente voltados à manipulação de grandes volumes de dados, o que requer o uso de plataformas de execução de alto desempenho como grades ou nuvens de computadores. Para execução dos workflows nesse tipo de plataforma é necessário o mapeamento dos recursos computacionais disponíveis para as atividades do workflow, processo conhecido como escalonamento. Plataformas de computação em nuvem têm se mostrado um alternativa viável para a execução de workflows científicos, mas o escalonamento nesse tipo de plataforma geralmente deve considerar restrições específicas como orçamento limitado ou o tipo de recurso computacional a ser utilizado na execução. Nesse contexto, informações como a duração estimada da execução ou limites de tempo e de custo (chamadas aqui de informações de suporte ao escalonamento) são importantes para garantir que o escalonamento seja eficiente e a execução ocorra de forma a atingir os resultados esperados. Este trabalho identifica as informações de suporte que podem ser adicionadas aos modelos de workflows científicos para amparar o escalonamento e a execução eficiente em plataformas de computação em nuvem. É proposta uma classificação dessas informações, e seu uso nos principais Sistemas Gerenciadores de Workflows Científicos (SGWC) é analisado. Para avaliar o impacto do uso das informações no escalonamento foram realizados experimentos utilizando modelos de workflows científicos com diferentes informações de suporte, escalonados com algoritmos que foram adaptados para considerar as informações inseridas. Nos experimentos realizados, observou-se uma redução no custo financeiro de execução do workflow em nuvem de até 59% e redução no makespan chegando a 8,6% se comparados à execução dos mesmos workflows sendo escalonados sem nenhuma informação de suporte disponível. / Science has been using computing resources to perform scientific processes and experiments that can be modeled as workflows handling large data volumes and performing actions such as selection, analysis and visualization of these data according to a specific procedure. Scientific workflows have been used by scientists from many areas, such as astronomy and bioinformatics, and tend to be computationally intensive and heavily focused on handling large data volumes, which requires using high-performance computing platforms such as grids or clouds. For workflow execution in these platforms it is necessary to assign the workflow activities to the available computational resources, a process known as scheduling. Cloud computing platforms have proved to be a viable alternative for scientific workflows execution, but scheduling in cloud must take into account specific constraints such as limited budget or the type of computing resources to be used in execution. In this context, information such as the estimated duration of execution, or time and cost limits (here this information is generally referred to as scheduling support information) become important for efficient scheduling and execution, aiming to achieve the expected results. This work identifies support information that can be added to scientific workflow models to support efficient scheduling and execution in cloud computing platforms. We propose and analyze a classification of such information and its use in Scientific Workflows Management Systems (SWMS). To assess the impact of support information on scheduling, experiments were conducted with scientific workflow models using different support information, scheduled with algorithms that were adapted to consider the added information. The experiments have shown a reduction of up to 59% on the financial cost of workflow execution in the cloud, and a reduction reaching 8,6% on the makespan if compared to workflow execution scheduled without any available supporting information.
|
8 |
Uma arquitetura de baixo acoplamento para execução de padrões de controle de fluxo em grades / A loosely coupled architecture to run workflow control-flow patterns in gridNardi, Alexandre Ricardo 27 April 2009 (has links)
O uso de padrões de workflow para controle de fluxo em aplicações de e-Science resulta em maior produtividade por parte do cientista, permitindo que se concentre em sua área de especialização. Todavia, o uso de padrões de workflow para paralelização em grades permanece uma questão em aberto. Este texto apresenta uma arquitetura de baixo acoplamento e extensível, para permitir a execução de padrões com ou sem a presença de grade, de modo transparente ao cientista. Descreve também o Padrão Junção Combinada, que atende a diversos cenários de paralelização comumente encontrados em aplicações de e-Science. Com isso, espera-se auxiliar o trabalho do cientista, oferecendo maior flexibilidade na utilização de grades e na representação de cenários de paralelização. / The use of workflow control-flow patterns in e-Science applications results in productivity improvement, allowing the scientist to concentrate in his/her own research area. However, the use of workflow control-flow patterns for execution in grids remains an opened question. This work presents a loosely coupled and extensible architecture, allowing use of patterns with or without grids, transparently to the scientist. It also describes the Combined Join Pattern, compliant to parallelization scenarios, commonly found in e-Science applications. As a result, it is expected to help the scientist tasks, giving him or her greater flexibility in the grid usage and in representing parallelization scenarios.
|
9 |
Cost-efficient resource management for scientific workflows on the cloudPietri, Ilia January 2016 (has links)
Scientific workflows are used in many scientific fields to abstract complex computations (tasks) and data or flow dependencies between them. High performance computing (HPC) systems have been widely used for the execution of scientific workflows. Cloud computing has gained popularity by offering users on-demand provisioning of resources and providing the ability to choose from a wide range of possible configurations. To do so, resources are made available in the form of virtual machines (VMs), described as a set of resource characteristics, e.g. amount of CPU and memory. The notion of VMs enables the use of different resource combinations which facilitates the deployment of the applications and the management of the resources. A problem that arises is determining the configuration, such as the number and type of resources, that leads to efficient resource provisioning. For example, allocating a large amount of resources may reduce application execution time however at the expense of increased costs. This thesis investigates the challenges that arise on resource provisioning and task scheduling of scientific workflows and explores ways to address them, developing approaches to improve energy efficiency for scientific workflows and meet the user's objectives, e.g. makespan and monetary cost. The motivation stems from the wide range of options that enable to select cost-efficient configurations and improve resource utilisation. The contributions of this thesis are the following. (i) A survey of the issues arising in resource management in cloud computing; The survey focuses on VM management, cost efficiency and the deployment of scientific workflows. (ii) A performance model to estimate the workflow execution time for a different number of resources based on the workflow structure; The model can be used to estimate the respective user and energy costs in order to determine configurations that lead to efficient resource provisioning and achieve a balance between various conflicting goals. (iii) Two energy-aware scheduling algorithms that maximise the number of completed workflows from an ensemble under energy and budget or deadline constraints; The algorithms address the problem of energy-aware resource provisioning and scheduling for scientific workflow ensembles. (iv) An energy-aware algorithm that selects the frequency to be used for each workflow task in order to achieve energy savings without exceeding the workflow deadline; The algorithm takes into account the different requirements and constraints that arise depending on the workflow and system characteristics. (v) Two cost-based frequency selection algorithms that choose the CPU frequency for each provisioned resource in order to achieve cost-efficient resource configurations for the user and complete the workflow within the deadline; Decision making is based on both the workflow characteristics and the pricing model of the provider.
|
10 |
Informações de suporte ao escalonamento de workflows científicos para a execução em plataformas de computação em nuvem / Support information to scientific workflow scheduling for execution in cloud computing platformsEduardo Cotrin Teixeira 26 April 2016 (has links)
A ciência tem feito uso frequente de recursos computacionais para execução de experimentos e processos científicos, que podem ser modelados como workflows que manipulam grandes volumes de dados e executam ações como seleção, análise e visualização desses dados segundo um procedimento determinado. Workflows científicos têm sido usados por cientistas de várias áreas, como astronomia e bioinformática, e tendem a ser computacionalmente intensivos e fortemente voltados à manipulação de grandes volumes de dados, o que requer o uso de plataformas de execução de alto desempenho como grades ou nuvens de computadores. Para execução dos workflows nesse tipo de plataforma é necessário o mapeamento dos recursos computacionais disponíveis para as atividades do workflow, processo conhecido como escalonamento. Plataformas de computação em nuvem têm se mostrado um alternativa viável para a execução de workflows científicos, mas o escalonamento nesse tipo de plataforma geralmente deve considerar restrições específicas como orçamento limitado ou o tipo de recurso computacional a ser utilizado na execução. Nesse contexto, informações como a duração estimada da execução ou limites de tempo e de custo (chamadas aqui de informações de suporte ao escalonamento) são importantes para garantir que o escalonamento seja eficiente e a execução ocorra de forma a atingir os resultados esperados. Este trabalho identifica as informações de suporte que podem ser adicionadas aos modelos de workflows científicos para amparar o escalonamento e a execução eficiente em plataformas de computação em nuvem. É proposta uma classificação dessas informações, e seu uso nos principais Sistemas Gerenciadores de Workflows Científicos (SGWC) é analisado. Para avaliar o impacto do uso das informações no escalonamento foram realizados experimentos utilizando modelos de workflows científicos com diferentes informações de suporte, escalonados com algoritmos que foram adaptados para considerar as informações inseridas. Nos experimentos realizados, observou-se uma redução no custo financeiro de execução do workflow em nuvem de até 59% e redução no makespan chegando a 8,6% se comparados à execução dos mesmos workflows sendo escalonados sem nenhuma informação de suporte disponível. / Science has been using computing resources to perform scientific processes and experiments that can be modeled as workflows handling large data volumes and performing actions such as selection, analysis and visualization of these data according to a specific procedure. Scientific workflows have been used by scientists from many areas, such as astronomy and bioinformatics, and tend to be computationally intensive and heavily focused on handling large data volumes, which requires using high-performance computing platforms such as grids or clouds. For workflow execution in these platforms it is necessary to assign the workflow activities to the available computational resources, a process known as scheduling. Cloud computing platforms have proved to be a viable alternative for scientific workflows execution, but scheduling in cloud must take into account specific constraints such as limited budget or the type of computing resources to be used in execution. In this context, information such as the estimated duration of execution, or time and cost limits (here this information is generally referred to as scheduling support information) become important for efficient scheduling and execution, aiming to achieve the expected results. This work identifies support information that can be added to scientific workflow models to support efficient scheduling and execution in cloud computing platforms. We propose and analyze a classification of such information and its use in Scientific Workflows Management Systems (SWMS). To assess the impact of support information on scheduling, experiments were conducted with scientific workflow models using different support information, scheduled with algorithms that were adapted to consider the added information. The experiments have shown a reduction of up to 59% on the financial cost of workflow execution in the cloud, and a reduction reaching 8,6% on the makespan if compared to workflow execution scheduled without any available supporting information.
|
Page generated in 0.0724 seconds