Global ETD Search

1	Facilitating reproducible computing via scientific workflows – an integrated system approach Cao, Yuan 04 May 2017 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Reproducible computing and research are of great importance for scientific investigation in any discipline. This thesis presents a general approach to provenance in the context of workflows for widely used script languages. Our solution is based on system integration, and is demonstrated by integrating MATLAB with VisTrails, an open source scientific workflow system. The integrated VisTrails-MATLAB system supports reproducible computing with truly prospective and retrospective provenance at multiple granularity levels as scientists choose for their scripts, and at the same time, is very easy to use. provenance scientific workflow scientific workflow management system integration
2	Computational analysis of CpG site DNA methylation Ghorbani, Mohammadmersad January 2013 (has links) Epigenetics is the study of factors that can change DNA and passed to next generation without change to DNA sequence. DNA methylation is one of the categories of epigenetic change. DNA methylation is the attachment of methyl group (CH3) to DNA. Most of the time it occurs in the sequences that G is followed by C known as CpG sites and by addition of methyl to the cytosine residue. As science and technology progress new data are available about individual’s DNA methylation profile in different conditions. Also new features discovered that can have role in DNA methylation. The availability of new data on DNA methylation and other features of DNA provide challenge to bioinformatics and the opportunity to discover new knowledge from existing data. In this research multiple data series were used to identify classes of methylation DNA to CpG sites. These classes are a) Never methylated CpG sites,b) Always methylated CpG sites, c) Methylated CpG sites in cancer/disease samples and non-methylated in normal samples d) Methylated CpG sites in normal samples and non-methylated in cancer/disease samples. After identification of these sites and their classes, an analysis was carried out to find the features which can better classify these sites a matrix of features was generated using four applications in EMBOSS software suite. Features matrix was also generated using the gUse/WS-PGRADE portal workflow system. In order to do this each of the four applications were grid enabled and ported to BOINC platform. The gUse portal was connected to the BOINC project via 3G-bridge. Each node in the workflow created portion of matrix and then these portions were combined together to create final matrix. This final feature matrix used in a hill climbing workflow. Hill climbing node was a JAVA program ported to BOINC platform. A Hill climbing search workflow was used to search for a subset of features that are better at classifying the CpG sites using 5 different measurements and three different classification methods: support vector machine, naïve bayes and J48 decision tree. Using this approach the hill climbing search found the models which contain less than half the number of features and better classification results. It is also been demonstrated that using gUse/WS-PGRADE workflow system can provide a modular way of feature generation so adding new feature generator application can be done without changing other parts. It is also shown that using grid enabled applications can speedup both feature generation and feature subset selection. The approach used in this research for distributed workflow based feature generation is not restricted to this study and can be applied in other studies that involve feature generation. The approach also needs multiple binaries to generate portions of features. The grid enabled hill climbing search application can also be used in different context as it only requires to follow the same format of feature matrix. 572.8
3	[en] SCIENTIFIC APPLICATION: REENGINEERING TO ADD WORKFLOW CONCEPTS / [pt] REENGENHARIA DE UMA APLICAÇÃO CIENTÍFICA PARA INCLUSÃO DE CONCEITOS DE WORKFLOW THIAGO MANHENTE DE CARVALHO MARQUES 17 January 2017 (has links) [pt] A aplicação de técnicas de workflows na área de computação científica é bastante explorada para a condução de experimentos e construção de modelos in silico. Ao analisarmos alguns desafios enfrentados por uma aplicação científica na área de geociências, percebemos que workflows podem ser usados para representar os modelos gerados na aplicação e facilitar o desenvolvimento de funcionalidades que supram as necessidades identificadas. A maioria dos trabalhos e ferramentas na área de workflows científicos, porém, são voltados para uso em ambientes de computação distribuída, como serviços web e computação em grade, sendo de difícil uso ou integração dentro de aplicações científicas mais simples. Nesta dissertação, discutimos como viabilizar a composição e representação de workflows dentro de uma aplicação científica existente. Descrevemos uma arquitetura conceitual de motor de workflows voltado para o uso dentro de uma aplicação stand-alone. Descrevemos também um modelo de implantação em uma aplicação C plus plus usando redes de Petri para modelar um workflow e funções C plus plus para representar as tarefas. Como prova de conceito, implantamos esse modelo de workflows em uma aplicação existente e analisamos o impacto do seu uso na aplicação. / [en] The use of workflow techniques in scientific computing is widely adopted in the execution of experiments and building in silico models. By analysing some challenges faced by a scientific application in the geosciences domain, we noticed that workflows could be used to represent the geological models created using the application so as to ease the development of features to meet those challenges. Most works and tools on the scientific workflows domain, however, are designed for use in distributed computing contexts like web services and grid computing, which makes them unsuitable for integration or use within simpler scientific applications. In this dissertation, we discuss how to make viable the composition and representation of workflows within an existing scientific application. We describe a conceptual architecture of a workflow engine designed to be used within a stand-alone application. We also describe an implementation model of this architecture in a C plus plus application using Petri nets to model a workflow and C plus plus functions to represent tasks. As proof of concept, we implement this workflow model in an existing application and studied its impact on the application. [pt] WORKFLOWS CIENTIFICOS [en] SCIENTIFIC WORKFLOW [pt] REENGENHARIA DE APLICACOES [en] APPLICATION REENGINEERING
4	Blockchain Use for Data Provenance in Scientific Workflow Sigurjonsson, Sindri Már Kaldal January 2018 (has links) In Scientific workflows, data provenance plays a big part. Through data provenance, the execution of the workflow is documented and information about the data pieces involved are stored. This can be used to reproduce scientific experiments or to proof how the results from the workflow came to be. It is therefore vital that the provenance data that is stored in the provenance database is always synchronized with its corresponding workflow, to verify that the provenance database has not been tampered with. The blockchain technology has been gaining a lot of attention in recent years since Satoshi Nakamoto released his Bitcoin paper in 2009. The blockchain technology consists of a peer-to-peer network where an append-only ledger is stored and replicated across a peer-to-peer network and offers high tamper-resistance through its consensus protocols. In this thesis, the option of whether the blockchain technology is a suitable solution for synchronizing workflow with its provenance data was explored. A system that generates a workflow, based on a definition written in a Domain Specific Language, was extended to utilize the blockchain technology to synchronize the workflow itself and its results. Furthermore, the InterPlanetary File System was utilized to assist with the versioning of individual executions of the workflow. The InterPlanetary File System provided the functionality of comparing individual workflows executions in more detail and to discover how they differ. The solution was analyzed with respect to the 21 CFR Part 11 regulations imposed by the FDA in order to see how it could assist with fulfilling the requirements of the regulations. Analysis on the system shows that the blockchain extension can be used to verify if the synchronization between a workflow and its results has been tampered with. Experiments revealed that the size of the workflow did not have a significant effect on the execution time of the extension. Additionally, the proposed solution offers a constant cost in digital currency regardless of the workflow. However, even though the extension shows some promise of assisting with fulfilling the requirements of the 21 CFR Part 11 regulations, analysis revealed that the extension does not fully comply with it due to the complexity of the regulations / I vetenskapliga arbetsflöden är usprung (eng. provenance) av dataviktigt. Genom att spåra ursprunget av data, i form av dokumentation,kan datas ursprung sparas. Detta kan användas för att återskapavetenskapliga experiment eller för att bevisa hur resultat från arbetsflödegenererats. Det är därför viktigt att datas ursprung, som lagrasi ursprungsdatabasen, alltid är synkroniserad med dess motsvarandearbetsflöde som ett sätt att verifiera att ursprungsdatabasen intehar manipulerats. Blockchainteknologi har fått mycket uppmärksamhetde senaste åren sen Satoshi Nakamoto släppte sin Bitcoin artikelår 2009. Blockchainteknologi består av ett peer-to-peer nätverk där endastbifogning tillåts i en liggare som är replikerad över ett peer-topeernätverk vilken tillhandahåller hög manipuleringsresistans genomkonsensusprotokoll. I denna uppsats undersöks hurvida blockchainteknologi är en passande lösning för arbetsflödessynkronisering avursprungsdata. Ett system som genererar ett arbetsflöde, baserat påen definition som skrivits i ett domänspecifikt språk, var förlängt föratt utnyttja blockchainteknologi för synkronisering av arbetsflödet ochdess resultat. InterPlanetary File System användes för att assistera medversionshanteringen av individuella exekveringar av arbetsflödet. InterPlanetaryFile System tillhandahöll funktionalitet för att jämföra individuellaarbetsflödesexekveringar mer detaljerat samt att upptäckahur de skiljer sig åt. Resultaten är analyserade med hänsyn till 21 CFRPart 11 regleringar från FDA för att se hur resultaten kan assistera medatt uppfylla kraven av förordningarna. Analys av systemen visar attblockchainförlängningen kan användas för att verifiera att synkroniseringenmellan arbetsflödet och dess resultat inte har manipulerats.Experimenten visade att storleken av arbetsflödet inte hade märkbareffekt på exekveringstiden av förlängningen. Därutöver möjliggör denpresenterade lösningen en konstant kostnad i digital valuta oavsett arbetsflödetsstorlek. Även om förlängningen visar lovande resultat förassistering av fullföljande av 21 CFR Part 11 regleringarna påvisar analysatt förlängningen inte fullständigt uppfyller kraven på grund avkomplexiteten av dessa regleringar. Computer Systems Datorsystem
5	Resource-oriented architecture based scientific workflow modelling Duan, Kewei January 2016 (has links) This thesis studies the feasibility and methodology of applying state-of-the-art computer technology in scientific workflow modelling, within a collaborative environment. The collaborative environment also indicates that the people involved include non-computer scientists or engineers from other disciplines. The objective of this research is to provide a systematic methodology based on a web environment for the purpose of lowering the barriers brought by the heterogeneous features of multi-institutions, multi-platforms and geographically distributed resources which are implied in the collaborative environment of scientific workflow. 004.2
6	Data-intensive interactive workflows for visual analytics Khemiri, Wael 12 December 2011 (has links) (PDF) The increasing amounts of electronic data of all forms, produced by humans (e.g. Web pages, structured content such as Wikipedia or the blogosphere etc.) and/or automatic tools (loggers, sensors, Web services, scientific programs or analysis tools etc.) leads to a situation of unprecedented potential for extracting new knowledge, finding new correlations, or simply making sense of the data.Visual analytics aims at combining interactive data visualization with data analysis tasks. Given the explosion in volume and complexity of scientific data, e.g., associated to biological or physical processes or social networks, visual analytics is called to play an important role in scientific data management.Most visual analytics platforms, however, are memory-based, and are therefore limited in the volume of data handled. Moreover, the integration of each new algorithm (e.g. for clustering) requires integrating it by hand into the platform. Finally, they lack the capability to define and deploy well-structured processes where users with different roles interact in a coordinated way sharing the same data and possibly the same visualizations.This work is at the convergence of three research areas: information visualization, database query processing and optimization, and workflow modeling. It provides two main contributions: (i) We propose a generic architecture for deploying a visual analytics platform on top of a database management system (DBMS) (ii) We show how to propagate data changes to the DBMS and visualizations, through the workflow process. Our approach has been implemented in a prototype called EdiFlow, and validated through several applications. It clearly demonstrates that visual analytics applications can benefit from robust storage and automatic process deployment provided by the DBMS while obtaining good performance and thus it provides scalability.Conversely, it could also be integrated into a data-intensive scientific workflow platform in order to increase its visualization features. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Visual analytics Scientific workflow systems Dynamic changes
7	Revisiter les grilles de PCs avec des technologies du Web et le Cloud computing / Re-examaning the Desktop Grids with Web Technologies and Cloud Computing Abidi, Leila 03 March 2015 (has links) Le contexte de cette thèse est à l’intersection des contextes des grilles de calculs, des nouvelles technologies du Web ainsi que des Clouds et des services à la demande. Depuis leur avènement au cours des années 90, les plates-formes distribuées, plus précisément les systèmes de grilles de calcul (Grid Computing), n’ont pas cessé d’évoluer permettant ainsi de susciter multiple efforts de recherche. Les grilles de PCs ont été proposées comme une alternative aux super-calculateurs par la fédération des milliers d’ordinateurs de bureau. Les détails de la mise en oeuvre d’une telle architecture de grille, en termes de mécanismes de mutualisation des ressources, restent très difficile à cerner. Parallèlement, le Web a complètement modifié notre façon d’accéder à l’information. Le Web est maintenant une composante essentielle de notre quotidien. Les équipements ont, à leur tour, évolué d’ordinateurs de bureau ou ordinateurs portables aux tablettes, lecteurs multimédias, consoles de jeux, smartphones, ou NetPCs. Cette évolution exige d’adapter et de repenser les applications/intergiciels de grille de PCs qui ont été développés ces dernières années. Notre contribution se résume dans la réalisation d’un intergiciel de grille de PCs que nous avons appelé RedisDG. Dans son fonctionnement, RedisDG reste similaire à la plupart des intergiciels de grilles de calcul, c’est-à-dire qu’il est capable d’exécuter des applications sous forme de «sacs de tâches» dans un environnement distribué, assurer le monitoring des noeuds, valider et certifier les résultats. L’innovation de RedisDG, réside dans l’intégration de la modélisation et la vérification formelles dans sa phase de conception, ce qui est non conventionnel mais très pertinent dans notre domaine. Notre approche consiste à repenser les grilles de PCs à partir d’une réflexion et d’un cadre formel permettant de les développer, de manière rigoureuse et de mieux maîtriser les évolutions technologiques à venir. / The context of this work is at the intersection of grid computing, the new Web technologies and the Clouds and services on demand contexts. Desktop Grid have been proposed as an alternative to supercomputers by the federation of thousands of desktops. The details of the implementation of such an architecture, in terms of resource sharing mechanisms, remain very hard. Meanwhile, the Web has completely changed the way we access information. The equipment, in turn, have evolved from desktops or laptops to tablets, smartphones or NetPCs. Our approach is to rethink Desktop Grids from a reflexion and a formal framework to develop them rigorously and better control future technological developments. We have reconsidered the interactions between the traditional components of a Desktop Grid based on the Web technology, and given birth to RedisDG, a new Desktop Grid middelware capable to operate on small devices, ie smartphones, tablets like the more traditional devicves (PCs). Our system is entirely based on the publish-subscribe paradigm. RedisDG is developped with Python and uses Redis as advanced key-value cache and store. Modélisation formelle Publication-souscription Grille de PCs Workflow scientifique Formal modelization Publish-subscribe Desktop grid Scientific workflow
8	Data-intensive interactive workflows for visual analytics / Données en masse et workflows interactifs pour la visualisation analytique Khemiri, Wael 12 December 2011 (has links) L'expansion du World Wide Web et la multiplication des sources de données (capteurs, services Web, programmes scientifiques, outils d'analyse, etc.) ont conduit à la prolifération de données hétérogènes et complexes. La phase d'extraction de connaissance et de recherche de corrélation devient ainsi de plus en plus difficile.Typiquement, une telle analyse est effectuée en utilisant les outils logiciels qui combinent: des techniques de visualisation, permettant aux utilisateurs d'avoir une meilleure compréhension des données, et des programmes d'analyse qui effectuent des opérations d'analyses complexes et longues.La visualisation analytique (visual analytics) vise à combiner la visualisation des donnéesavec des tâches d'analyse et de fouille. Etant donnée la complexité et la volumétrie importante des données scientifiques (par exemple, les données associées à des processus biologiques ou physiques, données des réseaux sociaux, etc.), la visualisation analytique est appelée à jouer un rôle important dans la gestion des données scientifiques.La plupart des plateformes de visualisation analytique actuelles utilisent des mécanismes en mémoire centrale pour le stockage et le traitement des données, ce qui limite le volume de données traitées. En outre, l'intégration de nouveaux algorithmes dans le processus de traitement nécessite du code d'intégration ad-hoc. Enfin, les plate-formes de visualisation actuelles ne permettent pas de définir et de déployer des processus structurés, où les utilisateurs partagent les données et, éventuellement, les visualisations.Ce travail, à la confluence des domaines de la visualisation analytique interactive et des bases de données, apporte deux contributions. (i) Nous proposons une architecture générique pour déployer une plate-forme de visualisation analytique au-dessus d'un système de gestion de bases de données (SGBD). (ii) Nous montrons comment propager les changements des données dans le SGBD, au travers des processus et des visualisations qui en font partie. Notre approche permet à l'application de visualisation analytique de profiter du stockage robuste et du déploiement automatique de processus à partir d'une spécification déclarative, supportés par le SGBD.Notre approche a été implantée dans un prototype appelé EdiFlow, et validée à travers plusieurs applications. Elle pourrait aussi s'intégrer dans une plate-forme de workflow scientifique à usage intensif de données, afin d'en augmenter les fonctionnalités de visualisation. / The increasing amounts of electronic data of all forms, produced by humans (e.g. Web pages, structured content such as Wikipedia or the blogosphere etc.) and/or automatic tools (loggers, sensors, Web services, scientific programs or analysis tools etc.) leads to a situation of unprecedented potential for extracting new knowledge, finding new correlations, or simply making sense of the data.Visual analytics aims at combining interactive data visualization with data analysis tasks. Given the explosion in volume and complexity of scientific data, e.g., associated to biological or physical processes or social networks, visual analytics is called to play an important role in scientific data management.Most visual analytics platforms, however, are memory-based, and are therefore limited in the volume of data handled. Moreover, the integration of each new algorithm (e.g. for clustering) requires integrating it by hand into the platform. Finally, they lack the capability to define and deploy well-structured processes where users with different roles interact in a coordinated way sharing the same data and possibly the same visualizations.This work is at the convergence of three research areas: information visualization, database query processing and optimization, and workflow modeling. It provides two main contributions: (i) We propose a generic architecture for deploying a visual analytics platform on top of a database management system (DBMS) (ii) We show how to propagate data changes to the DBMS and visualizations, through the workflow process. Our approach has been implemented in a prototype called EdiFlow, and validated through several applications. It clearly demonstrates that visual analytics applications can benefit from robust storage and automatic process deployment provided by the DBMS while obtaining good performance and thus it provides scalability.Conversely, it could also be integrated into a data-intensive scientific workflow platform in order to increase its visualization features. Visualisation analytique Systèmes workflow Gestion dynamique des données Visual analytics Scientific workflow systems Dynamic changes
9	Methods for Modeling and Analyzing Concurrent Software Zeng, Reng 02 July 2013 (has links) Concurrent software executes multiple threads or processes to achieve high performance. However, concurrency results in a huge number of different system behaviors that are difficult to test and verify. The aim of this dissertation is to develop new methods and tools for modeling and analyzing concurrent software systems at design and code levels. This dissertation consists of several related results. First, a formal model of Mondex, an electronic purse system, is built using Petri nets from user requirements, which is formally verified using model checking. Second, Petri nets models are automatically mined from the event traces generated from scientific workflows. Third, partial order models are automatically extracted from some instrumented concurrent program execution, and potential atomicity violation bugs are automatically verified based on the partial order models using model checking. Our formal specification and verification of Mondex have contributed to the world wide effort in developing a verified software repository. Our method to mine Petri net models automatically from provenance offers a new approach to build scientific workflows. Our dynamic prediction tool, named McPatom, can predict several known bugs in real world systems including one that evades several other existing tools. McPatom is efficient and scalable as it takes advantage of the nature of atomicity violations and considers only a pair of threads and accesses to a single shared variable at one time. However, predictive tools need to consider the tradeoffs between precision and coverage. Based on McPatom, this dissertation presents two methods for improving the coverage and precision of atomicity violation predictions: 1) a post-prediction analysis method to increase coverage while ensuring precision; 2) a follow-up replaying method to further increase coverage. Both methods are implemented in a completely automatic tool. concurrency multi-threaded program atomicity violation model checking Petri nets verification Mondex scientific workflow Software Engineering
10	Predictive Resource Management for Scientific Workflows Witt, Carl Philipp 21 July 2020 (has links) Um Erkenntnisse aus großen Mengen wissenschaftlicher Rohdaten zu gewinnen, sind komplexe Datenanalysen erforderlich. Scientific Workflows sind ein Ansatz zur Umsetzung solcher Datenanalysen. Um Skalierbarkeit zu erreichen, setzen die meisten Workflow-Management-Systeme auf bereits existierende Lösungen zur Verwaltung verteilter Ressourcen, etwa Batch-Scheduling-Systeme. Die Abschätzung der Ressourcen, die zur Ausführung einzelner Arbeitsschritte benötigt werden, wird dabei immer noch an die Nutzer:innen delegiert. Dies schränkt die Leistung und Benutzerfreundlichkeit von Workflow-Management-Systemen ein, da den Nutzer:innen oft die Zeit, das Fachwissen oder die Anreize fehlen, den Ressourcenverbrauch genau abzuschätzen. Diese Arbeit untersucht, wie die Ressourcennutzung während der Ausführung von Workflows automatisch erlernt werden kann. Im Gegensatz zu früheren Arbeiten werden Scheduling und Vorhersage von Ressourcenverbrauch in einem engeren Zusammenhang betrachtet. Dies bringt verschiedene Herausforderungen mit sich, wie die Quantifizierung der Auswirkungen von Vorhersagefehlern auf die Systemleistung. Die wichtigsten Beiträge dieser Arbeit sind: 1. Eine Literaturübersicht aktueller Ansätze zur Vorhersage von Spitzenspeicherverbrauch mittels maschinellen Lernens im Kontext von Batch-Scheduling-Systemen. 2. Ein Scheduling-Verfahren, das statistische Methoden verwendet, um vorherzusagen, welche Scheduling-Entscheidungen verbessert werden können. 3. Ein Ansatz zur Nutzung von zur Laufzeit gemessenem Spitzenspeicherverbrauch in Vorhersagemodellen, die die fortwährende Optimierung der Ressourcenallokation erlauben. Umfangreiche Simulationsexperimente geben Einblicke in Schlüsseleigenschaften von Scheduling-Heuristiken und Vorhersagemodellen. 4. Ein Vorhersagemodell, das die asymmetrischen Kosten überschätzten und unterschätzten Speicherverbrauchs berücksichtigt, sowie die Folgekosten von Vorhersagefehlern einbezieht. / Scientific experiments produce data at unprecedented volumes and resolutions. For the extraction of insights from large sets of raw data, complex analysis workflows are necessary. Scientific workflows enable such data analyses at scale. To achieve scalability, most workflow management systems are designed as an additional layer on top of distributed resource managers, such as batch schedulers or distributed data processing frameworks. However, like distributed resource managers, they do not automatically determine the amount of resources required for executing individual tasks in a workflow. The status quo is that workflow management systems delegate the challenge of estimating resource usage to the user. This limits the performance and ease-of-use of scientific workflow management systems, as users often lack the time, expertise, or incentives to estimate resource usage accurately. This thesis is an investigation of how to learn and predict resource usage during workflow execution. In contrast to prior work, an integrated perspective on prediction and scheduling is taken, which introduces various challenges, such as quantifying the effects of prediction errors on system performance. The main contributions are: 1. A survey of peak memory usage prediction in batch processing environments. It provides an overview of prior machine learning approaches, commonly used features, evaluation metrics, and data sets. 2. A static workflow scheduling method that uses statistical methods to predict which scheduling decisions can be improved. 3. A feedback-based approach to scheduling and predictive resource allocation, which is extensively evaluated using simulation. The results provide insights into the desirable characteristics of scheduling heuristics and prediction models. 4. A prediction model that reduces memory wastage. The design takes into account the asymmetric costs of overestimation and underestimation, as well as follow up costs of prediction errors. Scientific Workflow Stapelverarbeitung Task Graph Scheduling Gerichteter Azyklischer Graph scientific workflow batch scheduling static task graph scheduling directed acyclic graph resource consumption prediction 004 Informatik ST 265 ddc:004

Search results