Global ETD Search

1	Data preservation and reproducibility at the LHCb experiment at CERN Trisovic, Ana January 2018 (has links) This dissertation presents the first study of data preservation and research reproducibility in data science at the Large Hadron Collider at CERN. In particular, provenance capture of the experimental data and the reproducibility of physics analyses at the LHCb experiment were studied. First, the preservation of the software and hardware dependencies of the LHCb experimental data and simulations was investigated. It was found that the links between the data processing information and the datasets themselves were obscure. In order to document these dependencies, a graph database was designed and implemented. The nodes in the graph represent the data with their processing information, software and computational environment, whilst the edges represent their dependence on the other nodes. The database provides a central place to preserve information that was previously scattered across the LHCb computing infrastructure. Using the developed database, a methodology to recreate the LHCb computational environment and to execute the data processing on the cloud was implemented with the use of virtual containers. It was found that the produced physics events were identical to the official LHCb data, meaning that the system can aid in data preservation. Furthermore, the developed method can be used for outreach purposes, providing a streamlined way for a person external to CERN to process and analyse the LHCb data. Following this, the reproducibility of data analyses was studied. A data provenance tracking service was implemented within the LHCb software framework \textsc{Gaudi}. The service allows analysts to capture their data processing configurations that can be used to reproduce a dataset within the dataset itself. Furthermore, to assess the current status of the reproducibility of LHCb physics analyses, the major parts of an analysis were reproduced by following methods described in publicly and internally available documentation. This study allowed the identification of barriers to reproducibility and specific points where documentation is lacking. With this knowledge, one can specifically target areas that need improvement and encourage practices that would improve reproducibility in the future. Finally, contributions were made to the CERN Analysis Preservation portal, which is a general knowledge preservation framework developed at CERN to be used across all the LHC experiments. In particular, the functionality to preserve source code from git repositories and Docker images in one central location was implemented.
2	CHALLENGES IN SECURITY AUDITS IN OPEN SOURCE SYSTEMS / UTMANINGAR I SÄKERHETSREVISIONER I SYSTEM MED ÖPPEN KÄLLKOD Nordberg, Pontus January 2019 (has links) Today there is a heavy integration of information technology in almost every aspect of our lives and there is an increase in computer security that goes with it. To ensure this security, and that policies and procedures within an organisations related to this security are enforced; security audits are conducted. At the same time, use of open source software is also becoming increasingly common, becoming more a fact of life rather than an option. With these two trends in mind, this study analyses a selection of scientific literature on the topic and identifies the unique challenges a security audit in an open source environment faces, and aims to contribute on how to help alleviate the challenges. The study was performed in the form of a literature review, where the comparison and analysis revealed interesting information regarding the open source specific challenges, including both technical issues as well as challenges stemming from people’s perception and handling of open source software today. The answer to the question “What are the challenges when conducting security audits for open source systems and how can they be alleviated?” shows the main challenges to be too much trust is put in unverified binaries. The report offers suggestions and ideas on how to implement solutions in order to help diminish this challenge through the use and integration of Reproducible Builds, answering the second part of the question. Security Audit Open Source Software Reproducible Builds Computer Systems Datorsystem
3	Sweave. Dynamic generation of statistical reports using literate data analysis. Leisch, Friedrich January 2002 (has links) (PDF) Sweave combines typesetting with LATEX and data anlysis with S into integrated statistical documents. When run through R or Splus, all data analysis output (tables, graphs, ...) is created on the fly and inserted into a final LATEX document. Options control which parts of the original S code are shown to or hidden from the reader, respectively. Many S users are also LATEX users, hence no new software has to be learned. The report can be automatically updated if data or analysis change, which allows for truly reproducible research. (author's abstract) / Series: Report Series SFB "Adaptive Information Systems and Modelling in Economics and Management Science"
4	A reproducible approach to equity backtesting Arbi, Riaz 18 February 2020 (has links) Research findings relating to anomalous equity returns should ideally be repeatable by others. Usually, only a small subset of the decisions made in a particular backtest workflow are released, which limits reproducability. Data collection and cleaning, parameter setting, algorithm development and report generation are often done with manual point-and-click tools which do not log user actions. This problem is compounded by the fact that the trial-and-error approach of researchers increases the probability of backtest overfitting. Borrowing practices from the reproducible research community, we introduce a set of scripts that completely automate a portfolio-based, event-driven backtest. Based on free, open source tools, these scripts can completely capture the decisions made by a researcher, resulting in a distributable code package that allows easy reproduction of results. Equity Backtesting Reproducible Research Event-based Backtesting R RStudio
5	Root Cause Localization for Unreproducible Builds Liu, Changlin 07 September 2020 (has links) No description available. Computer Science
6	Efficient and Cost-effective Workflow Based on Containers for Distributed Reproducible Experiments Perera, Shelan January 2016 (has links) Reproducing distributed experiments is a challenging task for many researchers. There are many factors which make this problem harder to solve. In order to reproduce distributed experiments, researchers need to perform complex deployments which involve many dependent software stacks with many configurations and manual orchestrations. Further, researchers need to allocate a larger amount of money for clusters of machines and then spend their valuable time to perform those experiments. Also, some of the researchers spend a lot of time to validate a distributed scenario in a real environment as most of the pseudo distributed systems do not provide the characteristics of a real distributed system. Karamel provides solutions for the inconvenience caused by the manual orchestration by providing a comprehensive orchestration platform to deploy and run distributed experiments. But still, this solution may incur a similar amount of expenses as of a manual distributed setup since it uses virtual machines underneath. Further, it does not provide quick validations of a distributed setup with a quick feedback loop, as it takes considerable time to terminate and provision new virtual machines. Therefore, we provide a solution by integrating Docker that can co-exists with virtual machine based deployment model seamlessly. Our solution encapsulates the container-based deployment model for users to reproduce distributed experiment in a cost-effective and efficient manner. In this project, we introduce novel deployment model with containers that is not possible with the conventional virtual machine based deployment model. Further, we evaluate our solution with a real deployment of Apache Hadoop Terasort experiment which is a benchmark for Apache Hadoop map-reduce platform in order to explain how this model can be used to save the cost and improve the efficiency. docker orchestration container workflow cloud reproducible-experiments Computer Systems Datorsystem
7	A Reproducible Research Methodology for Designing and Conducting Faithful Simulations of Dynamic HPC Applications / Méthodologie de recherche reproductible adaptée à la conception et à la conduite de simulations d'applications scientifique multitâche dynamiques Stanisic, Luka 30 October 2015 (has links) L'évolution de l'informatique haute performance s'est réorientée au cours de cette dernière décennie. L'importante consommation énergétique des plates-formes modernes limite fortement la miniaturisation et l'augmentation des fréquences des processeurs. Cette contrainte énergétique a poussé les fabricants de matériels à développer de nombreuses architectures alternatives afin de répondre au besoin croissant de performance imposé par la communauté scientifique. Cependant, programmer efficacement sur une telle diversité de plate-formes et exploiter l'intégralité des ressources qu'elles offrent s'avère d'une grande difficulté. La tendance générale de conception d'application haute performance, basée sur un gros code monolithique offrant de nombreuses opportunités d'optimisation, est ainsi devenu de plus en plus difficile à appliquer en raison de la difficulté d'implémentation et de maintenance de ces codes complexes. Par conséquent, les développeurs de telles applications considèrent maintenant une approche plus modulaire et une exécution dynamique de celles-ci. Une approche populaire est d'implémenter ces applications à plus haut niveau, indépendamment de l'architecture matérielle, suivant un graphe de tâches où chacune d'entre elles correspond à un noyau de calcul soigneusement optimisé pour chaque architecture. Un système de runtime peut ensuite être utilisé pour ordonnancer dynamiquement ces tâches sur les ressources de calcul.Développer ces solutions et assurer leur bonne performance sur un large spectre de configurations reste un défit majeur. En raison de la grande complexité du matériel, de la variabilité des temps d'exécution des calculs et de la dynamicité d'ordonnancement des tâches, l'exécution des applications n'est pas déterministe et l'évaluation de la performance de ces systèmes est très difficile. Par conséquent, il y a un besoin de méthodes systématiques et reproductibles pour la conduite de recherche ainsi que de techniques d'évaluation de performance fiables pour étudier ces systèmes complexes.Dans cette thèse, nous montrons qu'il est possible de mettre en place une étude propre, cohérente et reproductible, par simulation, d'applications dynamiques. Nous proposons une méthode de travail unique basée sur deux outils connus, Git et Org-mode, pour la conduite de recherche expérimentale reproductible. Cette méthode simple permet une résolution pragmatique de problèmes comme le suivi de la provenance ou la réplication de l'analyse des données. Notre contribution à l'évaluation de performance des applications dynamiques consiste au design et à la validation de simulation/émulation hybride gros-grain de StarPU, un runtime dynamique basé sur un graphe de tâches pour architecture hybride, au dessus de SimGrid, un simulateur polyvalent pour systèmes distribués. Nous présentons comment notre solution permet l'obtention de prédictions fiables de performances d'exécutions réelles dans un large panel de machines hétérogènes sur deux classes de programme différentes, des applications d'algèbre linéaire dense et creuse, qui sont représentatives des applications scientifiques. / The evolution of High-Performance Computing systems has taken asharp turn in the last decade. Due to the enormous energyconsumption of modern platforms, miniaturization and frequencyscaling of processors have reached a limit. The energy constraintshas forced hardware manufacturers to develop alternative computerarchitecture solutions in order to manage answering the ever-growingneed of performance imposed by the scientists and thesociety. However, efficiently programming such diversity ofplatforms and fully exploiting the potentials of the numerousdifferent resources they offer is extremely challenging. Thepreviously dominant trend for designing high performanceapplications, which was based on large monolithic codes offeringmany optimization opportunities, has thus become more and moredifficult to apply since implementing and maintaining such complexcodes is very difficult. Therefore, application developersincreasingly consider modular approaches and dynamic applicationexecutions. A popular approach is to implement the application at ahigh level independently of the hardware architecture as DirectedAcyclic Graphs of tasks, each task corresponding to carefullyoptimized computation kernels for each architecture. A runtimesystem can then be used to dynamically schedule those tasks on thedifferent computing resources.Developing such solutions and ensuring their good performance on awide range of setups is however very challenging. Due to the highcomplexity of the hardware, to the duration variability of theoperations performed on a machine and to the dynamic scheduling ofthe tasks, the application executions are non-deterministic and theperformance evaluation of such systems is extremelydifficult. Therefore, there is a definite need for systematic andreproducible methods for conducting such research as well asreliable performance evaluation techniques for studying thesecomplex systems.In this thesis, we show that it is possible to perform a clean,coherent, reproducible study, using simulation, of dynamic HPCapplications. We propose a unique workflow based on two well-knownand widely-used tools, Git and Org-mode, for conducting areproducible experimental research. This simple workflow allows forpragmatically addressing issues such as provenance tracking and dataanalysis replication. Our contribution to the performance evaluationof dynamic HPC applications consists in the design and validation ofa coarse-grain hybrid simulation/emulation of StarPU, a dynamictask-based runtime for hybrid architectures, over SimGrid, aversatile simulator for distributed systems. We present how thistool can achieve faithful performance predictions of nativeexecutions on a wide range of heterogeneous machines and for twodifferent classes of programs, dense and sparse linear algebraapplications, that are a good representative of the real scientificapplications. Évaluation de performances Runtime Hpc Méthodologie Recherche Reproducible Simulation Performance Evaluation Runtime Hpc Methodology Reproducible Research Simulation 621
8	Reproducible research, software quality, online interfaces and publishing for image processing / Recherche reproductible, qualité logicielle, publication et interfaces en ligne pour le traitement d'image Limare, Nicolas 21 June 2012 (has links) Cette thèse est basée sur une étude des problèmes de reproductibilité rencontrés dans la recherche en traitement d'image. Nous avons conçu, créé et développé un journal scientifique, Image Processing On Line (IPOL), dans lequel les articles sont publiés avec une implémentation complète des algorithmes décrits, validée par les rapporteurs. Un service web de démonstration des algorithmes est joint aux articles, permettant de les tester sur données libres et de consulter l'historique des expériences précédentes. Nous proposons également une politique de droits d'auteur et licences, adaptée aux manuscrits et aux logiciels issus de la recherche, et des règles visant à guider les rapporteurs dans leur évaluation du logiciel. Le projet scientifique que constitue IPOL nous apparaît très bénéfique à la recherche en traitement d'image. L'examen détaillé des implémentations et les tests intensifs via le service web de démonstration ont permis de publier des articles de meilleure qualité. La fréquentation d'IPOL montre que ce journal est utile au-delà de la communauté de ses auteurs, qui sont globalement satisfaits de leur expérience et apprécient les avantages en terme de compréhension des algorithmes, de qualité des logiciels produits, de diffusion des travaux et d'opportunités de collaboration. Disposant de définitions claires des objets et méthodes, et d'implémentations validées, il devient possible de construire des chaînes complexes et fiables de traitement des images. / This thesis is based on a study of reproducibility issues in image processing research. We designed, created and developed a scientific journal, Image Processing On Line (IPOL), in which articles are published with a complete implementation of the algorithms described, validated by the rapporteurs. A demonstration web service is attached, allowing testing of the algorithms with freely submitted data and an archive of previous experiments. We also propose copyrights and license policy, suitable for manuscripts and research software software, and guidelines for the evaluation of software. The IPOL scientific project seems very beneficial to research in image processing. With the detailed examination of the implementations and extensive testing via the demonstration web service, we publish articles of better quality. IPOL usage shows that this journal is useful beyond the community of its authors, who are generally satisfied with their experience and appreciate the benefits in terms of understanding of the algorithms, quality of the software produced, and exposure of their works and opportunities for collaboration. With clear definitions of objects and methods, and validated implementations, complex image processing chains become possible. Ingénierie logicielle Publication scientifique Recherche reproductible Traitement d'images Scientific publications Reproducible research Images processing
9	Estratégia computacional para apoiar a reprodutibilidade e reuso de dados científicos baseado em metadados de proveniência. / Computational strategy to support the reproducibility and reuse of scientific data based on provenance metadata. Silva, Daniel Lins da 17 May 2017 (has links) A ciência moderna, apoiada pela e-science, tem enfrentado desafios de lidar com o grande volume e variedade de dados, gerados principalmente pelos avanços tecnológicos nos processos de coleta e processamento dos dados científicos. Como consequência, houve também um aumento na complexidade dos processos de análise e experimentação. Estes processos atualmente envolvem múltiplas fontes de dados e diversas atividades realizadas por grupos de pesquisadores geograficamente distribuídos, que devem ser compreendidas, reutilizadas e reproduzíveis. No entanto, as iniciativas da comunidade científica que buscam disponibilizar ferramentas e conscientizar os pesquisadores a compartilharem seus dados e códigos-fonte, juntamente com as publicações científicas, são, em muitos casos, insuficientes para garantir a reprodutibilidade e o reuso das contribuições científicas. Esta pesquisa objetiva definir uma estratégia computacional para o apoio ao reuso e a reprodutibilidade dos dados científicos, por meio da gestão da proveniência dos dados durante o seu ciclo de vida. A estratégia proposta nesta pesquisa é apoiada em dois componentes principais, um perfil de aplicação, que define um modelo padronizado para a descrição da proveniência dos dados, e uma arquitetura computacional para a gestão dos metadados de proveniência, que permite a descrição, armazenamento e compartilhamento destes metadados em ambientes distribuídos e heterogêneos. Foi desenvolvido um protótipo funcional para a realização de dois estudos de caso que consideraram a gestão dos metadados de proveniência de experimentos de modelagem de distribuição de espécies. Estes estudos de caso possibilitaram a validação da estratégia computacional proposta na pesquisa, demonstrando o seu potencial no apoio à gestão de dados científicos. / Modern science, supported by e-science, has faced challenges in dealing with the large volume and variety of data generated primarily by technological advances in the processes of collecting and processing scientific data. Therefore, there was also an increase in the complexity of the analysis and experimentation processes. These processes currently involve multiple data sources and numerous activities performed by geographically distributed research groups, which must be understood, reused and reproducible. However, initiatives by the scientific community with the goal of developing tools and sensitize researchers to share their data and source codes related to their findings, along with scientific publications, are often insufficient to ensure the reproducibility and reuse of scientific results. This research aims to define a computational strategy to support the reuse and reproducibility of scientific data through data provenance management during its entire life cycle. Two principal components support our strategy in this research, an application profile that defines a standardized model for the description of provenance metadata, and a computational architecture for the management of the provenance metadata that enables the description, storage and sharing of these metadata in distributed and heterogeneous environments. We developed a functional prototype for the accomplishment of two case studies that considered the management of provenance metadata during the experiments of species distribution modeling. These case studies enabled the validation of the computational strategy proposed in the research, demonstrating the potential of this strategy in supporting the management of scientific data. Arquitetura de software Biodiversidade Data provenance Data science Informática Metadados Metadata Reproducible research Data reuse
10	Estratégia computacional para apoiar a reprodutibilidade e reuso de dados científicos baseado em metadados de proveniência. / Computational strategy to support the reproducibility and reuse of scientific data based on provenance metadata. Daniel Lins da Silva 17 May 2017 (has links) A ciência moderna, apoiada pela e-science, tem enfrentado desafios de lidar com o grande volume e variedade de dados, gerados principalmente pelos avanços tecnológicos nos processos de coleta e processamento dos dados científicos. Como consequência, houve também um aumento na complexidade dos processos de análise e experimentação. Estes processos atualmente envolvem múltiplas fontes de dados e diversas atividades realizadas por grupos de pesquisadores geograficamente distribuídos, que devem ser compreendidas, reutilizadas e reproduzíveis. No entanto, as iniciativas da comunidade científica que buscam disponibilizar ferramentas e conscientizar os pesquisadores a compartilharem seus dados e códigos-fonte, juntamente com as publicações científicas, são, em muitos casos, insuficientes para garantir a reprodutibilidade e o reuso das contribuições científicas. Esta pesquisa objetiva definir uma estratégia computacional para o apoio ao reuso e a reprodutibilidade dos dados científicos, por meio da gestão da proveniência dos dados durante o seu ciclo de vida. A estratégia proposta nesta pesquisa é apoiada em dois componentes principais, um perfil de aplicação, que define um modelo padronizado para a descrição da proveniência dos dados, e uma arquitetura computacional para a gestão dos metadados de proveniência, que permite a descrição, armazenamento e compartilhamento destes metadados em ambientes distribuídos e heterogêneos. Foi desenvolvido um protótipo funcional para a realização de dois estudos de caso que consideraram a gestão dos metadados de proveniência de experimentos de modelagem de distribuição de espécies. Estes estudos de caso possibilitaram a validação da estratégia computacional proposta na pesquisa, demonstrando o seu potencial no apoio à gestão de dados científicos. / Modern science, supported by e-science, has faced challenges in dealing with the large volume and variety of data generated primarily by technological advances in the processes of collecting and processing scientific data. Therefore, there was also an increase in the complexity of the analysis and experimentation processes. These processes currently involve multiple data sources and numerous activities performed by geographically distributed research groups, which must be understood, reused and reproducible. However, initiatives by the scientific community with the goal of developing tools and sensitize researchers to share their data and source codes related to their findings, along with scientific publications, are often insufficient to ensure the reproducibility and reuse of scientific results. This research aims to define a computational strategy to support the reuse and reproducibility of scientific data through data provenance management during its entire life cycle. Two principal components support our strategy in this research, an application profile that defines a standardized model for the description of provenance metadata, and a computational architecture for the management of the provenance metadata that enables the description, storage and sharing of these metadata in distributed and heterogeneous environments. We developed a functional prototype for the accomplishment of two case studies that considered the management of provenance metadata during the experiments of species distribution modeling. These case studies enabled the validation of the computational strategy proposed in the research, demonstrating the potential of this strategy in supporting the management of scientific data. Arquitetura de software Biodiversidade Informática Metadados Data provenance Data science Metadata Reproducible research Data reuse

Search results