Global ETD Search

1	Data preservation and reproducibility at the LHCb experiment at CERN Trisovic, Ana January 2018 (has links) This dissertation presents the first study of data preservation and research reproducibility in data science at the Large Hadron Collider at CERN. In particular, provenance capture of the experimental data and the reproducibility of physics analyses at the LHCb experiment were studied. First, the preservation of the software and hardware dependencies of the LHCb experimental data and simulations was investigated. It was found that the links between the data processing information and the datasets themselves were obscure. In order to document these dependencies, a graph database was designed and implemented. The nodes in the graph represent the data with their processing information, software and computational environment, whilst the edges represent their dependence on the other nodes. The database provides a central place to preserve information that was previously scattered across the LHCb computing infrastructure. Using the developed database, a methodology to recreate the LHCb computational environment and to execute the data processing on the cloud was implemented with the use of virtual containers. It was found that the produced physics events were identical to the official LHCb data, meaning that the system can aid in data preservation. Furthermore, the developed method can be used for outreach purposes, providing a streamlined way for a person external to CERN to process and analyse the LHCb data. Following this, the reproducibility of data analyses was studied. A data provenance tracking service was implemented within the LHCb software framework \textsc{Gaudi}. The service allows analysts to capture their data processing configurations that can be used to reproduce a dataset within the dataset itself. Furthermore, to assess the current status of the reproducibility of LHCb physics analyses, the major parts of an analysis were reproduced by following methods described in publicly and internally available documentation. This study allowed the identification of barriers to reproducibility and specific points where documentation is lacking. With this knowledge, one can specifically target areas that need improvement and encourage practices that would improve reproducibility in the future. Finally, contributions were made to the CERN Analysis Preservation portal, which is a general knowledge preservation framework developed at CERN to be used across all the LHC experiments. In particular, the functionality to preserve source code from git repositories and Docker images in one central location was implemented.
2	Sweave. Dynamic generation of statistical reports using literate data analysis. Leisch, Friedrich January 2002 (has links) (PDF) Sweave combines typesetting with LATEX and data anlysis with S into integrated statistical documents. When run through R or Splus, all data analysis output (tables, graphs, ...) is created on the fly and inserted into a final LATEX document. Options control which parts of the original S code are shown to or hidden from the reader, respectively. Many S users are also LATEX users, hence no new software has to be learned. The report can be automatically updated if data or analysis change, which allows for truly reproducible research. (author's abstract) / Series: Report Series SFB "Adaptive Information Systems and Modelling in Economics and Management Science"
3	A reproducible approach to equity backtesting Arbi, Riaz 18 February 2020 (has links) Research findings relating to anomalous equity returns should ideally be repeatable by others. Usually, only a small subset of the decisions made in a particular backtest workflow are released, which limits reproducability. Data collection and cleaning, parameter setting, algorithm development and report generation are often done with manual point-and-click tools which do not log user actions. This problem is compounded by the fact that the trial-and-error approach of researchers increases the probability of backtest overfitting. Borrowing practices from the reproducible research community, we introduce a set of scripts that completely automate a portfolio-based, event-driven backtest. Based on free, open source tools, these scripts can completely capture the decisions made by a researcher, resulting in a distributable code package that allows easy reproduction of results. Equity Backtesting Reproducible Research Event-based Backtesting R RStudio
4	Reproducible research, software quality, online interfaces and publishing for image processing / Recherche reproductible, qualité logicielle, publication et interfaces en ligne pour le traitement d'image Limare, Nicolas 21 June 2012 (has links) Cette thèse est basée sur une étude des problèmes de reproductibilité rencontrés dans la recherche en traitement d'image. Nous avons conçu, créé et développé un journal scientifique, Image Processing On Line (IPOL), dans lequel les articles sont publiés avec une implémentation complète des algorithmes décrits, validée par les rapporteurs. Un service web de démonstration des algorithmes est joint aux articles, permettant de les tester sur données libres et de consulter l'historique des expériences précédentes. Nous proposons également une politique de droits d'auteur et licences, adaptée aux manuscrits et aux logiciels issus de la recherche, et des règles visant à guider les rapporteurs dans leur évaluation du logiciel. Le projet scientifique que constitue IPOL nous apparaît très bénéfique à la recherche en traitement d'image. L'examen détaillé des implémentations et les tests intensifs via le service web de démonstration ont permis de publier des articles de meilleure qualité. La fréquentation d'IPOL montre que ce journal est utile au-delà de la communauté de ses auteurs, qui sont globalement satisfaits de leur expérience et apprécient les avantages en terme de compréhension des algorithmes, de qualité des logiciels produits, de diffusion des travaux et d'opportunités de collaboration. Disposant de définitions claires des objets et méthodes, et d'implémentations validées, il devient possible de construire des chaînes complexes et fiables de traitement des images. / This thesis is based on a study of reproducibility issues in image processing research. We designed, created and developed a scientific journal, Image Processing On Line (IPOL), in which articles are published with a complete implementation of the algorithms described, validated by the rapporteurs. A demonstration web service is attached, allowing testing of the algorithms with freely submitted data and an archive of previous experiments. We also propose copyrights and license policy, suitable for manuscripts and research software software, and guidelines for the evaluation of software. The IPOL scientific project seems very beneficial to research in image processing. With the detailed examination of the implementations and extensive testing via the demonstration web service, we publish articles of better quality. IPOL usage shows that this journal is useful beyond the community of its authors, who are generally satisfied with their experience and appreciate the benefits in terms of understanding of the algorithms, quality of the software produced, and exposure of their works and opportunities for collaboration. With clear definitions of objects and methods, and validated implementations, complex image processing chains become possible. Ingénierie logicielle Publication scientifique Recherche reproductible Traitement d'images Scientific publications Reproducible research Images processing
5	Estratégia computacional para apoiar a reprodutibilidade e reuso de dados científicos baseado em metadados de proveniência. / Computational strategy to support the reproducibility and reuse of scientific data based on provenance metadata. Silva, Daniel Lins da 17 May 2017 (has links) A ciência moderna, apoiada pela e-science, tem enfrentado desafios de lidar com o grande volume e variedade de dados, gerados principalmente pelos avanços tecnológicos nos processos de coleta e processamento dos dados científicos. Como consequência, houve também um aumento na complexidade dos processos de análise e experimentação. Estes processos atualmente envolvem múltiplas fontes de dados e diversas atividades realizadas por grupos de pesquisadores geograficamente distribuídos, que devem ser compreendidas, reutilizadas e reproduzíveis. No entanto, as iniciativas da comunidade científica que buscam disponibilizar ferramentas e conscientizar os pesquisadores a compartilharem seus dados e códigos-fonte, juntamente com as publicações científicas, são, em muitos casos, insuficientes para garantir a reprodutibilidade e o reuso das contribuições científicas. Esta pesquisa objetiva definir uma estratégia computacional para o apoio ao reuso e a reprodutibilidade dos dados científicos, por meio da gestão da proveniência dos dados durante o seu ciclo de vida. A estratégia proposta nesta pesquisa é apoiada em dois componentes principais, um perfil de aplicação, que define um modelo padronizado para a descrição da proveniência dos dados, e uma arquitetura computacional para a gestão dos metadados de proveniência, que permite a descrição, armazenamento e compartilhamento destes metadados em ambientes distribuídos e heterogêneos. Foi desenvolvido um protótipo funcional para a realização de dois estudos de caso que consideraram a gestão dos metadados de proveniência de experimentos de modelagem de distribuição de espécies. Estes estudos de caso possibilitaram a validação da estratégia computacional proposta na pesquisa, demonstrando o seu potencial no apoio à gestão de dados científicos. / Modern science, supported by e-science, has faced challenges in dealing with the large volume and variety of data generated primarily by technological advances in the processes of collecting and processing scientific data. Therefore, there was also an increase in the complexity of the analysis and experimentation processes. These processes currently involve multiple data sources and numerous activities performed by geographically distributed research groups, which must be understood, reused and reproducible. However, initiatives by the scientific community with the goal of developing tools and sensitize researchers to share their data and source codes related to their findings, along with scientific publications, are often insufficient to ensure the reproducibility and reuse of scientific results. This research aims to define a computational strategy to support the reuse and reproducibility of scientific data through data provenance management during its entire life cycle. Two principal components support our strategy in this research, an application profile that defines a standardized model for the description of provenance metadata, and a computational architecture for the management of the provenance metadata that enables the description, storage and sharing of these metadata in distributed and heterogeneous environments. We developed a functional prototype for the accomplishment of two case studies that considered the management of provenance metadata during the experiments of species distribution modeling. These case studies enabled the validation of the computational strategy proposed in the research, demonstrating the potential of this strategy in supporting the management of scientific data. Arquitetura de software Biodiversidade Data provenance Data science Informática Metadados Metadata Reproducible research Data reuse
6	Estratégia computacional para apoiar a reprodutibilidade e reuso de dados científicos baseado em metadados de proveniência. / Computational strategy to support the reproducibility and reuse of scientific data based on provenance metadata. Daniel Lins da Silva 17 May 2017 (has links) A ciência moderna, apoiada pela e-science, tem enfrentado desafios de lidar com o grande volume e variedade de dados, gerados principalmente pelos avanços tecnológicos nos processos de coleta e processamento dos dados científicos. Como consequência, houve também um aumento na complexidade dos processos de análise e experimentação. Estes processos atualmente envolvem múltiplas fontes de dados e diversas atividades realizadas por grupos de pesquisadores geograficamente distribuídos, que devem ser compreendidas, reutilizadas e reproduzíveis. No entanto, as iniciativas da comunidade científica que buscam disponibilizar ferramentas e conscientizar os pesquisadores a compartilharem seus dados e códigos-fonte, juntamente com as publicações científicas, são, em muitos casos, insuficientes para garantir a reprodutibilidade e o reuso das contribuições científicas. Esta pesquisa objetiva definir uma estratégia computacional para o apoio ao reuso e a reprodutibilidade dos dados científicos, por meio da gestão da proveniência dos dados durante o seu ciclo de vida. A estratégia proposta nesta pesquisa é apoiada em dois componentes principais, um perfil de aplicação, que define um modelo padronizado para a descrição da proveniência dos dados, e uma arquitetura computacional para a gestão dos metadados de proveniência, que permite a descrição, armazenamento e compartilhamento destes metadados em ambientes distribuídos e heterogêneos. Foi desenvolvido um protótipo funcional para a realização de dois estudos de caso que consideraram a gestão dos metadados de proveniência de experimentos de modelagem de distribuição de espécies. Estes estudos de caso possibilitaram a validação da estratégia computacional proposta na pesquisa, demonstrando o seu potencial no apoio à gestão de dados científicos. / Modern science, supported by e-science, has faced challenges in dealing with the large volume and variety of data generated primarily by technological advances in the processes of collecting and processing scientific data. Therefore, there was also an increase in the complexity of the analysis and experimentation processes. These processes currently involve multiple data sources and numerous activities performed by geographically distributed research groups, which must be understood, reused and reproducible. However, initiatives by the scientific community with the goal of developing tools and sensitize researchers to share their data and source codes related to their findings, along with scientific publications, are often insufficient to ensure the reproducibility and reuse of scientific results. This research aims to define a computational strategy to support the reuse and reproducibility of scientific data through data provenance management during its entire life cycle. Two principal components support our strategy in this research, an application profile that defines a standardized model for the description of provenance metadata, and a computational architecture for the management of the provenance metadata that enables the description, storage and sharing of these metadata in distributed and heterogeneous environments. We developed a functional prototype for the accomplishment of two case studies that considered the management of provenance metadata during the experiments of species distribution modeling. These case studies enabled the validation of the computational strategy proposed in the research, demonstrating the potential of this strategy in supporting the management of scientific data. Arquitetura de software Biodiversidade Informática Metadados Data provenance Data science Metadata Reproducible research Data reuse
7	Development of a protocol for 3-D reconstruction of brain aneurysms from volumetric image data Welch, David Michael 01 July 2010 (has links) Cerebral aneurysm formation, growth, and rupture are active areas of investigation in the medical community. To model and test the mechanical processes involved, small aneurysm (< 5 mm) segmentations need to be performed quickly and reliably for large patient populations. In the absence of robust automatic segmentation methods, the Vascular Modeling Toolkit (VMTK) provides scripts for the complex tasks involved in computer-assisted segmentation. Though these tools give researchers a great amount of flexibility, they also make reproduction of results between investigators difficult and unreliable. We introduce a VMTK pipeline protocol that minimizes the user interaction for vessel and aneurysm segmentation and a training method for new users. This protocol allows for decision tree handling for CTA and MRA images. Furthermore, we investigate the variation between two expert users and two novice users for six patients using shape index measures developed by Ma et al. and Raghavan et al. cerebral aneurysms computer-assisted segmentation reproducible research shape indicies user variation VMTK
8	Reproducible research, software quality, online interfaces and publishing for image processing Limare, Nicolas 21 June 2012 (has links) (PDF) This thesis is based on a study of reproducibility issues in image processing research. We designed, created and developed a scientific journal, Image Processing On Line (IPOL), in which articles are published with a complete implementation of the algorithms described, validated by the rapporteurs. A demonstration web service is attached, allowing testing of the algorithms with freely submitted data and an archive of previous experiments. We also propose copyrights and license policy, suitable for manuscripts and research software software, and guidelines for the evaluation of software. The IPOL scientific project seems very beneficial to research in image processing. With the detailed examination of the implementations and extensive testing via the demonstration web service, we publish articles of better quality. IPOL usage shows that this journal is useful beyond the community of its authors, who are generally satisfied with their experience and appreciate the benefits in terms of understanding of the algorithms, quality of the software produced, and exposure of their works and opportunities for collaboration. With clear definitions of objects and methods, and validated implementations, complex image processing chains become possible. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Scientific publications Reproducible research Images processing
9	Reliability Generalization: a Systematic Review and Evaluation of Meta-analytic Methodology and Reporting Practice Holland, David F. (Educational consultant) 12 1900 (has links) Reliability generalization (RG) is a method for meta-analysis of reliability coefficients to estimate average score reliability across studies, determine variation in reliability, and identify study-level moderator variables influencing score reliability. A total of 107 peer-reviewed RG studies published from 1998 to 2013 were systematically reviewed to characterize the meta-analytic methods employed and to evaluate quality of reporting practice against standards for transparency in meta-analysis reporting. Most commonly, RG studies meta-analyzed alpha coefficients, which were synthesized using an unweighted, fixed-effects model applied to untransformed coefficients. Moderator analyses most frequently included multiple regression and bivariate correlations employing a fixed-effects model on untransformed, unweighted coefficients. Based on a unit-weighted scoring system, mean reporting quality for RG studies was statistically less than that for a comparison study of 198 meta-analyses in the organizational sciences across 42 indicators; however, means were not statistically significantly different between the two studies when evaluating reporting quality on 18 indicators deemed essential to ethical reporting practice in meta-analyses. Since its inception a wide variety of statistical methods have been applied to RG, and meta-analysis of reliability coefficients has extended to fields outside of psychological measurement, such as medicine and business. A set of guidelines for conducting and reporting RG studies is provided. reliability generalization reliability systematic review research synthesis meta-analysis Reproducible research. Meta-analysis.
10	A Reproducible Research Methodology for Designing and Conducting Faithful Simulations of Dynamic HPC Applications / Méthodologie de recherche reproductible adaptée à la conception et à la conduite de simulations d'applications scientifique multitâche dynamiques Stanisic, Luka 30 October 2015 (has links) L'évolution de l'informatique haute performance s'est réorientée au cours de cette dernière décennie. L'importante consommation énergétique des plates-formes modernes limite fortement la miniaturisation et l'augmentation des fréquences des processeurs. Cette contrainte énergétique a poussé les fabricants de matériels à développer de nombreuses architectures alternatives afin de répondre au besoin croissant de performance imposé par la communauté scientifique. Cependant, programmer efficacement sur une telle diversité de plate-formes et exploiter l'intégralité des ressources qu'elles offrent s'avère d'une grande difficulté. La tendance générale de conception d'application haute performance, basée sur un gros code monolithique offrant de nombreuses opportunités d'optimisation, est ainsi devenu de plus en plus difficile à appliquer en raison de la difficulté d'implémentation et de maintenance de ces codes complexes. Par conséquent, les développeurs de telles applications considèrent maintenant une approche plus modulaire et une exécution dynamique de celles-ci. Une approche populaire est d'implémenter ces applications à plus haut niveau, indépendamment de l'architecture matérielle, suivant un graphe de tâches où chacune d'entre elles correspond à un noyau de calcul soigneusement optimisé pour chaque architecture. Un système de runtime peut ensuite être utilisé pour ordonnancer dynamiquement ces tâches sur les ressources de calcul.Développer ces solutions et assurer leur bonne performance sur un large spectre de configurations reste un défit majeur. En raison de la grande complexité du matériel, de la variabilité des temps d'exécution des calculs et de la dynamicité d'ordonnancement des tâches, l'exécution des applications n'est pas déterministe et l'évaluation de la performance de ces systèmes est très difficile. Par conséquent, il y a un besoin de méthodes systématiques et reproductibles pour la conduite de recherche ainsi que de techniques d'évaluation de performance fiables pour étudier ces systèmes complexes.Dans cette thèse, nous montrons qu'il est possible de mettre en place une étude propre, cohérente et reproductible, par simulation, d'applications dynamiques. Nous proposons une méthode de travail unique basée sur deux outils connus, Git et Org-mode, pour la conduite de recherche expérimentale reproductible. Cette méthode simple permet une résolution pragmatique de problèmes comme le suivi de la provenance ou la réplication de l'analyse des données. Notre contribution à l'évaluation de performance des applications dynamiques consiste au design et à la validation de simulation/émulation hybride gros-grain de StarPU, un runtime dynamique basé sur un graphe de tâches pour architecture hybride, au dessus de SimGrid, un simulateur polyvalent pour systèmes distribués. Nous présentons comment notre solution permet l'obtention de prédictions fiables de performances d'exécutions réelles dans un large panel de machines hétérogènes sur deux classes de programme différentes, des applications d'algèbre linéaire dense et creuse, qui sont représentatives des applications scientifiques. / The evolution of High-Performance Computing systems has taken asharp turn in the last decade. Due to the enormous energyconsumption of modern platforms, miniaturization and frequencyscaling of processors have reached a limit. The energy constraintshas forced hardware manufacturers to develop alternative computerarchitecture solutions in order to manage answering the ever-growingneed of performance imposed by the scientists and thesociety. However, efficiently programming such diversity ofplatforms and fully exploiting the potentials of the numerousdifferent resources they offer is extremely challenging. Thepreviously dominant trend for designing high performanceapplications, which was based on large monolithic codes offeringmany optimization opportunities, has thus become more and moredifficult to apply since implementing and maintaining such complexcodes is very difficult. Therefore, application developersincreasingly consider modular approaches and dynamic applicationexecutions. A popular approach is to implement the application at ahigh level independently of the hardware architecture as DirectedAcyclic Graphs of tasks, each task corresponding to carefullyoptimized computation kernels for each architecture. A runtimesystem can then be used to dynamically schedule those tasks on thedifferent computing resources.Developing such solutions and ensuring their good performance on awide range of setups is however very challenging. Due to the highcomplexity of the hardware, to the duration variability of theoperations performed on a machine and to the dynamic scheduling ofthe tasks, the application executions are non-deterministic and theperformance evaluation of such systems is extremelydifficult. Therefore, there is a definite need for systematic andreproducible methods for conducting such research as well asreliable performance evaluation techniques for studying thesecomplex systems.In this thesis, we show that it is possible to perform a clean,coherent, reproducible study, using simulation, of dynamic HPCapplications. We propose a unique workflow based on two well-knownand widely-used tools, Git and Org-mode, for conducting areproducible experimental research. This simple workflow allows forpragmatically addressing issues such as provenance tracking and dataanalysis replication. Our contribution to the performance evaluationof dynamic HPC applications consists in the design and validation ofa coarse-grain hybrid simulation/emulation of StarPU, a dynamictask-based runtime for hybrid architectures, over SimGrid, aversatile simulator for distributed systems. We present how thistool can achieve faithful performance predictions of nativeexecutions on a wide range of heterogeneous machines and for twodifferent classes of programs, dense and sparse linear algebraapplications, that are a good representative of the real scientificapplications. Évaluation de performances Runtime Hpc Méthodologie Recherche Reproducible Simulation Performance Evaluation Runtime Hpc Methodology Reproducible Research Simulation 621

Search results