Return to search

Science Guided Machine Learning: Incorporating Scientific Domain Knowledge for Learning Under Data Paucity and Noisy Contexts

In recent years, the large amount of labeled data available has helped tend machine learning (ML) research toward using purely data driven end-to-end pipelines, e.g., in deep neural network research. However, in many situations, data is limited and of poor quality. Traditional ML pipelines are known to be susceptible to various issues when trained on low volumes of non-representative, noisy datasets. We investigate the question of whether prior domain knowledge about the problem being modeled can be employed within the ML pipeline to improve model performance under data paucity and in noisy contexts? This report presents recent developments as well as details, novel contributions in the context of incorporating prior domain knowledge in various data-driven modeling (i.e., machine learning - ML) pipelines particularly geared towards scientific applications. Such domain knowledge exists in various forms and can be incorporated into the machine learning pipeline using different implicit and explicit methods (termed: science-guided machine learning (SGML)). All the novel techniques proposed in this report have been presented in the context of developing SGML to model fluid dynamics applications, but can be easily generalized to other applications. Specifically, we present SGML pipelines to (i) incorporate prior domain knowledge into the ML model architecture (ii) incorporate knowledge about the distribution of the target process as statistical priors for improved prediction performance (iii) leverage prior knowledge to quantify consistency of ML decisions with scientific principles (iv) explicitly incorporate known mathematical relationships of scientific phenomena to influence the ML pipeline (v) develop science-guided transfer learning to improve performance under data paucity. Each technique that is presented, has been designed with a focus on simplicity and minimal cost of implementation with a goal of yielding significant improvements in model performance especially under low data volumes or under noisy data conditions. In each application, we demonstrate through rigorous qualitative and quantitative experiments that our SGML pipelines achieve significant improvements in performance and interpretability over corresponding models that are purely data-driven and agnostic to scientific knowledge. / Doctor of Philosophy / In this work, we present techniques for incorporating scientific knowledge into machine learning (ML) pipelines. We demonstrate these techniques with ML models trained with low data volumes as well as with non-representative, noisy datasets. In both these cases, we demonstrate through rigorous experimentation that incorporating scientific domain knowledge into the ML pipeline using our proposed science guided machine learning (SGML) techniques, leads to significant performance improvement.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/111558
Date18 August 2022
CreatorsMuralidhar, Nikhil
ContributorsComputer Science and Applications, Ramakrishnan, Narendran, Tafti, Danesh K., Lu, Chang Tien, Karpatne, Anuj, Ermon, Stefano
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
LanguageEnglish
Detected LanguageEnglish
TypeDissertation
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0019 seconds