Global ETD Search

1	To what extent is your data assimilation scheme designed to find the posterior mean, the posterior mode or something else? Hodyss, Daniel, Bishop, Craig H., Morzfeld, Matthias 30 September 2016 (has links) Recently there has been a surge in interest in coupling ensemble-based data assimilation methods with variational methods (commonly referred to as 4DVar). Here we discuss a number of important differences between ensemble-based and variational methods that ought to be considered when attempting to fuse these methods. We note that the Best Linear Unbiased Estimate (BLUE) of the posterior mean over a data assimilation window can only be delivered by data assimilation schemes that utilise the 4-dimensional (4D) forecast covariance of a prior distribution of non-linear forecasts across the data assimilation window. An ensemble Kalman smoother (EnKS) may be viewed as a BLUE approximating data assimilation scheme. In contrast, we use the dual form of 4DVar to show that the most likely non-linear trajectory corresponding to the posterior mode across a data assimilation window can only be delivered by data assimilation schemes that create counterparts of the 4D prior forecast covariance using a tangent linear model. Since 4DVar schemes have the required structural framework to identify posterior modes, in contrast to the EnKS, they may be viewed as mode approximating data assimilation schemes. Hence, when aspects of the EnKS and 4DVar data assimilation schemes are blended together in a hybrid, one would like to be able to understand how such changes would affect the mode-or mean-finding abilities of the data assimilation schemes. This article helps build such understanding using a series of simple examples. We argue that this understanding has important implications to both the interpretation of the hybrid state estimates and to their design. data assimilation ensemble methods variational methods
2	Methods for the spatial modeling and evalution of tree canopy cover Datsko, Jill Marie 24 May 2022 (has links) Tree canopy cover is an essential measure of forest health and productivity, which is widely studied due to its relevance to many disciplines. For example, declining tree canopy cover can be an indicator of forest health, insect infestation, or disease. This dissertation consists of three studies, focused on the spatial modeling and evaluation of tree canopy cover, drawing on recent developments and best practices in the fields of remote sensing, data collection, and statistical analysis.newlinenewline The first study evaluates how well harmonic regression variables derived at the pixel-level using a time-series of all available Landsat images predict values of tree canopy cover. Harmonic regression works to approximate the reflectance curve of a given band across time. Therefore the coefficients that result from the harmonic regression model estimate relate to the phenology of the area of each pixel. We use a time-series of all available cloud-free observations in each Landsat pixel for NDVI, SWIR1 and SWIR2 bands to obtain harmonic regression coefficients for each variable and then use those coefficients to estimate tree canopy cover at two discrete points in time. This study compares models estimated using these harmonic regression coefficients to those estimated using Landsat median composite imagery, and combined models. We show that (1) harmonic regression coefficients that use a single harmonic coefficient provided the best quality models, (2) harmonic regression coefficients from Landsat-derived NDVI, SWIR1, and SWIR2 bands improve the quality of tree canopy cover models when added to the full suite of median composite variables, (3) the harmonic regression constant for the NDVI time-series is an important variable across models, and (4) there is little to no additional information in the full suite of predictors compared to the harmonic regression coefficients alone based on the information criterion provided by principal components analysis. The second study presented evaluates the use of crowdsourcing with Amazon's Mechanical Turk platform to obtain photointerpretated tree canopy cover data. We collected multiple interpretations at each plot from both crowd and expert interpreters, and sampled these data using a Monte Carlo framework to estimate a classification model predicting the "reliability" of each crowd interpretation using expert interpretations as a benchmark, and identified the most important variables in estimating this reliability. The results show low agreement between crowd and expert groups, as well as between individual experts. We found that variables related to fatigue had the most bearing on the "reliability" of crowd interpretations followed by whether the interpreter used false color or natural color composite imagery during interpretation. Recommendations for further study and future implementations of crowdsourced photointerpretation are also provided. In the final study, we explored sampling methods for the purpose of model validation. We evaluated a method of stratified random sampling with optimal allocation using measures of prediction uncertainty derived from random forest regression models by comparing the accuracy and precision of estimates from samples drawn using this method to estimates from samples drawn using other common sampling protocols using three large, simulated datasets as case studies. We further tested the effect of reduced sample sizes on one of these datasets and demonstrated a method to report the accuracy of continuous models for domains that are either regionally constrained or numerically defined based on other variables or the modeled quantity itself. We show that stratified random sampling with optimal allocation provides the most precise estimates of the mean of the reference Y and the RMSE of the population. We also demonstrate that all sampling methods provide reasonably accurate estimates on average. Additionally we show that, as sample sizes are increased with each sampling method, the precision generally increases, eventually reaching a level of convergence where gains in estimate precision from adding additional samples would be marginal. / Doctor of Philosophy / Tree canopy cover is an essential measure of forest health, which is widely studied due to its relevance to many disciplines. For example, declining tree canopy cover can be an indicator of forest health, insect infestation, or disease. This dissertation consists of three studies, focused on the spatial modeling and evaluation of tree canopy cover, drawing on recent developments and best practices in the fields of remote sensing, data collection, and statistical analysis. The first study is an evaluation of the utility of harmonic regression coefficients from time-series satellite imagery, which describe the timing and magnitude of green-up and leaf loss at each location, to estimate tree canopy cover. This study compares models estimated using these harmonic regression coefficients to those estimated using median composite imagery, which obtain the median value of reflectance values across time data at each location, and models which used both types of variables. We show that (1) harmonic regression coefficients that use a simplified formula provided higher quality models compared to more complex alternatives, (2) harmonic regression coefficients improved the quality of tree canopy cover models when added to the full suite of median composite variables, (3) the harmonic regression constant, which is the coefficient that determines the average reflectance over time, based on time-series vegetation index data, is an important variable across models, and (4) there is little to no additional information in the full suite of predictors compared to the harmonic regression coefficients alone.newlinenewline The second study presented, evaluates the use of crowdsourcing, which engages non-experts in paid online tasks, with Amazon's Mechanical Turk platform to obtain tree canopy cover data, as interpreted from aerial images. We collected multiple interpretations at each location from both crowd and expert interpreters, and sampled these data using a repeated sampling framework to estimate a classification model predicting the "reliability" of each crowd interpretation using expert interpretations as a benchmark, and identified the most important variables in estimating this "reliability". The results show low agreement between crowd and expert groups, as well as between individual experts. We found that variables related to fatigue had the most bearing on the reliability of crowd interpretations followed by variables related to the display settings used to view imagery during interpretation. Recommendations for further study and future implementations of crowdsourced photointerpretation are also provided. In the final study, we explored sampling methods for the purpose of model validation. We evaluated a method of stratified random sampling with optimal allocation, a sampling method that is specifically designed to improve the precision of sample estimates, using measures of prediction uncertainty, describing the variability in predictions from different models in an ensemble of regression models. We compared the accuracy and precision of estimates from samples drawn using this method to estimates from samples drawn using other common sampling protocols using three large, mathematically simulated data products as case studies. We further tested the effect of smaller sample sizes on one of these data products and demonstrated a method to report the accuracy of continuous models for different land cover classes and for classes defined using 10% tree canopy cover intervals. We show that stratified random sampling with optimal allocation provides the most precise sample estimates. We also demonstrate that all sampling methods provide reasonably accurate estimates on average and we show that, as sample sizes are increased with each sampling method, the precision generally increases, eventually leveling off where gains in estimate precision from adding additional samples would be marginal. Time-series crowdsourcing sampling ensemble methods
3	Methods of Model Uncertainty: Bayesian Spatial Predictive Synthesis Cabel, Danielle 05 1900 (has links) This dissertation develops a new method of modeling uncertainty with spatial data called Bayesian spatial predictive synthesis (BSPS) and compares its predictive accuracy to established methods. Spatial data are often non-linear, complex, and difficult to capture with a single model. Existing methods such as model selection or simple model ensembling fail to consider the critical spatially varying model uncertainty problem; different models perform better or worse in different regions. BSPS can capture the model uncertainty by specifying a latent factor coefficient model that varies spatially as a synthesis function. This allows the model coefficients to vary across a region to achieve flexible spatial model ensembling. This method is derived from the theoretically best approximation of the data generating process (DGP), where the predictions are exact minimax. Two Markov chain Monte Carlo (MCMC) based algorithms are implemented in the BSPS framework for full uncertainty quantification, along with a variational Bayes strategy for faster point inference. This method is also extended for general responses. The examples in this dissertation include multiple simulation studies and two real world data applications. Through these examples, the performance and predictive power of BSPS is shown against various standard spatial models, ensemble methods, and machine learning methods. BSPS is able to maintain predictive accuracy as well as maintain interpretability of the prediction mechanisms. / Statistics Statistics Bayesian Ensemble methods Spatial Statistics Uncertainty
4	Structure and dynamics of proteins that inhibit complement activation Maciejewski, Mateusz January 2012 (has links) NMR studies have long been used as a tool to derive structural and dynamic information. Such information has a wide range of applications, and notably is used in the study of structure-activity relationships. The aims of this work were to use NMR spectroscopy to derive structures of the molecules inhibiting the activation of the alternative pathway of the complement portion of the innate immune system (namely, the N-terminus of factor H (FH) and two small peptides, Compstatin 10 and Compstatin 20) and to consider the interdomain dynamics of proteins consisting of three modules theoretically (in silico) and experimentally (for the three N-terminal domains of FH). We focused on the three N-terminal complement control protein (CCP) domains of the important complement regulator, human factor H (i.e. FH1-3). Its three-dimensional solution structure was derived based on nuclear Overhauser effects and residual dipolar couplings (RDCs). Each of the three CCP modules in this structure was similar to the corresponding CCP in the previously derived C3b-bound structure of FH1-4, but the relative orientations of the domains were different. These orientations were additionally different from the interdomain orientations in other molecules that interact with C3b, such as DAF2-4 and CR1-15-17. The measured RDC datasets, collected under three different conditions in media containing magnetically aligned bicelles (disk-like particles formed from phospholipids), were used to estimate interdomain motions in FH1-3. A method in which the data was fitted to a structural ensemble was used to analyze such interdomain flexibility. More than 80% of the conformers of this predominantly extended three-domain molecule exhibit flexions of < 40°. Such segmental flexibility (together with the local dynamics of the hypervariable loop within domain 3) could facilitate recognition of C3b via initial anchoring, as well as eventual reorganization of modules into the conformation captured in the previously solved crystal structure of a C3b complex with FH1-4. The NMR study of the Compstatin analogues revealed unique structural features that had not before been observed in this group of peptides. These features included two b-turns per peptide, neither of which was located in the ‘canonical’ regions in which b-turns were observed in previous molecular dynamics and NMR studies. The structures of Compstatin 10 and Compstatin 20 derived here were consistent with the isothermal calorimetry (ITC) and surface plasmon resonance (SPR) data recorded previously. In the in silico study of interdomain motion of three-domain proteins carried out here, the domains were represented as vectors attached to one another in a linear fashion. They were allowed to undergo Brownian motion biased by the potentials between the sequential vectors. The resulting trajectories were analyzed using model-free and extended model-free formalism. The degree of coupling of the interdomain motion with overall motion was determined, along with a representation of the overall motion. The similarity between the trajectories of the vectors transformed to this overall motion frame and the results obtained from the model-free analysis was determined.
5	Predicting Patient Satisfaction With Ensemble Methods Rosales, Elisa Renee 30 April 2015 (has links) Health plans are constantly seeking ways to assess and improve the quality of patient experience in various ambulatory and institutional settings. Standardized surveys are a common tool used to gather data about patient experience, and a useful measurement taken from these surveys is known as the Net Promoter Score (NPS). This score represents the extent to which a patient would, or would not, recommend his or her physician on a scale from 0 to 10, where 0 corresponds to "Extremely unlikely" and 10 to "Extremely likely". A large national health plan utilized automated calls to distribute such a survey to its members and was interested in understanding what factors contributed to a patient's satisfaction. Additionally, they were interested in whether or not NPS could be predicted using responses from other questions on the survey, along with demographic data. When the distribution of various predictors was compared between the less satisfied and highly satisfied members, there was significant overlap, indicating that not even the Bayes Classifier could successfully differentiate between these members. Moreover, the highly imbalanced proportion of NPS responses resulted in initial poor prediction accuracy. Thus, due to the non-linear structure of the data, and high number of categorical predictors, we have leveraged flexible methods, such as decision trees, bagging, and random forests, for modeling and prediction. We further altered the prediction step in the random forest algorithm in order to account for the imbalanced structure of the data. patient satisfaction random forests bagging ensemble methods decision trees
6	Ensemble methods for top-N recommendation Fan, Ziwei 20 April 2018 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / As the amount of information grows, the desire to efficiently filter out unnecessary information and retain relevant or interested information for people is increasing. To extract the information that will be of interest to people efficiently, we can utilize recommender systems. Recommender systems are information filtering systems that predict the preference of a user to an item. Based on historical data of users, recommender systems are able to make relevant recommendations to users. Due to its usefulness, Recommender systems have been widely used in many applications, including e-commerce and healthcare information systems. However, existing recommender systems suffer from several issues, including data sparsity and user/item heterogeneity. In this thesis, a hybrid dynamic and multi-collaborative filtering based recommendation technique has been developed to recommend search terms for physicians when physicians review a large number of patients’ information. Besides, a local sparse linear method ensemble has been developed to tackle the issues of data sparsity and user/item heterogeneity. In health information technology systems, most physicians suffer from information overload when they review patient information. A novel hybrid dynamic and multi-collaborative filtering method has been developed to improve information retrieval from electronic health records. We tackle the problem of recommending the next search term to a physician while the physician is searching for information about a patient. In this method, I have combined first-order Markov Chain and multi-collaborative filtering methods. For multi-collaborative filtering methods, I have developed the physician-patient collaborative filtering and transition-involved collaborative filtering methods. The developed method is tested using electronic health record data from the Indiana Network for Patient Care. The experimental results demonstrate that for 46.7% of test cases, this new method is able to correctly prioritize relevant information among top-5 recommendations that physicians are truly interested in. The local sparse linear model ensemble has been developed to tackle both the data sparsity and the user/item heterogeneity issues for the top-n recommendation. Multiple local sparse linear models are learned for all the users and items in the system. I have developed similarity-based and popularity-based methods to determine the local training data for each local model. Each local model is trained on Sparse Linear Method (SLIM) which is a powerful recommendation technique for top-n recommendation. These learned models are then combined in various ways to produce top-N recommendations. I have developed model results combination and model combination methods to combine all learned local models. The developed methods are tested on a benchmark dataset and its sparsified datasets. The experiments demonstrate 18.4% improvement from such ensemble models, particularly on sparse datasets. Recommender Systems Data Mining Ensemble Methods Top-N Recommendation
7	Machine Learning and Field Inversion approaches to Data-Driven Turbulence Modeling Michelen Strofer, Carlos Alejandro 27 April 2021 (has links) There still is a practical need for improved closure models for the Reynolds-averaged Navier-Stokes (RANS) equations. This dissertation explores two different approaches for using experimental data to provide improved closure for the Reynolds stress tensor field. The first approach uses machine learning to learn a general closure model from data. A novel framework is developed to train deep neural networks using experimental velocity and pressure measurements. The sensitivity of the RANS equations to the Reynolds stress, required for gradient-based training, is obtained by means of both variational and ensemble methods. The second approach is to infer the Reynolds stress field for a flow of interest from limited velocity or pressure measurements of the same flow. Here, this field inversion is done using a Monte Carlo Bayesian procedure and the focus is on improving the inference by enforcing known physical constraints on the inferred Reynolds stress field. To this end, a method for enforcing boundary conditions on the inferred field is presented. The two data-driven approaches explored and improved upon here demonstrate the potential for improved practical RANS predictions. / Doctor of Philosophy / The Reynolds-averaged Navier-Stokes (RANS) equations are widely used to simulate fluid flows in engineering applications despite their known inaccuracy in many flows of practical interest. The uncertainty in the RANS equations is known to stem from the Reynolds stress tensor for which no universally applicable turbulence model exists. The computational cost of more accurate methods for fluid flow simulation, however, means RANS simulations will likely continue to be a major tool in engineering applications and there is still a need for improved RANS turbulence modeling. This dissertation explores two different approaches to use available experimental data to improve RANS predictions by improving the uncertain Reynolds stress tensor field. The first approach is using machine learning to learn a data-driven turbulence model from a set of training data. This model can then be applied to predict new flows in place of traditional turbulence models. To this end, this dissertation presents a novel framework for training deep neural networks using experimental measurements of velocity and pressure. When using velocity and pressure data, gradient-based training of the neural network requires the sensitivity of the RANS equations to the learned Reynolds stress. Two different methods, the continuous adjoint and ensemble approximation, are used to obtain the required sensitivity. The second approach explored in this dissertation is field inversion, whereby available data for a flow of interest is used to infer a Reynolds stress field that leads to improved RANS solutions for that same flow. Here, the field inversion is done via the ensemble Kalman inversion (EKI), a Monte Carlo Bayesian procedure, and the focus is on improving the inference by enforcing known physical constraints on the inferred Reynolds stress field. To this end, a method for enforcing boundary conditions on the inferred field is presented. While further development is needed, the two data-driven approaches explored and improved upon here demonstrate the potential for improved practical RANS predictions. Turbulence modeling Deep learning Bayesian inference Variational methods Ensemble methods
8	Discovering Compact and Informative Structures through Data Partitioning Fiterau, Madalina 01 September 2015 (has links) In many practical scenarios, prediction for high-dimensional observations can be accurately performed using only a fraction of the existing features. However, the set of relevant predictive features, known as the sparsity pattern, varies across data. For instance, features that are informative for a subset of observations might be useless for the rest. In fact, in such cases, the dataset can be seen as an aggregation of samples belonging to several low-dimensional sub-models, potentially due to different generative processes. My thesis introduces several techniques for identifying sparse predictive structures and the areas of the feature space where these structures are effective. This information allows the training of models which perform better than those obtained through traditional feature selection. We formalize Informative Projection Recovery, the problem of extracting a set of low-dimensional projections of data which jointly form an accurate solution to a given learning task. Our solution to this problem is a regression-based algorithm that identifies informative projections by optimizing over a matrix of point-wise loss estimators. It generalizes to a number of machine learning problems, offering solutions to classification, clustering and regression tasks. Experiments show that our method can discover and leverage low-dimensional structure, yielding accurate and compact models. Our method is particularly useful in applications involving multivariate numeric data in which expert assessment of the results is of the essence. Additionally, we developed an active learning framework which works with the obtained compact models in finding unlabeled data deemed to be worth expert evaluation. For this purpose, we enhance standard active selection criteria using the information encapsulated by the trained model. The advantage of our approach is that the labeling effort is expended mainly on samples which benefit models from the hypothesis class we are considering. Additionally, the domain experts benefit from the availability of informative axis aligned projections at the time of labeling. Experiments show that this results in an improved learning rate over standard selection criteria, both for synthetic data and real-world data from the clinical domain, while the comprehensible view of the data supports the labeling process and helps preempt labeling errors. informative projection recovery cost-based feature selection ensemble methods data partitioning active learning clinical data analysis
9	Learning on Complex Simulations Banfield, Robert E 11 April 2007 (has links) This dissertation explores Machine Learning in the context of computationally intensive simulations. Complex simulations such as those performed at Sandia National Laboratories for the Advanced Strategic Computing program may contain multiple terabytes of data. The amount of data is so large that it is computationally infeasible to transfer between nodes on a supercomputer. In order to create the simulation, data is distributed spatially. For example, if this dissertation was to be broken apart spatially, the binding might be one partition, the first fifty pages another partition, the top three inches of every remaining page another partition, and the remainder confined to the last partition. This distribution of data is not conducive to learning using existing machine learning algorithms, as it violates some standard assumptions, the most important being that data is independently and identically distributed (i.i.d.). Unique algorithms must be created in order to deal with the spatially distributed data. Another problem which this dissertation addresses is learning from large data sets in general. The pervasive spread of computers into so many areas has enabled data capture from places that previously did not have available data. Various algorithms for speeding up classification of small and medium-sized data sets have been developed over the past several years. Most of these take advantage of developing a multiple classifier system in which the fusion of many classifiers results in higher accuracy than that obtained by any single classifier. Most also have a direct application to the problem of learning from large data sets. In this dissertation, a thorough statistical analysis of several of these algorithms is provided on 57 publicly available data sets. Random forests, in particular, is able to achieve some of the highest accuracy results while speeding up classification significantly. Random forests, through a classifier fusion strategy known as Probabilistic Majority Voting (PMV) and a variant referred to as Weighted Probabilistic Majority Voting (wPMV), was used on two simulations. The first simulation is of a canister being crushed in the same fashion as a human might crush a soda can. Each of half a million physical data points in the simulation contains nine attributes. In the second simulation, a casing is dropped on the ground. This simulation contains 21 attributes and over 1,500,000 data points. Results show that reasonable accuracy can be obtained by using PMV or wPMV, but this accuracy is not as high as using all of the data in a non-spatially partitioned environment. In order to increase the accuracy, a semi-supervised algorithm was developed. This algorithm is capable of increasing the accuracy several percentage points over that of using all of the non-partitioned data, and includes several benefits such as reducing the number of labeled examples which scientists would otherwise manually identify. It also depicts more accurately the real-world usage situations which scientists encounter when applying these Machine Learning techniques to new simulations. Machine learning Spatial inhomogeneity Parallel processing Complex simulations Ensemble methods American Studies Arts and Humanities
10	The Bootstrap in Supervised Learning and its Applications in Genomics/Proteomics Vu, Thang 2011 May 1900 (has links) The small-sample size issue is a prevalent problem in Genomics and Proteomics today. Bootstrap, a resampling method which aims at increasing the efficiency of data usage, is considered to be an effort to overcome the problem of limited sample size. This dissertation studies the application of bootstrap to two problems of supervised learning with small sample data: estimation of the misclassification error of Gaussian discriminant analysis, and the bagging ensemble classification method. Estimating the misclassification error of discriminant analysis is a classical problem in pattern recognition and has many important applications in biomedical research. Bootstrap error estimation has been shown empirically to be one of the best estimation methods in terms of root mean squared error. In the first part of this work, we conduct a detailed analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA) classification rule under Gaussian populations. We derive the exact formulas of the first and the second moment of the zero bootstrap and the convex bootstrap estimators, as well as their cross moments with the resubstitution estimator and the true error. Based on these results, we obtain the exact formulas of the bias, the variance, and the root mean squared error of the deviation from the true error of these bootstrap estimators. This includes the moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions. In the second part of this work, we conduct an extensive empirical investigation of bagging, which is an application of bootstrap to ensemble classification. We investigate the performance of bagging in the classification of small-sample gene-expression data and protein-abundance mass spectrometry data, as well as the accuracy of small-sample error estimation with this ensemble classification rule. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overtting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, the ensemble method did not improve the performance of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, by formulating carefully how the error count is normalized, and investigate the performance of error estimation for bagging of common classification rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as resubstitution, leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the numerical experiments indicated that the performance of the out-of-bag estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically biased. The performance of the other estimators is consistent with their performance with the corresponding single classifiers, as reported in other studies. The results of this work are expected to provide helpful guidance to practitioners who are interested in applying the bootstrap in supervised learning applications. Bootstrap Error Estimation Classification LDA Bagging Out-of-Bag Estimation Ensemble Methods Genomics Proteomics

Search results