Return to search

Worlds Collide through Gaussian Processes: Statistics, Geoscience and Mathematical Programming

Gaussian process (GP) regression is the canonical method for nonlinear spatial modeling among the statistics and machine learning communities. Geostatisticians use a subtly different technique known as kriging. I shall highlight key similarities and differences between GPs and kriging through the use of large scale gold mining data. Most importantly GPs are largely hands-off, automatically learning from the data whereas kriging requires an expert human in the loop to guide analysis. To emphasize this, I show an imputation method for left censored values frequently seen in mining data. Oftentimes geologists ignore censored values due to the difficulty of imputing with kriging, but GPs execute imputation with relative ease leading to better estimates of the gold surface. My hope is that this research can serve as a springboard to encourage the mining community to consider using GPs over kriging for diverse utility after GP model fitting. Another common use of GPs that would be inefficient for kriging is Bayesian Optimization (BO). Traditionally BO is designed to find a global optima by sequentially sampling from a function of interest using an acquisition function. When two or more local or global optima of the function of interest have similar objective values, it often makes some sense to target the more "robust" solution with a wider domain of attraction. However, traditional BO weighs these solutions the same, favoring whichever has a slightly better objective value. By combining the idea of expected improvement (EI) from the BO community with mathematical programming's concept of an adversary, I introduce a novel algorithm to target robust solutions called robust expected improvement (REI). The adversary penalizes "peaked" areas of the objective function making those values appear less desirable. REI performs acquisitions using EI on the adversarial space yielding data sets focused on the robust solution that exhibit EI's already proven excellent balance of exploration and exploitation. / Doctor of Philosophy / Since its origins in the 1940's, spatial statistics modeling has adapted to fit different communities. The geostatistics community developed with an emphasis on modeling mining operations and has further evolved to cover a slew of different applications largely focused on two or three physical dimensions. The computer experiments community developed later when these physical experiments started moving into the virtual realm with advances in computer technology. While birthed from the same foundation, computer experimenters often look at ten or sometimes even higher dimension problems. Due to these differences among others, each community tailored their methods to best fit their common problems. My research compares the modern instantiations of the differing methodology on two sets of real gold mining data. Ultimately, I prefer the computer experiments methods for their ease of adaptation to downstream tasks at no cost to model performance. A statistical model is almost never a standalone development; it is created with a specific goal in mind. The first case I show of this is "imputation" of mining data. Mining data often have a detection threshold such that any observation with very small mineral concentrations are recorded at the threshold. Frequently, geostatisticians simply throw out these observations because they cause problems in modeling. Statisticians try to use the information that there is a low concentration combined with the rest of the fully observed data to derive a best guess at the concentration of thresholded locations. Under the geostatistics framework, this is cumbersome, but the computer experiments community consider imputation an easy extension. Another common model task is creating an experiment to best learn a surface. The surface may be a gold deposit on Earth or an unknown virtual function or anything measurable really. To do this, computer experimenters often use "active learning" by sampling one point at a time, using that point to generate a better informed model which suggests a new point to sample, repeating until a satisfactory number of points are sampled. Geostatisticians often prefer "one-shot" experiments by deciding all samples prior to collecting any. Thus the geostatistics framework is not appropriate for active learning. Active learning tries to find the "best" location of the surface with either the maximum or minimum response. I adapt this problem to redefine best to find a "robust" location where the response does not change much even if the location is not perfectly specified. As an example, consider setting operating conditions for a factory. If locations produce a similar amount of product, but one needs an exact pressure setting or else it blows up the factory, the other is certainly preferred. To design experiments to find robust locations, I borrow ideas from the mathematical programming community to develop a novel method for robust active learning.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/114922
Date04 May 2023
CreatorsChristianson, Ryan Beck
ContributorsStatistics, Gramacy, Robert B., Pollyea, Ryan, House, Leanna L., Van Mullekom, Jennifer H.
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
LanguageEnglish
Detected LanguageEnglish
TypeDissertation
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0026 seconds