281 |
Local parametric poisson models for fisheries dataYee, Irene Mei Ling January 1988 (has links)
Poisson process is a common model for count data. However, a global Poisson model is inadequate for sparse data such as the marked salmon recovery data that have huge extraneous variations and noise. An empirical Bayes model, which enables information to be aggregated to overcome the lack of information from data in individual cells, is thus developed to handle these data. The method fits a local parametric Poisson model to describe the variation at each sampling period and incorporates this approach with a conventional local smoothing technique to remove noise. Finally, the overdispersion relative to the Poisson model is modelled by mixing these locally smoothed, Poisson models in an appropriate way. This method is then applied to the marked salmon data to obtain the overall patterns and the corresponding credibility intervals for the underlying trend in the data. / Science, Faculty of / Statistics, Department of / Graduate
|
282 |
Topics in Bayesian Design and Analysis for SamplingLiu, Yutao January 2021 (has links)
Survey sampling is an old field, but it is changing due to recent advancement in statistics and data science. More specifically, modern statistical techniques have provided us with new tools to solve old problems in potentially better ways, and new problems arise as data with complex and rich information become more available nowadays. This dissertation is consisted of three parts, with the first part being an example of solving an old problem with new tools, the second part solving a new problem in a data-rich setting, and the third part from a design perspective. All three parts deal with modeling survey data and auxiliary information using flexible Bayesian models.
In the first part, we consider Bayesian model-based inference for skewed survey data. Skewed data are common in sample surveys. Using probability proportional to size sampling as an example, where the values of a size variable are known for the population units, we propose two Bayesian model-based predictive methods for estimating finite population quantiles with skewed sample survey data. We assume the survey outcome to follow a skew-normal distribution given the probability of selection, and model the location and scale parameters of the skew-normal distribution as functions of the probability of selection. To allow a flexible association between the survey outcome and the probability of selection, the first method models the location parameter with a penalized spline and the scale parameter with a polynomial function, while the second method models both the location and scale parameters with penalized splines. Using a fully Bayesian approach, we obtain the posterior predictive distributions of the non-sampled units in the population, and thus the posterior distributions of the finite population quantiles. We show through simulations that our proposed methods are more efficient and yield shorter credible intervals with better coverage rates than the conventional weighted method in estimating finite population quantiles. We demonstrate the application of our proposed methods using data from the 2013 National Drug Abuse Treatment System Survey.
In the second part, we consider inference from non-random samples in data-rich settings where high-dimensional auxiliary information is available both in the sample and the target population, with survey inference being a special case. We propose a regularized prediction approach that predicts the outcomes in the population using a large number of auxiliary variables such that the ignorability assumption is reasonable while the Bayesian framework is straightforward for quantification of uncertainty. Besides the auxiliary variables, inspired by Little and An (2004), we also extend the approach by estimating the propensity score for a unit to be included in the sample and also including it as a predictor in the machine learning models. We show through simulation studies that the regularized predictions using soft Bayesian additive regression trees (SBART) yield valid inference for the population means and coverage rates close to the nominal levels. We demonstrate the application of the proposed methods using two different real data applications, one in a survey and one in an epidemiology study.
In the third part, we consider survey design for multilevel regression and post-stratification (MRP), a survey adjustment technique that corrects the known discrepancy between sample and population using shared auxiliary variables. MRP has been widely applied in survey analysis, for both probability and non-probability samples. However, literature on survey design for MRP is scarce. We propose a closed form formula to calculate theoretical margin of errors (MOEs) for various estimands based on the variance parameters in the multilevel regression model and sample sizes in the post-strata. We validate the theoretical MOEs via comparisons with the empirical MOEs in simulations studies covering various sample allocation plans. The validation procedure indicates that the theoretical MOEs based on the formula aligns with the empirical results for various estimands. We demonstrate the application of the sample size calculation formula in two different survey design scenarios, online panels that utilize quota sampling and telephone surveys with fixed total sample sizes.
|
283 |
A Cognitively Diagnostic Modeling Approach to Diagnosing Misconceptions and SubskillsElbulok, Musa January 2021 (has links)
The objective of the present project was to propose a new methodology for measuring misconceptions and subskills simultaneously using diagnostic information available from incorrect alternatives in multiple-choice tests designed for that purpose. Misconceptions are systematic and persistent errors that represent a learned intentional incorrect response (Brown & VanLehn, 1980; Ozkan & Ozkan, 2012). In prior research, Lee and Corter (2011) found that classification accuracy for their Bayesian Network misconception diagnosis models improved when latent higher-order subskills and specific wrong answers were included. Here, these contributions are adapted to a cognitively diagnostic measurement approach using the multiple-choice Deterministic Inputs Noisy “And” Gate (MC-DINA) model, first developed by de la Torre (2009b), by specifying dependencies between attributes to measure latent misconceptions and subskills simultaneously. A simulation study was conducted employing the proposed methodology (referred to as MC-DINA-H) across sample sizes (500, 1000, 2,000, and 5,000 examinees) and test lengths (15, 30, and 60 items) conditions. Eight attributes (4 misconceptions and 4 subskills) were included in the main simulation study. Attribute classification accuracy of the MC-DINA-H was compared to four less complex models and was found to more accurately classify attributes only when the attributes were relatively frequently required by multiple-choice options in the diagnostic assessment. The findings suggest that each attribute should be required by at least 15-20 percent of options in the diagnostic assessment.
|
284 |
Variational Bayesian Methods for Inferring Spatial Statistics and Nonlinear DynamicsMoretti, Antonio Khalil January 2021 (has links)
This thesis discusses four novel statistical methods and approximate inference techniques for analyzing structured neural and molecular sequence data. The main contributions are new algorithms for approximate inference and learning in Bayesian latent variable models involving spatial statistics and nonlinear dynamics. First, we propose an amortized variational inference method to separate a set of overlapping signals into spatially localized source functions without knowledge of the original signals or the mixing process. In the second part of this dissertation, we discuss two approaches for uncovering nonlinear, smooth latent dynamics from sequential data. Both algorithms construct variational families on extensions of nonlinear state space models where the underlying systems are described by hidden stochastic differential equations. The first method proposes a structured approximate posterior describing spatially-dependent linear dynamics, as well as an algorithm that relies on the fixed-point iteration method to achieve convergence. The second method proposes a variational backward simulation technique from an unbiased estimate of the marginal likelihood defined through a subsampling process. In the final chapter, we develop connections between discrete and continuous variational sequential search for Bayesian phylogenetic inference. We propose a technique that uses sequential search to construct a variational objective defined on the composite space of non-clock phylogenetic trees. Each of these techniques are motivated by real problems within computational biology and applied to provide insights into the underlying structure of complex data.
|
285 |
Multiple Causal Inference with Bayesian Factor ModelsWang, Yixin January 2020 (has links)
Causal inference from observational data is a vital problem, but it comes with strong assumptions. Most methods assume that we observe all confounders, variables that affect both the cause variables and the outcome variables. But whether we have observed all confounders is a famously untestable assumption. In this dissertation, we develop algorithms for causal inference from observational data, allowing for unobserved confounding. These algorithms focus on problems of multiple causal inference: scientific studies that involve many causes or many outcomes that are simultaneously of interest.
Begin with multiple causal inference with many causes. We develop the deconfounder, an algorithm that accommodates unobserved confounding by leveraging the multiplicity of the causes. How does the deconfounder work? The deconfounder uses the correlation among the multiple causes as evidence for unobserved confounders, combining Bayesian factor models and predictive model checking to perform causal inference.
We study the theoretical requirements for the deconfounder to provide unbiased causal estimates, along with its limitations and trade-offs. We also show how the deconfounder connects to the proxy-variable strategy for causal identification (Miao et al., 2018) by treating subsets of causes as proxies of the unobserved confounder. We demonstrate the deconfounder in simulation studies and real-world data. As an application, we develop the deconfounded recommender, a variant of the deconfounder tailored to causal inference on recommender systems.
Finally, we consider multiple causal inference with many outcomes. We develop the control-outcome deconfounder, an algorithm that corrects for unobserved confounders using multiple negative control outcomes. Negative control outcomes are outcome variables for which the cause is a priori known to have no effect. The control-outcome deconfounder uses the correlation among these outcomes as evidence for unobserved confounders. We discuss the theoretical and empirical properties of the control-outcome deconfounder. We also show how the control-outcome deconfounder generalizes the method of synthetic controls (Abadie et al., 2010, 2015; Abadie and Gardeazabal, 2003), expanding its scope to nonlinear settings and non-panel data.
|
286 |
Essays on the use of probabilistic machine learning for estimating customer preferences with limited informationPadilla, Nicolas January 2021 (has links)
In this thesis, I explore in two essays how to augment thin historical purchase data with other sources of information using Bayesian and probabilistic machine learning frameworks to better infer customers' preferences and their future behavior. In the first essay, I posit that firms can better manage recently-acquired customers by using the information from acquisition to inform future demand preferences for those customers. I develop a probabilistic machine learning model based on Deep Exponential Families to relate multiple acquisition characteristics with individual level demand parameters, and I show that the model is able to capture flexibly non-linear relationships between acquisition behaviors and demand parameters. I estimate the model using data from a retail context and show that firms can better identify which new customers are the most valuable.
In the second essay, I explore how to combine the information collected through the customer journey—search queries, clicks and purchases; both within-journeys and across journeys—to infer the customer’s preferences and likelihood of buying, in settings in which there is thin purchase history and where preferences might change from one purchase journey to another.
I propose a non-parametric Bayesian model that combines these different sources of information and accounts for what I call context heterogeneity, which are journey-specific preferences that depend on the context of the specific journey. I apply the model in the context of airline ticket purchases using data from one of the largest travel search websites and show that the model is able to accurately infer preferences and predict choice in an environment characterized by very thin historical data. I find strong context heterogeneity across journeys, reinforcing the idea that treating all journeys as stemming from the same set of preferences may lead to erroneous inferences.
|
287 |
Advances in Statistical Machine Learning Methods for Neural Data ScienceZhou, Ding January 2021 (has links)
Innovations in neural data recording techniques are revolutionizing neuroscience and presenting both challenges and opportunities for statistical data analysis. This dissertation discusses several recent advances in neural data signal processing, encoding, decoding, and dimension reduction. Chapter 1 introduces challenges in neural data science and common statistical methods used to address them. Chapter 2 develops a new method to detect neurons and extract signals from noisy calcium imaging data with irregular neuron shapes. Chapter 3 introduces a novel probabilistic framework for modeling deconvolved calcium traces. Chapter 4 proposes an improved Bayesian nonparametric extension of the hidden Markov model (HMM) that separates the strength of the self-persistence prior and transition prior. Chapter 5 introduces a more identifiable and interpretable latent variable model for Poisson observations. We develop efficient algorithms to fit each of the aforementioned methods and demonstrate their effectiveness on both simulated and real data.
|
288 |
Modernizing Markov Chains Monte Carlo for Scientific and Bayesian ModelingMargossian, Charles Christopher January 2022 (has links)
The advent of probabilistic programming languages has galvanized scientists to write increasingly diverse models to analyze data. Probabilistic models use a joint distribution over observed and latent variables to describe at once elaborate scientific theories, non-trivial measurement procedures, information from previous studies, and more. To effectively deploy these models in a data analysis, we need inference procedures which are reliable, flexible, and fast. In a Bayesian analysis, inference boils down to estimating the expectation values and quantiles of the unnormalized posterior distribution. This estimation problem also arises in the study of non-Bayesian probabilistic models, a prominent example being the Ising model of Statistical Physics.
Markov chains Monte Carlo (MCMC) algorithms provide a general-purpose sampling method which can be used to construct sample estimators of moments and quantiles. Despite MCMC’s compelling theory and empirical success, many models continue to frustrate MCMC, as well as other inference strategies, effectively limiting our ability to use these models in a data analysis. These challenges motivate new developments in MCMC. The term “modernize” in the title refers to the deployment of methods which have revolutionized Computational Statistics and Machine Learning in the past decade, including: (i) hardware accelerators to support massive parallelization, (ii) approximate inference based on tractable densities, (iii) high-performance automatic differentiation and (iv) continuous relaxations of discrete systems.
The growing availability of hardware accelerators such as GPUs has in the past years motivated a general MCMC strategy, whereby we run many chains in parallel with a short sampling phase, rather than a few chains with a long sampling phase. Unfortunately existing convergence diagnostics are not designed for the “many short chains” regime. This is notably the case of the popular R statistics which claims convergence only if the effective sample size per chain is large. We present the nested R, denoted nR, a generalization of R which does not conflate short chains and poor mixing, and offers a useful diagnostic provided we run enough chains and meet certain initialization conditions. Combined with nR the short chain regime presents us with the opportunity to identify optimal lengths for the warmup and sampling phases, as well as the optimal number of chains; tuning parameters of MCMC which are otherwise chosen using heuristics or trial-and-error.
We next focus on semi-specialized algorithms for latent Gaussian models, arguably the most widely used of class of hierarchical models. It is well understood that MCMC often struggles with the geometry of the posterior distribution generated by these models. Using a Laplace approximation, we marginalize out the latent Gaussian variables and then integrate the remaining parameters with Hamiltonian Monte Carlo (HMC), a gradient-based MCMC. This approach combines MCMC and a distributional approximation, and offers a useful alternative to pure MCMC or pure approximation methods such as Variational Inference. We compare the three paradigms across a range of general linear models, which admit a sophisticated prior, i.e. a Gaussian process and a Horseshoe prior. To implement our scheme efficiently, we derive a novel automatic differentiation method called the adjoint-differentiated Laplace approximation. This differentiation algorithm propagates the minimal information needed to construct the gradient of the approximate marginal likelihood, and yields a scalable differentiation method that is orders of magnitude faster than state of the art differentiation for high-dimensional hyperparameters. We next discuss the application of our algorithm to models with an unconventional likelihood, going beyond the classical setting of general linear models. This necessitates a non-trivial generalization of the adjoint-differentiated Laplace approximation, which we implement using higher-order adjoint methods. The generalization works out to be both more general and more efficient. We apply the resulting method to an unconventional latent Gaussian model, identifying promising features and highlighting persistent challenges.
The final chapter of this dissertation focuses on a specific but rich problem: the Ising model of Statistical Physics, and its generalization as the Potts and Spin Glass models. These models are challenging because they are discrete, precluding the immediate use of gradient-based algorithms, and exhibit multiple modes, notably at cold temperatures. We propose a new class of MCMC algorithms to draw samples from Potts models by augmenting the target space with a carefully constructed auxiliary Gaussian variable. In contrast to existing methods of a similar flavor, our algorithm can take advantage of the low-rank structure of the coupling matrix and scales linearly with the number of states in a Potts model. The method is applied to a broad range of coupling and temperature regimes and compared to several sampling methods, allowing us to paint a nuanced algorithmic landscape.
|
289 |
Confirmation Bias and Related ErrorsBorthwick, Geoffrey Ludlow 01 January 2010 (has links)
This study attempted to replicate and extend the study of Doherty, Mynatt, Tweney, and Schiavo (1979), which introduced what is here called the Bayesian conditionals selection paradigm. The present study used this paradigm (and a script similar to that used by Doherty et al.) to explore confirmation bias and related errors that can appear in both search and integration in probability revision. Despite selection differences and weak manipulations, this study provided information relevant to four important questions. First, by asking participants to estimate the values of the conditional probabilities they did not learn, this study was able to examine the use of "intuitive conditionals". This study found evidence that participants used intuitive conditionals and that their intuitive conditionals were affected by the size of the actual conditionals. Second, by examining both phases in the same study, this study became the first to look for inter-phase interactions. A strong correlation was found between the use of focal search strategies and focal integration strategies (r=.81, p
|
290 |
Bayesian inquiry: an approach to the use of expertsYee, King G. 01 January 1976 (has links)
Subjective information is a valuable resource; however, decisionmakers often ignore it because of difficulties in eliciting it from assessors. This thesis is on Bayesian inquiry and it presents an approach to eliciting subjective information from assessors. Based on the concepts of cascaded inference and Bayesian statistics, the approach is designed to reveal to the decision-maker the way in which the assessor considers his options and the reasons he has for selecting particular alternatives. Unlike previous works on cascaded inferences, the approach here focuses on incoherency. Specifically, it employs the use of additional information to revise and check the estimates. The reassessment may be done directly or indirectly. The indirect procedure uses a second order probability or type II distribution. An algorithm utilizing this approach is also presented. The methodology is applicable to any number of assessors. Procedures for aggregating and deriving surrogate distributions are also proposed.
|
Page generated in 0.0407 seconds