Gray-Davies, Tristan Daniel
This thesis explores approaches to regression that utilise the treatment of covariates as random variables. The distribution of covariates, along with the conditional regression model Y | X, define the joint model over (Y,X), and in particular, the marginal distribution of the response Y. This marginal distribution provides a vehicle for the incorporation of prior information, as well as external, marginal data. The marginal distribution of the response provides a means of parameterisation that can yield scalable inference, simple prior elicitation, and, in the case of survival analysis, the complete treatment of truncated data. In many cases, this information can be utilised without need to specify a model for X. Chapter 2 considers the application of Bayesian linear regression where large marginal datasets are available, but the collection of response and covariate data together is limited to a small dataset. These marginal datasets can be used to estimate the marginal means and variances of Y and X, which impose two constraints on the parameters of the linear regression model. We define a joint prior over covariate effects and the conditional variance σ<sup>2</sup> via a parameter transformation, which allows us to guarantee these marginal constraints are met. This provides a computationally efficient means of incorporating marginal information, useful when incorporation via the imputation of missing values may be implausible. The resulting prior and posterior have rich dependence structures that have a natural 'analysis of variance' interpretation, due to the constraint on the total marginal variance of Y. The concept of 'marginal coherence' is introduced, whereby competing models place the same prior on the marginal mean and variance of the response. Our marginally constrained prior can be extended by placing priors on the marginal variances, in order to perform variable selection in a marginally coherent fashion. Chapter 3 constructs a Bayesian nonparametric regression model parameterised in terms of FY , the marginal distribution of the response. This naturally allows the incorporation of marginal data, and provides a natural means of specifying a prior distribution for a regression model. The construction is such that the distribution of the ordering of the response, given covariates, takes the form of the Plackett-Luce model for ranks. This facilitates a natural composite likelihood approximation that decomposes the likelihood into a term for the marginal response data, and a term for the probability of the observed ranking. This can be viewed as a extension to the partial likelihood for proportional hazards models. This convenient form leads to simple approximate posterior inference, which circumvents the need to perform MCMC, allowing scalability to large datasets. We apply the model to a US Census dataset with over 1,300,000 data points and more than 100 covariates, where the nonparametric prior is able to capture the highly non-standard distribution of incomes. Chapter 4 explores the analysis of randomised clinical trial (RCT) data for subgroup analysis, where interest lies in the optimal allocation of treatment D(X), based on covariates. Standard analyses build a conditional model Y | X,T for the response, given treatment and covariates, which can be used to deduce the optimal treatment rule. We show that the treatment of covariates as random facilitates direct testing of a treatment rule, without the need to specify a conditional model. This provides a robust, efficient, and easy-to-use methodology for testing treatment rules. This nonparametric testing approach is used as a splitting criteria in a random-forest methodology for the exploratory analysis of subgroups. The model introduced in Chapter 3 is applied in the context of subgroup analysis, providing a Bayesian nonparametric analogue to this approach: where inference is based only on the order of the data, circumventing the requirement to specify a full data-generating model. Both approaches to subgroup analysis are applied to data from an AIDS Clinical Trial.
Meysam Tavakoli (8767965)
28 April 2020
<p>The main goal of data analysis is to summarize huge amount of data (as our observation) with a few numbers that come up us with some sort of intuition into the process that generated the data. Regardless of the method we use to analyze the data, the process of analysis includes (1) create the mathematical formulation for the problem, (2) data collection, (3) create a probability model for the data, (4) estimate the parameters of the model, and (5) summarize the results in a proper way-a process that is called ”statistical inference”.<br></p><p>Recently it has been suggested that using the concept of Bayesian approach and more specifically Bayesian nonparametrics (BNPs) is showed to have a deep influence in the area of data analysis , and in this field, they have just begun to be extracted [2–4]. However, to our best knowledge, there is no single resource yet avail-able that explain it, both its concepts, and implementation, as would be needed to bring the capacity of BNPs to relieve on data analysis and accelerate its unavoidable extensive acceptance.<br></p><p>Therefore, in this dissertation, we provide a description of the concepts and implementation of an important, and computational tool that extracts BNPs in this area specifically its application in the field of biophysics. Here, the goal is using BNPs to understand the rules of life (in vivo) at the scale at which life occurs (single molecule)from the fastest possible acquirable data (single photons).<br></p><p>In chapter 1, we introduce a brief introduction to Data Analysis in biophysics.Here, our overview is aimed for anyone, from student to established researcher, who plans to understand what can be accomplished with statistical methods to modeling and where the field of data analysis in biophysics is headed. For someone just getting started, we present a special on the logic, strengths and shortcomings of data analysis frameworks with a focus on very recent approaches.<br></p><p>In chapter 2, we provide an overview on data analysis in single molecule bio-physics. We discuss about data analysis tools and model selection problem and mainly Bayesian approach. We also discuss about BNPs and their distinctive characteristics that make them ideal mathematical tools in modeling of complex biomolecules as they offer meaningful and clear physical interpretation and let full posterior probabilities over molecular-level models to be deduced with minimum subjective choices.<br></p><p>In chapter 3, we work on spectroscopic approaches and fluorescence time traces.These traces are employed to report on dynamical features of biomolecules. The fundamental unit of information came from these time traces is the single photon.Individual photons have information from the biomolecule, from which they are emit-ted, to the detector on timescales as fast as microseconds. Therefore, from confocal microscope viewpoint it is theoretically feasible to monitor biomolecular dynamics at such timescales. In practice, however, signals are stochastic and in order to derive dynamical information through traditional means such as fluorescence correlation spectroscopy (FCS) and related methods fluorescence time trace signals are gathered and temporally auto-correlated over many minutes. So far, it has been unfeasible to analyze dynamical attributes of biomolecules on timescales near data acquisition as this requests that we estimate the biomolecule numbers emitting photons and their locations within the confocal volume. The mathematical structure of this problem causes that we leave the normal (”parametric”) Bayesian paradigm. Here, we utilize novel mathematical tools, BNPs, that allow us to extract in a principled fashion the same information normally concluded from FCS but from the direct analysis of significantly smaller datasets starting from individual single photon arrivals. Here, we specifically are looking for diffusion coefficient of the molecules. Diffusion coefficient allows molecules to find each other in a cell and at the cellular level, determination of the diffusion coefficient can provide us valuable insights about how molecules interact with their environment. We discuss the concepts of this method in assisting significantly reduce phototoxic damage on the sample and the ability to monitor the dynamics of biomolecules, even down to the single molecule level, at such timescales.<br></p><p>In chapter 4, we present a new approach to infer lifetime. In general, fluorescenceLifetime Imaging (FLIM) is an approach which provides us information on the numberof species and their associated lifetimes. Current lifetime data analysis methods relyon either time correlated single photon counting (TCSPC) or phasor analysis. These methods require large numbers of photons to converge to the appropriate lifetimes and do not determine how many species are responsible for those lifetimes. Here, we propose a new method to analyze lifetime data based on BNPs that precisely takes into account several experimental complexities. Using BNPs, we can not only identify the most probable number of species but also their lifetimes with at least an order magnitudes less data than competing methods (TCSPC or phasors). To evaluate our method, we test it with both simulated and experimental data for one, two, three and four species with both stationary and moving molecules. Also, we compare our species estimate and lifetime determination with both TCSPC and phasor analysis for different numbers of photons used in the analysis.<br></p><p>In conclusion, the basis of every spectroscopic method is the detection of photons.Photon arrivals encode complex dynamical and chemical information and methods to analyze such arrivals have the capability to reveal dynamical and chemical processes on fast timescales. Here, we turn our attention to fluorescence lifetime imaging and single spot fluorescence confocal microscopy where individual photon arrivals report on dynamics and chemistry down to the single molecule level. The reason this could not previously be achieved is because of the uncertainty in the number of chemical species and numbers of molecules contributing for the signal (i.e., responsible for contributing photons). That is, to learn dynamical or kinetic parameters (like diffusion coefficients or lifetime) we need to be able to interpret which photon is reporting on what process. For this reason, we abandon the parametric Bayesian paradigm and use the nonparametric paradigm that allows us to flexibly explore and learn numbers of molecules and chemical reaction space. We demonstrate the power of BNPs over traditional methods in single spot confocal and FLIM analysis in fluorescence lifetime imaging.<br></p>
Waters, Austin Severn
02 July 2014
Digital media collections hold an unprecedented source of knowledge and data about the world. Yet, even at current scales, the data exceeds by many orders of magnitude the amount a single user could browse through in an entire lifetime. Making use of such data requires computational tools that can index, search over, and organize media documents in ways that are meaningful to human users, based on the meaning of their content. This dissertation develops an automated approach to analyzing digital media content based on topic models. Its primary contribution, the Infinite-Word Topic Model (IWTM), helps extend topic modeling to digital media domains by removing model assumptions that do not make sense for them -- in particular, the assumption that documents are composed of discrete, mutually-exclusive words from a fixed-size vocabulary. While conventional topic models like Latent Dirichlet Allocation (LDA) require that media documents be converted into bags of words, IWTM incorporates clustering into its probabilistic model and treats the vocabulary size as a random quantity to be inferred based on the data. Among its other benefits, IWTM achieves better performance than LDA while automating the selection of the vocabulary size. This dissertation contributes fast, scalable variational inference methods for IWTM that allow the model to be applied to large datasets. Furthermore, it introduces a new method, Incremental Variational Inference (IVI), for training IWTM and other Bayesian non-parametric models efficiently on growing datasets. IVI allows such models to grow in complexity as the dataset grows, as their priors state that they should. Finally, building on IVI, an active learning method for topic models is developed that intelligently samples new data, resulting in models that train faster, achieve higher performance, and use smaller amounts of labeled data. / text
07 December 2010
This research focuses on finding the optimal maintenance policy for an item with varying failure behavior. We analyze several types of item failure rates and develop methods to solve for optimal maintenance schedules. We also illustrate nonparametric modeling techniques for failure rates, and utilize these models in the optimization methods. The general problem falls under the umbrella of stochastic optimization under uncertainty. / text
03 October 2013
Three statistical models are developed to address problems in Next-Generation Sequencing data. The first two models are designed for RNA-Seq data and the third is designed for ChIP-Seq data. The first of the RNA-Seq models uses a Bayesian non- parametric model to detect genes that are differentially expressed across treatments. A negative binomial sampling distribution is used for each gene’s read count such that each gene may have its own parameters. Despite the consequent large number of parameters, parsimony is imposed by a clustering inherent in the Bayesian nonparametric framework. A Bayesian discovery procedure is adopted to calculate the probability that each gene is differentially expressed. A simulation study and real data analysis show this method will perform at least as well as existing leading methods in some cases. The second RNA-Seq model shares the framework of the first model, but replaces the usual random partition prior from the Dirichlet process by a random partition prior indexed by distances from Gene Ontology (GO). The use of the external biological information yields improvements in statistical power over the original Bayesian discovery procedure. The third model addresses the problem of identifying protein binding sites for ChIP-Seq data. An exact test via a stochastic approximation is used to test the hypothesis that the treatment effect is independent of the sequence count intensity effect. The sliding window procedure for ChIP-Seq data is followed. The p-value and the adjusted false discovery rate are calculated for each window. For the sites identified as peak regions, three candidate models are proposed for characterizing the bimodality of the ChIP-Seq data, and the stochastic approximation in Monte Carlo (SAMC) method is used for selecting the best of the three. Real data analysis shows that this method produces comparable results as other existing methods and is advantageous in identifying bimodality of the data.
Slifko, Matthew D.
11 September 2019
We live in the data explosion era. The unprecedented amount of data offers a potential wealth of knowledge but also brings about concerns regarding ethical collection and usage. Mistakes stemming from anomalous data have the potential for severe, real-world consequences, such as when building prediction models for housing prices. To combat anomalies, we develop the Cauchy-Net Mixture Model (CNMM). The CNMM is a flexible Bayesian nonparametric tool that employs a mixture between a Dirichlet Process Mixture Model (DPMM) and a Cauchy distributed component, which we call the Cauchy-Net (CN). Each portion of the model offers benefits, as the DPMM eliminates the limitation of requiring a fixed number of a components and the CN captures observations that do not belong to the well-defined components by leveraging its heavy tails. Through isolating the anomalous observations in a single component, we simultaneously identify the observations in the net as warranting further inspection and prevent them from interfering with the formation of the remaining components. The result is a framework that allows for simultaneously clustering observations and making predictions in the face of the anomalous data. We demonstrate the usefulness of the CNMM in a variety of experimental situations and apply the model for predicting housing prices in Fairfax County, Virginia. / Doctor of Philosophy / We live in the data explosion era. The unprecedented amount of data offers a potential wealth of knowledge but also brings about concerns regarding ethical collection and usage. Mistakes stemming from anomalous data have the potential for severe, real-world consequences, such as when building prediction models for housing prices. To combat anomalies, we develop the Cauchy-Net Mixture Model (CNMM). The CNMM is a flexible tool for identifying and isolating the anomalies, while simultaneously discovering cluster structure and making predictions among the nonanomalous observations. The result is a framework that allows for simultaneously clustering and predicting in the face of the anomalous data. We demonstrate the usefulness of the CNMM in a variety of experimental situations and apply the model for predicting housing prices in Fairfax County, Virginia.
Darnieder, William Francis
22 July 2011
No description available.
Niekum, Scott D.
01 September 2013
Robots exhibit flexible behavior largely in proportion to their degree of semantic knowledge about the world. Such knowledge is often meticulously hand-coded for a narrow class of tasks, limiting the scope of possible robot competencies. Thus, the primary limiting factor of robot capabilities is often not the physical attributes of the robot, but the limited time and skill of expert programmers. One way to deal with the vast number of situations and environments that robots face outside the laboratory is to provide users with simple methods for programming robots that do not require the skill of an expert. For this reason, learning from demonstration (LfD) has become a popular alternative to traditional robot programming methods, aiming to provide a natural mechanism for quickly teaching robots. By simply showing a robot how to perform a task, users can easily demonstrate new tasks as needed, without any special knowledge about the robot. Unfortunately, LfD often yields little semantic knowledge about the world, and thus lacks robust generalization capabilities, especially for complex, multi-step tasks. To address this shortcoming of LfD, we present a series of algorithms that draw from recent advances in Bayesian nonparametric statistics and control theory to automatically detect and leverage repeated structure at multiple levels of abstraction in demonstration data. The discovery of repeated structure provides critical insights into task invariants, features of importance, high-level task structure, and appropriate skills for the task. This culminates in the discovery of semantically meaningful skills that are flexible and reusable, providing robust generalization and transfer in complex, multi-step robotic tasks. These algorithms are tested and evaluated using a PR2 mobile manipulator, showing success on several complex real-world tasks, such as furniture assembly.
<p>Unprecedented amount of data has been collected in diverse fields such as social network, infectious disease and political science in this information explosive era. The high dimensional, complex and heterogeneous data imposes tremendous challenges on traditional statistical models. Bayesian nonparametric methods address these challenges by providing models that can fit the data with growing complexity. In this thesis, we design novel Bayesian nonparametric models on dataset from three different fields, hyperspectral images analysis, infectious disease and voting behaviors. </p><p>First, we consider analysis of noisy and incomplete hyperspectral imagery, with the objective of removing the noise and inferring the missing data. The noise statistics may be wavelength-dependent, and the fraction of data missing (at random) may be substantial, including potentially entire bands, offering the potential to significantly reduce the quantity of data that need be measured. We achieve this objective by employing Bayesian dictionary learning model, considering two distinct means of imposing sparse dictionary usage and drawing the dictionary elements from a Gaussian process prior, imposing structure on the wavelength dependence of the dictionary elements.</p><p>Second, a Bayesian statistical model is developed for analysis of the time-evolving properties of infectious disease, with a particular focus on viruses. The model employs a latent semi-Markovian state process, and the state-transition statistics are driven by three terms: ($i$) a general time-evolving trend of the overall population, ($ii$) a semi-periodic term that accounts for effects caused by the days of the week, and ($iii$) a regression term that relates the probability of infection to covariates (here, specifically, to the Google Flu Trends data).</p><p>Third, extensive information on 3 million randomly sampled United States citizens is used to construct a statistical model of constituent preferences for each U.S. congressional district. This model is linked to the legislative voting record of the legislator from each district, yielding an integrated model for constituency data, legislative roll-call votes, and the text of the legislation. The model is used to examine the extent to which legislators' voting records are aligned with constituent preferences, and the implications of that alignment (or lack thereof) on subsequent election outcomes. The analysis is based on a Bayesian nonparametric formalism, with fast inference via a stochastic variational Bayesian analysis.</p> / Dissertation
Paulo Cilas Marques Filho
19 December 2011
Definimos, a partir de uma partição de um intervalo limitado da reta real formada por subintervalos, uma distribuição a priori sobre uma classe de densidades em relação à medida de Lebesgue construindo uma densidade aleatória cujas realizações são funções simples não negativas que assumem um valor constante em cada subintervalo da partição e possuem integral unitária. Utilizamos tais densidades aleatórias simples na análise bayesiana de um conjunto de observáveis absolutamente contínuos e provamos que a distribuição a priori é fechada sob amostragem. Exploramos as distribuições a priori e a posteriori via simulações estocásticas e obtemos soluções bayesianas para o problema de estimação de densidade. Os resultados das simulações exibem o comportamento assintótico da distribuição a posteriori quando crescemos o tamanho das amostras dos dados analisados. Quando a partição não é conhecida a priori, propomos um critério de escolha a partir da informação contida na amostra. Apesar de a esperança de uma densidade aleatória simples ser sempre uma densidade descontínua, obtemos estimativas suaves resolvendo um problema de decisão em que os estados da natureza são realizações da densidade aleatória simples e as ações são densidades suaves de uma classe adequada. / We define, from a known partition in subintervals of a bounded interval of the real line, a prior distribution over a class of densities with respect to Lebesgue measure constructing a random density whose realizations are nonnegative simple functions that integrate to one and have a constant value on each subinterval of the partition. These simple random densities are used in the Bayesian analysis of a set of absolutely continuous observables and the prior distribution is proved to be closed under sampling. We explore the prior and posterior distributions through stochastic simulations and find Bayesian solutions to the problem of density estimation. Simulations results show the asymptotic behavior of the posterior distribution as we increase the size of the analyzed data samples. When the partition is unknown, we propose a choice criterion based on the information contained in the sample. In spite of the fact that the expectation of a simple random density is always a discontinuous density, we get smooth estimates solving a decision problem where the states of nature are realizations of the simple random density and the actions are smooth densities of a suitable class.
Page generated in 0.1274 seconds