Global ETD Search

21	Theoretical study of some statistical procedures applied to complex data / Etude théorique de quelques procédures statistiques pour le traitement de données complexes Cottet, Vincent R. 17 November 2017 (has links) La partie principale de cette thèse s'intéresse à développer les aspects théoriques et algorithmiques pour trois procédures statistiques distinctes. Le premier problème abordé est la complétion de matrices binaires. Nous proposons un estimateur basé sur une approximation variationnelle pseudo-bayésienne en utilisant une fonction de perte différente de celles utilisées auparavant. Nous pouvons calculer des bornes non asymptotiques sur le risque intégré. L'estimateur proposé est beaucoup plus rapide à calculer qu'une estimation de type MCMC et nous montrons sur des exemples qu'il est efficace en pratique. Le deuxième problème abordé est l'étude des propriétés théoriques du minimiseur du risque empirique pénalisé pour des fonctions de perte lipschitziennes. Nous pouvons ensuite appliquer les résultats principaux sur la régression logistique avec la pénalisation SLOPE ainsi que sur la complétion de matrice. Le troisième chapitre développe une approximation de type Expectation-Propagation quand la vraisemblance n'est pas explicite. On utilise alors l'approximation ABC dans un second temps. Cette procédure peut s'appliquer à beaucoup de modèles et est beaucoup plus précise et rapide. Elle est appliquée à titre d'exemple sur un modèle d'extrêmes spatiaux. / The main part of this thesis aims at studying the theoretical and algorithmic aspects of three distinct statistical procedures. The first problem is the binary matrix completion. We propose an estimator based on a variational approximation of a pseudo-Bayesian estimator. We use a different loss function of the ones used in the literature. We are able to compute non asymptotic risk bounds. It is much faster to compute the estimator than a MCMC method and we show on examples that it is efficient in practice. In a second part we study the theoretical properties of the regularized empirical risk minimizer for Lipschitz loss functions. We are therefore able to apply it on the logistic regression with the SLOPE regularization and on the matrix completion as well. The third chapter develops an Expectation-Propagation approximation when the likelihood is not explicit. We then use an ABC approximation in a second stage. This procedure may be applied to many models and is more precise and faster than the classic ABC approximation. It is used in a spatial extremes model. Statistiques Inférence bayésienne Statistiques computationnelle Machine Learning Complétion de matrices Extrêmes spatiaux Statistics Bayesian Inference Computational Statistics Machine Learning Matrix Completion Spatial Extremes 519
22	GENERATIVE MODELS WITH MARGINAL CONSTRAINTS Bingjing Tang (16380291) 16 June 2023 (has links) <p> Generative models form powerful tools for learning data distributions and simulating new samples. Recent years have seen significant advances in the flexibility and applicability of such models, with Bayesian approaches like nonparametric Bayesian models and deep neural network models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) finding use in a wide range of domains. However, the black-box nature of these models means that they are often hard to interpret, and they often come with modeling implications that are inconsistent with side knowledge resulting from domain knowledge. This thesis studies situations where the modeler has side knowledge represented as probability distributions on functionals of the objects being modeled, and we study methods to incorporate this particular kind of side knowledge into flexible generative models. This dissertation covers three main parts. </p> <p><br></p> <p>The first part focuses on incorporating a special case of the aforementioned side knowledge into flexible nonparametric Bayesian models. Many times, practitioners have additional distributional information about a subset of the coordinates of the observations being modeled. The flexibility of nonparametric Bayesian models usually implies incompatibility with this side information. Such inconsistency triggers the necessity of developing methods to incorporate this side knowledge into flexible nonparametric Bayesian models. We design a specialized generative process to build in this side knowledge and propose a novel sigmoid Gaussian process conditional model. We also develop a corresponding posterior sampling method based on data augmentation to overcome a doubly intractable problem. We illustrate the efficacy of our proposed constrained nonparametric Bayesian model in a variety of real-world scenarios including modeling environmental and earthquake data. </p> <p><br></p> <p>The second part of the dissertation discusses neural network approaches to satisfying the said general side knowledge. Further, the generative models considered in this part broaden into black-box models. We formulate this side knowledge incorporation problem as a constrained divergence minimization problem and propose two scalable neural network approaches as its solution. We demonstrate their practicality using various synthetic and real examples. </p> <p><br></p> <p> The third part of the dissertation concentrates on a specific generative model of individual pixels of the fMRI data constructed from a latent group image. Usually there is two-fold side knowledge about the latent group image: spatial structure and partial activation zones. The former can be captured by modeling the prior for the group image with Markov random fields. The latter, which is often obtained from previous related studies, is left for future research. We propose a novel Bayesian model with Markov random fields and aim to estimate the maximum a posteriori for the group image. We also derive a variational Bayes algorithm to overcome local optima in the optimization.</p> Computational statistics Statistical data science Knowledge Constraints Nonparametric Bayesian Black-box Neural Networks Conditional Density Estimation Density Ratio Estimation Sigmoid Gaussian Processes
23	Toward a Theory of Auto-modeling Yiran Jiang (16632711) 25 July 2023 (has links) <p>Statistical modeling aims at constructing a mathematical model for an existing data set. As a comprehensive concept, statistical modeling leads to a wide range of interesting problems. Modern parametric models, such as deepnets, have achieved remarkable success in quite a few application areas with massive data. Although being powerful in practice, many fitted over-parameterized models potentially suffer from losing good statistical properties. For this reason, a new framework named the Auto-modeling (AM) framework is proposed. Philosophically, the mindset is to fit models to future observations rather than the observed sample. Technically, choosing an imputation model for generating future observations, we fit models to future observations via optimizing an approximation to the desired expected loss function based on its sample counterpart and what we call an adaptive {\it duality function}.</p> <p><br></p> <p>The first part of the dissertation (Chapter 2 to 7) focuses on the new philosophical perspective of the method, as well as the details of the main framework. Technical details, including essential theoretical properties of the method are also investigated. We also demonstrate the superior performance of the proposed method via three applications: Many-normal-means problem, $n < p$ linear regression and image classification.</p> <p><br></p> <p>The second part of the dissertation (Chapter 8) focuses on the application of the AM framework to the construction of linear regression models. Our primary objective is to shed light on the stability issue associated with the commonly used data-driven model selection methods such as cross-validation (CV). Furthermore, we highlight the philosophical distinctions between CV and AM. Theoretical properties and numerical examples presented in the study demonstrate the potential and promise of AM-based linear model selection. Additionally, we have devised a conformal prediction method specifically tailored for quantifying the uncertainty of AM predictions in the context of linear regression.</p> Applied statistics Computational statistics Statistical theory future observations bootstrap re-sampling cross-validation image classification linear regression over-parameterization statistical modeling
24	Causal Inference in the Face of Assumption Violations Yuki Ohnishi (18423810) 26 April 2024 (has links) <p dir="ltr">This dissertation advances the field of causal inference by developing methodologies in the face of assumption violations. Traditional causal inference methodologies hinge on a core set of assumptions, which are often violated in the complex landscape of modern experiments and observational studies. This dissertation proposes novel methodologies designed to address the challenges posed by single or multiple assumption violations. By applying these innovative approaches to real-world datasets, this research uncovers valuable insights that were previously inaccessible with existing methods. </p><p><br></p><p dir="ltr">First, three significant sources of complications in causal inference that are increasingly of interest are interference among individuals, nonadherence of individuals to their assigned treatments, and unintended missing outcomes. Interference exists if the outcome of an individual depends not only on its assigned treatment, but also on the assigned treatments for other units. It commonly arises when limited controls are placed on the interactions of individuals with one another during the course of an experiment. Treatment nonadherence frequently occurs in human subject experiments, as it can be unethical to force an individual to take their assigned treatment. Clinical trials, in particular, typically have subjects that do not adhere to their assigned treatments due to adverse side effects or intercurrent events. Missing values also commonly occur in clinical studies. For example, some patients may drop out of the study due to the side effects of the treatment. Failing to account for these considerations will generally yield unstable and biased inferences on treatment effects even in randomized experiments, but existing methodologies lack the ability to address all these challenges simultaneously. We propose a novel Bayesian methodology to fill this gap. </p><p><br></p><p dir="ltr">My subsequent research further addresses one of the limitations of the first project: a set of assumptions about interference structures that may be too restrictive in some practical settings. We introduce a concept of the ``degree of interference" (DoI), a latent variable capturing the interference structure. This concept allows for handling arbitrary, unknown interference structures to facilitate inference on causal estimands. </p><p><br></p><p dir="ltr">While randomized experiments offer a solid foundation for valid causal analysis, people are also interested in conducting causal inference using observational data due to the cost and difficulty of randomized experiments and the wide availability of observational data. Nonetheless, using observational data to infer causality requires us to rely on additional assumptions. A central assumption is that of \emph{ignorability}, which posits that the treatment is randomly assigned based on the variables (covariates) included in the dataset. While crucial, this assumption is often debatable, especially when treatments are assigned sequentially to optimize future outcomes. For instance, marketers typically adjust subsequent promotions based on responses to earlier ones and speculate on how customers might have reacted to alternative past promotions. This speculative behavior introduces latent confounders, which must be carefully addressed to prevent biased conclusions. </p><p dir="ltr">In the third project, we investigate these issues by studying sequences of promotional emails sent by a US retailer. We develop a novel Bayesian approach for causal inference from longitudinal observational data that accommodates noncompliance and latent sequential confounding. </p><p><br></p><p dir="ltr">Finally, we formulate the causal inference problem for the privatized data. In the era of digital expansion, the secure handling of sensitive data poses an intricate challenge that significantly influences research, policy-making, and technological innovation. As the collection of sensitive data becomes more widespread across academic, governmental, and corporate sectors, addressing the complex balance between making data accessible and safeguarding private information requires the development of sophisticated methods for analysis and reporting, which must include stringent privacy protections. Currently, the gold standard for maintaining this balance is Differential privacy. </p><p dir="ltr">Local differential privacy is a differential privacy paradigm in which individuals first apply a privacy mechanism to their data (often by adding noise) before transmitting the result to a curator. The noise for privacy results in additional bias and variance in their analyses. Thus, it is of great importance for analysts to incorporate the privacy noise into valid inference.</p><p dir="ltr">In this final project, we develop methodologies to infer causal effects from locally privatized data under randomized experiments. We present frequentist and Bayesian approaches and discuss the statistical properties of the estimators, such as consistency and optimality under various privacy scenarios.</p> Econometric and statistical methods Applied statistics Computational statistics Statistical data science Statistical theory Causal Inference Bayesian statistics Interference Noncompliance Missing not at random (MNAR) Bayesian Nonparametrics Differential privacy
25	Performance of supertree methods for estimating species trees Wang, Yuancheng January 2010 (has links) Phylogenetics is the research of ancestor-descendant relationships among different groups of organisms, for example, species or populations of interest. The datasets involved are usually sequence alignments of various subsets of taxa for various genes. A major task of phylogenetics is often to combine estimated gene trees from many loci sampled from the genes into an overall estimate species tree topology. Eventually, one can construct the tree of life that depicts the ancestor-descendant relationships for all known species around the world. If there is missing data or incomplete sampling in the datasets, then supertree methods can be used to assemble gene trees with different subsets of taxa into an estimated overall species tree topology. In this study, we assume that gene tree discordance is solely due to incomplete lineage sorting under the multispecies coalescent model (Degnan and Rosenberg, 2009). If there is missing data or incomplete sampling in the datasets, then supertree methods can be used to assemble gene trees with different subsets of taxa into an estimated species tree topology. In addition, we examine the performance of the most commonly used supertree method (Wilkinson et al., 2009), namely matrix representation with parsimony (MRP), to explore its statistical properties in this setting. In particular, we show that MRP is not statistically consistent. That is, an estimated species tree topology other than the true species tree topology is more likely to be returned by MRP as the number of gene trees increases. For some situations, using longer branch lengths, randomly deleting taxa or even introducing mutation can improve the performance of MRP so that the matching species tree topology is recovered more often. In conclusion, MRP is a supertree method that is able to handle large amounts of conflict in the input gene trees. However, MRP is not statistically consistent, when using gene trees arise from the multispecies coalescent model to estimate species trees. phylogenetics computational statistics gene tree species tree supertree method simulation study incomplete lineage sorting multispecies coalescent model statistically consistent expected parsimony score pruning schemes mutation model
26	Image-based modelling of pattern dynamics in a semiarid grassland of the Pilbara, Australia Sadler, Rohan January 2007 (has links) [Truncated abstract] Ecologists are increasingly interested in quantifying local interacting processes and their impacts on spatial vegetation patterns. In arid and semiarid ecosystems, theoretical models (often spatially explicit) of dynamical system behaviour have been used to provide insight into changes in vegetation patterning and productivity triggered by ecological events, such as fire and episodic rainfall. The incorporation of aerial imagery of vegetation patterning into current theoretical model remains a challenge, as few theoretical models may be inferred directly from ecological data, let alone imagery. However, if conclusions drawn from theoretical models were well supported by image data then these models could serve as a basis for improved prediction of complex ecosystem behaviour. The objective of this thesis is therefore to innovate methods for inferring theoretical models of vegetation dynamics from imagery. ... These results demonstrate how an ad hoc inference procedure returns biologically meaningful parameter estimates for a germ-grain model of T. triandra vegetation patterning, with VLSA photography as data. Various aspects of the modelling and inference procedures are discussed in the concluding chapter, including possible future extensions and alternative applications for germ-grain models. I conclude that the state-and-transition model provides an effective exploration of an ecosystem?s dynamics, and complements spatially explicit models designed to test specific ecological mechanisms. Significantly, both types of models may now be inferred from image data through the methodologies I have developed, and can provide an empirical basis to theoretical models of complex vegetation dynamics used in understanding and managing arid (and other) ecological systems. Vegetation dynamics -- Data processing Kangaroo grass Computational statistics Spatially explicit models Semi-arid grasslands Very large scale aerial imagery
27	Multivariate analysis of high-throughput sequencing data / Analyses multivariées de données de séquençage à haut débit Durif, Ghislain 13 December 2016 (has links) L'analyse statistique de données de séquençage à haut débit (NGS) pose des questions computationnelles concernant la modélisation et l'inférence, en particulier à cause de la grande dimension des données. Le travail de recherche dans ce manuscrit porte sur des méthodes de réductions de dimension hybrides, basées sur des approches de compression (représentation dans un espace de faible dimension) et de sélection de variables. Des développements sont menés concernant la régression "Partial Least Squares" parcimonieuse (supervisée) et les méthodes de factorisation parcimonieuse de matrices (non supervisée). Dans les deux cas, notre objectif sera la reconstruction et la visualisation des données. Nous présenterons une nouvelle approche de type PLS parcimonieuse, basée sur une pénalité adaptative, pour la régression logistique. Cette approche sera utilisée pour des problèmes de prédiction (devenir de patients ou type cellulaire) à partir de l'expression des gènes. La principale problématique sera de prendre en compte la réponse pour écarter les variables non pertinentes. Nous mettrons en avant le lien entre la construction des algorithmes et la fiabilité des résultats.Dans une seconde partie, motivés par des questions relatives à l'analyse de données "single-cell", nous proposons une approche probabiliste pour la factorisation de matrices de comptage, laquelle prend en compte la sur-dispersion et l'amplification des zéros (caractéristiques des données single-cell). Nous développerons une procédure d'estimation basée sur l'inférence variationnelle. Nous introduirons également une procédure de sélection de variables probabiliste basée sur un modèle "spike-and-slab". L'intérêt de notre méthode pour la reconstruction, la visualisation et le clustering de données sera illustré par des simulations et par des résultats préliminaires concernant une analyse de données "single-cell". Toutes les méthodes proposées sont implémentées dans deux packages R: plsgenomics et CMF / The statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection. Developments are made concerning: the sparse Partial Least Squares (PLS) regression framework for supervised classification, and the sparse matrix factorization framework for unsupervised exploration. In both situations, our main purpose will be to focus on the reconstruction and visualization of the data. First, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression to predict the label of a discrete outcome. For instance, such a method will be used for prediction (fate of patients or specific type of unidentified single cells) based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results. Then, motivated by questions regarding single-cell data analysis, we propose a flexible model-based approach for the factorization of count matrices, that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data), for which we derive an estimation procedure based on variational inference. In this scheme, we consider probabilistic variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering will be illustrated by simulation experiments and by preliminary results on single-cell data analysis. All proposed methods were implemented into two R-packages "plsgenomics" and "CMF" based on high performance computing Statistiques computationnelles Données en grande dimension Réduction de dimension Compression Sélection de Variables Régression logistique Partial Least Squares parcimonieuse Factorisation probabiliste de matrices Computational Statistics High-dimensional data Dimension reduction Compression Variable selection Logistic regression Sparse Partial Least Squares Probabilistic matrix factorization 570.15
28	QUANTUM ACTIVATION FUNCTIONS FOR NEURAL NETWORK REGULARIZATION Christopher Alfred Hickey (16379193) 18 June 2023 (has links) <p> The Bias-Variance Trade-off, where restricting the size of a hypothesis class can limit the generalization error of a model, is a canonical problem in Machine Learning, and a particular issue for high-variance models like Neural Networks that do not have enough parameters to enter the interpolating regime. Regularization techniques add bias to a model to lower testing error at the cost of increasing training error. This paper applies quantum circuits as activation functions in order to regularize a Feed-Forward Neural Network. The network using Quantum Activation Functions is compared against a network of the same dimensions except using Rectified Linear Unit (ReLU) activation functions, which can fit any arbitrary function. The Quantum Activation Function network is then shown to have comparable training performance to ReLU networks, both with and without regularization, for the tasks of binary classification, polynomial regression, and regression on a multicollinear dataset, which is a dataset whose design matrix is rank-deficient. The Quantum Activation Function network is shown to achieve regularization comparable to networks with L2-Regularization, the most commonly used method for neural network regularization today, with regularization parameters in the range of λ ∈ [.1, .5], while still allowing the model to maintain enough variance to achieve low training error. While there are limitations to the current physical implementation of quantum computers, there is potential for future architecture, or hardware-based, regularization methods that leverage the aspects of quantum circuits that provide lower generalization error. </p> Numerical analysis Optimisation Computational statistics Statistical theory Stochastic analysis and modelling Quantum technologies Neural Networks Statistical Machine Learning Machine Learning Deep Learning Quantum Computing
29	Thesis_deposit.pdf Sehwan Kim (15348235) 25 April 2023 (has links) <p> Adaptive MCMC is advantageous over traditional MCMC due to its ability to automatically adjust its proposal distributions during the sampling process, providing improved sampling efficiency and faster convergence to the target distribution, especially in complex or high-dimensional problems. However, designing and validating the adaptive scheme cautiously is crucial to ensure algorithm validity and prevent the introduction of biases. This dissertation focuses on the use of Adaptive MCMC for deep learning, specifically addressing the mode collapse issue in Generative Adversarial Networks (GANs) and implementing Fiducial inference, and its application to Causal inference in individual treatment effect problems.</p> <p><br></p> <p> First, GAN was recently introduced in the literature as a novel machine learning method for training generative models. However, GAN is very difficult to train due to the issue of mode collapse, i.e., lack of diversity among generated data. We figure out the reason why GAN suffers from this issue and lay out a new theoretical framework for GAN based on randomized decision rules such that the mode collapse issue can be overcome essentially. Under the new theoretical framework, the discriminator converges to a fixed point while the generator converges to a distribution at the Nash equilibrium.</p> <p><br></p> <p> Second, Fiducial inference was generally considered as R.A. Fisher's a big blunder, but the goal he initially set, <em>making inference for the uncertainty of model parameters on the basis of observations</em>, has been continually pursued by many statisticians. By leveraging on advanced statistical computing techniques such as stochastic approximation Markov chain Monte Carlo, we develop a new statistical inference method, the so-called extended Fiducial inference, which achieves the initial goal of fiducial inference. </p> <p><br></p> <p> Lastly, estimating ITE is important for decision making in various fields, particularly in health research where precision medicine is being investigated. Conditional average treatment effect (CATE) is often used for such purpose, but uncertainty quantification and explaining the variability of predicted ITE is still needed for fair decision making. We discuss using extended Fiducial inference to construct prediction intervals for ITE, and introduces a double neural net algorithm for efficient prediction and estimation of nonlinear ITE.</p> Computational statistics Statistical data science Statistical theory adaptive MCMC techniques Generative Adversarial Net Fiducial inference Casual inference Individual treatment effects Stochastic Approximation Monte Carlo
30	LB-CNN & HD-OC, DEEP LEARNING ADAPTABLE BINARIZATION TOOLS FOR LARGE SCALE IMAGE CLASSIFICATION Timothy G Reese (13163115) 28 July 2022 (has links) <p>The computer vision task of classifying natural images is a primary driving force behind modern AI algorithms. Deep Convolutional Neural Networks (CNNs) demonstrate state of the art performance in large scale multi-class image classification tasks. However, due to the many layers and millions of parameters these models are considered to be black box algorithms. The decisions of these models are further obscured due to a cumbersome multi-class decision process. There exists another approach called class binarization in the literature which determines the multi-class prediction outcome through a sequence of binary decisions.The focus of this dissertation is on the integration of the class-binarization approach to multi-class classification with deep learning models, such as CNNs, for addressing large scale image classification problems. Three works are presented to address the integration.</p> <p>In the first work, Error Correcting Output Codes (ECOCs) are integrated into CNNs by inserting a latent-binarization layer prior to the CNNs final classification layer. This approach encapsulates both encoding and decoding steps of ECOC into a single CNN architecture. EM and Gibbs sampling algorithms are combined with back-propagation to train CNN models with Latent Binarization (LB-CNN). The training process of LB-CNN guides the model to discover hidden relationships similar to the semantic relationships known apriori between the categories. The proposed models and algorithms are applied to several image recognition tasks, producing excellent results.</p> <p>In the second work, Hierarchically Decodeable Output Codes (HD-OCs) are proposedto compactly describe a hierarchical probabilistic binary decision process model over the features of a CNN. HD-OCs enforce more homogeneous assignments of the categories to the dichotomy labels. A novel concept called average decision depth is presented to quantify the average number of binary questions needed to classify an input. An HD-OC is trained using a hierarchical log-likelihood loss that is empirically shown to orient the output of the latent feature space to resemble the hierarchical structure described by the HD-OC. Experiments are conducted at several different scales of category labels. The experiments demonstrate strong performance and powerful insights into the decision process of the model.</p> <p>In the final work, the literature of enumerative combinatorics and partially ordered sets isused to establish a unifying framework of class-binarization methods under the Multivariate Bernoulli family of models. The unifying framework theoretically establishes simple relationships for transitioning between the different binarization approaches. Such relationships provide useful investigative tools for the discovery of statistical dependencies between large groups of categories. They are additionally useful for incorporating taxonomic information as well as enforcing structural model constraints. The unifying framework lays the groundwork for future theoretical and methodological work in addressing the fundamental issues of large scale multi-class classification.</p> <p><br></p> Computational statistics Statistical data science Statistical theory ECOC classification algorithms image classification techniques Hierarchical algorithms enumerative combinatorics Deep Learning Imaging Neural Networks method Convolutional neural networks image analysis class-binarization

Search results