Global ETD Search

1	<b>STOCHASTIC NEURAL NETWORK AND CAUSAL INFERENCE</b> Yaxin Fang (17069563) 10 January 2025 (has links) <p dir="ltr">Estimating causal effects from observational data has been challenging due to high-dimensional complex dataset and confounding biases. In this thesis, we try to tackle these issues by leveraging deep learning techniques, including sparse deep learning and stochastic neural networks, that have been developed in recent literature. </p><p dir="ltr">With the advancement of data science, the collection of increasingly complex datasets has become commonplace. In such datasets, the data dimension can be extremely high, and the underlying data generation process can be unknown and highly nonlinear. As a result, the task of making causal inference with high-dimensional complex data has become a fundamental problem in many disciplines, such as medicine, econometrics, and social science. However, the existing methods for causal inference are frequently developed under the assumption that the data dimension is low or that the underlying data generation process is linear or approximately linear. To address these challenges, chapter 3 proposes a novel causal inference approach for dealing with high-dimensional complex data. By using sparse deep learning techniques, the proposed approach can address both the high dimensionality and unknown data generation process in a coherent way. Furthermore, the proposed approach can also be used when missing values are present in the datasets. Extensive numerical studies indicate that the proposed approach outperforms existing ones. </p><p dir="ltr">One of the major challenges in causal inference with observational data is handling missing confounder. Latent variable modeling is a valid framework to address this challenge, but current approaches within the framework often suffer from consistency issues in causal effect estimation and are hard to extend to more complex application scenarios. To bridge this gap, in chapter 4, we propose a new latent variable modeling approach. It utilizes a stochastic neural network, where the latent variables are imputed as the outputs of hidden neurons using an adaptive stochastic gradient HMC algorithm. Causal inference is then conducted based on the imputed latent variables. Under mild conditions, the new approach provides a theoretical guarantee for the consistency of causal effect estimation. The new approach also serves as a versatile tool for modeling various causal relationships, leveraging the flexibility of the stochastic neural network in natural process modeling. We show that the new approach matches state-of-the-art performance on benchmarks for causal effect estimation and demonstrate its adaptability to proxy variable and multiple-cause scenarios.</p> Computational statistics causal inference stochastic neural network
2	rstream: Streams of Random Numbers for Stochastic Simulation L'Ecuyer, Pierre, Leydold, Josef January 2005 (has links) (PDF) The package rstream provides a unified interface to streams of random numbers for the R statistical computing language. Features are: * independent streams of random numbers * substreams * easy handling of streams (initialize, reset) * antithetic random variates The paper describes this packages and demonstrates an simple example the usefulness of this approach. / Series: Preprint Series / Department of Applied Statistics and Data Processing
3	Estimating Freeway Travel Time Reliability for Traffic Operations and Planning Yang, Shu, Yang, Shu January 2016 (has links) Travel time reliability (TTR) has attracted increasing attention in recent years, and is often listed as one of the major roadway performance and service quality measures for both traffic engineers and travelers. Measuring travel time reliability is the first step towards improving travel time reliability, ensuring on-time arrivals, and reducing travel costs. Four components may be primarily considered, including travel time estimation/collection, quantity of travel time selection, probability distribution selection, and TTR measure selection. Travel time is a key transportation performance measure because of its diverse applications and it also serves the foundation of estimating travel time reliability. Various modelling approaches to estimating freeway travel time have been well developed due to widespread installation of intelligent transportation system sensors. However, estimating accurate travel time using existing freeway travel time models is still challenging under congested conditions. Therefore, this study aimed to develop an innovative freeway travel time estimation model based on the General Motors (GM) car-following model. Since the GM model is usually used in a micro-simulation environment, the concepts of virtual leading and virtual following vehicles are proposed to allow the GM model to be used in macro-scale environments using aggregated traffic sensor data. Travel time data collected from three study corridors on I-270 in St. Louis, Missouri was used to verify the estimated travel times produced by the proposed General Motors Travel Time Estimation (GMTTE) model and two existing models, the instantaneous model and the time-slice model. The results showed that the GMTTE model outperformed the two existing models due to lower mean average percentage errors of 1.62% in free-flow conditions and 6.66% in two congested conditions. Overall, the GMTTE model demonstrated its robustness and accuracy for estimating freeway travel times. Most travel time reliability measures are derived directly from continuous probability distributions and applied to the traffic data directly. However, little previous research shows a consensus of probability distribution family selection for travel time reliability. Different probability distribution families could yield different values for the same travel time reliability measure (e.g. standard deviation). It is believe that the specific selection of probability distribution families has few effects on measuring travel time reliability. Therefore, two hypotheses are proposed in hope of accurately measuring travel time reliability. An experiment is designed to prove the two hypotheses. The first hypothesis is proven by conducting the Kolmogorov–Smirnov test and checking log-likelihoods, and Akaike information criterion with a correction for finite sample sizes (AICc) and Bayesian information criterion (BIC) convergences; and the second hypothesis is proven by examining both moment-based and percentile-based travel time reliability measures. The results from the two hypotheses testing suggest that 1) underfitting may cause disagreement in distribution selection, 2) travel time can be precisely fitted using mixture models with higher value of the number of mixture distributions (K), regardless of the distribution family, and 3) the travel time reliability measures are insensitive to the selection of distribution family. Findings of this research allows researchers and practitioners to avoid the work of testing various distributions, and travel time reliability can be more accurately measured using mixture models due to higher value of log-likelihoods. As with travel time collection, the accuracy of the observed travel time and the optimal travel time data quantity should be determined before using the TTR data. The statistical accuracy of TTR measures should be evaluated so that the statistical behavior and belief can be fully understood. More specifically, this issue can be formulated as a question: using a certain amount of travel time data, how accurate is the travel time reliability for a specific freeway corridor, time of day (TOD), and day of week (DOW)? A framework for answering this question has not been proposed in the past. Our study proposes a framework based on bootstrapping to evaluate the accuracy of TTR measures and answer the question. Bootstrapping is a computer-based method for assigning measures of accuracy to multiple types of statistical estimators without requiring a specific probability distribution. Three scenarios representing three traffic flow conditions (free-flow, congestion, and transition) were used to fully understand the accuracy of TTR measures under different traffic conditions. The results of the accuracy measurements primarily showed that: 1) the proposed framework can facilitate assessment of the accuracy of TTR, and 2) stabilization of the TTR measures did not necessarily correspond to statistical accuracy. The findings in our study also suggested that moment-based TTR measures may not be statistically sufficient for measuring freeway TTR. Additionally, our study suggested that 4 or 5 weeks of travel time data is enough for measuring freeway TTR under free-flow conditions, 40 weeks for congested conditions, and 35 weeks for transition conditions. A considerable number of studies have contributed to measuring travel time reliability. Travel time distribution estimation is considered as an important starting input of measuring travel time reliability. Kernel density estimation (KDE) is used to estimate travel time distribution, instead of parametric probability distributions, e.g. Lognormal distribution, the two state models. The Hasofer Lind - Rackwitz Fiessler (HL-RF) algorithm, widely used in the field of reliability engineering, is applied to this work. It is used to compute the reliability index of a system based on its previous performance. The computing procedure for travel time reliability of corridors on a freeway is first introduced. Network travel time reliability is developed afterwards. Given probability distributions estimated by the KDE technique, and an anticipated travel time from travelers, the two equations of the corridor and network travel time reliability can be used to address the question, "How reliable is my perceived travel time?" The definition of travel time reliability is in the sense of "on time performance", and it is conducted inherently from the perspective of travelers. Further, the major advantages of the proposed method are: 1) The proposed method demonstrates an alternative way to estimate travel time distributions when the choice of probability distribution family is still uncertain; 2) the proposed method shows its flexibility for being applied onto different levels of roadways (e.g. individual roadway segment or network). A user-defined anticipated travel time can be input, and travelers can utilize the computed travel time reliability information to plan their trips in advance, in order to better manage trip time, reduce cost, and avoid frustration. computational statistics probability mixture model Travel time Travel time reliability car following model
4	Model-based Learning: t-Families, Variable Selection, and Parameter Estimation Andrews, Jeffrey Lambert 27 August 2012 (has links) The phrase model-based learning describes the use of mixture models in machine learning problems. This thesis focuses on a number of issues surrounding the use of mixture models in statistical learning tasks: including clustering, classification, discriminant analysis, variable selection, and parameter estimation. After motivating the importance of statistical learning via mixture models, five papers are presented. For ease of consumption, the papers are organized into three parts: mixtures of multivariate t-families, variable selection, and parameter estimation. / Natural Sciences and Engineering Research Council of Canada through a doctoral postgraduate scholarship. Computational Statistics Cluster Analysis Multivariate Statistics Classification Statistical Learning Mixture Models
5	Joint Posterior Inference for Latent Gaussian Models and extended strategies using INLA Chiuchiolo, Cristian 06 June 2022 (has links) Bayesian inference is particularly challenging on hierarchical statistical models as computational complexity becomes a significant issue. Sampling-based methods like the popular Markov Chain Monte Carlo (MCMC) can provide accurate solutions, but they likely suffer a high computational burden. An attractive alternative is the Integrated Nested Laplace Approximations (INLA) approach, which is faster when applied to the broad class of Latent Gaussian Models (LGMs). The method computes fast and empirically accurate deterministic posterior marginal approximations of the model's unknown parameters. In the first part of this thesis, we discuss how to extend the software's applicability to a joint posterior inference by constructing a new class of joint posterior approximations, which also add marginal corrections for location and skewness. As these approximations result from a combination of a Gaussian Copula and internally pre-computed accurate Gaussian Approximations, we name this class Skew Gaussian Copula (SGC). By computing moments and correlation structure of a mixture representation of these distributions, we achieve new fast and accurate deterministic approximations for linear combinations in a subset of the model's latent field. The same mixture approximates a full joint posterior density through a Monte Carlo sampling on the hyperparameter set. We set highly skewed examples based on Poisson and Binomial hierarchical models and verify these new approximations using INLA and MCMC. The new skewness correction from the Skew Gaussian Copula is more consistent with the outcomes provided by the default INLA strategies. In the last part, we propose an extension of the parametric fit employed by the Simplified Laplace Approximation strategy in INLA when approximating posterior marginals. By default, the strategy matches log derivatives from a third-order Taylor expansion of each Laplace Approximation marginal with those derived from Skew Normal distributions. We consider a fourth-order term and adapt an Extended Skew Normal distribution to produce a more accurate approximation fit when skewness is large. We set similarly skewed data simulations with Poisson and Binomial likelihoods and show that the posterior marginal results from the new extended strategy are more accurate and coherent with the MCMC ones than its original version. Bayesian Statistics Computational Statistics Latent Gaussian Models Integrated Nested Laplace Approximation Markov Chain Monte Carlo
6	Sparse Latent-Space Learning for High-Dimensional Data: Extensions and Applications White, Alexander James 05 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / The successful treatment and potential eradication of many complex diseases, such as cancer, begins with elucidating the convoluted mapping of molecular profiles to phenotypical manifestation. Our observed molecular profiles (e.g., genomics, transcriptomics, epigenomics) are often high-dimensional and are collected from patient samples falling into heterogeneous disease subtypes. Interpretable learning from such data calls for sparsity-driven models. This dissertation addresses the high dimensionality, sparsity, and heterogeneity issues when analyzing multiple-omics data, where each method is implemented with a concomitant R package. First, we examine challenges in submatrix identification, which aims to find subgroups of samples that behave similarly across a subset of features. We resolve issues such as two-way sparsity, non-orthogonality, and parameter tuning with an adaptive thresholding procedure on the singular vectors computed via orthogonal iteration. We validate the method with simulation analysis and apply it to an Alzheimer’s disease dataset. The second project focuses on modeling relationships between large, matched datasets. Exploring regressional structures between large data sets can provide insights such as the effect of long-range epigenetic influences on gene expression. We present a high-dimensional version of mixture multivariate regression to detect patient clusters, each with different correlation structures of matched-omics datasets. Results are validated via simulation and applied to matched-omics data sets. In the third project, we introduce a novel approach to modeling spatial transcriptomics (ST) data with a spatially penalized multinomial model of the expression counts. This method solves the low-rank structures of zero-inflated ST data with spatial smoothness constraints. We validate the model using manual cell structure annotations of human brain samples. We then applied this technique to additional ST datasets. / 2025-05-22 Applied high dimensional statistics Computational statistics Latent Low rank structure Sparse Spatial transcriptomics
7	Exact Markov Chain Monte Carlo for a Class of Diffusions Qi Wang (14157183) 05 December 2022 (has links) <p>This dissertation focuses on the simulation efficiency of the Markov process for two scenarios: Stochastic differential equations(SDEs) and simulated weather data. </p> <p><br></p> <p>For SDEs, we propose a novel Gibbs sampling algorithm that allows sampling from a particular class of SDEs without any discretization error and shows the proposed algorithm improves the sampling efficiency by orders of magnitude against the existing popular algorithms. </p> <p><br></p> <p>In the weather data simulation study, we investigate how representative the simulated data are for three popular stochastic weather generators. Our results suggest the need for more than a single realization when generating weather data to obtain suitable representations of climate. </p> Computational statistics Brownian motion processes Markov chain Monte Carlo Poisson process Stochastic Differential Equations Diffusion Processes
8	An Efficient Implementation of a Robust Clustering Algorithm Blostein, Martin January 2016 (has links) Clustering and classification are fundamental problems in statistical and machine learning, with a broad range of applications. A common approach is the Gaussian mixture model, which assumes that each cluster or class arises from a distinct Gaussian distribution. This thesis studies a robust, high-dimensional extension of the Gaussian mixture model that automatically detects outliers and noise, and a computationally efficient implementation thereof. The contaminated Gaussian distribution is a robust elliptic distribution that allows for automatic detection of ``bad points'', and is used to make robust the usual factor analysis model. In turn, the mixtures of contaminated Gaussian factor analyzers (MCGFA) algorithm allows high-dimesional, robust clustering, classification and detection of bad points. A family of MCGFA models is created through the introduction of different constraints on the covariance structure. A new, efficient implementation of the algorithm is presented, along with an account of its development. The fast implementation permits thorough testing of the MCGFA algorithm, and its performance is compared to two natural competitors: parsimonious Gaussian mixture models (PGMM) and mixtures of modified t factor analyzers (MMtFA). The algorithms are tested systematically on simulated and real data. / Thesis / Master of Science (MSc) clustering classification statistical learning machine learning robust computational statistics mixture models
9	Sparse Deep Learning and Stochastic Neural Network Yan Sun (12425889) 13 May 2022 (has links) <p>Deep learning has achieved state-of-the-art performance on many machine learning tasks. But the deep neural network(DNN) model still suffers a few issues. Over-parametrized neural network generally has better optimization landscape, but it is computationally expensive, hard to interpret and the model usually can not correctly quantify the prediction uncertainty. On the other hand, small DNN model could suffer from local trap and will be hard to optimize. In this dissertation, we tackle these issues from two directions, sparse deep learning and stochastic neural network. </p> <p><br></p> <p>For sparse deep learning, we proposed Bayesian neural network(BNN) model with mixture of normal prior. Theoretically, We established the posterior consistency and structure selection consistency, which ensures the sparse DNN model can be consistently identified. We also demonstrate the asymptotic normality of the prediction, which ensures the prediction uncertainty to be correctly quantified. Computationally, we proposed a prior annealing approach to optimize the posterior of BNN. The proposed methods share similar computation complexity to the standard stochastic gradient descent method for training DNN. Experiment results show that our model performs well on high dimensional variable selection as well as neural network pruning.</p> <p><br></p> <p>For stochastic neural network, we proposed a Kernel-Expanded Stochastic Neural Network model or K-StoNet model in short. We reformulate the DNN as a latent variable model and incorporate support vector regression (SVR) as the first hidden layer. The latent variable formulation breaks the training into a series of convex optimization problems and the model can be easily trained using the imputation-regularized optimization (IRO) algorithm. We provide theoretical guarantee for convergence of the algorithm and the prediction uncertainty quantification. Experiment results show that the proposed model can achieve good prediction performance and provide correct confidence region for prediction. </p> Computational statistics Sparse Deep Learning Stochastic Neural Network
10	Particle-based Parameter Inference in Stochastic Volatility Models: Batch vs. Online / Partikelbaseradparameterskattning i stokastiska volatilitets modeller: batch vs. online Toft, Albin January 2019 (has links) This thesis focuses on comparing an online parameter estimator to an offline estimator, both based on the PaRIS-algorithm, when estimating parameter values for a stochastic volatility model. By modeling the stochastic volatility model as a hidden Markov model, estimators based on particle filters can be implemented in order to estimate the unknown parameters of the model. The results from this thesis implies that the proposed online estimator could be considered as a superior method to the offline counterpart. The results are however somewhat inconclusive, and further research regarding the subject is recommended. / Detta examensarbetefokuserar på att jämföra en online och offline parameter-skattare i stokastiskavolatilitets modeller. De två parameter-skattarna som jämförs är båda baseradepå PaRIS-algoritmen. Genom att modellera en stokastisk volatilitets-model somen dold Markov kedja, kunde partikelbaserade parameter-skattare användas föratt uppskatta de okända parametrarna i modellen. Resultaten presenterade idetta examensarbete tyder på att online-implementationen av PaRIS-algorimen kanses som det bästa alternativet, jämfört med offline-implementationen.Resultaten är dock inte helt övertygande, och ytterligare forskning inomområdet Hidden Markov models the PaRIS-algorithm computational statistics Probability Theory and Statistics Sannolikhetsteori och statistik

Search results