Global ETD Search

1	RISK INTERPRETATION OF DIFFERENTIAL PRIVACY Jiajun Liang (13190613) 31 July 2023 (has links) <p><br></p><p>How to set privacy parameters is a crucial problem for the consistent application of DP in practice. The current privacy parameters do not provide direct suggestions for this problem. On the other hand, different databases may have varying degrees of information leakage, allowing attackers to enhance their attacks with the available information. This dissertation provides an additional interpretation of the current DP notions by introducing a framework that directly considers the worst-case average failure probability of attackers under different levels of knowledge. </p><p><br></p><p>To achieve this, we introduce a novel measure of attacker knowledge and establish a dual relationship between (type I error, type II error) and (prior, average failure probability). By leveraging this framework, we propose an interpretable paradigm to consistently set privacy parameters on different databases with varying levels of leaked information. </p><p><br></p><p>Furthermore, we characterize the minimax limit of private parameter estimation, driven by $1/(n(1-2p))^2+1/n$, where $p$ represents the worst-case probability risk and $n$ is the number of data points. This characterization is more interpretable than the current lower bound $\min{1/(n\epsilon^2),1/(n\delta^2)}+1/n$ on $(\epsilon,\delta)$-DP. Additionally, we identify the phase transition of private parameter estimation based on this limit and provide suggestions for protocol designs to achieve optimal private estimations. </p><p><br></p><p>Last, we consider a federated learning setting where the data are stored in a distributed manner and privacy-preserving interactions are required. We extend the proposed interpretation to federated learning, considering two scenarios: protecting against privacy breaches against local nodes and protecting privacy breaches against the center. Specifically, we consider a non-convex sparse federated parameter estimation problem and apply it to the generalized linear models. We tackle two challenges in this setting. Firstly, we encounter the issue of initialization due to the privacy requirements that limit the number of queries to the database. Secondly, we overcome the heterogeneity in the distribution among local nodes to identify low-dimensional structures.</p> Statistical data science Statistical theory differential privacy sparse federated learning DP paradigm privacy parameters estimation limit
2	Pragmatic Statistical Approaches for Power Analysis, Causal Inference, and Biomarker Detection Fan Wu (16536675) 26 July 2023 (has links) <p>Mediation analyses play a critical role in social and personality psychology research. However, current approaches for assessing power and sample size in mediation models have limitations, particularly when dealing with complex mediation models and multiple mediator sequential models. These limitations stem from limited software options and the substantial computational time required. In this part, we address these challenges by extending the joint significance test and product of coefficients test to incorporate the fourth-pathed mediated effect and generalized kth-pathed mediated effect. Additionally, we propose a model-based bootstrap method and provide convenient R tools for estimating power in complex mediation models. Through our research, we demonstrate that power decreases as the number of mediators increases and as the influence of coefficients varies. We summarize our results and discuss the implications of power analysis in relation to mediator complexity and coefficient influence. We provide insights for researchers seeking to optimize study designs and enhance the reliability of their findings in complex mediation models. </p> <p>Matching is a crucial step in causal inference, as it allows for more robust and reasonable analyses by creating better-matched pairs. However, in real-world scenarios, data are often collected and stored by different local institutions or separate departments, posing challenges for effective matching due to data fragmentation. Additionally, the harmonization of such data needs to prioritize privacy preservation. In this part, we propose a new hierarchical framework that addresses these challenges by implementing differential privacy on raw data to protect sensitive information while maintaining data utility. We also design a data access control system with three different access levels for designers based on their roles, ensuring secure and controlled access to the matched datasets. Simulation studies and analyses of datasets from the 2017 Atlantic Causal Inference Conference Data Challenge are conducted to showcase the flexibility and utility of our framework. Through this research, we contribute to the advancement of statistical methodologies in matching and privacy-preserving data analysis, offering a practical solution for data integration and privacy protection in causal inference studies. </p> <p>Biomarker discovery is a complex and resource-intensive process, encompassing discovery, qualification, verification, and validation stages prior to clinical evaluation. Streamlining this process by efficiently identifying relevant biomarkers in the discovery phase holds immense value. In this part, we present a likelihood ratio-based approach to accurately identify truly relevant protein markers in discovery studies. Leveraging the observation of unimodal underlying distributions of expression profiles for irrelevant markers, our method demonstrates promising performance when evaluated on real experimental data. Additionally, to address non-normal scenarios, we introduce a kernel ratio-based approach, which we evaluate using non-normal simulation settings. Through extensive simulations, we observe the high effectiveness of the kernel method in discovering the set of truly relevant markers, resulting in precise biomarker identifications with elevated sensitivity and a low empirical false discovery rate. </p> Statistical data science Statistics Data processing causal inference analyses biomarker approach power analysis program
3	GRAPH-BASED ANALYSIS OF NON-RANDOM MISSING DATA PROBLEMS WITH LOW-RANK NATURE: STRUCTURED PREDICTION, MATRIX COMPLETION AND SPARSE PCA Hanbyul Lee (17586345) 09 December 2023 (has links) <p dir="ltr">In most theoretical studies on missing data analysis, data is typically assumed to be missing according to a specific probabilistic model. However, such assumption may not accurately reflect real-world situations, and sometimes missing is not purely random. In this thesis, our focus is on analyzing incomplete data matrices without relying on any probabilistic model assumptions for the missing schemes. To characterize a missing scheme deterministically, we employ a graph whose adjacency matrix is a binary matrix that indicates whether each matrix entry is observed or not. Leveraging its graph properties, we mathematically represent the missing pattern of an incomplete data matrix and conduct a theoretical analysis of how this non-random missing pattern affects the solvability of specific problems related to incomplete data. This dissertation primarily focuses on three types of incomplete data problems characterized by their low-rank nature: structured prediction, matrix completion, and sparse PCA.</p><p dir="ltr">First, we investigate a basic structured prediction problem, which involves recovering binary node labels on a fixed undirected graph, where noisy binary observations corresponding to edges are given. Essentially, this setting parallels a simple binary rank-1 symmetric matrix completion problem, where missing entries are determined by a fixed undirected graph. Our aim is to establish the fundamental limit bounds of this problem, revealing a close association between the limits and graph properties, such as connectivity.</p><p dir="ltr">Second, we move on to the general low-rank matrix completion problem. In this study, we establish provable guarantees for exact and approximate low-rank matrix completion problems that can be applied to any non-random missing pattern, by utilizing the observation graph corresponding to the missing scheme. We theoretically and experimentally show that the standard constrained nuclear norm minimization algorithm can successfully recover the true matrix when the observation graph is well-connected and has similar node degrees. We also verify that matrix completion is achievable with a near-optimal sample complexity rate when the observation graph has uniform node degrees and its adjacency matrix has a large spectral gap.</p><p dir="ltr">Finally, we address the sparse PCA problem, featuring an approximate low-rank attribute. Missing data is common in situations where sparse PCA is useful, such as single-cell RNA sequence data analysis. We propose a semidefinite relaxation of the non-convex $\ell_1$-regularized PCA problem to solve sparse PCA on incomplete data. We demonstrate that the method is particularly effective when the observation pattern has favorable properties. Our theory is substantiated through synthetic and real data analysis, showcasing the superior performance of our algorithm compared to other sparse PCA approaches, especially when the observed data pattern has specific characteristics.</p> Statistical data science Missing data handling structured prediction problems Matrix completion approach Sparse principal component analysis
4	<b>Deep Neural Network Structural Vulnerabilities And Remedial Measures</b> Yitao Li (9148706) 02 December 2023 (has links) <p dir="ltr">In the realm of deep learning and neural networks, there has been substantial advancement, but the persistent DNN vulnerability to adversarial attacks has prompted the search for more efficient defense strategies. Unfortunately, this becomes an arms race. Stronger attacks are being develops, while more sophisticated defense strategies are being proposed, which either require modifying the model's structure or incurring significant computational costs during training. The first part of the work makes a significant progress towards breaking this arms race. Let’s consider natural images, where all the feature values are discrete. Our proposed metrics are able to discover all the vulnerabilities surrounding a given natural image. Given sufficient computation resource, we are able to discover all the adversarial examples given one clean natural image, eliminating the need to develop new attacks. For remedial measures, our approach is to introduce a random factor into DNN classification process. Furthermore, our approach can be combined with existing defense strategy, such as adversarial training, to further improve performance.</p> Computer vision Adversarial machine learning Statistical data science Model Robustness explainable AI method model security
5	Practicality in Generative Modeling & Synthetic Data Daniel Antonio Cardona (19339264) 07 August 2024 (has links) <p dir="ltr">As machine learning continues to grow and surprise us, its complexity grows as well. Indeed, many machine learning models have become black boxes. Yet, there is a prevailing need for practicality. This dissertation offers some practicality on generative modeling and synthetic data, a recently popular application of generative models. First, Lightweight Chained Universal Approximators (LiCUS) is proposed. Motivated by statistical sampling principles, LiCUS tackles a simplified generative task with its universal approximation property while having a minimal computational bottleneck. When compared to a generative adversarial network (GAN) and variational auto-encoder (VAE), LiCUS empirically yields synthetic data with greater utility for a classifier on the Modified National Institute of Standards and Technology (MNIST) dataset. Second, following on its potential for informative synthetic data, LiCUS undergoes an extensive synthetic data supplementation experiment. The experiment largely serves as an informative starting point for practical use of synthetic data via LiCUS. In addition, by proposing a gold standard of reserved data, the experimental results suggest that additional data collection may generally outperform models supplemented with synthetic data, at least when using LiCUS. Given that the experiment was conducted on two datasets, future research could involve further experimentation on a greater number and variety of datasets, such as images. Lastly, generative machine learning generally demands large datasets, which is not guaranteed in practice. To alleviate this demand, one could offer expert knowledge. This is demonstrated by applying an expert-informed Wasserstein GAN with gradient penalty (WGAN-GP) on network flow traffic from NASA's Operational Simulation for Small Satellites (NOS3). If one were to directly apply a WGAN-GP, it would fail to respect the physical limitations between satellite components and permissible communications amongst them. By arming a WGAN-GP with cyber-security software Argus, the informed WGAN-GP could produce permissible satellite network flows when given as little as 10,000 flows. In all, this dissertation illustrates how machine learning processes could be modified under a more practical lens and incorporate pre-existing statistical principles and expert knowledge. </p> Statistical data science Statistics not elsewhere classified Generative modeling Synthetic data
6	An Evaluation of Approaches for Generative Adversarial Network Overfitting Detection Tung Tien Vu (12091421) 20 November 2023 (has links) <p dir="ltr">Generating images from training samples solves the challenge of imbalanced data. It provides the necessary data to run machine learning algorithms for image classification, anomaly detection, and pattern recognition tasks. In medical settings, having imbalanced data results in higher false negatives due to a lack of positive samples. Generative Adversarial Networks (GANs) have been widely adopted for image generation. GANs allow models to train without computing intractable probability while producing high-quality images. However, evaluating GANs has been challenging for the researchers due to a need for an objective function. Most studies assess the quality of generated images and the variety of classes those images cover. Overfitting of training images, however, has received less attention from researchers. When the generated images are mere copies of the training data, GAN models will overfit and will not generalize well. This study examines the ability to detect overfitting of popular metrics: Maximum Mean Discrepancy (MMD) and Fréchet Inception Distance (FID). We investigate the metrics on two types of data: handwritten digits and chest x-ray images using Analysis of Variance (ANOVA) models.</p> Deep learning Statistical data science generative adversarial network (GANs) chest X-ray datasets frechet inception distance maximum mean discrepancy (MMD)
7	Statistical Methods for Offline Deep Reinforcement Learning Danyang Wang (18414336) 20 April 2024 (has links) <p dir="ltr">Reinforcement learning (RL) has been a rapidly evolving field of research over the past years, enhancing developments in areas such as artificial intelligence, healthcare, and education, to name a few. Regardless of the success of RL, its inherent online learning nature presents obstacles for its real-world applications, since in many settings, online data collection with the latest learned policy can be expensive and/or dangerous (such as robotics, healthcare, and autonomous driving). This challenge has catalyzed research into offline RL, which involves reinforcement learning from previously collected static datasets, without the need for further online data collection. However, most existing offline RL methods depend on two key assumptions: unconfoundedness and positivity (also known as the full-coverage assumption), which frequently do not hold in the context of static datasets. </p><p dir="ltr">In the first part of this dissertation, we simultaneously address these two challenges by proposing a novel policy learning algorithm: PESsimistic CAusal Learning (PESCAL). We utilize the mediator variable based on Front-Door Criterion, to remove the confounding bias. Additionally, we adopt the pessimistic principle to tackle the distributional shift problem induced by the under-coverage issue. This issue refers to the mismatch of distributions between the action distributions induced by candidate policies, and the policy that generates the observational data (known as the behavior policy). Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function, to partially mitigate the issue of distributional shift. This insight significantly simplifies our algorithm, by circumventing the challenging task of sequential uncertainty quantification for the estimated Q-function. Moreover, we provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.</p><p dir="ltr">In the second part of this dissertation, in contrast to the first part, which approaches the distributional shift issue implicitly by penalizing the value function as a whole, we explicitly constrain the learned policy to not deviate significantly from the behavior policy, while still enabling flexible adjustment of the degree of constraints. Building upon the offline reinforcement learning algorithm, TD3+BC \cite{fujimoto2021minimalist}, we propose a model-free actor-critic algorithm with an adjustable behavior cloning (BC) term. We employ an ensemble of networks to quantify the uncertainty of the estimated value function, thus addressing the issue of overestimation. Moreover, we introduce a method that is both convenient and intuitively simple for controlling the degree of BC, through a Bernoulli random variable based on the user-specified confidence level for different offline datasets. Our proposed algorithm, named Ensemble-based Actor Critic with Adaptive Behavior Cloning (EABC), is straightforward to implement, exhibits low variance, and achieves strong performance across all D4RL benchmarks.</p> Reinforcement learning Computational statistics Statistical data science reinforcement learning Offline Learning Sequential Decision-Making artificial intellgience Algorithm Design
8	GENERATIVE MODELS WITH MARGINAL CONSTRAINTS Bingjing Tang (16380291) 16 June 2023 (has links) <p> Generative models form powerful tools for learning data distributions and simulating new samples. Recent years have seen significant advances in the flexibility and applicability of such models, with Bayesian approaches like nonparametric Bayesian models and deep neural network models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) finding use in a wide range of domains. However, the black-box nature of these models means that they are often hard to interpret, and they often come with modeling implications that are inconsistent with side knowledge resulting from domain knowledge. This thesis studies situations where the modeler has side knowledge represented as probability distributions on functionals of the objects being modeled, and we study methods to incorporate this particular kind of side knowledge into flexible generative models. This dissertation covers three main parts. </p> <p><br></p> <p>The first part focuses on incorporating a special case of the aforementioned side knowledge into flexible nonparametric Bayesian models. Many times, practitioners have additional distributional information about a subset of the coordinates of the observations being modeled. The flexibility of nonparametric Bayesian models usually implies incompatibility with this side information. Such inconsistency triggers the necessity of developing methods to incorporate this side knowledge into flexible nonparametric Bayesian models. We design a specialized generative process to build in this side knowledge and propose a novel sigmoid Gaussian process conditional model. We also develop a corresponding posterior sampling method based on data augmentation to overcome a doubly intractable problem. We illustrate the efficacy of our proposed constrained nonparametric Bayesian model in a variety of real-world scenarios including modeling environmental and earthquake data. </p> <p><br></p> <p>The second part of the dissertation discusses neural network approaches to satisfying the said general side knowledge. Further, the generative models considered in this part broaden into black-box models. We formulate this side knowledge incorporation problem as a constrained divergence minimization problem and propose two scalable neural network approaches as its solution. We demonstrate their practicality using various synthetic and real examples. </p> <p><br></p> <p> The third part of the dissertation concentrates on a specific generative model of individual pixels of the fMRI data constructed from a latent group image. Usually there is two-fold side knowledge about the latent group image: spatial structure and partial activation zones. The former can be captured by modeling the prior for the group image with Markov random fields. The latter, which is often obtained from previous related studies, is left for future research. We propose a novel Bayesian model with Markov random fields and aim to estimate the maximum a posteriori for the group image. We also derive a variational Bayes algorithm to overcome local optima in the optimization.</p> Computational statistics Statistical data science Knowledge Constraints Nonparametric Bayesian Black-box Neural Networks Conditional Density Estimation Density Ratio Estimation Sigmoid Gaussian Processes
9	Causal Inference in the Face of Assumption Violations Yuki Ohnishi (18423810) 26 April 2024 (has links) <p dir="ltr">This dissertation advances the field of causal inference by developing methodologies in the face of assumption violations. Traditional causal inference methodologies hinge on a core set of assumptions, which are often violated in the complex landscape of modern experiments and observational studies. This dissertation proposes novel methodologies designed to address the challenges posed by single or multiple assumption violations. By applying these innovative approaches to real-world datasets, this research uncovers valuable insights that were previously inaccessible with existing methods. </p><p><br></p><p dir="ltr">First, three significant sources of complications in causal inference that are increasingly of interest are interference among individuals, nonadherence of individuals to their assigned treatments, and unintended missing outcomes. Interference exists if the outcome of an individual depends not only on its assigned treatment, but also on the assigned treatments for other units. It commonly arises when limited controls are placed on the interactions of individuals with one another during the course of an experiment. Treatment nonadherence frequently occurs in human subject experiments, as it can be unethical to force an individual to take their assigned treatment. Clinical trials, in particular, typically have subjects that do not adhere to their assigned treatments due to adverse side effects or intercurrent events. Missing values also commonly occur in clinical studies. For example, some patients may drop out of the study due to the side effects of the treatment. Failing to account for these considerations will generally yield unstable and biased inferences on treatment effects even in randomized experiments, but existing methodologies lack the ability to address all these challenges simultaneously. We propose a novel Bayesian methodology to fill this gap. </p><p><br></p><p dir="ltr">My subsequent research further addresses one of the limitations of the first project: a set of assumptions about interference structures that may be too restrictive in some practical settings. We introduce a concept of the ``degree of interference" (DoI), a latent variable capturing the interference structure. This concept allows for handling arbitrary, unknown interference structures to facilitate inference on causal estimands. </p><p><br></p><p dir="ltr">While randomized experiments offer a solid foundation for valid causal analysis, people are also interested in conducting causal inference using observational data due to the cost and difficulty of randomized experiments and the wide availability of observational data. Nonetheless, using observational data to infer causality requires us to rely on additional assumptions. A central assumption is that of \emph{ignorability}, which posits that the treatment is randomly assigned based on the variables (covariates) included in the dataset. While crucial, this assumption is often debatable, especially when treatments are assigned sequentially to optimize future outcomes. For instance, marketers typically adjust subsequent promotions based on responses to earlier ones and speculate on how customers might have reacted to alternative past promotions. This speculative behavior introduces latent confounders, which must be carefully addressed to prevent biased conclusions. </p><p dir="ltr">In the third project, we investigate these issues by studying sequences of promotional emails sent by a US retailer. We develop a novel Bayesian approach for causal inference from longitudinal observational data that accommodates noncompliance and latent sequential confounding. </p><p><br></p><p dir="ltr">Finally, we formulate the causal inference problem for the privatized data. In the era of digital expansion, the secure handling of sensitive data poses an intricate challenge that significantly influences research, policy-making, and technological innovation. As the collection of sensitive data becomes more widespread across academic, governmental, and corporate sectors, addressing the complex balance between making data accessible and safeguarding private information requires the development of sophisticated methods for analysis and reporting, which must include stringent privacy protections. Currently, the gold standard for maintaining this balance is Differential privacy. </p><p dir="ltr">Local differential privacy is a differential privacy paradigm in which individuals first apply a privacy mechanism to their data (often by adding noise) before transmitting the result to a curator. The noise for privacy results in additional bias and variance in their analyses. Thus, it is of great importance for analysts to incorporate the privacy noise into valid inference.</p><p dir="ltr">In this final project, we develop methodologies to infer causal effects from locally privatized data under randomized experiments. We present frequentist and Bayesian approaches and discuss the statistical properties of the estimators, such as consistency and optimality under various privacy scenarios.</p> Econometric and statistical methods Applied statistics Computational statistics Statistical data science Statistical theory Causal Inference Bayesian statistics Interference Noncompliance Missing not at random (MNAR) Bayesian Nonparametrics Differential privacy
10	ROBUST INFERENCE FOR HETEROGENEOUS TREATMENT EFFECTS WITH APPLICATIONS TO NHANES DATA Ran Mo (20329047) 10 January 2025 (has links) <p dir="ltr">Estimating the conditional average treatment effect (CATE) using data from the National Health and Nutrition Examination Survey (NHANES) provides valuable insights into the heterogeneous</p><p dir="ltr">impacts of health interventions across diverse populations, facilitating public health strategies that consider individual differences in health behaviors and conditions. However, estimating CATE with NHANES data face challenges often encountered in observational studies, such as outliers, heavy-tailed error distributions, skewed data, model misspecification, and the curse of dimensionality. To address these challenges, this dissertation presents three consecutive studies that thoroughly explore robust methods for estimating heterogeneous treatment effects. </p><p dir="ltr">The first study introduces an outlier-resistant estimation method by incorporating M-estimation, replacing the $L_2$ loss in the traditional inverse propensity weighting (IPW) method with a robust loss function. To assess the robustness of our approach, we investigate its influence function and breakdown point. Additionally, we derive the asymptotic properties of the proposed estimator, enabling valid inference for the proposed outlier-resistant estimator of CATE.</p><p dir="ltr">The method proposed in the first study relies on a symmetric assumption which is commonly required by standard outlier-resistant methods. To remove this assumption while maintaining </p><p dir="ltr">unbiasedness, the second study employs the adaptive Huber loss, which dynamically adjusts the robustification parameter based on the sample size to achieve optimal tradeoff between bias and robustness. The robustification parameter is explicitly derived from theoretical results, making it unnecessary to rely on time-consuming data-driven methods for its selection.</p><p dir="ltr">We also derive concentration and Berry-Esseen inequalities to precisely quantify the convergence rates as well as finite sample performance.</p><p dir="ltr">In both previous studies, the propensity scores were estimated parametrically, which is sensitive to model misspecification issues. The third study extends the robust estimator from our first </p><p dir="ltr">project by plugging in a kernel-based nonparametric estimation of the propensity score with sufficient dimension reduction (SDR). Specifically, we adopt a robust minimum average variance estimation (rMAVE) for the central mean space under the potential outcome framework. Together with higher-order kernels, the resulting CATE estimation gains enhanced efficiency.</p><p dir="ltr">In all three studies, the theoretical results are derived, and confidence intervals are constructed for inference based on these findings. The properties of the proposed estimators are verified through extensive simulations. Additionally, applying these methods to NHANES data validates the estimators' ability to handle diverse and contaminated datasets, further demonstrating their effectiveness in real-world scenarios.</p><p><br></p> Applied statistics Computational statistics Statistical data science Statistical theory Conditional Average Treatment Effect Inverse probability weighting estimation Robust Estimation Dimension reduction (Statistics)

Search results