• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • No language data
  • Tagged with
  • 19
  • 19
  • 19
  • 12
  • 8
  • 7
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

RISK INTERPRETATION OF DIFFERENTIAL PRIVACY

Jiajun Liang (13190613) 31 July 2023 (has links)
<p><br></p><p>How to set privacy parameters is a crucial problem for the consistent application of DP in practice. The current privacy parameters do not provide direct suggestions for this problem. On the other hand, different databases may have varying degrees of information leakage, allowing attackers to enhance their attacks with the available information. This dissertation provides an additional interpretation of the current DP notions by introducing a framework that directly considers the worst-case average failure probability of attackers under different levels of knowledge. </p><p><br></p><p>To achieve this, we introduce a novel measure of attacker knowledge and establish a dual relationship between (type I error, type II error) and (prior, average failure probability). By leveraging this framework, we propose an interpretable paradigm to consistently set privacy parameters on different databases with varying levels of leaked information. </p><p><br></p><p>Furthermore, we characterize the minimax limit of private parameter estimation, driven by $1/(n(1-2p))^2+1/n$, where $p$ represents the worst-case probability risk and $n$ is the number of data points. This characterization is more interpretable than the current lower bound $\min{1/(n\epsilon^2),1/(n\delta^2)}+1/n$ on $(\epsilon,\delta)$-DP. Additionally, we identify the phase transition of private parameter estimation based on this limit and provide suggestions for protocol designs to achieve optimal private estimations. </p><p><br></p><p>Last, we consider a federated learning setting where the data are stored in a distributed manner and privacy-preserving interactions are required. We extend the proposed interpretation to federated learning, considering two scenarios: protecting against privacy breaches against local nodes and protecting privacy breaches against the center. Specifically, we consider a non-convex sparse federated parameter estimation problem and apply it to the generalized linear models. We tackle two challenges in this setting. Firstly, we encounter the issue of initialization due to the privacy requirements that limit the number of queries to the database. Secondly, we overcome the heterogeneity in the distribution among local nodes to identify low-dimensional structures.</p>
2

Pragmatic Statistical Approaches for Power Analysis, Causal Inference, and Biomarker Detection

Fan Wu (16536675) 26 July 2023 (has links)
<p>Mediation analyses play a critical role in social and personality psychology research. However, current approaches for assessing power and sample size in mediation models have limitations, particularly when dealing with complex mediation models and multiple mediator sequential models. These limitations stem from limited software options and the substantial computational time required. In this part, we address these challenges by extending the joint significance test and product of coefficients test to incorporate the fourth-pathed mediated effect and generalized kth-pathed mediated effect. Additionally, we propose a model-based bootstrap method and provide convenient R tools for estimating power in complex mediation models. Through our research, we demonstrate that power decreases as the number of mediators increases and as the influence of coefficients varies. We summarize our results and discuss the implications of power analysis in relation to mediator complexity and coefficient influence. We provide insights for researchers seeking to optimize study designs and enhance the reliability of their findings in complex mediation models. </p> <p>Matching is a crucial step in causal inference, as it allows for more robust and reasonable analyses by creating better-matched pairs. However, in real-world scenarios, data are often collected and stored by different local institutions or separate departments, posing challenges for effective matching due to data fragmentation. Additionally, the harmonization of such data needs to prioritize privacy preservation. In this part, we propose a new hierarchical framework that addresses these challenges by implementing differential privacy on raw data to protect sensitive information while maintaining data utility. We also design a data access control system with three different access levels for designers based on their roles, ensuring secure and controlled access to the matched datasets. Simulation studies and analyses of datasets from the 2017 Atlantic Causal Inference Conference Data Challenge are conducted to showcase the flexibility and utility of our framework. Through this research, we contribute to the advancement of statistical methodologies in matching and privacy-preserving data analysis, offering a practical solution for data integration and privacy protection in causal inference studies. </p> <p>Biomarker discovery is a complex and resource-intensive process, encompassing discovery, qualification, verification, and validation stages prior to clinical evaluation. Streamlining this process by efficiently identifying relevant biomarkers in the discovery phase holds immense value. In this part, we present a likelihood ratio-based approach to accurately identify truly relevant protein markers in discovery studies. Leveraging the observation of unimodal underlying distributions of expression profiles for irrelevant markers, our method demonstrates promising performance when evaluated on real experimental data. Additionally, to address non-normal scenarios, we introduce a kernel ratio-based approach, which we evaluate using non-normal simulation settings. Through extensive simulations, we observe the high effectiveness of the kernel method in discovering the set of truly relevant markers, resulting in precise biomarker identifications with elevated sensitivity and a low empirical false discovery rate.  </p>
3

GRAPH-BASED ANALYSIS OF NON-RANDOM MISSING DATA PROBLEMS WITH LOW-RANK NATURE: STRUCTURED PREDICTION, MATRIX COMPLETION AND SPARSE PCA

Hanbyul Lee (17586345) 09 December 2023 (has links)
<p dir="ltr">In most theoretical studies on missing data analysis, data is typically assumed to be missing according to a specific probabilistic model. However, such assumption may not accurately reflect real-world situations, and sometimes missing is not purely random. In this thesis, our focus is on analyzing incomplete data matrices without relying on any probabilistic model assumptions for the missing schemes. To characterize a missing scheme deterministically, we employ a graph whose adjacency matrix is a binary matrix that indicates whether each matrix entry is observed or not. Leveraging its graph properties, we mathematically represent the missing pattern of an incomplete data matrix and conduct a theoretical analysis of how this non-random missing pattern affects the solvability of specific problems related to incomplete data. This dissertation primarily focuses on three types of incomplete data problems characterized by their low-rank nature: structured prediction, matrix completion, and sparse PCA.</p><p dir="ltr">First, we investigate a basic structured prediction problem, which involves recovering binary node labels on a fixed undirected graph, where noisy binary observations corresponding to edges are given. Essentially, this setting parallels a simple binary rank-1 symmetric matrix completion problem, where missing entries are determined by a fixed undirected graph. Our aim is to establish the fundamental limit bounds of this problem, revealing a close association between the limits and graph properties, such as connectivity.</p><p dir="ltr">Second, we move on to the general low-rank matrix completion problem. In this study, we establish provable guarantees for exact and approximate low-rank matrix completion problems that can be applied to any non-random missing pattern, by utilizing the observation graph corresponding to the missing scheme. We theoretically and experimentally show that the standard constrained nuclear norm minimization algorithm can successfully recover the true matrix when the observation graph is well-connected and has similar node degrees. We also verify that matrix completion is achievable with a near-optimal sample complexity rate when the observation graph has uniform node degrees and its adjacency matrix has a large spectral gap.</p><p dir="ltr">Finally, we address the sparse PCA problem, featuring an approximate low-rank attribute. Missing data is common in situations where sparse PCA is useful, such as single-cell RNA sequence data analysis. We propose a semidefinite relaxation of the non-convex $\ell_1$-regularized PCA problem to solve sparse PCA on incomplete data. We demonstrate that the method is particularly effective when the observation pattern has favorable properties. Our theory is substantiated through synthetic and real data analysis, showcasing the superior performance of our algorithm compared to other sparse PCA approaches, especially when the observed data pattern has specific characteristics.</p>
4

<b>Deep Neural Network Structural Vulnerabilities And Remedial Measures</b>

Yitao Li (9148706) 02 December 2023 (has links)
<p dir="ltr">In the realm of deep learning and neural networks, there has been substantial advancement, but the persistent DNN vulnerability to adversarial attacks has prompted the search for more efficient defense strategies. Unfortunately, this becomes an arms race. Stronger attacks are being develops, while more sophisticated defense strategies are being proposed, which either require modifying the model's structure or incurring significant computational costs during training. The first part of the work makes a significant progress towards breaking this arms race. Let’s consider natural images, where all the feature values are discrete. Our proposed metrics are able to discover all the vulnerabilities surrounding a given natural image. Given sufficient computation resource, we are able to discover all the adversarial examples given one clean natural image, eliminating the need to develop new attacks. For remedial measures, our approach is to introduce a random factor into DNN classification process. Furthermore, our approach can be combined with existing defense strategy, such as adversarial training, to further improve performance.</p>
5

An Evaluation of Approaches for Generative Adversarial Network Overfitting Detection

Tung Tien Vu (12091421) 20 November 2023 (has links)
<p dir="ltr">Generating images from training samples solves the challenge of imbalanced data. It provides the necessary data to run machine learning algorithms for image classification, anomaly detection, and pattern recognition tasks. In medical settings, having imbalanced data results in higher false negatives due to a lack of positive samples. Generative Adversarial Networks (GANs) have been widely adopted for image generation. GANs allow models to train without computing intractable probability while producing high-quality images. However, evaluating GANs has been challenging for the researchers due to a need for an objective function. Most studies assess the quality of generated images and the variety of classes those images cover. Overfitting of training images, however, has received less attention from researchers. When the generated images are mere copies of the training data, GAN models will overfit and will not generalize well. This study examines the ability to detect overfitting of popular metrics: Maximum Mean Discrepancy (MMD) and Fréchet Inception Distance (FID). We investigate the metrics on two types of data: handwritten digits and chest x-ray images using Analysis of Variance (ANOVA) models.</p>
6

Statistical Methods for Offline Deep Reinforcement Learning

Danyang Wang (18414336) 20 April 2024 (has links)
<p dir="ltr">Reinforcement learning (RL) has been a rapidly evolving field of research over the past years, enhancing developments in areas such as artificial intelligence, healthcare, and education, to name a few. Regardless of the success of RL, its inherent online learning nature presents obstacles for its real-world applications, since in many settings, online data collection with the latest learned policy can be expensive and/or dangerous (such as robotics, healthcare, and autonomous driving). This challenge has catalyzed research into offline RL, which involves reinforcement learning from previously collected static datasets, without the need for further online data collection. However, most existing offline RL methods depend on two key assumptions: unconfoundedness and positivity (also known as the full-coverage assumption), which frequently do not hold in the context of static datasets. </p><p dir="ltr">In the first part of this dissertation, we simultaneously address these two challenges by proposing a novel policy learning algorithm: PESsimistic CAusal Learning (PESCAL). We utilize the mediator variable based on Front-Door Criterion, to remove the confounding bias. Additionally, we adopt the pessimistic principle to tackle the distributional shift problem induced by the under-coverage issue. This issue refers to the mismatch of distributions between the action distributions induced by candidate policies, and the policy that generates the observational data (known as the behavior policy). Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function, to partially mitigate the issue of distributional shift. This insight significantly simplifies our algorithm, by circumventing the challenging task of sequential uncertainty quantification for the estimated Q-function. Moreover, we provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.</p><p dir="ltr">In the second part of this dissertation, in contrast to the first part, which approaches the distributional shift issue implicitly by penalizing the value function as a whole, we explicitly constrain the learned policy to not deviate significantly from the behavior policy, while still enabling flexible adjustment of the degree of constraints. Building upon the offline reinforcement learning algorithm, TD3+BC \cite{fujimoto2021minimalist}, we propose a model-free actor-critic algorithm with an adjustable behavior cloning (BC) term. We employ an ensemble of networks to quantify the uncertainty of the estimated value function, thus addressing the issue of overestimation. Moreover, we introduce a method that is both convenient and intuitively simple for controlling the degree of BC, through a Bernoulli random variable based on the user-specified confidence level for different offline datasets. Our proposed algorithm, named Ensemble-based Actor Critic with Adaptive Behavior Cloning (EABC), is straightforward to implement, exhibits low variance, and achieves strong performance across all D4RL benchmarks.</p>
7

GENERATIVE MODELS WITH MARGINAL CONSTRAINTS

Bingjing Tang (16380291) 16 June 2023 (has links)
<p> Generative models form powerful tools for learning data distributions and simulating new samples. Recent years have seen significant advances in the flexibility and applicability of such models, with Bayesian approaches like nonparametric Bayesian models and deep neural network models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) finding use in a wide range of domains. However, the black-box nature of these models means that they are often hard to interpret, and they often come with modeling implications that are inconsistent with side knowledge resulting from domain knowledge. This thesis studies situations where the modeler has side knowledge represented as probability distributions on functionals of the objects being modeled, and we study methods to incorporate this particular kind of side knowledge into flexible generative models. This dissertation covers three main parts. </p> <p><br></p> <p>The first part focuses on incorporating a special case of the aforementioned side knowledge into flexible nonparametric Bayesian models. Many times, practitioners have additional distributional information about a subset of the coordinates of the observations being modeled. The flexibility of nonparametric Bayesian models usually implies incompatibility with this side information. Such inconsistency triggers the necessity of developing methods to incorporate this side knowledge into flexible nonparametric Bayesian models. We design a specialized generative process to build in this side knowledge and propose a novel sigmoid Gaussian process conditional model. We also develop a corresponding posterior sampling method based on data augmentation to overcome a doubly intractable problem. We illustrate the efficacy of our proposed constrained nonparametric Bayesian model in a variety of real-world scenarios including modeling environmental and earthquake data. </p> <p><br></p> <p>The second part of the dissertation discusses neural network approaches to satisfying the said general side knowledge. Further, the generative models considered in this part broaden into black-box models. We formulate this side knowledge incorporation problem as a constrained divergence minimization problem and propose two scalable neural network approaches as its solution. We demonstrate their practicality using various synthetic and real examples. </p> <p><br></p> <p> The third part of the dissertation concentrates on a specific generative model of individual pixels of the fMRI data constructed from a latent group image. Usually there is two-fold side knowledge about the latent group image: spatial structure and partial activation zones. The former can be captured by modeling the prior for the group image with Markov random fields. The latter, which is often obtained from previous related studies, is left for future research. We propose a novel Bayesian model with Markov random fields and aim to estimate the maximum a posteriori for the group image. We also derive a variational Bayes algorithm to overcome local optima in the optimization.</p>
8

Causal Inference in the Face of Assumption Violations

Yuki Ohnishi (18423810) 26 April 2024 (has links)
<p dir="ltr">This dissertation advances the field of causal inference by developing methodologies in the face of assumption violations. Traditional causal inference methodologies hinge on a core set of assumptions, which are often violated in the complex landscape of modern experiments and observational studies. This dissertation proposes novel methodologies designed to address the challenges posed by single or multiple assumption violations. By applying these innovative approaches to real-world datasets, this research uncovers valuable insights that were previously inaccessible with existing methods. </p><p><br></p><p dir="ltr">First, three significant sources of complications in causal inference that are increasingly of interest are interference among individuals, nonadherence of individuals to their assigned treatments, and unintended missing outcomes. Interference exists if the outcome of an individual depends not only on its assigned treatment, but also on the assigned treatments for other units. It commonly arises when limited controls are placed on the interactions of individuals with one another during the course of an experiment. Treatment nonadherence frequently occurs in human subject experiments, as it can be unethical to force an individual to take their assigned treatment. Clinical trials, in particular, typically have subjects that do not adhere to their assigned treatments due to adverse side effects or intercurrent events. Missing values also commonly occur in clinical studies. For example, some patients may drop out of the study due to the side effects of the treatment. Failing to account for these considerations will generally yield unstable and biased inferences on treatment effects even in randomized experiments, but existing methodologies lack the ability to address all these challenges simultaneously. We propose a novel Bayesian methodology to fill this gap. </p><p><br></p><p dir="ltr">My subsequent research further addresses one of the limitations of the first project: a set of assumptions about interference structures that may be too restrictive in some practical settings. We introduce a concept of the ``degree of interference" (DoI), a latent variable capturing the interference structure. This concept allows for handling arbitrary, unknown interference structures to facilitate inference on causal estimands. </p><p><br></p><p dir="ltr">While randomized experiments offer a solid foundation for valid causal analysis, people are also interested in conducting causal inference using observational data due to the cost and difficulty of randomized experiments and the wide availability of observational data. Nonetheless, using observational data to infer causality requires us to rely on additional assumptions. A central assumption is that of \emph{ignorability}, which posits that the treatment is randomly assigned based on the variables (covariates) included in the dataset. While crucial, this assumption is often debatable, especially when treatments are assigned sequentially to optimize future outcomes. For instance, marketers typically adjust subsequent promotions based on responses to earlier ones and speculate on how customers might have reacted to alternative past promotions. This speculative behavior introduces latent confounders, which must be carefully addressed to prevent biased conclusions. </p><p dir="ltr">In the third project, we investigate these issues by studying sequences of promotional emails sent by a US retailer. We develop a novel Bayesian approach for causal inference from longitudinal observational data that accommodates noncompliance and latent sequential confounding. </p><p><br></p><p dir="ltr">Finally, we formulate the causal inference problem for the privatized data. In the era of digital expansion, the secure handling of sensitive data poses an intricate challenge that significantly influences research, policy-making, and technological innovation. As the collection of sensitive data becomes more widespread across academic, governmental, and corporate sectors, addressing the complex balance between making data accessible and safeguarding private information requires the development of sophisticated methods for analysis and reporting, which must include stringent privacy protections. Currently, the gold standard for maintaining this balance is Differential privacy. </p><p dir="ltr">Local differential privacy is a differential privacy paradigm in which individuals first apply a privacy mechanism to their data (often by adding noise) before transmitting the result to a curator. The noise for privacy results in additional bias and variance in their analyses. Thus, it is of great importance for analysts to incorporate the privacy noise into valid inference.</p><p dir="ltr">In this final project, we develop methodologies to infer causal effects from locally privatized data under randomized experiments. We present frequentist and Bayesian approaches and discuss the statistical properties of the estimators, such as consistency and optimality under various privacy scenarios.</p>
9

A GENERAL FRAMEWORK FOR CUSTOMER CONTENT PRINT QUALITY DEFECT DETECTION AND ANALYSIS

Runzhe Zhang (11442742) 11 July 2022 (has links)
<p>Print quality (PQ) is one of the most significant issues with electrophotographic printers. There are many reasons for PQ issues, such as limitations of the electrophotographic process, faulty printer components, or other failures of the print mechanism. These reasons can produce different PQ issues, like streaks, bands, gray spots, text fading, and color fading defects. It is important to analyze the nature and causes of different print defects to more efficiently repair printers and improve the electrophotographic process. </p> <p><br></p> <p>We design a general framework for print quality detection and analysis of customer content. This print quality analysis framework inputs the original digital image saved on the computer and then the scanned image. This framework includes two main modules: image pre-processing, print defects feature vector extraction, and classification. The first module, image pre-processing, includes image registration, color calibration, and region of interest (ROI) extraction. The ROI extraction part is designed to extract four different kinds of ROI from the digital master image. Because different ROIs include different print defects, for example, the symbol ROI includes the text fading defect, and the raster ROI includes the color fading defect. The second module includes different ROI print defects detection and analysis algorithms. We classify different ROI print defects using their feature vector based on their severity. This module proposed four important defects detection methods: uniform color area streak detection, symbol ROI color text fading detection, raster ROI color fading detection using a novel unsupervised clustering method, and raster ROI streak detection. We will introduce the details of these algorithms in this thesis. </p> <p><br></p> <p>We will also show two other interesting print quality projects: print margin skew detection and print velocity simulation and estimation. Print margin skew detection proposes an algorithm that uses the Hough Lines Detection algorithm to detect printing margin and skew errors based on factual scanned image verification. In the print velocity simulation and estimation project, we propose a print velocity simulation tool, design a specific print velocity test page, and design a print velocity estimation algorithm using the dynamic time warping algorithm. </p>
10

Thesis_deposit.pdf

Sehwan Kim (15348235) 25 April 2023 (has links)
<p>    Adaptive MCMC is advantageous over traditional MCMC due to its ability to automatically adjust its proposal distributions during the sampling process, providing improved sampling efficiency and faster convergence to the target distribution, especially in complex or high-dimensional problems. However, designing and validating the adaptive scheme cautiously is crucial to ensure algorithm validity and prevent the introduction of biases. This dissertation focuses on the use of Adaptive MCMC for deep learning, specifically addressing the mode collapse issue in Generative Adversarial Networks (GANs) and implementing Fiducial inference, and its application to Causal inference in individual treatment effect problems.</p> <p><br></p> <p>    First, GAN was recently introduced in the literature as a novel machine learning method for training generative models. However, GAN is very difficult to train due to the issue of mode collapse, i.e., lack of diversity among generated data. We figure out the reason why GAN suffers from this issue and lay out a new theoretical framework for GAN based on randomized decision rules such that the mode collapse issue can be overcome essentially. Under the new theoretical framework, the discriminator converges to a fixed point while the generator converges to a distribution at the Nash equilibrium.</p> <p><br></p> <p>    Second, Fiducial inference was generally considered as R.A. Fisher's a big blunder, but the goal he initially set, <em>making inference for the uncertainty of model parameters on the basis of observations</em>, has been continually pursued by many statisticians. By leveraging on advanced statistical computing techniques such as stochastic approximation Markov chain Monte Carlo, we develop a new statistical inference method, the so-called extended Fiducial inference, which achieves the initial goal of fiducial inference. </p> <p><br></p> <p>    Lastly, estimating ITE is important for decision making in various fields, particularly in health research where precision medicine is being investigated. Conditional average treatment effect (CATE) is often used for such purpose, but uncertainty quantification and explaining the variability of predicted ITE is still needed for fair decision making. We discuss using extended Fiducial inference to construct prediction intervals for ITE, and introduces a double neural net algorithm for efficient prediction and estimation of nonlinear ITE.</p>

Page generated in 0.1333 seconds