Global ETD Search

1	Statistical Methods for Incomplete Covariates and Two-Phase Designs McIsaac, Michael 18 December 2012 (has links) Incomplete data is a pervasive problem in health research, and as a result statistical methods enabling inference based on partial information play a critical role. This thesis explores estimation of regression coefficients and associated inferences when variables are incompletely observed. In the later chapters, we focus primarily on settings with incomplete covariate data which arise by design, as in studies with two-phase sampling schemes, as opposed to incomplete data which arise due to events beyond the control of the scientist. We consider the problem in which "inexpensive" auxiliary information can be used to inform the selection of individuals for collection of data on the "expensive" covariate. In particular, we explore how parameter estimation relates to the choice of sampling scheme. Efficient sampling designs are defined by choosing the optimal sampling criteria within a particular class of selection models under a two-phase framework. We compare the efficiency of these optimal designs to simple random sampling and balanced sampling designs under a variety of frameworks for inference. As a prelude to the work on two-phase designs, we first review and study issues related to incomplete data arising due to chance. In Chapter 2, we discuss several models by which missing data can arise, with an emphasis on issues in clinical trials. The likelihood function is used as a basis for discussing different missing data mechanisms for incomplete responses in short-term and longitudinal studies, as well as for missing covariates. We briefly discuss common ad hoc strategies for dealing with incomplete data, such as complete-case analyses and naive methods of imputation, and we review more broadly appropriate approaches for dealing with incomplete data in terms of asymptotic and empirical frequency properties. These methods include the EM algorithm, multiple imputation, and inverse probability weighted estimating equations. Simulation studies are reported which demonstrate how to implement these procedures and examine performance empirically. We further explore the asymptotic bias of these estimators when the nature of the missing data mechanism is misspecified. We consider specific types of model misspecification in methods designed to account for the missingness and compare the limiting values of the resulting estimators. In Chapter 3, we focus on methods for two-phase studies in which covariates are incomplete by design. In the second phase of the two-phase study, subject to correct specification of key models, optimal sub-sampling probabilities can be chosen to minimise the asymptotic variance of the resulting estimator. These optimal phase-II sampling designs are derived and the empirical and asymptotic relative efficiencies resulting from these designs are compared to those from simple random sampling and balanced sampling designs. We further examine the effect on efficiency of utilising external pilot data to estimate parameters needed for derivation of optimal designs, and we explore the sensitivity of these optimal sampling designs to misspecification of preliminary parameter estimates and to the misspecification of the covariate model at the design stage. Designs which are optimal for analyses based on inverse probability weighted estimating equations are shown to result in efficiency gains for several different methods of analysis and are shown to be relatively robust to misspecification of the parameters or models used to derive the optimal designs. Furthermore, these optimal designs for inverse probability weighted estimating equations are shown to be well behaved when necessary design parameters are estimated using relatively small external pilot studies. We also consider efficient two-phase designs explicitly in the context of studies involving clustered and longitudinal responses. Model-based methods are discussed for estimation and inference. Asymptotic results are used to derive optimal sampling designs and the relative efficiencies of these optimal designs are again compared with simple random sampling and balanced sampling designs. In this more complex setting, balanced sampling designs are demonstrated to be inefficient and it is not obvious when balanced sampling will offer greater efficiency than a simple random sampling design. We explore the relative efficiency of phase-II sampling designs based on increasing amounts of information in the longitudinal responses and show that the balanced design may become less efficient when more data is available at the design stage. In contrast, the optimal design is able to exploit additional information to increase efficiency whenever more data is available at phase-I. In Chapter 4, we consider an innovative adaptive two-phase design which breaks the phase-II sampling into a phase-IIa sample obtained by a balanced or proportional sampling strategy, and a phase-IIb sample collected according to an optimal sampling design based on the data in phases I and IIa. This approach exploits the previously established robustness of optimal inverse probability weighted designs to overcome the difficulties associated with the fact that derivations of optimal designs require a priori knowledge of parameters. The efficiency of this hybrid design is compared to those of the proportional and balanced sampling designs, and to the efficiency of the true optimal design, in a variety of settings. The efficiency gains of this adaptive two-phase design are particularly apparent in the setting involving clustered response data, and it is natural to consider this approach in settings with complex models for which it is difficult to even speculate on suitable parameter values at the design stage. Incomplete Data Two-Phase Designs Statistics
2	Statistical Methods for Incomplete Covariates and Two-Phase Designs McIsaac, Michael 18 December 2012 (has links) Incomplete data is a pervasive problem in health research, and as a result statistical methods enabling inference based on partial information play a critical role. This thesis explores estimation of regression coefficients and associated inferences when variables are incompletely observed. In the later chapters, we focus primarily on settings with incomplete covariate data which arise by design, as in studies with two-phase sampling schemes, as opposed to incomplete data which arise due to events beyond the control of the scientist. We consider the problem in which "inexpensive" auxiliary information can be used to inform the selection of individuals for collection of data on the "expensive" covariate. In particular, we explore how parameter estimation relates to the choice of sampling scheme. Efficient sampling designs are defined by choosing the optimal sampling criteria within a particular class of selection models under a two-phase framework. We compare the efficiency of these optimal designs to simple random sampling and balanced sampling designs under a variety of frameworks for inference. As a prelude to the work on two-phase designs, we first review and study issues related to incomplete data arising due to chance. In Chapter 2, we discuss several models by which missing data can arise, with an emphasis on issues in clinical trials. The likelihood function is used as a basis for discussing different missing data mechanisms for incomplete responses in short-term and longitudinal studies, as well as for missing covariates. We briefly discuss common ad hoc strategies for dealing with incomplete data, such as complete-case analyses and naive methods of imputation, and we review more broadly appropriate approaches for dealing with incomplete data in terms of asymptotic and empirical frequency properties. These methods include the EM algorithm, multiple imputation, and inverse probability weighted estimating equations. Simulation studies are reported which demonstrate how to implement these procedures and examine performance empirically. We further explore the asymptotic bias of these estimators when the nature of the missing data mechanism is misspecified. We consider specific types of model misspecification in methods designed to account for the missingness and compare the limiting values of the resulting estimators. In Chapter 3, we focus on methods for two-phase studies in which covariates are incomplete by design. In the second phase of the two-phase study, subject to correct specification of key models, optimal sub-sampling probabilities can be chosen to minimise the asymptotic variance of the resulting estimator. These optimal phase-II sampling designs are derived and the empirical and asymptotic relative efficiencies resulting from these designs are compared to those from simple random sampling and balanced sampling designs. We further examine the effect on efficiency of utilising external pilot data to estimate parameters needed for derivation of optimal designs, and we explore the sensitivity of these optimal sampling designs to misspecification of preliminary parameter estimates and to the misspecification of the covariate model at the design stage. Designs which are optimal for analyses based on inverse probability weighted estimating equations are shown to result in efficiency gains for several different methods of analysis and are shown to be relatively robust to misspecification of the parameters or models used to derive the optimal designs. Furthermore, these optimal designs for inverse probability weighted estimating equations are shown to be well behaved when necessary design parameters are estimated using relatively small external pilot studies. We also consider efficient two-phase designs explicitly in the context of studies involving clustered and longitudinal responses. Model-based methods are discussed for estimation and inference. Asymptotic results are used to derive optimal sampling designs and the relative efficiencies of these optimal designs are again compared with simple random sampling and balanced sampling designs. In this more complex setting, balanced sampling designs are demonstrated to be inefficient and it is not obvious when balanced sampling will offer greater efficiency than a simple random sampling design. We explore the relative efficiency of phase-II sampling designs based on increasing amounts of information in the longitudinal responses and show that the balanced design may become less efficient when more data is available at the design stage. In contrast, the optimal design is able to exploit additional information to increase efficiency whenever more data is available at phase-I. In Chapter 4, we consider an innovative adaptive two-phase design which breaks the phase-II sampling into a phase-IIa sample obtained by a balanced or proportional sampling strategy, and a phase-IIb sample collected according to an optimal sampling design based on the data in phases I and IIa. This approach exploits the previously established robustness of optimal inverse probability weighted designs to overcome the difficulties associated with the fact that derivations of optimal designs require a priori knowledge of parameters. The efficiency of this hybrid design is compared to those of the proportional and balanced sampling designs, and to the efficiency of the true optimal design, in a variety of settings. The efficiency gains of this adaptive two-phase design are particularly apparent in the setting involving clustered response data, and it is natural to consider this approach in settings with complex models for which it is difficult to even speculate on suitable parameter values at the design stage. Incomplete Data Two-Phase Designs Statistics
3	Stable Mixing of Complete and Incomplete Information Corduneanu, Adrian, Jaakkola, Tommi 08 November 2001 (has links) An increasing number of parameter estimation tasks involve the use of at least two information sources, one complete but limited, the other abundant but incomplete. Standard algorithms such as EM (or em) used in this context are unfortunately not stable in the sense that they can lead to a dramatic loss of accuracy with the inclusion of incomplete observations. We provide a more controlled solution to this problem through differential equations that govern the evolution of locally optimal solutions (fixed points) as a function of the source weighting. This approach permits us to explicitly identify any critical (bifurcation) points leading to choices unsupported by the available complete data. The approach readily applies to any graphical model in O(n^3) time where n is the number of parameters. We use the naive Bayes model to illustrate these ideas and demonstrate the effectiveness of our approach in the context of text classification problems. AI semi-supervised learning incomplete data EM stable estimation
4	Seismic data processing with curvelets: a multiscale and nonlinear approach. Herrmann, Felix J., Wang, Deli, Hennenfent, Gilles, Moghaddam, Peyman P. January 2007 (has links) In this abstract, we present a nonlinear curvelet-based sparsity promoting formulation of a seismic processing flow, consisting of the following steps: seismic data regularization and the restoration of migration amplitudes. We show that the curvelet’s wavefront detection capability and invariance under the migration-demigration operator lead to a formulation that is stable under noise and missing data. curvelet transform incomplete data seismic sparsity mixing noise
5	Query Processing Over Incomplete Data Streams Ren, Weilong 19 November 2021 (has links) No description available. Computer Science P-iDS Query Processing Incomplete Data Streams
6	Robust Diagnostics for the Logistic Regression Model With Incomplete Data 范少華 Unknown Date (has links) Atkinson 及 Riani 應用前進搜尋演算法來處理百牡利資料中所包含的多重離群值(2001）。在這篇論文中，我們沿用相同的想法來處理在不完整資料下一般線性模型中的多重離群值。這個演算法藉由先填補資料中遺漏的部分，再利用前進搜尋演算法來確認資料中的離群值。我們所提出的方法可以解決處理多重離群值時常會遇到的遮蓋效應。我們應用了一些真實資料來說明這個演算法並得到令人滿意結果。 / Atkinson and Riani (2001) apply the forward search algorithm to deal with the problem of the detection of multiple outliers in binomial data. In this thesis, we extend the similar idea to identify multiple outliers for the generalized linear models when part of data are missing. The algorithm starts with imputation method to fill-in the missing observations in the data, and then use the forward search algorithm to confirm outliers. The proposed method can overcome the masking effect, which commonly occurs when multiple outliers exit in the data. Real data are used to illustrate the procedure, and satisfactory results are obtained. EM algorithm Incomplete data generalized linear model high breakdown ppint robust methods
7	Non-parametric Bayesian Learning with Incomplete Data Wang, Chunping January 2010 (has links) <p>In most machine learning approaches, it is usually assumed that data are complete. When data are partially missing due to various reasons, for example, the failure of a subset of sensors, image corruption or inadequate medical measurements, many learning methods designed for complete data cannot be directly applied. In this dissertation we treat two kinds of problems with incomplete data using non-parametric Bayesian approaches: classification with incomplete features and analysis of low-rank matrices with missing entries.</p><p>Incomplete data in classification problems are handled by assuming input features to be generated from a mixture-of-experts model, with each individual expert (classifier) defined by a local Gaussian in feature space. With a linear classifier associated with each Gaussian component, nonlinear classification boundaries are achievable without the introduction of kernels. Within the proposed model, the number of components is theoretically ``infinite'' as defined by a Dirichlet process construction, with the actual number of mixture components (experts) needed inferred based upon the data under test. With a higher-level DP we further extend the classifier for analysis of multiple related tasks (multi-task learning), where model components may be shared across tasks. Available data could be augmented by this way of information transfer even when tasks are only similar in some local regions of feature space, which is particularly critical for cases with scarce incomplete training samples from each task. The proposed algorithms are implemented using efficient variational Bayesian inference and robust performance is demonstrated on synthetic data, benchmark data sets, and real data with natural missing values.</p><p>Another scenario of interest is to complete a data matrix with entries missing. The recovery of missing matrix entries is not possible without additional assumptions on the matrix under test, and here we employ the common assumption that the matrix is low-rank. Unlike methods with a preset fixed rank, we propose a non-parametric Bayesian alternative based on the singular value decomposition (SVD), where missing entries are handled naturally, and the number of underlying factors is imposed to be small and inferred in the light of observed entries. Although we assume missing at random, the proposed model is generalized to incorporate auxiliary information including missingness features. We also make a first attempt in the matrix-completion community to acquire new entries actively. By introducing a probit link function, we are able to handle counting matrices with the decomposed low-rank matrices latent. The basic model and its extensions are validated on</p><p>synthetic data, a movie-rating benchmark and a new data set presented for the first time.</p> / Dissertation Engineering, Electronics and Electrical Classification Dirichlet process Incomplete data Matrix completion Multi-task learning Non-parametric Bayesian
8	Partial least squares structural equation modelling with incomplete data : an investigation of the impact of imputation methods Mohd Jamil, J. B. January 2012 (has links) Despite considerable advances in missing data imputation methods over the last three decades, the problem of missing data remains largely unsolved. Many techniques have emerged in the literature as candidate solutions. These techniques can be categorised into two classes: statistical methods of data imputation and computational intelligence methods of data imputation. Due to the longstanding use of statistical methods in handling missing data problems, it takes quite some time for computational intelligence methods to gain profound attention even though these methods have analogous accuracy, in comparison to other approaches. The merits of both these classes have been discussed at length in the literature, but only limited studies make significant comparison to these classes. This thesis contributes to knowledge by firstly, conducting a comprehensive comparison of standard statistical methods of data imputation, namely, mean substitution (MS), regression imputation (RI), expectation maximization (EM), tree imputation (TI) and multiple imputation (MI) on missing completely at random (MCAR) data sets. Secondly, this study also compares the efficacy of these methods with a computational intelligence method of data imputation, ii namely, a neural network (NN) on missing not at random (MNAR) data sets. The significance difference in performance of the methods is presented. Thirdly, a novel procedure for handling missing data is presented. A hybrid combination of each of these statistical methods with a NN, known here as the post-processing procedure, was adopted to approximate MNAR data sets. Simulation studies for each of these imputation approaches have been conducted to assess the impact of missing values on partial least squares structural equation modelling (PLS-SEM) based on the estimated accuracy of both structural and measurement parameters. The best method to deal with particular missing data mechanisms is highly recognized. Several significant insights were deduced from the simulation results. It was figured that for the problem of MCAR by using statistical methods of data imputation, MI performs better than the other methods for all percentages of missing data. Another unique contribution is found when comparing the results before and after the NN post-processing procedure. This improvement in accuracy may be resulted from the neural network's ability to derive meaning from the imputed data set found by the statistical methods. Based on these results, the NN post-processing procedure is capable to assist MS in producing significant improvement in accuracy of the approximated values. This is a promising result, as MS is the weakest method in this study. This evidence is also informative as MS is often used as the default method available to users of PLS-SEM software. 658
9	Multiple prediction from incomplete data with the focused curvelet transform Herrmann, Felix J. January 2007 (has links) Incomplete data represents a major challenge for a successful prediction and subsequent removal of multiples. In this paper, a new method will be represented that tackles this challenge in a two-step approach. During the first step, the recenly developed curvelet-based recovery by sparsity-promoting inversion (CRSI) is applied to the data, followed by a prediction of the primaries. During the second high-resolution step, the estimated primaries are used to improve the frequency content of the recovered data by combining the focal transform, defined in terms of the estimated primaries, with the curvelet transform. This focused curvelet transform leads to an improved recovery, which can subsequently be used as input for a second stage of multiple prediction and primary-multiple separation. incomplete data curvelet CRSI SRME 2D 3D wavefield reconstruction
10	Session Clustering Using Mixtures of Proportional Hazards Models Mair, Patrick, Hudec, Marcus January 2008 (has links) (PDF) Emanating from classical Weibull mixture models we propose a framework for clustering survival data with various proportionality restrictions imposed. By introducing mixtures of Weibull proportional hazards models on a multivariate data set a parametric cluster approach based on the EM-algorithm is carried out. The problem of non-response in the data is considered. The application example is a real life data set stemming from the analysis of a world-wide operating eCommerce application. Sessions are clustered due to the dwell times a user spends on certain page-areas. The solution allows for the interpretation of the navigation behavior in terms of survival and hazard functions. A software implementation by means of an R package is provided. (author´s abstract) / Series: Research Report Series / Department of Statistics and Mathematics

Search results