91 |
Predicting the Unobserved : A statistical analysis of missing data techniques for binary classificationSäfström, Stella January 2019 (has links)
The aim of the thesis is to investigate how the classification performance of random forest and logistic regression differ, given an imbalanced data set with MCAR missing data. The performance is measured in terms of accuracy and sensitivity. Two analyses are performed: one with a simulated data set and one application using data from the Swedish population registries. The simulation study is created to have the same class imbalance at 1:5. The missing values are handled using three different techniques: complete case analysis, predictive mean matching and mean imputation. The thesis concludes that logistic regression and random forest are on average equally accurate, with some instances of random forest outperforming logistic regression. Logistic regression consistently outperforms random forest with regards to sensitivity. This implies that logistic regression may be the best option for studies where the goal is to accurately predict outcomes in the minority class. None of the missing data techniques stood out in terms of performance.
|
92 |
Imputação de alelos microssatélites a partir de haplótiposSNP para verificação de paternidade na raça Nelore / Imputation of microsatellite alleles from SNP haplotypes for parental verification in Nellore cattleSouza, Milla Albuquerque de 31 January 2013 (has links)
As técnicas de marcadores moleculares têm sido aplicadas em estudos populacionais das espécies bovinas, verificação de genealogia e teste de paternidade. Dentre os marcadores moleculares, os microssatélites (MS) são amplamente utilizados, porém, alguns problemas técnicos têm motivado o desenvolvimento de alternativas, como os marcadores do tipo polimorfismo de nucleotídeos único (SNP). Assim, surgiu a necessidade de identificar haplótipos SNP que estão em concordância com cada alelo MS e então os genótipos MS poderiam ser convertidos em genótipos SNP e vice-versa, por meio da imputação do genótipo. O objetivo deste trabalho foi aplicar um método para imputar alelos MS a partir de haplótipos SNP, para verificação de paternidade, utilizando animais da raça Nelore e também identificar um menor conjunto de SNP, com qualidade suficiente para otimizar e diminuir o custo da genotipagem. Foram realizadas genotipagens em SNP e MS para 99 trios de animais da raça Nelore provenientes da EMBRAPA Pecuária Sudeste e foi verificada a existência de alelos nulos pelo programa MICRO-CHECKER. Foram selecionados SNP que estivessem próximos de cada marcador MS e o programa BEAGLE foi usado para identificar a fase de ligação dos genótipos. Posteriormente, foi realizada a técnica de imputação dos MS a partir de haplótipos SNP e foi verificada a paternidade pelo programa CERVUS. A precisão da imputação dos alelos MS foi verificada através do cálculo da concordância entre os alelos MS imputados e relatados. O marcador SPS115 foi removido da análise por evidências de alelos nulos, devido ao excesso de homozigotos observados. O marcador mais informativo foi o TGLA122, cujo conteúdo de informação polimórfica (PIC) foi 0,8. Foram encontrados desvios do equilíbrio de HW (P<0,05) para os locos ETH225 e TGLA57. Um maior conjunto de SNP foi necessário para imputação de alelos MS para o marcador BM1824. As taxas de verificação de parentesco foram de 97,1% para os alelos MS genotipados e 96,3% para os MS imputados. Somente 4% dos 99 filhos não tiveram a paternidade atribuída, quando a simulação foi feita apenas para o pai conhecido e 1% quando pai e mãe eram conhecidos. Esta técnica obteve precisão maior que 96% para a imputação de dados MS e permitiu imputar dados genotípicos multialélicos a partir de bi-alélicos. Os resultados terão um impacto imediato para os pesquisadores e associações de criadores que visam a transição do MS para SNP baseada em verificação de parentesco. / Molecular markers techniques have been applied in bovine population studies, genealogy verification and paternity test. Among the molecular markers, microsatellites (MS) are widely used, however, some technical problems have motivated alternatives development, as markers type single nucleotide polymorphism (SNP). Thus, the need to identify SNP haplotypes which are in agreement with each MS allele and then MS genotypes could be converted into SNP genotypes and vice versa, through genotype imputation. The objective of this study was to apply a method to impute MS alleles from SNP haplotypes to verify paternity, using Nellore and also identify a smaller set of SNP, with enough quality to optimize and reduce genotype cost. SNP genotyping was performed at and for 99 MS trios Nellore from EMBRAPA Cattle Southeast and was checked for null alleles by MICROCHECKER. SNP were selected that were near each MS marker and the program BEAGLE was used to identify genotypes phase. Subsequently, were applied the MS imputation technique from SNP haplotype and paternity was verified by CERVUS. The accuracy of MS alleles imputation was verified by calculating the correlation between MS alleles imputed and reported. The SPS115 marker was removed from the analysis for null alleles evidence due to homozygote excess observed. The most informative marker was TGLA122 with 0.8 PIC. Deviations from equilibrium HW (P<0.05) were found for the loci ETH225 and TGLA57. A larger set of SNP was necessary to impute MS alleles for the marker BM1824. The verification rates of paternity were 97.1% for genotyped MS alleles and 96.3% for MS imputed. Using imputed MS alleles and when only the sire was considered only 4% of the 99 offspring were not assigned paternity and 1% when both parents were known. The technique achieved greater than 96% accuracy for MS imputation data. This research allow to impute multi-allelic genotypes from bi-allelic data. Our results will have an immediate impact for researchers and livestock associations aiming the transition from MS- to SNP-based parentage verification.
|
93 |
Estratégias para tratamento de variáveis com dados faltantes durante o desenvolvimento de modelos preditivos / Strategies for treatment of variables with missing data during the development of predictive modelsAssunção, Fernando 09 May 2012 (has links)
Modelos preditivos têm sido cada vez mais utilizados pelo mercado a fim de auxiliarem as empresas na mitigação de riscos, expansão de carteiras, retenção de clientes, prevenção a fraudes, entre outros objetivos. Entretanto, durante o desenvolvimento destes modelos é comum existirem, dentre as variáveis preditivas, algumas que possuem dados não preenchidos (missings), sendo necessário assim adotar algum procedimento para tratamento destas variáveis. Dado este cenário, este estudo tem o objetivo de discutir metodologias de tratamento de dados faltantes em modelos preditivos, incentivando o uso de algumas delas já conhecidas pelo meio acadêmico, só que não utilizadas pelo mercado. Para isso, este trabalho descreve sete metodologias. Todas elas foram submetidas a uma aplicação empírica utilizando uma base de dados referente ao desenvolvimento de um modelo de Credit Score. Sobre esta base foram desenvolvidos sete modelos (um para cada metodologia descrita) e seus resultados foram avaliados e comparados através de índices de desempenho amplamente utilizados pelo mercado (KS, Gini, ROC e Curva de Aprovação). Nesta aplicação, as técnicas que apresentaram melhor desempenho foram a que tratam os dados faltantes como uma categoria à parte (técnica já utilizada pelo mercado) e a metodologia que consiste em agrupar os dados faltantes na categoria conceitualmente mais semelhante. Já a que apresentou o pior desempenho foi a metodologia que simplesmente não utiliza a variável com dados faltantes, outro procedimento comumente visto no mercado. / Predictive models have been increasingly used by the market in order to assist companies in risk mitigation, portfolio growth, customer retention, fraud prevention, among others. During the model development, however, it is usual to have, among the predictive variables, some who have data not filled in (missing values), thus it is necessary to adopt a procedure to treat these variables. Given this scenario, the aim of this study is to discuss frameworks to deal with missing data in predictive models, encouraging the use of some already known by academia that are still not used by the market. This paper describes seven methods, which were submitted to an empirical application using a Credit Score data set. Each framework described resulted in a predictive model developed and the results were evaluated and compared through a series of widely used performance metrics (KS, Gini, ROC curve, Approval curve). In this application, the frameworks that presented better performance were the ones that treated missing data as a separate category (technique already used by the market) and the framework which consists of grouping the missing data in the category most similar conceptually. The worst performance framework otherwise was the one that simply ignored the variable containing missing values, another procedure commonly used by the market.
|
94 |
Statistical methods for certain large, complex data challengesLi, Jun 15 November 2018 (has links)
Big data concerns large-volume, complex, growing data sets, and it provides us opportunities as well as challenges. This thesis focuses on statistical methods for several specific large, complex data challenges - each involving representation of data with complex format, utilization of complicated information, and/or intensive computational cost.
The first problem we work on is hypothesis testing for multilayer network data, motivated by an example in computational biology. We show how to represent the complex structure of a multilayer network as a single data point within the space of supra-Laplacians and then develop a central limit theorem and hypothesis testing theories for multilayer networks in that space. We develop both global and local testing strategies for mean comparison and investigate sample size requirements. The methods were applied to the motivating computational biology example and compared with the classic Gene Set Enrichment Analysis(GSEA). More biological insights are found in this comparison.
The second problem is the source detection problem in epidemiology, which is one of the most important issues for control of epidemics. Ideally, we want to locate the sources based on all history data. However, this is often infeasible, because the history data is complex, high-dimensional and cannot be fully observed. Epidemiologists have recognized the crucial role of human mobility as an important proxy to a complete history, but little in the literature to date uses this information for source detection. We recast the source detection problem as identifying a relevant mixture component in a multivariate Gaussian mixture model. Human mobility within a stochastic PDE model is used to calibrate the parameters. The capability of our method is demonstrated in the context of the 2000-2002 cholera outbreak in the KwaZulu-Natal province.
The third problem is about multivariate time series imputation, which is a classic problem in statistics. To address the common problem of low signal-to-noise ratio in high-dimensional multivariate time series, we propose models based on state-space models which provide more precise inference of missing values by clustering multivariate time series components in a nonparametric way. The models are suitable for large-scale time series due to their efficient parameter estimation. / 2019-05-15T00:00:00Z
|
95 |
Substituição de valores ausentes: uma abordagem baseada em um algoritmo evolutivo para agrupamento de dados / Missing value substitution: an approach based on evolutionary algorithm for clustering dataSilva, Jonathan de Andrade 29 April 2010 (has links)
A substituição de valores ausentes, também conhecida como imputação, é uma importante tarefa para a preparação dos dados em aplicações de mineração de dados. Este trabalho propõe e avalia um algoritmo para substituição de valores ausentes baseado em um algoritmo evolutivo para agrupamento de dados. Este algoritmo baseia-se na suposição de que grupos (previamente desconhecidos) de dados podem prover informações úteis para o processo de imputação. Para avaliar experimentalmente o algoritmo proposto, simulações de valores ausentes foram realizadas em seis bases de dados, para problemas de classificação, com a aplicação de dois mecanismos amplamente usados em experimentos controlados: MCAR e MAR. Os algoritmos de imputação têm sido tradicionalmente avaliados por algumas medidas de capacidade de predição. Entretanto, essas tradicionais medidas de avaliação não estimam a influência dos métodos de imputação na etapa final em tarefas de modelagem (e.g., em classificação). Este trabalho descreve resultados experimentais obtidos sob a perspectiva de predição e inserção de tendências (viés) em problemas de classificação. Os resultados de diferentes cenários nos quais o algoritmo proposto, apresenta em geral, desempenho semelhante a outros seis algoritmos de imputação reportados na literatura. Finalmente, as análises estatísticas reportadas sugerem que melhores resultados de predição não implicam necessariamente em menor viés na classificação / The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. This work proposes and evaluates an algorithm for missing values imputation that is based on an evolutionary algorithm for clustering. This algorithm is based on the assumption that clusters of (partially unknown) data can provide useful information for the imputation process. In order to experimentally assess the proposed method, simulations of missing values were performed on six classification datasets, with two missingness mechanisms widely used in practice: MCAR and MAR. Imputation algorithms have been traditionally assessed by some measures of prediction capability. However, this traditionall approach does not allow inferring the influence of imputed values in the ultimate modeling tasks (e.g., in classification). This work describes the experimental results obtained from the prediction and insertion bias perspectives in classification problems. The results illustrate different scenarios in which the proposed algorithm performs similarly to other six imputation algorithms reported in the literature. Finally, statistical analyses suggest that best prediction results do not necessarily imply in less classification bias
|
96 |
Exploratory Visualization of Data with Variable QualityHuang, Shiping 11 January 2005 (has links)
Data quality, which refers to correctness, uncertainty, completeness and other aspects of data, has became more and more prevalent and has been addressed across multiple disciplines. Data quality could be introduced and presented in any of the data manipulation processes such as data collection, transformation, and visualization. Data visualization is a process of data mining and analysis using graphical presentation and interpretation. The correctness and completeness of the visualization discoveries to a large extent depend on the quality of the original data. Without the integration of quality information with data presentation, the analysis of data using visualization is incomplete at best and can lead to inaccurate or incorrect conclusions at worst. This thesis addresses the issue of data quality visualization. Incorporating data quality measures into the data displays is challenging in that the display is apt to be cluttered when faced with multiple dimensions and data records. We investigate both the incorporation of data quality information in traditional multivariate data display techniques as well as develop novel visualization and interaction tools that operate in data quality space. We validate our results using several data sets that have variable quality associated with dimensions, records, and data values.
|
97 |
Genetics of disease resistance : application to bovine tuberculosisTsairidou, Smaragda January 2016 (has links)
Bovine Tuberculosis (bTB) is a disease of significant economic importance, being one of the most persistent animal health problems in the UK and the Republic of Ireland and increasingly constituting a public health concern especially for the developing world. Limitations of the currently available diagnostic and control methods, along with our incomplete understanding of bTB transmission, prevent successful eradication. This Thesis addresses the development of a complementary control strategy which will be based on animal genetics and will allow us to identify animals genetically predisposed to be more resistant to disease. Specifically, the aim of my PhD project is to investigate the genetic architecture of resistance to bTB and demonstrate the feasibility of whole genome prediction for the control of bTB in cattle. Genomic selection for disease resistance in livestock populations will assist with the reduction of the in herd-level incidence and the severity of potential outbreaks. The first objective was to explore the estimation of breeding values for bTB resistance in UK dairy cattle, and test these genomic predictions for situations when disease phenotypes are not available on selection candidates. Through using dense SNP chip data the results of Chapter 2 demonstrate that genomic selection for bTB resistance is feasible (h2 = 0.23(SE = 0.06)) and bTB resistance can be predicted using genetic markers with an estimate of prediction accuracy of r(g, ĝ) = 0.33 in this data. It was shown that genotypes help to predict disease state (AUC ≈ 0.58) and animals lacking bTB phenotypes can be selected based on their genotypes. In Chapter 3, a novel approach is presented to identify loci displaying heterozygote (dis)advantage associated with resistance to M. bovis, hypothesising underlying non-additive genetic variation, and these results are compared with those obtained from standard genome scans. A marker was identified suggesting an association between locus heterozygosity and increased susceptibility to bTB i.e. a heterozygote disadvantage, with the heterozygotes being significantly more in the cases than in the controls (x2 = 11.50, p < 0.001). Secondly, this thesis focused on conducting a meta-analysis on two dairy cattle populations with bTB phenotypes and SNP chip genotypes, identifying genomic regions underlying bTB resistance and testing genomic predictions by means of cross-validation. In Chapter 4, exploration of the genetic architecture of the trait revealed that bTB resistance is a moderately polygenic, complex trait with clusters of causal variants spread across a few major chromosomes collectively controlling the trait. A region was identified on chromosome 6, putatively associated with bTB resistance and this chromosome as a whole was shown to contribute a major proportion (hc 2= 0.051) of the observed variation in this dataset. Genomic prediction for bTB was shown to be feasible even when only distantly related populations are combined (r(g,ĝ)=0.33 (SE = 0.05)), with the chromosomal heritability results suggesting that the accuracy arises from the SNPs capturing linkage disequilibrium between markers and QTL, as well as additive relationships between animals (~80% of estimated genomic h2 is due to relatedness). To extend the analysis, in Chapter 5, high density genotypes were inferred by means of genotype imputation, anticipating that these analyses will allow the identification of genomic regions associated with bTB resistance more closely, and that would increase the prediction accuracy. Genotype imputation was successful, however, using all imputed genotypes added little information. The limiting factor was found to be the number of animals and the trait definitions rather than the density of genotypes. Thirdly, a quantitative genetic analysis of actual Single Intradermal Comparative Cervical Test (SICCT) values collected during bTB herd testing was conducted aiming to investigate if selection for bTB resistance is likely to have an impact on the SICCT diagnostic test. This analysis demonstrated that the SICCT has a negligibly low heritability (h2=0.0104 (SE = 0.0032)) and any effect on the responsiveness to the test is likely to be small. In conclusion, breeding for disease resistance in livestock is feasible and we can predict the risk of bTB in cattle using genomic information. Further, putative QTLs associated with bTB resistance were identified, and exploration of the genetic architecture of bTB resistance revealed a moderately polygenic trait. These results suggest that given that larger datasets with more phenotyped and genotyped animals will be available, we can breed for bTB resistance and implement the genomic selection technology in breeding programmes aiming to improve the disease status and overall health of the livestock population. Using the genomics this can be continued as the epidemic declines.
|
98 |
Statistical Learning Methods for Personalized Medical Decision MakingLiu, Ying January 2016 (has links)
The theme of my dissertation is on merging statistical modeling with medical domain knowledge and machine learning algorithms to assist in making personalized medical decisions. In its simplest form, making personalized medical decisions for treatment choices and disease diagnosis modality choices can be transformed into classification or prediction problems in machine learning, where the optimal decision for an individual is a decision rule that yields the best future clinical outcome or maximizes diagnosis accuracy. However, challenges emerge when analyzing complex medical data. On one hand, statistical modeling is needed to deal with inherent practical complications such as missing data, patients' loss to follow-up, ethical and resource constraints in randomized controlled clinical trials. On the other hand, new data types and larger scale of data call for innovations combining statistical modeling, domain knowledge and information technologies. This dissertation contains three parts addressing the estimation of optimal personalized rule for choosing treatment, the estimation of optimal individualized rule for choosing disease diagnosis modality, and methods for variable selection if there are missing data.
In the first part of this dissertation, we propose a method to find optimal Dynamic treatment regimens (DTRs) in Sequential Multiple Assignment Randomized Trial (SMART) data. Dynamic treatment regimens (DTRs) are sequential decision rules tailored at each stage of treatment by potentially time-varying patient features and intermediate outcomes observed in previous stages. The complexity, patient heterogeneity, and chronicity of many diseases and disorders call for learning optimal DTRs that best dynamically tailor treatment to each individual's response over time. We propose a robust and efficient approach referred to as Augmented Multistage Outcome-Weighted Learning (AMOL) to identify optimal DTRs from sequential multiple assignment randomized trials. We improve outcome-weighted learning (Zhao et al.~2012) to allow for negative outcomes; we propose methods to reduce variability of weights to achieve numeric stability and higher efficiency; and finally, for multiple-stage trials, we introduce robust augmentation to improve efficiency by drawing information from Q-function regression models at each stage. The proposed AMOL remains valid even if the regression model is misspecified. We formally justify that proper choice of augmentation guarantees smaller stochastic errors in value function estimation for AMOL; we then establish the convergence rates for AMOL. The comparative advantage of AMOL over existing methods is demonstrated in extensive simulation studies and applications to two SMART data sets: a two-stage trial for attention deficit hyperactivity disorder and the STAR*D trial for major depressive disorder.
The second part of the dissertation introduced a machine learning algorithm to estimate personalized decision rules for medical diagnosis/screening to maximize a weighted combination of sensitivity and specificity. Using subject-specific risk factors and feature variables, such rules administer screening tests with balanced sensitivity and specificity, and thus protect low-risk subjects from unnecessary pain and stress caused by false positive tests, while achieving high sensitivity for subjects at high risk. We conducted simulation study mimicking a real breast cancer study, and we found significant improvements on sensitivity and specificity comparing our personalized screening strategy (assigning mammography+MRI to high-risk patients and mammography alone to low-risk subjects based on a composite score of their risk factors) to one-size-fits-all strategy (assigning mammography+MRI or mammography alone to all subjects). When applying to a Parkinson's disease (PD) FDG-PET and fMRI data, we showed that the method provided individualized modality selection that can improve AUC, and it can provide interpretable decision rules for choosing brain imaging modality for early detection of PD. To the best of our knowledge, this is the first time in the literature to propose automatic data-driven methods and learning algorithm for personalized diagnosis/screening strategy.
In the last part of the dissertation, we propose a method, Multiple Imputation Random Lasso (MIRL), to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. % in the presence of missing data. In this study, 80% of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after list-wise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (SES) Asian boys who are at high risk of developing obesity.
|
99 |
Modelo oculto de Markov para imputação de genótipos de marcadores moleculares: Uma aplicação no mapeamento de QTL utilizando a abordagem bayesiana / Hidden Markov model for imputation of genotypes of molecular markers: An application in QTL mapping using Bayesian approachMedeiros, Elias Silva de 28 August 2014 (has links)
Muitas são as características quantitativas que são, significativamente, influenciadas por fatores genéticos, em geral, existem vários genes que colaboram para a variação de uma ou mais características quantitativas. As informações ausentes a respeito dos genótipos nos marcadores moleculares é um problema comum em estudo de mapeamento genético e, por conseguinte, no mapeamento dos locus que controlam estas características fenotípicas (QTL). Os dados que não foram observados ocorrem, principalmente, devido a erros de genotipagem e de marcadores não informativos. Para solucionar este problema foi utilizado o método do modelo oculto de Markov para inferir estes dados. Os métodos de acurácias evidenciaram o sucesso da aplicação desta técnica de imputa- ção. Uma vez imputado, na inferência bayesiana estes dados não serão mais tratados como uma variável aleatória resultando assim, numa redução no espaço paramétrico do modelo. Outra grande dificuldade no mapeamento de QTL se deve ao fato de que não se conhece ao certo a quantidade destes que influenciam uma dada característica, fazendo com que surjam diversos problemas, um deles é a dimensão do espaço paramétrico e, consequentemente, a obtenção da amostra a posteriori. Assim, com o objetivo de contornar este problema foi proposta a utilização do método Monte Carlo via cadeia de Markov com Saltos Reversíveis, uma vez que este permite flutuar, entre cada iteração, modelos com diferentes quantidades de parâmetros. A utilização da abordagem bayesiana permitiu detectar cinco QTL para a característica estudada. Todas as análises foram implementadas no programa estatístico R. / There are many quantitative characteristics which are significantly influenced by genetic factors, in general, there are several genes that contribute to the variation of one or more quantitative trait. The missing information about the genotypes in molecular markers is a common problem in studying genetic mapping and therefore the mapping of loci that control these phenotypic traits (QTL). The data were not observed occur mainly due to errors in genotyping and uninformative markers. To solve this problem the method of occult Markov model to infer this information was used. Techniques accuracies demonstrated the successful application of this technique of imputation. Once allocated, in the Bayesian inference this data will no longer be treated as a random variable thus resulting in a reduction in the parameter space of the model. Another great difficulty in mapping QTL is due to the fact that no one knows exactly the amount of these which influence a given characteristic, so that several problems arise, one of them is dimension of the parameter space and, consequently, obtaining the sample a posterior. Thus, in order to solve this problem using the method via Monte Carlo Markov chain Reversible Jump was proposed, since this allows fluctuate between each iteration, models with different numbers of parameters. The use of the Bayesian approach allowed five QTL detected for the studied trait. All analyzes were implemented in the statistical software R.
|
100 |
Methods for handling missing data in cohort studies where outcomes are truncated by deathWen, Lan January 2018 (has links)
This dissertation addresses problems found in observational cohort studies where the repeated outcomes of interest are truncated by both death and by dropout. In particular, we consider methods that make inference for the population of survivors at each time point, otherwise known as 'partly conditional inference'. Partly conditional inference distinguishes between the reasons for missingness; failure to make this distinction will cause inference to be based not only on pre-death outcomes which exist but also on post-death outcomes which fundamentally do not exist. Such inference is called 'immortal cohort inference'. Investigations of health and cognitive outcomes in two studies - the 'Origins of Variance in the Old Old' and the 'Health and Retirement Study' - are conducted. Analysis of these studies is complicated by outcomes of interest being missing because of death and dropout. We show, first, that linear mixed models and joint models (that model both the outcome and survival processes) produce immortal cohort inference. This makes the parameters in the longitudinal (sub-)model difficult to interpret. Second, a thorough comparison of well-known methods used to handle missing outcomes - inverse probability weighting, multiple imputation and linear increments - is made, focusing particularly on the setting where outcomes are missing due to both dropout and death. We show that when the dropout models are correctly specified for inverse probability weighting, and the imputation models are correctly specified for multiple imputation or linear increments, then the assumptions of multiple imputation and linear increments are the same as those of inverse probability weighting only if the time of death is included in the dropout and imputation models. Otherwise they may not be. Simulation studies show that each of these methods gives negligibly biased estimates of the partly conditional mean when its assumptions are met, but potentially biased estimates if its assumptions are not met. In addition, we develop new augmented inverse probability weighted estimating equations for making partly conditional inference, which offer double protection against model misspecification. That is, as long as one of the dropout and imputation models is correctly specified, the partly conditional inference is valid. Third, we describe methods that can be used to make partly conditional inference for non-ignorable missing data. Both monotone and non-monotone missing data are considered. We propose three methods that use a tilt function to relate the distribution of an outcome at visit j among those who were last observed at some time before j to those who were observed at visit j. Sensitivity analyses to departures from ignorable missingness assumptions are conducted on simulations and on real datasets. The three methods are: i) an inverse probability weighted method that up-weights observed subjects to represent subjects who are still alive but are not observed; ii) an imputation method that replaces missing outcomes of subjects who are alive with their conditional mean outcomes given past observed data; and iii) a new augmented inverse probability method that combines the previous two methods and is doubly-robust against model misspecification.
|
Page generated in 0.0964 seconds