Global ETD Search

31	SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY Zheng, Xiyu 01 January 2016 (has links) Abstract In a two-level hierarchical linear model(HLM2), the outcome as well as covariates may have missing values at any of the levels. One way to analyze all available data in the model is to estimate a multivariate normal joint distribution of variables, including the outcome, subject to missingness conditional on covariates completely observed by maximum likelihood(ML); draw multiple imputation (MI) of missing values given the estimated joint model; and analyze the hierarchical model given the MI [1,2]. The assumption is data missing at random (MAR). While this method yields efficient estimation of the hierarchical model, it often estimates the model given discrete missing data that is handled under multivariate normality. In this thesis, we evaluate how robust it is to estimate a hierarchical linear model given discrete missing data by the method. We simulate incompletely observed data from a series of hierarchical linear models given discrete covariates MAR, estimate the models by the method, and assess the sensitivity of handling discrete missing data under the multivariate normal joint distribution by computing bias, root mean squared error, standard error, and coverage probability in the estimated hierarchical linear models via a series of simulation studies. We want to achieve the following aim: Evaluate the performance of the method handling binary covariates MAR. We let the missing patterns of level-1 and -2 binary covariates depend on completely observed variables and assess how the method handles binary missing data given different values of success probabilities and missing rates. Based on the simulation results, the missing data analysis is robust under certain parameter settings. Efficient analysis performs very well for estimation of level-1 fixed and random effects across varying success probabilities and missing rates. MAR estimation of level-2 binary covariate is not well estimated when the missing rate in level-2 binary covariate is greater than 10%. The rest of the thesis is organized as follows: Section 1 introduces the background information including conventional methods for hierarchical missing data analysis, different missing data mechanisms, and the innovation and significance of this study. Section 2 explains the efficient missing data method. Section 3 represents the sensitivity analysis of the missing data method and explain how we carry out the simulation study using SAS, software package HLM7, and R. Section 4 illustrates the results and useful recommendations for researchers who want to use the missing data method for binary covariates MAR in HLM2. Section 5 presents an illustrative analysis National Growth of Health Study (NGHS) by the missing data method. The thesis ends with a list of useful references that will guide the future study and simulation codes we used. Missing at Random Maximum Likelihood Multiple Imputation Hierarchical Linear Model Social Statistics
32	Estratégias para tratamento de variáveis com dados faltantes durante o desenvolvimento de modelos preditivos / Strategies for treatment of variables with missing data during the development of predictive models Assunção, Fernando 09 May 2012 (has links) Modelos preditivos têm sido cada vez mais utilizados pelo mercado a fim de auxiliarem as empresas na mitigação de riscos, expansão de carteiras, retenção de clientes, prevenção a fraudes, entre outros objetivos. Entretanto, durante o desenvolvimento destes modelos é comum existirem, dentre as variáveis preditivas, algumas que possuem dados não preenchidos (missings), sendo necessário assim adotar algum procedimento para tratamento destas variáveis. Dado este cenário, este estudo tem o objetivo de discutir metodologias de tratamento de dados faltantes em modelos preditivos, incentivando o uso de algumas delas já conhecidas pelo meio acadêmico, só que não utilizadas pelo mercado. Para isso, este trabalho descreve sete metodologias. Todas elas foram submetidas a uma aplicação empírica utilizando uma base de dados referente ao desenvolvimento de um modelo de Credit Score. Sobre esta base foram desenvolvidos sete modelos (um para cada metodologia descrita) e seus resultados foram avaliados e comparados através de índices de desempenho amplamente utilizados pelo mercado (KS, Gini, ROC e Curva de Aprovação). Nesta aplicação, as técnicas que apresentaram melhor desempenho foram a que tratam os dados faltantes como uma categoria à parte (técnica já utilizada pelo mercado) e a metodologia que consiste em agrupar os dados faltantes na categoria conceitualmente mais semelhante. Já a que apresentou o pior desempenho foi a metodologia que simplesmente não utiliza a variável com dados faltantes, outro procedimento comumente visto no mercado. / Predictive models have been increasingly used by the market in order to assist companies in risk mitigation, portfolio growth, customer retention, fraud prevention, among others. During the model development, however, it is usual to have, among the predictive variables, some who have data not filled in (missing values), thus it is necessary to adopt a procedure to treat these variables. Given this scenario, the aim of this study is to discuss frameworks to deal with missing data in predictive models, encouraging the use of some already known by academia that are still not used by the market. This paper describes seven methods, which were submitted to an empirical application using a Credit Score data set. Each framework described resulted in a predictive model developed and the results were evaluated and compared through a series of widely used performance metrics (KS, Gini, ROC curve, Approval curve). In this application, the frameworks that presented better performance were the ones that treated missing data as a separate category (technique already used by the market) and the framework which consists of grouping the missing data in the category most similar conceptually. The worst performance framework otherwise was the one that simply ignored the variable containing missing values, another procedure commonly used by the market. credit score credit score dados faltantes imputação múltipla missing values modelos preditivos multiple imputation predictive models
33	Avaliação de redes Bayesianas para imputação em variáveis qualitativas e quantitativas. / Evaluating Bayesian networks for imputation with qualitative and quantitative variables. Magalhães, Ismenia Blavatsky de 29 March 2007 (has links) Redes Bayesianas são estruturas que combinam distribuições de probabilidade e grafos. Apesar das redes Bayesianas terem surgido na década de 80 e as primeiras tentativas em solucionar os problemas gerados a partir da não resposta datarem das décadas de 30 e 40, a utilização de estruturas deste tipo especificamente para imputação é bem recente: em 2002 em institutos oficiais de estatística e em 2003 no contexto de mineração de dados. O intuito deste trabalho é o de fornecer alguns resultados da aplicação de redes Bayesianas discretas e mistas para imputação. Para isso é proposto um algoritmo que combina o conhecimento de especialistas e dados experimentais observados de pesquisas anteriores ou parte dos dados coletados. Ao empregar as redes Bayesianas neste contexto, parte-se da hipótese de que uma vez preservadas as variáveis em sua relação original, o método de imputação será eficiente em manter propriedades desejáveis. Neste sentido, foram avaliados três tipos de consistências já existentes na literatura: a consistência da base de dados, a consistência lógica e a consistência estatística, e propôs-se a consistência estrutural, que se define como sendo a capacidade de a rede manter sua estrutura na classe de equivalência da rede original quando construída a partir dos dados após a imputação. É utilizada pela primeira vez uma rede Bayesiana mista para o tratamento da não resposta em variáveis quantitativas. Calcula-se uma medida de consistência estatística para redes mistas usando como recurso a imputação múltipla para a avaliação de parâmetros da rede e de modelos de regressão. Como aplicação foram conduzidos experimentos com base nos dados de domicílios e pessoas do Censo Demográfico 2000 do município de Natal e nos dados de um estudo sobre homicídios em Campinas. Dos resultados afirma-se que as redes Bayesianas para imputação em atributos discretos são promissoras, principalmente se o interesse estiver em manter a consistência estatística e o número de classes da variável for pequeno. Já para outras características, como o coeficiente de contingência entre as variáveis, são afetadas pelo método à medida que se aumenta o percentual de não resposta. Nos atributos contínuos, a mediana apresenta-se mais sensível ao método. / Bayesian networks are structures that combine probability distributions with graphs. Although Bayesian networks initially appeared in the 1980s and the first attempts to solve the problems generated from the non-response date back to the 1930s and 1940s, the use of structures of this kind specifically for imputation is rather recent: in 2002 by official statistical institutes, and in 2003 in the context of data mining. The purpose of this work is to present some results on the application of discrete and mixed Bayesian networks for imputation. For that purpose, we present an algorithm combining knowledge obtained from experts with experimental data derived from previous research or part of the collected data. To apply Bayesian networks in this context, it is assumed that once the variables are preserved in their original relation, the imputation method will be effective in maintaining desirable properties. Pursuant to this, three types of consistence which already exist in literature are evaluated: the database consistence, the logical consistence and the statistical consistence. In addition, the structural consistence is proposed, which can be defined as the ability of a network to maintain its structure in the equivalence class of the original network when built from the data after imputation. For the first time a mixed Bayesian network is used for the treatment of the non-response in quantitative variables. The statistical consistence for mixed networks is being developed by using, as a resource, the multiple imputation for evaluating network parameters and regression models. For the purpose of application, some experiences were conducted using simple networks based on data for dwellings and people from the 2000 Demographic Census in the City of Natal and on data from a study on homicides in the City of Campinas. It can be stated from the results that the Bayesian networks for imputation in discrete attributes seem to be promising, particularly if the interest is to maintain the statistical consistence and if the number of classes of the variable is small. Features such as the contingency tables coefficient among variables, on the other hand, are affected by this method as the percentage of non-response increases. The median is more sensitive to this method in continuous attributes. Bayesian networks Imputação Imputação múltipla Imputation Missing data Multiple imputation Não resposta Redes Bayesianas
34	Statistical Learning Methods for Personalized Medical Decision Making Liu, Ying January 2016 (has links) The theme of my dissertation is on merging statistical modeling with medical domain knowledge and machine learning algorithms to assist in making personalized medical decisions. In its simplest form, making personalized medical decisions for treatment choices and disease diagnosis modality choices can be transformed into classification or prediction problems in machine learning, where the optimal decision for an individual is a decision rule that yields the best future clinical outcome or maximizes diagnosis accuracy. However, challenges emerge when analyzing complex medical data. On one hand, statistical modeling is needed to deal with inherent practical complications such as missing data, patients' loss to follow-up, ethical and resource constraints in randomized controlled clinical trials. On the other hand, new data types and larger scale of data call for innovations combining statistical modeling, domain knowledge and information technologies. This dissertation contains three parts addressing the estimation of optimal personalized rule for choosing treatment, the estimation of optimal individualized rule for choosing disease diagnosis modality, and methods for variable selection if there are missing data. In the first part of this dissertation, we propose a method to find optimal Dynamic treatment regimens (DTRs) in Sequential Multiple Assignment Randomized Trial (SMART) data. Dynamic treatment regimens (DTRs) are sequential decision rules tailored at each stage of treatment by potentially time-varying patient features and intermediate outcomes observed in previous stages. The complexity, patient heterogeneity, and chronicity of many diseases and disorders call for learning optimal DTRs that best dynamically tailor treatment to each individual's response over time. We propose a robust and efficient approach referred to as Augmented Multistage Outcome-Weighted Learning (AMOL) to identify optimal DTRs from sequential multiple assignment randomized trials. We improve outcome-weighted learning (Zhao et al.~2012) to allow for negative outcomes; we propose methods to reduce variability of weights to achieve numeric stability and higher efficiency; and finally, for multiple-stage trials, we introduce robust augmentation to improve efficiency by drawing information from Q-function regression models at each stage. The proposed AMOL remains valid even if the regression model is misspecified. We formally justify that proper choice of augmentation guarantees smaller stochastic errors in value function estimation for AMOL; we then establish the convergence rates for AMOL. The comparative advantage of AMOL over existing methods is demonstrated in extensive simulation studies and applications to two SMART data sets: a two-stage trial for attention deficit hyperactivity disorder and the STAR*D trial for major depressive disorder. The second part of the dissertation introduced a machine learning algorithm to estimate personalized decision rules for medical diagnosis/screening to maximize a weighted combination of sensitivity and specificity. Using subject-specific risk factors and feature variables, such rules administer screening tests with balanced sensitivity and specificity, and thus protect low-risk subjects from unnecessary pain and stress caused by false positive tests, while achieving high sensitivity for subjects at high risk. We conducted simulation study mimicking a real breast cancer study, and we found significant improvements on sensitivity and specificity comparing our personalized screening strategy (assigning mammography+MRI to high-risk patients and mammography alone to low-risk subjects based on a composite score of their risk factors) to one-size-fits-all strategy (assigning mammography+MRI or mammography alone to all subjects). When applying to a Parkinson's disease (PD) FDG-PET and fMRI data, we showed that the method provided individualized modality selection that can improve AUC, and it can provide interpretable decision rules for choosing brain imaging modality for early detection of PD. To the best of our knowledge, this is the first time in the literature to propose automatic data-driven methods and learning algorithm for personalized diagnosis/screening strategy. In the last part of the dissertation, we propose a method, Multiple Imputation Random Lasso (MIRL), to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. % in the presence of missing data. In this study, 80% of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after list-wise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (SES) Asian boys who are at high risk of developing obesity. Machine learning Biometry Therapeutics Multiple imputation (Statistics) Medical statistics Machine learning--Statistical methods
35	Imputação múltipla de dados faltantes: exemplo de aplicação no Estudo Pró-Saúde / Multiple imputation of missing data: application in the Pro-Saude Program Thaís de Paulo Rangel 05 March 2013 (has links) Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Dados faltantes são um problema comum em estudos epidemiológicos e, dependendo da forma como ocorrem, as estimativas dos parâmetros de interesse podem estar enviesadas. A literatura aponta algumas técnicas para se lidar com a questão, e, a imputação múltipla vem recebendo destaque nos últimos anos. Esta dissertação apresenta os resultados da utilização da imputação múltipla de dados no contexto do Estudo Pró-Saúde, um estudo longitudinal entre funcionários técnico-administrativos de uma universidade no Rio de Janeiro. No primeiro estudo, após simulação da ocorrência de dados faltantes, imputou-se a variável cor/raça das participantes, e aplicou-se um modelo de análise de sobrevivência previamente estabelecido, tendo como desfecho a história auto-relatada de miomas uterinos. Houve replicação do procedimento (100 vezes) para se determinar a distribuição dos coeficientes e erros-padrão das estimativas da variável de interesse. Apesar da natureza transversal dos dados aqui utilizados (informações da linha de base do Estudo Pró-Saúde, coletadas em 1999 e 2001), buscou-se resgatar a história do seguimento das participantes por meio de seus relatos, criando uma situação na qual a utilização do modelo de riscos proporcionais de Cox era possível. Nos cenários avaliados, a imputação demonstrou resultados satisfatórios, inclusive quando da avaliação de performance realizada. A técnica demonstrou um bom desempenho quando o mecanismo de ocorrência dos dados faltantes era do tipo MAR (Missing At Random) e o percentual de não-resposta era de 10%. Ao se imputar os dados e combinar as estimativas obtidas nos 10 bancos (m=10) gerados, o viés das estimativas era de 0,0011 para a categoria preta e 0,0015 para pardas, corroborando a eficiência da imputação neste cenário. Demais configurações também apresentaram resultados semelhantes. No segundo artigo, desenvolve-se um tutorial para aplicação da imputação múltipla em estudos epidemiológicos, que deverá facilitar a utilização da técnica por pesquisadores brasileiros ainda não familiarizados com o procedimento. São apresentados os passos básicos e decisões necessárias para se imputar um banco de dados, e um dos cenários utilizados no primeiro estudo é apresentado como exemplo de aplicação da técnica. Todas as análises foram conduzidas no programa estatístico R, versão 2.15 e os scripts utilizados são apresentados ao final do texto. / Missing data are a common problem in epidemiologic studies and depending on the way they occur, the resulting estimates may be biased. Literature shows several techniques to deal with this subject and multiple imputation has been receiving attention in the recent years. This dissertation presents the results of applying multiple imputation of missing data in the context of the Pro-Saude Study, a longitudinal study among civil servants at a university in Rio de Janeiro, Brazil. In the first paper, after simulation of missing data, the variable color/race of the female servants was imputed and analyzed through a previously established survival model, which had the self-reported history of uterine leiomyoma as the outcome. The process has been replicated a hundred times in order to determine the distribution of the coefficient and standard errors of the variable being imputed. Although the data presented were cross-sectionally collected (baseline data of the Pro-Saude Study, gathered in 1999 and 2001), the following of the servants were determined using self-reported information. In this scenario, the Cox proportional hazards model could be applied. In the situations created, imputation showed adequate results, including in the performance analyses. The technique had a satisfactory effectiveness when the missing mechanism was MAR (Missing At Random) and the percent of missing data was 10. Imputing the missing information and combining the estimates of the 10 resulting datasets produced a bias of 0,0011 to black women and 0,0015 to brown (mixed-race) women, what corroborates the efficiency of multiple imputation in this scenario. In the second paper, a tutorial was created to guide the application of multiple imputation in epidemiologic studies, which should facilitate the use of the technique by Brazilian researchers who are still not familiarized with the procedure. Basic steps and important decisions necessary to impute a dataset are presented and one of the scenarios of the first paper is used as an application example. All the analyses were performed at R statistical software, version 2.15 and the scripts are presented at the end of the text. Dados faltantes Imputação múltipla Análise de sobrevivência Tutorial Missing data Multiple imputation Survival analysis Tutorial EPIDEMIOLOGIA
36	Estratégias para tratamento de variáveis com dados faltantes durante o desenvolvimento de modelos preditivos / Strategies for treatment of variables with missing data during the development of predictive models Fernando Assunção 09 May 2012 (has links) Modelos preditivos têm sido cada vez mais utilizados pelo mercado a fim de auxiliarem as empresas na mitigação de riscos, expansão de carteiras, retenção de clientes, prevenção a fraudes, entre outros objetivos. Entretanto, durante o desenvolvimento destes modelos é comum existirem, dentre as variáveis preditivas, algumas que possuem dados não preenchidos (missings), sendo necessário assim adotar algum procedimento para tratamento destas variáveis. Dado este cenário, este estudo tem o objetivo de discutir metodologias de tratamento de dados faltantes em modelos preditivos, incentivando o uso de algumas delas já conhecidas pelo meio acadêmico, só que não utilizadas pelo mercado. Para isso, este trabalho descreve sete metodologias. Todas elas foram submetidas a uma aplicação empírica utilizando uma base de dados referente ao desenvolvimento de um modelo de Credit Score. Sobre esta base foram desenvolvidos sete modelos (um para cada metodologia descrita) e seus resultados foram avaliados e comparados através de índices de desempenho amplamente utilizados pelo mercado (KS, Gini, ROC e Curva de Aprovação). Nesta aplicação, as técnicas que apresentaram melhor desempenho foram a que tratam os dados faltantes como uma categoria à parte (técnica já utilizada pelo mercado) e a metodologia que consiste em agrupar os dados faltantes na categoria conceitualmente mais semelhante. Já a que apresentou o pior desempenho foi a metodologia que simplesmente não utiliza a variável com dados faltantes, outro procedimento comumente visto no mercado. / Predictive models have been increasingly used by the market in order to assist companies in risk mitigation, portfolio growth, customer retention, fraud prevention, among others. During the model development, however, it is usual to have, among the predictive variables, some who have data not filled in (missing values), thus it is necessary to adopt a procedure to treat these variables. Given this scenario, the aim of this study is to discuss frameworks to deal with missing data in predictive models, encouraging the use of some already known by academia that are still not used by the market. This paper describes seven methods, which were submitted to an empirical application using a Credit Score data set. Each framework described resulted in a predictive model developed and the results were evaluated and compared through a series of widely used performance metrics (KS, Gini, ROC curve, Approval curve). In this application, the frameworks that presented better performance were the ones that treated missing data as a separate category (technique already used by the market) and the framework which consists of grouping the missing data in the category most similar conceptually. The worst performance framework otherwise was the one that simply ignored the variable containing missing values, another procedure commonly used by the market. credit score dados faltantes imputação múltipla modelos preditivos credit score missing values multiple imputation predictive models
37	Avaliação de redes Bayesianas para imputação em variáveis qualitativas e quantitativas. / Evaluating Bayesian networks for imputation with qualitative and quantitative variables. Ismenia Blavatsky de Magalhães 29 March 2007 (has links) Redes Bayesianas são estruturas que combinam distribuições de probabilidade e grafos. Apesar das redes Bayesianas terem surgido na década de 80 e as primeiras tentativas em solucionar os problemas gerados a partir da não resposta datarem das décadas de 30 e 40, a utilização de estruturas deste tipo especificamente para imputação é bem recente: em 2002 em institutos oficiais de estatística e em 2003 no contexto de mineração de dados. O intuito deste trabalho é o de fornecer alguns resultados da aplicação de redes Bayesianas discretas e mistas para imputação. Para isso é proposto um algoritmo que combina o conhecimento de especialistas e dados experimentais observados de pesquisas anteriores ou parte dos dados coletados. Ao empregar as redes Bayesianas neste contexto, parte-se da hipótese de que uma vez preservadas as variáveis em sua relação original, o método de imputação será eficiente em manter propriedades desejáveis. Neste sentido, foram avaliados três tipos de consistências já existentes na literatura: a consistência da base de dados, a consistência lógica e a consistência estatística, e propôs-se a consistência estrutural, que se define como sendo a capacidade de a rede manter sua estrutura na classe de equivalência da rede original quando construída a partir dos dados após a imputação. É utilizada pela primeira vez uma rede Bayesiana mista para o tratamento da não resposta em variáveis quantitativas. Calcula-se uma medida de consistência estatística para redes mistas usando como recurso a imputação múltipla para a avaliação de parâmetros da rede e de modelos de regressão. Como aplicação foram conduzidos experimentos com base nos dados de domicílios e pessoas do Censo Demográfico 2000 do município de Natal e nos dados de um estudo sobre homicídios em Campinas. Dos resultados afirma-se que as redes Bayesianas para imputação em atributos discretos são promissoras, principalmente se o interesse estiver em manter a consistência estatística e o número de classes da variável for pequeno. Já para outras características, como o coeficiente de contingência entre as variáveis, são afetadas pelo método à medida que se aumenta o percentual de não resposta. Nos atributos contínuos, a mediana apresenta-se mais sensível ao método. / Bayesian networks are structures that combine probability distributions with graphs. Although Bayesian networks initially appeared in the 1980s and the first attempts to solve the problems generated from the non-response date back to the 1930s and 1940s, the use of structures of this kind specifically for imputation is rather recent: in 2002 by official statistical institutes, and in 2003 in the context of data mining. The purpose of this work is to present some results on the application of discrete and mixed Bayesian networks for imputation. For that purpose, we present an algorithm combining knowledge obtained from experts with experimental data derived from previous research or part of the collected data. To apply Bayesian networks in this context, it is assumed that once the variables are preserved in their original relation, the imputation method will be effective in maintaining desirable properties. Pursuant to this, three types of consistence which already exist in literature are evaluated: the database consistence, the logical consistence and the statistical consistence. In addition, the structural consistence is proposed, which can be defined as the ability of a network to maintain its structure in the equivalence class of the original network when built from the data after imputation. For the first time a mixed Bayesian network is used for the treatment of the non-response in quantitative variables. The statistical consistence for mixed networks is being developed by using, as a resource, the multiple imputation for evaluating network parameters and regression models. For the purpose of application, some experiences were conducted using simple networks based on data for dwellings and people from the 2000 Demographic Census in the City of Natal and on data from a study on homicides in the City of Campinas. It can be stated from the results that the Bayesian networks for imputation in discrete attributes seem to be promising, particularly if the interest is to maintain the statistical consistence and if the number of classes of the variable is small. Features such as the contingency tables coefficient among variables, on the other hand, are affected by this method as the percentage of non-response increases. The median is more sensitive to this method in continuous attributes. Imputação Imputação múltipla Não resposta Redes Bayesianas Bayesian networks Imputation Missing data Multiple imputation
38	A Monte Carlo Study: The Impact of Missing Data in Cross-Classification Random Effects Models Alemdar, Meltem 12 August 2009 (has links) Unlike multilevel data with a purely nested structure, data that are cross-classified not only may be clustered into hierarchically ordered units but also may belong to more than one unit at a given level of a hierarchy. In a cross-classified design, students at a given school might be from several different neighborhoods and one neighborhood might have students who attend a number of different schools. In this type of scenario, schools and neighborhoods are considered to be cross-classified factors, and cross-classified random effects modeling (CCREM) should be used to analyze these data appropriately. A common problem in any type of multilevel analysis is the presence of missing data at any given level. There has been little research conducted in the multilevel literature about the impact of missing data, and none in the area of cross-classified models. The purpose of this study was to examine the effect of data that are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR), on CCREM estimates while exploring multiple imputation to handle the missing data. In addition, this study examined the impact of including an auxiliary variable that is correlated with the variable with missingness (the level-1 predictor) in the imputation model for multiple imputation. This study expanded on the CCREM Monte Carlo simulation work of Meyers (2004) by the inclusion of studying the effect of missing data and method for handling these missing data with CCREM. The results demonstrated that in general, multiple imputation met Hoogland and Boomsma’s (1998) relative bias estimation criteria (less than 5% in magnitude) for parameter estimates under different types of missing data patterns. For the standard error estimates, substantial relative bias (defined by Hoogland and Boomsma as greater than 10%) was found in some conditions. When multiple imputation was used to handle the missing data then substantial bias was found in the standard errors in most cells where data were MNAR. This bias increased as a function of the percentage of missing data. Cross- Classified Data Cross-Classified Random Effects Models Missing Data Multiple Imputation Education Education Policy
39	Nonparametric Bayesian Methods for Multiple Imputation of Large Scale Incomplete Categorical Data in Panel Studies Si, Yajuan January 2012 (has links) <p>The thesis develops nonparametric Bayesian models to handle incomplete categorical variables in data sets with high dimension using the framework of multiple imputation. It presents methods for ignorable missing data in cross-sectional studies, and potentially non-ignorable missing data in panel studies with refreshment samples.</p><p>The first contribution is a fully Bayesian, joint modeling approach of multiple imputation for categorical data based on Dirichlet process mixtures of multinomial distributions. The approach automatically models complex dependencies while being computationally expedient. </p><p>I illustrate repeated sampling properties of the approach</p><p>using simulated data. This approach offers better performance than default chained equations methods, which are often used in such settings. I apply the methodology to impute missing background data in the 2007 Trends in International Mathematics and Science Study.</p><p>For the second contribution, I extend the nonparametric Bayesian imputation engine to consider a mix of potentially non-ignorable attrition and ignorable item nonresponse in multiple wave panel studies. Ignoring the attrition in models for panel data can result in biased inference if the reason for attrition is systematic and related to the missing values. Panel data alone cannot estimate the attrition effect without untestable assumptions about the missing data mechanism. Refreshment samples offer an extra data source that can be utilized to estimate the attrition effect while reducing reliance on strong assumptions of the missing data mechanism. </p><p>I consider two novel Bayesian approaches to handle the attrition and item non-response simultaneously under multiple imputation in a two wave panel with one refreshment sample when the variables involved are categorical and high dimensional. </p><p>First, I present a semi-parametric selection model that includes an additive non-ignorable attrition model with main effects of all variables, including demographic variables and outcome measures in wave 1 and wave 2. The survey variables are modeled jointly using Bayesian mixture of multinomial distributions. I develop the posterior computation algorithms for the semi-parametric selection model under different prior choices for the regression coefficients in the attrition model. </p><p>Second, I propose two Bayesian pattern mixture models for this scenario that use latent classes to model the dependency among the variables and the attrition. I develop a dependent Bayesian latent pattern mixture model for which variables are modeled via latent classes and attrition is treated as a covariate in the class allocation weights. And, I develop a joint Bayesian latent pattern mixture model, for which attrition and variables are modeled jointly via latent classes.</p><p>I show via simulation studies that the pattern mixture models can recover true parameter estimates, even when inferences based on the panel alone are biased from attrition. </p><p>I apply both the selection and pattern mixture models to data from the 2007-2008 Associated Press/Yahoo News election panel study.</p> / Dissertation Statistics Categorical Large scale Multiple imputation Nonparametric Bayes Panel Refreshment sample
40	Nonparametric tests for interval-censored failure time data via multiple imputation Huang, Jin-long 26 June 2008 (has links) Interval-censored failure time data often occur in follow-up studies where subjects can only be followed periodically and the failure time can only be known to lie in an interval. In this paper we consider the problem of comparing two or more interval-censored samples. We propose a multiple imputation method for discrete interval-censored data to impute exact failure times from interval-censored observations and then apply existing test for exact data, such as the log-rank test, to imputed exact data. The test statistic and covariance matrix are calculated by our proposed multiple imputation technique. The formula of covariance matrix estimator is similar to the estimator used by Follmann, Proschan and Leifer (2003) for clustered data. Through simulation studies we find that the performance of the proposed log-rank type test is comparable to that of the test proposed by Finkelstein (1986), and is better than that of the two existing log-rank type tests proposed by Sun (2001) and Zhao and Sun (2004) due to the differences in the method of multiple imputation and the covariance matrix estimation. The proposed method is illustrated by means of an example involving patients with breast cancer. We also investigate applying our method to the other two-sample comparison tests for exact data, such as Mantel's test (1967) and the integrated weighted difference test. Mantel's test log-rank test treatments comparison multiple imputation integrated weighted difference test interval-censored data

Search results