21 |
A Comparison of Last Observation Carried Forward and Multiple Imputation in a Longitudinal Clinical TrialCarmack, Tara Lynn 25 June 2012 (has links)
No description available.
|
22 |
The Effect of Item Parameter Uncertainty on Test ReliabilityBodine, Andrew James 24 August 2012 (has links)
No description available.
|
23 |
Performance of Imputation Algorithms on Artificially Produced Missing at Random DataOketch, Tobias O 01 May 2017 (has links)
Missing data is one of the challenges we are facing today in modeling valid statistical models. It reduces the representativeness of the data samples. Hence, population estimates, and model parameters estimated from such data are likely to be biased.
However, the missing data problem is an area under study, and alternative better statistical procedures have been presented to mitigate its shortcomings. In this paper, we review causes of missing data, and various methods of handling missing data. Our main focus is evaluating various multiple imputation (MI) methods from the multiple imputation of chained equation (MICE) package in the statistical software R. We assess how these MI methods perform with different percentages of missing data. A multiple regression model was fit on the imputed data sets and the complete data set. Statistical comparisons of the regression coefficients are made between the models using the imputed data and the complete data.
|
24 |
Performance Comparison of Imputation Algorithms on Missing at Random DataAddo, Evans Dapaa 01 May 2018 (has links)
Missing data continues to be an issue not only the field of statistics but in any field, that deals with data. This is due to the fact that almost all the widely accepted and standard statistical software and methods assume complete data for all the variables included in the analysis. As a result, in most studies, statistical power is weakened and parameter estimates are biased, leading to weak conclusions and generalizations.
Many studies have established that multiple imputation methods are effective ways of handling missing data. This paper examines three different imputation methods (predictive mean matching, Bayesian linear regression and linear regression, non Bayesian) in the MICE package in the statistical software, R, to ascertain which of the three imputation methods imputes data that yields parameter estimates closest to the parameter estimates of a complete data given different percentages of missingness. In comparing the parameter estimates of the complete data and the imputed data, the parameter estimates in each model were evaluated and compared. The paper extends the analysis by generating a pseudo data of the original data to establish how the imputation methods perform under varying conditions.
|
25 |
Comparison of Imputation Methods for Mixed Data Missing at RandomHeidt, Kaitlyn 01 May 2019 (has links)
A statistician's job is to produce statistical models. When these models are precise and unbiased, we can relate them to new data appropriately. However, when data sets have missing values, assumptions to statistical methods are violated and produce biased results. The statistician's objective is to implement methods that produce unbiased and accurate results. Research in missing data is becoming popular as modern methods that produce unbiased and accurate results are emerging, such as MICE in R, a statistical software. Using real data, we compare four common imputation methods, in the MICE package in R, at different levels of missingness. The results were compared in terms of the regression coefficients and adjusted R^2 values using the complete data set. The CART and PMM methods consistently performed better than the OTF and RF methods. The procedures were repeated on a second sample of real data and the same conclusions were drawn.
|
26 |
Multiple imputation in the presence of a detection limit, with applications : an empirical approach / Shawn Carl LiebenbergLiebenberg, Shawn Carl January 2014 (has links)
Scientists often encounter unobserved or missing measurements that are typically reported as less than a fixed detection limit. This especially occurs in the environmental sciences when detection of low exposures are not possible due to limitations of the measuring instrument, and the resulting data are often referred to as type I and II left censored data. Observations lying below this detection limit are therefore often ignored, or `guessed' because it cannot be measured accurately. However, reliable estimates of the population parameters are nevertheless required to perform statistical analysis. The problem of dealing with values below a detection limit becomes increasingly complex when a large number of observations are present below this limit. Researchers thus have interest in developing statistical robust estimation procedures for dealing with left- or right-censored data sets (SinghandNocerino2002). The aim of this study focuses on several main components regarding the problems mentioned above. The imputation of censored data below a fixed detection limit are studied, particularly using the maximum likelihood procedure of Cohen(1959), and several variants thereof, in combination with four new variations of the multiple imputation concept found in literature. Furthermore, the focus also falls strongly on estimating the density of the resulting imputed, `complete' data set by applying various kernel density estimators. It should be noted that bandwidth selection issues are not of importance in this study, and will be left for further research. In this study, however, the maximum likelihood estimation method of Cohen (1959) will be compared with several variant methods, to establish which of these maximum likelihood estimation procedures for censored data estimates the population parameters of three chosen Lognormal distribution, the most reliably in terms of well-known discrepancy measures. These methods will be implemented in combination with four new multiple imputation procedures, respectively, to assess which of these nonparametric methods are most effective with imputing the 12 censored values below the detection limit, with regards to the global discrepancy measures mentioned above. Several variations of the Parzen-Rosenblatt kernel density estimate will be fitted to the complete filled-in data sets, obtained from the previous methods, to establish which is the preferred data-driven method to estimate these densities. The primary focus of the current study will therefore be the performance of the four chosen multiple imputation methods, as well as the recommendation of methods and procedural combinations to deal with data in the presence of a detection limit. An extensive Monte Carlo simulation study was performed to compare the various methods and procedural combinations. Conclusions and recommendations regarding the best of these methods and combinations are made based on the study's results. / MSc (Statistics), North-West University, Potchefstroom Campus, 2014
|
27 |
Multiple imputation in the presence of a detection limit, with applications : an empirical approach / Shawn Carl LiebenbergLiebenberg, Shawn Carl January 2014 (has links)
Scientists often encounter unobserved or missing measurements that are typically reported as less than a fixed detection limit. This especially occurs in the environmental sciences when detection of low exposures are not possible due to limitations of the measuring instrument, and the resulting data are often referred to as type I and II left censored data. Observations lying below this detection limit are therefore often ignored, or `guessed' because it cannot be measured accurately. However, reliable estimates of the population parameters are nevertheless required to perform statistical analysis. The problem of dealing with values below a detection limit becomes increasingly complex when a large number of observations are present below this limit. Researchers thus have interest in developing statistical robust estimation procedures for dealing with left- or right-censored data sets (SinghandNocerino2002). The aim of this study focuses on several main components regarding the problems mentioned above. The imputation of censored data below a fixed detection limit are studied, particularly using the maximum likelihood procedure of Cohen(1959), and several variants thereof, in combination with four new variations of the multiple imputation concept found in literature. Furthermore, the focus also falls strongly on estimating the density of the resulting imputed, `complete' data set by applying various kernel density estimators. It should be noted that bandwidth selection issues are not of importance in this study, and will be left for further research. In this study, however, the maximum likelihood estimation method of Cohen (1959) will be compared with several variant methods, to establish which of these maximum likelihood estimation procedures for censored data estimates the population parameters of three chosen Lognormal distribution, the most reliably in terms of well-known discrepancy measures. These methods will be implemented in combination with four new multiple imputation procedures, respectively, to assess which of these nonparametric methods are most effective with imputing the 12 censored values below the detection limit, with regards to the global discrepancy measures mentioned above. Several variations of the Parzen-Rosenblatt kernel density estimate will be fitted to the complete filled-in data sets, obtained from the previous methods, to establish which is the preferred data-driven method to estimate these densities. The primary focus of the current study will therefore be the performance of the four chosen multiple imputation methods, as well as the recommendation of methods and procedural combinations to deal with data in the presence of a detection limit. An extensive Monte Carlo simulation study was performed to compare the various methods and procedural combinations. Conclusions and recommendations regarding the best of these methods and combinations are made based on the study's results. / MSc (Statistics), North-West University, Potchefstroom Campus, 2014
|
28 |
Statistical Approaches for Handling Missing Data in Cluster Randomized TrialsFiero, Mallorie H. January 2016 (has links)
In cluster randomized trials (CRTs), groups of participants are randomized as opposed to individual participants. This design is often chosen to minimize treatment arm contamination or to enhance compliance among participants. In CRTs, we cannot assume independence among individuals within the same cluster because of their similarity, which leads to decreased statistical power compared to individually randomized trials. The intracluster correlation coefficient (ICC) is crucial in the design and analysis of CRTs, and measures the proportion of total variance due to clustering. Missing data is a common problem in CRTs and should be accommodated with appropriate statistical techniques because they can compromise the advantages created by randomization and are a potential source of bias. In three papers, I investigate statistical approaches for handling missing data in CRTs. In the first paper, I carry out a systematic review evaluating current practice of handling missing data in CRTs. The results show high rates of missing data in the majority of CRTs, yet handling of missing data remains suboptimal. Fourteen (16%) of the 86 reviewed trials reported carrying out a sensitivity analysis for missing data. Despite suggestions to weaken the missing data assumption from the primary analysis, only five of the trials weakened the assumption. None of the trials reported using missing not at random (MNAR) models. Due to the low proportion of CRTs reporting an appropriate sensitivity analysis for missing data, the second paper aims to facilitate performing a sensitivity analysis for missing data in CRTs by extending the pattern mixture approach for missing clustered data under the MNAR assumption. I implement multilevel multiple imputation (MI) in order to account for the hierarchical structure found in CRTs, and multiply imputed values by a sensitivity parameter, k, to examine parameters of interest under different missing data assumptions. The simulation results show that estimates of parameters of interest in CRTs can vary widely under different missing data assumptions. A high proportion of missing data can occur among CRTs because missing data can be found at the individual level as well as the cluster level. In the third paper, I use a simulation study to compare missing data strategies to handle missing cluster level covariates, including the linear mixed effects model, single imputation, single level MI ignoring clustering, MI incorporating clusters as fixed effects, and MI at the cluster level using aggregated data. The results show that when the ICC is small (ICC ≤ 0.1) and the proportion of missing data is low (≤ 25\%), the mixed model generates unbiased estimates of regression coefficients and ICC. When the ICC is higher (ICC > 0.1), MI at the cluster level using aggregated data performs well for missing cluster level covariates, though caution should be taken if the percentage of missing data is high.
|
29 |
變數遺漏值的多重插補應用於條件評估法 / Multiple imputation for missing covariates in contingent valua-tion survey費詩元, Fei, Shih Yuan Unknown Date (has links)
多數關於願付價格(WTP)之研究中,遺漏資料通常被視為完全隨機遺漏(MCAR)並刪除之。然而,研究中的某些重要變數若具有過高的遺漏比例時,則可能造成分析上的偏誤。
收入在許多條件評估(Contingent Valuation)調查中經常扮演著一個重要的角色,同時其也是受訪者最傾向於遺漏的變項之一。在這份研究中,我們將透過模擬的方式來評估多重插補法(Multiple Imputa- tion) 於插補願付價格調查中之遺漏收入之表現。我們考慮三種資料情況:刪除遺漏資料後所剩餘之完整資料、一次插補資料、以及多重插補資料,針對這三種情況,藉由三要素混合模型(Three-Component Mixture Model)所進行之分析來評估其優劣。模擬結果顯示,多重插補法之分析結果優於僅利用刪除遺漏資料所剩餘之完整資料進行分析之結果,並且隨著遺漏比例上升,其優劣更是明顯。我們也發現多重插補法之結果也比起一次插補來的更加可靠、穩定。因此如果資料遺漏機制非完全隨機遺漏之機制時,我們認為多重插補法是一個值得信任且表現不錯的處理方法。
此外,文中也透過「竹東及朴子地區心臟血管疾病長期追蹤研究」(Cardio Vascular Disease risk FACtor Two-township Study,簡稱CVDFACTS) 之資料來進行實證分析。文中示範一些評估遺漏機制的技巧,包括比較存活曲線以及邏輯斯迴歸。透過實證分析,我們發現插補前後的確造成模型分析及估計上的差異。 / Most often, studies focus on willingness to pay (WTP) simply ignore the missing values and treat them as if they were missing completely at random. It is well-known that such a practice might cause serious bias and lead to incorrect results.
Income is one of the most influential variables in CV (contingent valuation) study and is also the variable that respondents most likely fail to respond. In the present study, we evaluate the performance of multiple imputation (MI) on missing income in the analysis of WTP through a series of simulation experiments. Several approaches such as complete-case analysis, single imputation, and MI are considered and com-pared. We show that performance with MI is always better than complete-case analy-sis, especially when the missing rate gets high. We also show that MI is more stable and reliable than single imputation.
As an illustration, we use data from Cardio Vascular Disease risk FACtor Two-township Study (CVDFACTS). We demonstrate how to determine the missing mechanism through comparing the survival curves and a logistic regression model fitting. Based on the empirical study, we find that discarding cases with missing in-come can lead to something different from that with multiple imputation. If the dis-carded cases are not missing complete at random, the remaining samples will be biased. That can be a serious problem in CV research. To conclude, MI is a useful method to deal with missing value problems and it should be worthwhile to give it a try in CV studies.
|
30 |
Multiple Imputation Methods for Nonignorable Nonresponse, Adaptive Survey Design, and Dissemination of Synthetic GeographiesPaiva, Thais Viana January 2014 (has links)
<p>This thesis presents methods for multiple imputation that can be applied to missing data and data with confidential variables. Imputation is useful for missing data because it results in a data set that can be analyzed with complete data statistical methods. The missing data are filled in by values generated from a model fit to the observed data. The model specification will depend on the observed data pattern and the missing data mechanism. For example, when the reason why the data is missing is related to the outcome of interest, that is nonignorable missingness, we need to alter the model fit to the observed data to generate the imputed values from a different distribution. Imputation is also used for generating synthetic values for data sets with disclosure restrictions. Since the synthetic values are not actual observations, they can be released for statistical analysis. The interest is in fitting a model that approximates well the relationships in the original data, keeping the utility of the synthetic data, while preserving the confidentiality of the original data. We consider applications of these methods to data from social sciences and epidemiology.</p><p>The first method is for imputation of multivariate continuous data with nonignorable missingness. Regular imputation methods have been used to deal with nonresponse in several types of survey data. However, in some of these studies, the assumption of missing at random is not valid since the probability of missing depends on the response variable. We propose an imputation method for multivariate data sets when there is nonignorable missingness. We fit a truncated Dirichlet process mixture of multivariate normals to the observed data under a Bayesian framework to provide flexibility. With the posterior samples from the mixture model, an analyst can alter the estimated distribution to obtain imputed data under different scenarios. To facilitate that, I developed an R application that allows the user to alter the values of the mixture parameters and visualize the imputation results automatically. I demonstrate this process of sensitivity analysis with an application to the Colombian Annual Manufacturing Survey. I also include a simulation study to show that the correct complete data distribution can be recovered if the true missing data mechanism is known, thus validating that the method can be meaningfully interpreted to do sensitivity analysis.</p><p>The second method uses the imputation techniques for nonignorable missingness to implement a procedure for adaptive design in surveys. Specifically, I develop a procedure that agencies can use to evaluate whether or not it is effective to stop data collection. This decision is based on utility measures to compare the data collected so far with potential follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios considered by the analyst. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufactures.</p><p>The third method is for imputation of confidential data sets with spatial locations using disease mapping models. We consider data that include fine geographic information, such as census tract or street block identifiers. This type of data can be difficult to release as public use files, since fine geography provides information that ill-intentioned data users can use to identify individuals. We propose to release data with simulated geographies, so as to enable spatial analyses while reducing disclosure risks. We fit disease mapping models that predict areal-level counts from attributes in the file, and sample new locations based on the estimated models. I illustrate this approach using data on causes of death in North Carolina, including evaluations of the disclosure risks and analytic validity that can result from releasing synthetic geographies.</p> / Dissertation
|
Page generated in 0.107 seconds