81 |
Temporally-Embedded Deep Learning Model for Health Outcome PredictionBoursalie, Omar January 2021 (has links)
Deep learning models are increasingly used to analyze health records to model disease progression. Two characteristics of health records present challenges to developers of deep learning-based medical systems. First, the veracity of the estimation of missing health data must be evaluated to optimize the performance of deep learning models. Second, the currently most successful deep learning diagnostic models, called transformers, lack a mechanism to analyze the temporal characteristics of health records.
In this thesis, these two challenges are investigated using a real-world medical dataset of longitudinal health records from 340,143 patients over ten years called MIIDD: McMaster Imaging Information and Diagnostic Dataset. To address missing data, the performance of imputation models (mean, regression, and deep learning) were evaluated on a real-world medical dataset. Next, techniques from adversarial machine learning were used to demonstrate how imputation can have a cascading negative impact on a deep learning model. Then, the strengths and limitations of evaluation metrics from the statistical literature (qualitative, predictive accuracy, and statistical distance) to evaluate deep learning-based imputation models were investigated. This research can serve as a reference to researchers evaluating the impact of imputation on their deep learning models.
To analyze the temporal characteristics of health records, a new model was developed and evaluated called DTTHRE: Decoder Transformer for Temporally-Embedded Health Records Encoding. DTTHRE predicts patients' primary diagnoses by analyzing their medical histories, including the elapsed time between visits. The proposed model successfully predicted patients' primary diagnosis in their final visit with improved predictive performance (78.54 +/- 0.22%) compared to existing models in the literature. DTTHRE also increased the training examples available from limited medical datasets by predicting the primary diagnosis for each visit (79.53 +/- 0.25%) with no additional training time. This research contributes towards the goal of disease predictive modeling for clinical decision support. / Dissertation / Doctor of Philosophy (PhD) / In this thesis, two challenges using deep learning models to analyze health records are investigated using a real-world medical dataset. First, an important step in analyzing health records is to estimate missing data. We investigated how imputation can have a cascading negative impact on a deep learning model's performance. A comparative analysis was then conducted to investigate the strengths and limitations of evaluation metrics from the statistical literature to assess deep learning-based imputation models. Second, the most successful deep learning diagnostic models to date, called transformers, lack a mechanism to analyze the temporal characteristics of health records. To address this gap, we developed a new temporally-embedded transformer to analyze patients' medical histories, including the elapsed time between visits, to predict their primary diagnoses. The proposed model successfully predicted patients' primary diagnosis in their final visit with improved predictive performance (78.54 +/- 0.22%) compared to existing models in the literature.
|
82 |
As the World Turns Out: Economic Growth and Voter Turnout From a Global PerspectiveKoch, Luther Allen 11 June 2007 (has links)
No description available.
|
83 |
Missing Data Treatments in Multilevel Latent Growth Model: A Monte Carlo Simulation StudyJiang, Hui 25 September 2014 (has links)
No description available.
|
84 |
Multiple imputation in the presence of a detection limit, with applications : an empirical approach / Shawn Carl LiebenbergLiebenberg, Shawn Carl January 2014 (has links)
Scientists often encounter unobserved or missing measurements that are typically reported as less than a fixed detection limit. This especially occurs in the environmental sciences when detection of low exposures are not possible due to limitations of the measuring instrument, and the resulting data are often referred to as type I and II left censored data. Observations lying below this detection limit are therefore often ignored, or `guessed' because it cannot be measured accurately. However, reliable estimates of the population parameters are nevertheless required to perform statistical analysis. The problem of dealing with values below a detection limit becomes increasingly complex when a large number of observations are present below this limit. Researchers thus have interest in developing statistical robust estimation procedures for dealing with left- or right-censored data sets (SinghandNocerino2002). The aim of this study focuses on several main components regarding the problems mentioned above. The imputation of censored data below a fixed detection limit are studied, particularly using the maximum likelihood procedure of Cohen(1959), and several variants thereof, in combination with four new variations of the multiple imputation concept found in literature. Furthermore, the focus also falls strongly on estimating the density of the resulting imputed, `complete' data set by applying various kernel density estimators. It should be noted that bandwidth selection issues are not of importance in this study, and will be left for further research. In this study, however, the maximum likelihood estimation method of Cohen (1959) will be compared with several variant methods, to establish which of these maximum likelihood estimation procedures for censored data estimates the population parameters of three chosen Lognormal distribution, the most reliably in terms of well-known discrepancy measures. These methods will be implemented in combination with four new multiple imputation procedures, respectively, to assess which of these nonparametric methods are most effective with imputing the 12 censored values below the detection limit, with regards to the global discrepancy measures mentioned above. Several variations of the Parzen-Rosenblatt kernel density estimate will be fitted to the complete filled-in data sets, obtained from the previous methods, to establish which is the preferred data-driven method to estimate these densities. The primary focus of the current study will therefore be the performance of the four chosen multiple imputation methods, as well as the recommendation of methods and procedural combinations to deal with data in the presence of a detection limit. An extensive Monte Carlo simulation study was performed to compare the various methods and procedural combinations. Conclusions and recommendations regarding the best of these methods and combinations are made based on the study's results. / MSc (Statistics), North-West University, Potchefstroom Campus, 2014
|
85 |
Multiple imputation in the presence of a detection limit, with applications : an empirical approach / Shawn Carl LiebenbergLiebenberg, Shawn Carl January 2014 (has links)
Scientists often encounter unobserved or missing measurements that are typically reported as less than a fixed detection limit. This especially occurs in the environmental sciences when detection of low exposures are not possible due to limitations of the measuring instrument, and the resulting data are often referred to as type I and II left censored data. Observations lying below this detection limit are therefore often ignored, or `guessed' because it cannot be measured accurately. However, reliable estimates of the population parameters are nevertheless required to perform statistical analysis. The problem of dealing with values below a detection limit becomes increasingly complex when a large number of observations are present below this limit. Researchers thus have interest in developing statistical robust estimation procedures for dealing with left- or right-censored data sets (SinghandNocerino2002). The aim of this study focuses on several main components regarding the problems mentioned above. The imputation of censored data below a fixed detection limit are studied, particularly using the maximum likelihood procedure of Cohen(1959), and several variants thereof, in combination with four new variations of the multiple imputation concept found in literature. Furthermore, the focus also falls strongly on estimating the density of the resulting imputed, `complete' data set by applying various kernel density estimators. It should be noted that bandwidth selection issues are not of importance in this study, and will be left for further research. In this study, however, the maximum likelihood estimation method of Cohen (1959) will be compared with several variant methods, to establish which of these maximum likelihood estimation procedures for censored data estimates the population parameters of three chosen Lognormal distribution, the most reliably in terms of well-known discrepancy measures. These methods will be implemented in combination with four new multiple imputation procedures, respectively, to assess which of these nonparametric methods are most effective with imputing the 12 censored values below the detection limit, with regards to the global discrepancy measures mentioned above. Several variations of the Parzen-Rosenblatt kernel density estimate will be fitted to the complete filled-in data sets, obtained from the previous methods, to establish which is the preferred data-driven method to estimate these densities. The primary focus of the current study will therefore be the performance of the four chosen multiple imputation methods, as well as the recommendation of methods and procedural combinations to deal with data in the presence of a detection limit. An extensive Monte Carlo simulation study was performed to compare the various methods and procedural combinations. Conclusions and recommendations regarding the best of these methods and combinations are made based on the study's results. / MSc (Statistics), North-West University, Potchefstroom Campus, 2014
|
86 |
Evaluation verschiedener Imputationsverfahren zur Aufbereitung großer Datenbestände am Beispiel der SrV-Studie von 2013Meister, Romy 09 March 2016 (has links) (PDF)
Missing values are a serious problem in surveys. The literature suggests to replace these with realistic values using imputation methods. This master thesis examines four different imputation techniques concerning their ability for handling missing data. Therefore, mean imputation, conditional mean imputation, Expectation-Maximization algorithm and Markov-Chain-Monte-Carlo method are presented. In addition, the three first mentioned methods were simulated by using a large real data set. To analyse the quality of these techniques a metric variable of the original data set was chosen to generate some missing values considering different percentages of missingness and common missing data mechanism. After the replacement of the simulated missing values, several statistical parameters, like quantiles, arithmetic mean and variance of all completed data sets were calculated in order to compare them with the parameters from the original data set. The results, that have been established by empiric data analysis, show that the Expectation-Maximization algorithm estimates all considered statistical parameters of the complete data set far better than the other analysed imputation methods, although the assumption of a multivariate normal distribution could not be achieved. It is found, that the mean as well as the conditional mean imputation produce statistically significant estimator for the arithmetic mean under the supposition of missing completely at random, whereas other parameters as the variance do not show the estimated effects. Generally, the accuracy of all estimators from the three imputation methods decreases with increasing percentage of missingness. The results lead to the conclusion that the Expectation-Maximization algorithm should be preferred over the mean and the conditional mean imputation.
|
87 |
Statistical Approaches for Handling Missing Data in Cluster Randomized TrialsFiero, Mallorie H. January 2016 (has links)
In cluster randomized trials (CRTs), groups of participants are randomized as opposed to individual participants. This design is often chosen to minimize treatment arm contamination or to enhance compliance among participants. In CRTs, we cannot assume independence among individuals within the same cluster because of their similarity, which leads to decreased statistical power compared to individually randomized trials. The intracluster correlation coefficient (ICC) is crucial in the design and analysis of CRTs, and measures the proportion of total variance due to clustering. Missing data is a common problem in CRTs and should be accommodated with appropriate statistical techniques because they can compromise the advantages created by randomization and are a potential source of bias. In three papers, I investigate statistical approaches for handling missing data in CRTs. In the first paper, I carry out a systematic review evaluating current practice of handling missing data in CRTs. The results show high rates of missing data in the majority of CRTs, yet handling of missing data remains suboptimal. Fourteen (16%) of the 86 reviewed trials reported carrying out a sensitivity analysis for missing data. Despite suggestions to weaken the missing data assumption from the primary analysis, only five of the trials weakened the assumption. None of the trials reported using missing not at random (MNAR) models. Due to the low proportion of CRTs reporting an appropriate sensitivity analysis for missing data, the second paper aims to facilitate performing a sensitivity analysis for missing data in CRTs by extending the pattern mixture approach for missing clustered data under the MNAR assumption. I implement multilevel multiple imputation (MI) in order to account for the hierarchical structure found in CRTs, and multiply imputed values by a sensitivity parameter, k, to examine parameters of interest under different missing data assumptions. The simulation results show that estimates of parameters of interest in CRTs can vary widely under different missing data assumptions. A high proportion of missing data can occur among CRTs because missing data can be found at the individual level as well as the cluster level. In the third paper, I use a simulation study to compare missing data strategies to handle missing cluster level covariates, including the linear mixed effects model, single imputation, single level MI ignoring clustering, MI incorporating clusters as fixed effects, and MI at the cluster level using aggregated data. The results show that when the ICC is small (ICC ≤ 0.1) and the proportion of missing data is low (≤ 25\%), the mixed model generates unbiased estimates of regression coefficients and ICC. When the ICC is higher (ICC > 0.1), MI at the cluster level using aggregated data performs well for missing cluster level covariates, though caution should be taken if the percentage of missing data is high.
|
88 |
變數遺漏值的多重插補應用於條件評估法 / Multiple imputation for missing covariates in contingent valua-tion survey費詩元, Fei, Shih Yuan Unknown Date (has links)
多數關於願付價格(WTP)之研究中,遺漏資料通常被視為完全隨機遺漏(MCAR)並刪除之。然而,研究中的某些重要變數若具有過高的遺漏比例時,則可能造成分析上的偏誤。
收入在許多條件評估(Contingent Valuation)調查中經常扮演著一個重要的角色,同時其也是受訪者最傾向於遺漏的變項之一。在這份研究中,我們將透過模擬的方式來評估多重插補法(Multiple Imputa- tion) 於插補願付價格調查中之遺漏收入之表現。我們考慮三種資料情況:刪除遺漏資料後所剩餘之完整資料、一次插補資料、以及多重插補資料,針對這三種情況,藉由三要素混合模型(Three-Component Mixture Model)所進行之分析來評估其優劣。模擬結果顯示,多重插補法之分析結果優於僅利用刪除遺漏資料所剩餘之完整資料進行分析之結果,並且隨著遺漏比例上升,其優劣更是明顯。我們也發現多重插補法之結果也比起一次插補來的更加可靠、穩定。因此如果資料遺漏機制非完全隨機遺漏之機制時,我們認為多重插補法是一個值得信任且表現不錯的處理方法。
此外,文中也透過「竹東及朴子地區心臟血管疾病長期追蹤研究」(Cardio Vascular Disease risk FACtor Two-township Study,簡稱CVDFACTS) 之資料來進行實證分析。文中示範一些評估遺漏機制的技巧,包括比較存活曲線以及邏輯斯迴歸。透過實證分析,我們發現插補前後的確造成模型分析及估計上的差異。 / Most often, studies focus on willingness to pay (WTP) simply ignore the missing values and treat them as if they were missing completely at random. It is well-known that such a practice might cause serious bias and lead to incorrect results.
Income is one of the most influential variables in CV (contingent valuation) study and is also the variable that respondents most likely fail to respond. In the present study, we evaluate the performance of multiple imputation (MI) on missing income in the analysis of WTP through a series of simulation experiments. Several approaches such as complete-case analysis, single imputation, and MI are considered and com-pared. We show that performance with MI is always better than complete-case analy-sis, especially when the missing rate gets high. We also show that MI is more stable and reliable than single imputation.
As an illustration, we use data from Cardio Vascular Disease risk FACtor Two-township Study (CVDFACTS). We demonstrate how to determine the missing mechanism through comparing the survival curves and a logistic regression model fitting. Based on the empirical study, we find that discarding cases with missing in-come can lead to something different from that with multiple imputation. If the dis-carded cases are not missing complete at random, the remaining samples will be biased. That can be a serious problem in CV research. To conclude, MI is a useful method to deal with missing value problems and it should be worthwhile to give it a try in CV studies.
|
89 |
Multiple Imputation Methods for Nonignorable Nonresponse, Adaptive Survey Design, and Dissemination of Synthetic GeographiesPaiva, Thais Viana January 2014 (has links)
<p>This thesis presents methods for multiple imputation that can be applied to missing data and data with confidential variables. Imputation is useful for missing data because it results in a data set that can be analyzed with complete data statistical methods. The missing data are filled in by values generated from a model fit to the observed data. The model specification will depend on the observed data pattern and the missing data mechanism. For example, when the reason why the data is missing is related to the outcome of interest, that is nonignorable missingness, we need to alter the model fit to the observed data to generate the imputed values from a different distribution. Imputation is also used for generating synthetic values for data sets with disclosure restrictions. Since the synthetic values are not actual observations, they can be released for statistical analysis. The interest is in fitting a model that approximates well the relationships in the original data, keeping the utility of the synthetic data, while preserving the confidentiality of the original data. We consider applications of these methods to data from social sciences and epidemiology.</p><p>The first method is for imputation of multivariate continuous data with nonignorable missingness. Regular imputation methods have been used to deal with nonresponse in several types of survey data. However, in some of these studies, the assumption of missing at random is not valid since the probability of missing depends on the response variable. We propose an imputation method for multivariate data sets when there is nonignorable missingness. We fit a truncated Dirichlet process mixture of multivariate normals to the observed data under a Bayesian framework to provide flexibility. With the posterior samples from the mixture model, an analyst can alter the estimated distribution to obtain imputed data under different scenarios. To facilitate that, I developed an R application that allows the user to alter the values of the mixture parameters and visualize the imputation results automatically. I demonstrate this process of sensitivity analysis with an application to the Colombian Annual Manufacturing Survey. I also include a simulation study to show that the correct complete data distribution can be recovered if the true missing data mechanism is known, thus validating that the method can be meaningfully interpreted to do sensitivity analysis.</p><p>The second method uses the imputation techniques for nonignorable missingness to implement a procedure for adaptive design in surveys. Specifically, I develop a procedure that agencies can use to evaluate whether or not it is effective to stop data collection. This decision is based on utility measures to compare the data collected so far with potential follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios considered by the analyst. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufactures.</p><p>The third method is for imputation of confidential data sets with spatial locations using disease mapping models. We consider data that include fine geographic information, such as census tract or street block identifiers. This type of data can be difficult to release as public use files, since fine geography provides information that ill-intentioned data users can use to identify individuals. We propose to release data with simulated geographies, so as to enable spatial analyses while reducing disclosure risks. We fit disease mapping models that predict areal-level counts from attributes in the file, and sample new locations based on the estimated models. I illustrate this approach using data on causes of death in North Carolina, including evaluations of the disclosure risks and analytic validity that can result from releasing synthetic geographies.</p> / Dissertation
|
90 |
SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITYZheng, Xiyu 01 January 2016 (has links)
Abstract
In a two-level hierarchical linear model(HLM2), the outcome as well as covariates may have missing values at any of the levels. One way to analyze all available data in the model is to estimate a multivariate normal joint distribution of variables, including the outcome, subject to missingness conditional on covariates completely observed by maximum likelihood(ML); draw multiple imputation (MI) of missing values given the estimated joint model; and analyze the hierarchical model given the MI [1,2]. The assumption is data missing at random (MAR). While this method yields efficient estimation of the hierarchical model, it often estimates the model given discrete missing data that is handled under multivariate normality. In this thesis, we evaluate how robust it is to estimate a hierarchical linear model given discrete missing data by the method. We simulate incompletely observed data from a series of hierarchical linear models given discrete covariates MAR, estimate the models by the method, and assess the sensitivity of handling discrete missing data under the multivariate normal joint distribution by computing bias, root mean squared error, standard error, and coverage probability in the estimated hierarchical linear models via a series of simulation studies. We want to achieve the following aim: Evaluate the performance of the method handling binary covariates MAR. We let the missing patterns of level-1 and -2 binary covariates depend on completely observed variables and assess how the method handles binary missing data given different values of success probabilities and missing rates.
Based on the simulation results, the missing data analysis is robust under certain parameter settings. Efficient analysis performs very well for estimation of level-1 fixed and random effects across varying success probabilities and missing rates. MAR estimation of level-2 binary covariate is not well estimated when the missing rate in level-2 binary covariate is greater than 10%.
The rest of the thesis is organized as follows: Section 1 introduces the background information including conventional methods for hierarchical missing data analysis, different missing data mechanisms, and the innovation and significance of this study. Section 2 explains the efficient missing data method. Section 3 represents the sensitivity analysis of the missing data method and explain how we carry out the simulation study using SAS, software package HLM7, and R. Section 4 illustrates the results and useful recommendations for researchers who want to use the missing data method for binary covariates MAR in HLM2. Section 5 presents an illustrative analysis National Growth of Health Study (NGHS) by the missing data method. The thesis ends with a list of useful references that will guide the future study and simulation codes we used.
|
Page generated in 0.1065 seconds