Return to search

Improving the accuracy of statistics used in de-identification and model validation (via the concordance statistic) pertaining to time-to-event data

Time-to-event data is very common in medical research. Thus, clinicians and patients need analysis of this data to be accurate, as it is often used to interpret disease screening results, inform treatment decisions, and identify at-risk patient groups (ie, sex, race, gene expressions, etc.). This thesis tackles three statistical issues pertaining to time-to-event data.

The first issue was incurred from an Institute for Clinical and Evaluative Sciences lung cancer registry data set, which was de-identified by censoring patients at an earlier date. This resulted in an underestimate of the observed times of censored patients. Five methods were proposed to account for the underestimation incurred by de-identification. A subsequent simulation study was conducted to compare the effectiveness of each method in reducing bias, and mean squared error as well as improving coverage probabilities of four different KM estimates. The simulation results demonstrated that situations with relatively large numbers of censored patients required methodology with larger perturbation. In these scenarios, the fourth proposed method (which perturbed censored times such that they were censored in the final year of study) yielded estimates with the smallest bias, mean squared error, and largest coverage probability. Alternatively, when there were smaller numbers of censored patients, any manipulation to the altered data set worsened the accuracy of the estimates.

The second issue arises when investigating model validation via the concordance (c) statistic. Specifically, the c-statistic is intended for measuring the accuracy of statistical models which assess the risk associated with a binary outcome. The c-statistic estimates the proportion of patient pairs where the patient with a higher predicted risk had experienced the event. The definition of a c-statistic cannot be uniquely extended to time-to-event outcomes, thus many proposals have been made. The second project developed a parametric c-statistic which assumes to the true survival times are exponentially distributed to invoke the memoryless property. A simulation study was conducted which included a comparative analysis of two other time-to-event c-statistics. Three different definitions of concordance in the time-to-event setting were compared, as were three different c-statistics. The c-statistic developed by the authors yielded the smallest bias when censoring is present in data, even when the exponentially distributed parametric assumptions do not hold. The c-statistic developed by the authors appears to be the most robust to censored data. Thus, it is recommended to use this c-statistic to validate prediction models applied to censored data.

The third project in this thesis developed and assessed the appropriateness of an empirical time-to-event c-statistic that is derived by estimating the survival times of censored patients via the EM algorithm. A simulation study was conducted for various sample sizes, censoring levels and correlation rates. A non-parametric bootstrap was employed and the mean and standard error of the bias of 4 different time-to-event c-statistics were compared, including the empirical EM c-statistic developed by the authors. The newly developed c-statistic yielded the smallest mean bias and standard error in all simulated scenarios. The c-statistic developed by the authors appears to be the most appropriate when estimating concordance of a time-to-event model. Thus, it is recommended to use this c-statistic to validate prediction models applied to censored data. / Thesis / Doctor of Philosophy (PhD)

Identiferoai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/25786
Date January 2020
CreatorsCaetano, Samantha-Jo
ContributorsPond, Gregory R., Mathematics and Statistics
Source SetsMcMaster University
LanguageEnglish
Detected LanguageEnglish
TypeThesis

Page generated in 0.0022 seconds