Global ETD Search

111	Nonparametric tests for interval-censored failure time data via multiple imputation Huang, Jin-long 26 June 2008 (has links) Interval-censored failure time data often occur in follow-up studies where subjects can only be followed periodically and the failure time can only be known to lie in an interval. In this paper we consider the problem of comparing two or more interval-censored samples. We propose a multiple imputation method for discrete interval-censored data to impute exact failure times from interval-censored observations and then apply existing test for exact data, such as the log-rank test, to imputed exact data. The test statistic and covariance matrix are calculated by our proposed multiple imputation technique. The formula of covariance matrix estimator is similar to the estimator used by Follmann, Proschan and Leifer (2003) for clustered data. Through simulation studies we find that the performance of the proposed log-rank type test is comparable to that of the test proposed by Finkelstein (1986), and is better than that of the two existing log-rank type tests proposed by Sun (2001) and Zhao and Sun (2004) due to the differences in the method of multiple imputation and the covariance matrix estimation. The proposed method is illustrated by means of an example involving patients with breast cancer. We also investigate applying our method to the other two-sample comparison tests for exact data, such as Mantel's test (1967) and the integrated weighted difference test. Mantel's test log-rank test treatments comparison multiple imputation integrated weighted difference test interval-censored data
112	A Comparsion of Multiple Imputation Methods for Missing Covariate Values in Recurrent Event Data Huo, Zhao January 2015 (has links) Multiple imputation (MI) is a commonly used approach to impute missing data. This thesis studies missing covariates in recurrent event data, and discusses ways to include the survival outcomes in the imputation model. Some MI methods under consideration are the event indicator D combined with, respectively, the right-censored event times T, the logarithm of T and the cumulative baseline hazard H0(T). After imputation, we can then proceed to the complete data analysis. The Cox proportional hazards (PH) model and the PWP model are chosen as the analysis models, and the coefficient estimates are of substantive interest. A Monte Carlo simulation study is conducted to compare different MI methods, the relative bias and mean square error will be used in the evaluation process. Furthermore, an empirical study based on cardiovascular disease event data which contains missing values will be conducted. Overall, the results show that MI based on the Nelson-Aalen estimate of H0(T) is preferred in most circumstances. Missing data Multiple imputation Missing covariates Recurrent event data Cox PH model PWP model
113	Estimating market values for non-publicly-traded U.S. life insurers Zhao, Liyan 28 August 2008 (has links) Not available / text Insurance companies--Valuation Life insurance stocks--United States Multiple imputation (Statistics)
114	Multiple Imputation on Missing Values in Time Series Data Oh, Sohae January 2015 (has links) <p>Financial stock market data, for various reasons, frequently contain missing values. One reason for this is that, because the markets close for holidays, daily stock prices are not always observed. This creates gaps in information, making it difficult to predict the following day’s stock prices. In this situation, information during the holiday can be “borrowed” from other countries’ stock market, since global stock prices tend to show similar movements and are in fact highly correlated. The main goal of this study is to combine stock index data from various markets around the world and develop an algorithm to impute the missing values in individual stock index using “information-sharing” between different time series. To develop imputation algorithm that accommodate time series-specific features, we take multiple imputation approach using dynamic linear model for time-series and panel data. This algorithm assumes ignorable missing data mechanism, as which missingness due to holiday. The posterior distributions of parameters, including missing values, is simulated using Monte Carlo Markov Chain (MCMC) methods and estimates from sets of draws are then combined using Rubin’s combination rule, rendering final inference of the data set. Specifically, we use the Gibbs sampler and Forward Filtering and Backward Sampling (FFBS) to simulate joint posterior distribution and posterior predictive distribution of latent variables and other parameters. A simulation study is conducted to check the validity and the performance of the algorithm using two error-based measurements: Root Mean Square Error (RMSE), and Normalized Root Mean Square Error (NRMSE). We compared the overall trend of imputed time series with complete data set, and inspected the in-sample predictability of the algorithm using Last Value Carried Forward (LVCF) method as a bench mark. The algorithm is applied to real stock price index data from US, Japan, Hong Kong, UK and Germany. From both of the simulation and the application, we concluded that the imputation algorithm performs well enough to achieve our original goal, predicting the stock price for the opening price after a holiday, outperforming the benchmark method. We believe this multiple imputation algorithm can be used in many applications that deal with time series with missing values such as financial and economic data and biomedical data.</p> / Thesis Statistics Finance Bayesian Dynamic linear Model Forward Filtering Backward Sampling Multiple Imputation Time series Data
115	Partial least squares structural equation modelling with incomplete data : an investigation of the impact of imputation methods Mohd Jamil, J. B. January 2012 (has links) Despite considerable advances in missing data imputation methods over the last three decades, the problem of missing data remains largely unsolved. Many techniques have emerged in the literature as candidate solutions. These techniques can be categorised into two classes: statistical methods of data imputation and computational intelligence methods of data imputation. Due to the longstanding use of statistical methods in handling missing data problems, it takes quite some time for computational intelligence methods to gain profound attention even though these methods have analogous accuracy, in comparison to other approaches. The merits of both these classes have been discussed at length in the literature, but only limited studies make significant comparison to these classes. This thesis contributes to knowledge by firstly, conducting a comprehensive comparison of standard statistical methods of data imputation, namely, mean substitution (MS), regression imputation (RI), expectation maximization (EM), tree imputation (TI) and multiple imputation (MI) on missing completely at random (MCAR) data sets. Secondly, this study also compares the efficacy of these methods with a computational intelligence method of data imputation, ii namely, a neural network (NN) on missing not at random (MNAR) data sets. The significance difference in performance of the methods is presented. Thirdly, a novel procedure for handling missing data is presented. A hybrid combination of each of these statistical methods with a NN, known here as the post-processing procedure, was adopted to approximate MNAR data sets. Simulation studies for each of these imputation approaches have been conducted to assess the impact of missing values on partial least squares structural equation modelling (PLS-SEM) based on the estimated accuracy of both structural and measurement parameters. The best method to deal with particular missing data mechanisms is highly recognized. Several significant insights were deduced from the simulation results. It was figured that for the problem of MCAR by using statistical methods of data imputation, MI performs better than the other methods for all percentages of missing data. Another unique contribution is found when comparing the results before and after the NN post-processing procedure. This improvement in accuracy may be resulted from the neural network's ability to derive meaning from the imputed data set found by the statistical methods. Based on these results, the NN post-processing procedure is capable to assist MS in producing significant improvement in accuracy of the approximated values. This is a promising result, as MS is the weakest method in this study. This evidence is also informative as MS is often used as the default method available to users of PLS-SEM software. 658
116	Topics and Applications in Synthetic Data Loong, Bronwyn 07 September 2012 (has links) Releasing synthetic data in place of observed values is a method of statistical disclosure control for the public dissemination of survey data collected by national statistical agencies. The overall goal is to limit the risk of disclosure of survey respondents' identities or sensitive attributes, but simultaneously retain enough detail in the synthetic data to preserve the inferential conclusions drawn on the target population, in potential future legitimate statistical analyses. This thesis presents three new research contributions in the analysis and application of synthetic data. Firstly, to understand differences in types of input between the imputer, typically an agency, and the analyst, we present a definition of congeniality in the context of multiple imputation for synthetic data. Our definition is motivated by common examples of uncongeniality, specifically ignorance of the original survey design in analysis of fully synthetic data, and situations when the imputation model and analysis procedure condition upon different sets of records. We conclude that our definition provides a framework to assist the imputer to identify the source of a discrepancy between observed and synthetic data analytic results. Motivated by our definition, we derive an alternative approach to synthetic data inference, to recover the observed data set sampling distribution of sufficient statistics given the synthetic data. Secondly, we address the problem of negative method-of-moments variance estimates given fully synthetic data, which may be produced with the current inferential methods. We apply the adjustment for density maximization (ADM) method to variance estimation, and demonstrate using ADM as an alternative approach to produce positive variance estimates. Thirdly, we present a new application of synthetic data techniques to confidentialize survey data from a large-scale healthcare study. To date, application of synthetic data techniques to healthcare survey data is rare. We discuss identification of variables for synthesis, specification of imputation models, and working measures of disclosure risk assessment. Following comparison of observed and synthetic data analytic results based on published studies, we conclude that use of synthetic data for our healthcare survey is best suited for exploratory data analytic purposes. / Statistics data confidentiality data utility disclosure risk multiple imputation synthetic uncongeniality statistics
117	Statistical methods for the analysis of genetic association studies Su, Zhan January 2008 (has links) One of the main biological goals of recent years is to determine the genes in the human genome that cause disease. Recent technological advances have realised genome-wide association studies, which have uncovered numerous genetic regions implicated with human diseases. The current approach to analysing data from these studies is based on testing association at single SNPs but this is widely accepted as underpowered to detect rare and poorly tagged variants. In this thesis we propose several novel approaches to analysing large-scale association data, which aim to improve upon the power offered by traditional approaches. We combine an established imputation framework with a sophisticated disease model that allows for multiple disease causing mutations at a single locus. To evaluate our methods, we have developed a fast and realistic method to simulate association data conditional on population genetic data. The simulation results show that our methods remain powerful even if the causal variant is not well tagged, there are haplotypic effects or there is allelic heterogeneity. Our methods are further validated by the analysis of the recent WTCCC genome-wide association data, where we have detected confirmed disease loci, known regions of allelic heterogeneity and new signals of association. One of our methods also has the facility to identify the high risk haplotype backgrounds that harbour the disease alleles, and therefore can be used for fine-mapping. We believe that the incorporation of our methods into future association studies will help progress the understanding genetic diseases. 611
118	Epidemiological Study of Coccidioidomycosis in Greater Tucson, Arizona Tabor, Joseph Anthony January 2009 (has links) The goal of this dissertation is to characterize the distribution and determinants of coccidioidomycosis in greater Tucson, Arizona, using landscape ecology and complex survey methods to control for environmental factors that affect <italic>Coccidioides</italic> exposure. Notifiable coccidioidomycosis cases reported to the health department in Arizona have dramatically increased since 1997 and indicate a potential epidemic of unknown causes. Epidemic determination is confounded by concurrent changes in notifiable disease reporting-compliance, misdiagnosis, and changing demographics of susceptible populations. A stratified, two-stage, address-based telephone survey of greater Tucson, Arizona, was conducted in 2002 and 2003. Subjects were recruited from direct marketing data by census block groups and landscape strata as determined using a geographic information system (GIS). Subjects were interviewed about potential risk factors. Address-level state health department notifiable-disease surveillance data were compared with self-reported survey data to estimate the true disease frequency.Comparing state surveillance data with the survey data, no coccidioidomycosis epidemic was detectable from 1992 to 2006 after adjusting surveillance data for reporting compliance. State health department surveillance reported only 20% of the probable reportable cases in 2001.Utilizing survey data and geographic coding, it was observed that spatial and temporal disease frequency was highly variable at the census block-group scale and indicates that localized soil disturbance events are a major group-level risk factor. Poststratification by 2000 census demographic data adjusted for selection bias into the survey and response rate. Being Hispanic showed similar odds ratio of self-reporting coccidioidomycosis diagnosis as of being non-Hispanic White race-ethnicity when controlled by other risk factors. Cigarette smoking in the home and having a home located in the low Hispanic foothills and low Hispanic riparian strata were associated with elevated risk of odds ratios for coccidioidomycosis. Sample stratification by landscape and demographics controlled for differential classification of susceptibility and exposures between strata.Clustered, address-based telephone surveys provide a feasible and valid method to recruit populations from address-based lists by using a GIS to design a survey and population survey statistical methods for the analysis. Notifiable coccidioidomycosis case surveillance can be improved by including reporting compliance in the analysis. Pathogen exposures and host susceptibility are important predictable group-level determinants of coccidioidomycosis that were controlled by stratified sampling using a landscape ecology approach. capture-recapture complex survey multiple imputation population control adjustment valley fever
119	Imputing Genotypes Using Regularized Generalized Linear Regression Models Griesman, Joshua 14 June 2012 (has links) As genomic sequencing technologies continue to advance, researchers are furthering their understanding of the relationships between genetic variants and expressed traits (Hirschhorn and Daly, 2005). However, missing data can significantly limit the power of a genetic study. Here, the use of a regularized generalized linear model, denoted GLMNET is proposed to impute missing genotypes. The method aimed to address certain limitations of earlier regression approaches in regards to genotype imputation, particularly multicollinearity among predictors. The performance of GLMNET-based method is compared to the performance of the phase-based method fastPHASE. Two simulation settings were evaluated: a sparse-missing model, and a small-panel expan- sion model. The sparse-missing model simulated a scenario where SNPs were missing in a random fashion across the genome. In the small-panel expansion model, a set of test individuals that were only genotyped at a small subset of the SNPs of the large panel. Each imputation method was tested in the context of two data-sets: Canadian Holstein cattle data and human HapMap CEU data. Although the proposed method was able to perform with high accuracy (>90% in all simulations), fastPHASE per- formed with higher accuracy (>94%). However, the new method, which was coded in R, was able to impute genotypes with better time efficiency than fastPHASE and this could be further improved by optimizing in a compiled language. Bioinformatics Computational Biology Quantitative Genetics Genome Wide Association Study Genotype Imputation Generalized Linear Models
120	A Genetic Characterization of the Hays Converter Fleming, Allison 03 April 2013 (has links) This thesis gives a genetic overview of the Hays Converter, a beef breed developed in Canada in the 1950s. Pedigree records were examined to determine genetic diversity and inbreeding. A positive rate of inbreeding and a decrease in the amount of genetic diversity was found. Single trait and bivariate animal models were used to determine genetic parameters and trends for growth, ultrasound, and carcass traits. An increasing genetic trend was found for growth traits which the breed was selected for. The accuracy of imputation from 6k to 50k marker panels using a reference group of 100 animals was determined. Imputation was performed with a high accuracy (>0.93) for pure Hays Converter animals, but was found to be unsuccessful when individuals had large contributions from additional breeds. This work forms the foundation for future management and advance of the breed while outlining its history and progress. / Daniel P. Hays Hays Converter Beef Cattle Genetic Diversity Inbreeding Genetic Parameters Genetic Trends Imputation Breed Composition

Search results