Global ETD Search

41	A Comparsion of Multiple Imputation Methods for Missing Covariate Values in Recurrent Event Data Huo, Zhao January 2015 (has links) Multiple imputation (MI) is a commonly used approach to impute missing data. This thesis studies missing covariates in recurrent event data, and discusses ways to include the survival outcomes in the imputation model. Some MI methods under consideration are the event indicator D combined with, respectively, the right-censored event times T, the logarithm of T and the cumulative baseline hazard H0(T). After imputation, we can then proceed to the complete data analysis. The Cox proportional hazards (PH) model and the PWP model are chosen as the analysis models, and the coefficient estimates are of substantive interest. A Monte Carlo simulation study is conducted to compare different MI methods, the relative bias and mean square error will be used in the evaluation process. Furthermore, an empirical study based on cardiovascular disease event data which contains missing values will be conducted. Overall, the results show that MI based on the Nelson-Aalen estimate of H0(T) is preferred in most circumstances. Missing data Multiple imputation Missing covariates Recurrent event data Cox PH model PWP model
42	Estimating market values for non-publicly-traded U.S. life insurers Zhao, Liyan 28 August 2008 (has links) Not available / text Insurance companies--Valuation Life insurance stocks--United States Multiple imputation (Statistics)
43	Multiple Imputation on Missing Values in Time Series Data Oh, Sohae January 2015 (has links) <p>Financial stock market data, for various reasons, frequently contain missing values. One reason for this is that, because the markets close for holidays, daily stock prices are not always observed. This creates gaps in information, making it difficult to predict the following day’s stock prices. In this situation, information during the holiday can be “borrowed” from other countries’ stock market, since global stock prices tend to show similar movements and are in fact highly correlated. The main goal of this study is to combine stock index data from various markets around the world and develop an algorithm to impute the missing values in individual stock index using “information-sharing” between different time series. To develop imputation algorithm that accommodate time series-specific features, we take multiple imputation approach using dynamic linear model for time-series and panel data. This algorithm assumes ignorable missing data mechanism, as which missingness due to holiday. The posterior distributions of parameters, including missing values, is simulated using Monte Carlo Markov Chain (MCMC) methods and estimates from sets of draws are then combined using Rubin’s combination rule, rendering final inference of the data set. Specifically, we use the Gibbs sampler and Forward Filtering and Backward Sampling (FFBS) to simulate joint posterior distribution and posterior predictive distribution of latent variables and other parameters. A simulation study is conducted to check the validity and the performance of the algorithm using two error-based measurements: Root Mean Square Error (RMSE), and Normalized Root Mean Square Error (NRMSE). We compared the overall trend of imputed time series with complete data set, and inspected the in-sample predictability of the algorithm using Last Value Carried Forward (LVCF) method as a bench mark. The algorithm is applied to real stock price index data from US, Japan, Hong Kong, UK and Germany. From both of the simulation and the application, we concluded that the imputation algorithm performs well enough to achieve our original goal, predicting the stock price for the opening price after a holiday, outperforming the benchmark method. We believe this multiple imputation algorithm can be used in many applications that deal with time series with missing values such as financial and economic data and biomedical data.</p> / Thesis Statistics Finance Bayesian Dynamic linear Model Forward Filtering Backward Sampling Multiple Imputation Time series Data
44	Topics and Applications in Synthetic Data Loong, Bronwyn 07 September 2012 (has links) Releasing synthetic data in place of observed values is a method of statistical disclosure control for the public dissemination of survey data collected by national statistical agencies. The overall goal is to limit the risk of disclosure of survey respondents' identities or sensitive attributes, but simultaneously retain enough detail in the synthetic data to preserve the inferential conclusions drawn on the target population, in potential future legitimate statistical analyses. This thesis presents three new research contributions in the analysis and application of synthetic data. Firstly, to understand differences in types of input between the imputer, typically an agency, and the analyst, we present a definition of congeniality in the context of multiple imputation for synthetic data. Our definition is motivated by common examples of uncongeniality, specifically ignorance of the original survey design in analysis of fully synthetic data, and situations when the imputation model and analysis procedure condition upon different sets of records. We conclude that our definition provides a framework to assist the imputer to identify the source of a discrepancy between observed and synthetic data analytic results. Motivated by our definition, we derive an alternative approach to synthetic data inference, to recover the observed data set sampling distribution of sufficient statistics given the synthetic data. Secondly, we address the problem of negative method-of-moments variance estimates given fully synthetic data, which may be produced with the current inferential methods. We apply the adjustment for density maximization (ADM) method to variance estimation, and demonstrate using ADM as an alternative approach to produce positive variance estimates. Thirdly, we present a new application of synthetic data techniques to confidentialize survey data from a large-scale healthcare study. To date, application of synthetic data techniques to healthcare survey data is rare. We discuss identification of variables for synthesis, specification of imputation models, and working measures of disclosure risk assessment. Following comparison of observed and synthetic data analytic results based on published studies, we conclude that use of synthetic data for our healthcare survey is best suited for exploratory data analytic purposes. / Statistics data confidentiality data utility disclosure risk multiple imputation synthetic uncongeniality statistics
45	Epidemiological Study of Coccidioidomycosis in Greater Tucson, Arizona Tabor, Joseph Anthony January 2009 (has links) The goal of this dissertation is to characterize the distribution and determinants of coccidioidomycosis in greater Tucson, Arizona, using landscape ecology and complex survey methods to control for environmental factors that affect <italic>Coccidioides</italic> exposure. Notifiable coccidioidomycosis cases reported to the health department in Arizona have dramatically increased since 1997 and indicate a potential epidemic of unknown causes. Epidemic determination is confounded by concurrent changes in notifiable disease reporting-compliance, misdiagnosis, and changing demographics of susceptible populations. A stratified, two-stage, address-based telephone survey of greater Tucson, Arizona, was conducted in 2002 and 2003. Subjects were recruited from direct marketing data by census block groups and landscape strata as determined using a geographic information system (GIS). Subjects were interviewed about potential risk factors. Address-level state health department notifiable-disease surveillance data were compared with self-reported survey data to estimate the true disease frequency.Comparing state surveillance data with the survey data, no coccidioidomycosis epidemic was detectable from 1992 to 2006 after adjusting surveillance data for reporting compliance. State health department surveillance reported only 20% of the probable reportable cases in 2001.Utilizing survey data and geographic coding, it was observed that spatial and temporal disease frequency was highly variable at the census block-group scale and indicates that localized soil disturbance events are a major group-level risk factor. Poststratification by 2000 census demographic data adjusted for selection bias into the survey and response rate. Being Hispanic showed similar odds ratio of self-reporting coccidioidomycosis diagnosis as of being non-Hispanic White race-ethnicity when controlled by other risk factors. Cigarette smoking in the home and having a home located in the low Hispanic foothills and low Hispanic riparian strata were associated with elevated risk of odds ratios for coccidioidomycosis. Sample stratification by landscape and demographics controlled for differential classification of susceptibility and exposures between strata.Clustered, address-based telephone surveys provide a feasible and valid method to recruit populations from address-based lists by using a GIS to design a survey and population survey statistical methods for the analysis. Notifiable coccidioidomycosis case surveillance can be improved by including reporting compliance in the analysis. Pathogen exposures and host susceptibility are important predictable group-level determinants of coccidioidomycosis that were controlled by stratified sampling using a landscape ecology approach. capture-recapture complex survey multiple imputation population control adjustment valley fever
46	Some Recent Advances in Non- and Semiparametric Bayesian Modeling with Copulas, Mixtures, and Latent Variables Murray, Jared January 2013 (has links) <p>This thesis develops flexible non- and semiparametric Bayesian models for mixed continuous, ordered and unordered categorical data. These methods have a range of possible applications; the applications considered in this thesis are drawn primarily from the social sciences, where multivariate, heterogeneous datasets with complex dependence and missing observations are the norm. </p><p>The first contribution is an extension of the Gaussian factor model to Gaussian copula factor models, which accommodate continuous and ordinal data with unspecified marginal distributions. I describe how this model is the most natural extension of the Gaussian factor model, preserving its essential dependence structure and the interpretability of factor loadings and the latent variables. I adopt an approximate likelihood for posterior inference and prove that, if the Gaussian copula model is true, the approximate posterior distribution of the copula correlation matrix asymptotically converges to the correct parameter under nearly any marginal distributions. I demonstrate with simulations that this method is both robust and efficient, and illustrate its use in an application from political science.</p><p>The second contribution is a novel nonparametric hierarchical mixture model for continuous, ordered and unordered categorical data. The model includes a hierarchical prior used to couple component indices of two separate models, which are also linked by local multivariate regressions. This structure effectively overcomes the limitations of existing mixture models for mixed data, namely the overly strong local independence assumptions. In the proposed model local independence is replaced by local conditional independence, so that the induced model is able to more readily adapt to structure in the data. I demonstrate the utility of this model as a default engine for multiple imputation of mixed data in a large repeated-sampling study using data from the Survey of Income and Participation. I show that it improves substantially on its most popular competitor, multiple imputation by chained equations (MICE), while enjoying certain theoretical properties that MICE lacks. </p><p>The third contribution is a latent variable model for density regression. Most existing density regression models are quite flexible but somewhat cumbersome to specify and fit, particularly when the regressors are a combination of continuous and categorical variables. The majority of these methods rely on extensions of infinite discrete mixture models to incorporate covariate dependence in mixture weights, atoms or both. I take a fundamentally different approach, introducing a continuous latent variable which depends on covariates through a parametric regression. In turn, the observed response depends on the latent variable through an unknown function. I demonstrate that a spline prior for the unknown function is quite effective relative to Dirichlet Process mixture models in density estimation settings (i.e., without covariates) even though these Dirichlet process mixtures have better theoretical properties asymptotically. The spline formulation enjoys a number of computational advantages over more flexible priors on functions. Finally, I demonstrate the utility of this model in regression applications using a dataset on U.S. wages from the Census Bureau, where I estimate the return to schooling as a smooth function of the quantile index.</p> / Dissertation Statistics Bayesian methods Bayesian Nonparametrics Copula Modeling Density Regression Mixture Modeling Multiple Imputation
47	Comparative approaches to handling missing data, with particular focus on multiple imputation for both cross-sectional and longitudinal models. Hassan, Ali Satty Ali. January 2012 (has links) Much data-based research are characterized by the unavoidable problem of incompleteness as a result of missing or erroneous values. This thesis discusses some of the various strategies and basic issues in statistical data analysis to address the missing data problem, and deals with both the problem of missing covariates and missing outcomes. We restrict our attention to consider methodologies which address a specific missing data pattern, namely monotone missingness. The thesis is divided into two parts. The first part placed a particular emphasis on the so called missing at random (MAR) assumption, but focuses the bulk of attention on multiple imputation techniques. The main aim of this part is to investigate various modelling techniques using application studies, and to specify the most appropriate techniques as well as gain insight into the appropriateness of these techniques for handling incomplete data analysis. This thesis first deals with the problem of missing covariate values to estimate regression parameters under a monotone missing covariate pattern. The study is devoted to a comparison of different imputation techniques, namely markov chain monte carlo (MCMC), regression, propensity score (PS) and last observation carried forward (LOCF). The results from the application study revealed that we have universally best methods to deal with missing covariates when the missing data pattern is monotone. Of the methods explored, the MCMC and regression methods of imputation to estimate regression parameters with monotone missingness were preferable to the PS and LOCF methods. This study is also concerned with comparative analysis of the techniques applied to incomplete Gaussian longitudinal outcome or response data due to random dropout. Three different methods are assessed and investigated, namely multiple imputation (MI), inverse probability weighting (IPW) and direct likelihood analysis. The findings in general favoured MI over IPW in the case of continuous outcomes, even when the MAR mechanism holds. The findings further suggest that the use of MI and direct likelihood techniques lead to accurate and equivalent results as both techniques arrive at the same substantive conclusions. The study also compares and contrasts several statistical methods for analyzing incomplete non-Gaussian longitudinal outcomes when the underlying study is subject to ignorable dropout. The methods considered include weighted generalized estimating equations (WGEE), multiple imputation after generalized estimating equations (MI-GEE) and generalized linear mixed model (GLMM). The current study found that the MI-GEE method was considerably robust, doing better than all the other methods in terms of small and large sample sizes, regardless of the dropout rates. The primary interest of the second part of the thesis falls under the non-ignorable dropout (MNAR) modelling frameworks that rely on sensitivity analysis in modelling incomplete Gaussian longitudinal data. The aim of this part is to deal with non-random dropout by explicitly modelling the assumptions that caused the dropout and incorporated this additional sub-model into the model for the measurement data, and to assess the sensitivity of the modelling assumptions. The study pays attention to the analysis of repeated Gaussian measures subject to potentially non-random dropout in order to study the influence on inference that might be caused in the data by the dropout process. We consider the construction of a particular type of selection model, namely the Diggle-Kenward model as a tool for assessing the sensitivity of a selection model in terms of the modelling assumptions. The major conclusions drawn were that there was evidence in favour of the MAR process rather than an MCAR process in the context of the assumed model. In addition, there was the need to obtain further insight into the data by comparing various sensitivity analysis frameworks. Lastly, two families of models were also compared and contrasted to investigate the potential influence on inference that dropout might have or exert on the dependent measurement data considered, and to deal with incomplete sequences. The models were based on selection and pattern mixture frameworks used for sensitivity analysis to jointly model the distribution of the dropout process and longitudinal measurement process. The results of the sensitivity analysis were in agreement and hence led to similar parameter estimates. Additional confidence in the findings was gained as both models led to similar results for significant effects such as marginal treatment effects. / Thesis (M.Sc.)-University of KwaZulu-Natal, Pietermaritzburg, 2012. Multiple imputation (Statistics) Multivariate analysis. Missing observations (Statistics)
48	Survival analysis for breast cancer Liu, Yongcai 21 September 2010 (has links) This research carries out a survival analysis for patients with breast cancer. The influence of clinical and pathologic features, as well as molecular markers on survival time are investigated. Special attention focuses on whether the molecular markers can provide additional information in helping predict clinical outcome and guide therapies for breast cancer patients. Three outcomes, breast cancer specific survival (BCSS), local relapse survival (LRS) and distant relapse survival (DRS), are examined using two datasets, the large dataset with missing values in markers (n=1575) and the small (complete) dataset consisting of patient records without any missing values (n=910). Results show that some molecular markers, such as YB1, could join ER, PR and HER2 to be integrated into cancer clinical practices. Further clinical research work is needed to identify the importance of CK56. The 10 year survival probability at the mean of all the covariates (clinical variables and markers) for BCSS, LRS, and DRS is 77%, 91%, and 72% respectively. Due to the presence of a large portion of missing values in the dataset, a sophisticated multiple imputation method is needed to estimate the missing values so that an unbiased and more reliable analysis can be achieved. In this study, three multiple imputation (MI) methods, data augmentation (DA), multivariate imputations by chained equations (MICE) and AREG, are employed and compared. Results shows that AREG is the preferred MI approach. The reliability of MI results are demonstrated using various techniques. This work will hopefully shed light on the determination of appropriate MI methods for other similar research situations. Survival analysis Breast cancer Multiple imputation Molecular marker
49	Multiple imputation for marginal and mixed models in longitudinal data with informative missingness Deng, Wei, January 2005 (has links) Thesis (Ph. D.)--Ohio State University, 2005. / Title from first page of PDF file. Document formatted into pages; contains xiii, 108 p.; also includes graphics. Includes bibliographical references (p. 104-108). Available online via OhioLINK's ETD Center
50	A Monte Carlo study the impact of missing data in cross-classification random effects models / Alemdar, Meltem. January 2008 (has links) Thesis (Ph. D.)--Georgia State University, 2008. / Title from title page (Digital Archive@GSU, viewed July 20, 2010) Carolyn F. Furlow, committee chair; Philo A. Hutcheson, Phillip E. Gagne, Sheryl A. Gowen, committee members. Includes bibliographical references (p. 96-100).

Search results