Global ETD Search

31	Quantifying Power and Bias in Cluster Randomized Trials Using Mixed Models vs. Cluster-Level Analysis in the Presence of Missing Data: A Simulation Study Vincent, Brenda January 2016 (has links) In cluster randomized trials (CRTs), groups are randomized to treatment arms rather than individuals while the outcome is assessed on the individuals within each cluster. Individuals within clusters tend to be more similar than in a randomly selected sample, which poses issues with dependence, which may lead to underestimated standard errors if ignored. To adjust for the correlation between individuals within clusters, two main approaches are used to analyze CRTs: cluster-level and individual-level analysis. In a cluster-level analysis summary measures are obtained for each cluster and then the two sets of cluster-specific measures are compared, such as with a t-test of the cluster means. A mixed model which takes into account cluster membership is an example of an individual-level analysis. We used a simulation study to quantify and compare power and bias of these two methods. We further take into account the effect of missing data. Complete datasets were generated and then data were deleted to simulate missing completely at random (MCAR) and missing at random (MAR) data. A balanced design, with two treatment groups and two time points was assumed. Cluster size, variance components (including within-subject, within-cluster and between-cluster variance) and proportion of missingness were varied to simulate common scenarios seen in practice. For each combination of parameters, 1,000 datasets were generated and analyzed. Results of our simulation study indicate that cluster-level analysis resulted in substantial loss of power when data were MAR. Individual-level analysis had higher power and remained unbiased, even with a small number of clusters. cluster randomized trial missing data mixed model power Biostatistics bias
32	Predicting HIV Status Using Neural Networks and Demographic Factors Tim, Taryn Nicole Ho 15 February 2007 (has links) Student Number : 0006036T - MSc(Eng) project report - School of Electrical and Information Engineering - Faculty of Engineering and the Built Environment / Demographic and medical history information obtained from annual South African antenatal surveys is used to estimate the risk of acquiring HIV. The estimation system consists of a classifier: a neural network trained to perform binary classification, using supervised learning with the survey data. The survey information contains discrete variables such as age, gravidity and parity, as well as the quantitative variables race and location, making up the input to the neural network. HIV status is the output. A multilayer perceptron with a logistic function is trained with a cross entropy error function, providing a probabilistic interpretation of the output. Predictive and classification performance is measured, and the sensitivity and specificity are illustrated on the Receiver Operating Characteristic. An auto-associative neural network is trained on complete datasets, and when presented with partial data, global optimisation methods are used to approximate the missing entries. The effect of the imputed data on the network prediction is investigated. neural networks risk assessment HIV multilayer perceptron missing data
33	Computational intelligence techniques for missing data imputation Nelwamondo, Fulufhelo Vincent 14 August 2008 (has links) Despite considerable advances in missing data imputation techniques over the last three decades, the problem of missing data remains largely unsolved. Many techniques have emerged in the literature as candidate solutions, including the Expectation Maximisation (EM), and the combination of autoassociative neural networks and genetic algorithms (NN-GA). The merits of both these techniques have been discussed at length in the literature, but have never been compared to each other. This thesis contributes to knowledge by firstly, conducting a comparative study of these two techniques.. The significance of the difference in performance of the methods is presented. Secondly, predictive analysis methods suitable for the missing data problem are presented. The predictive analysis in this problem is aimed at determining if data in question are predictable and hence, to help in choosing the estimation techniques accordingly. Thirdly, a novel treatment of missing data for online condition monitoring problems is presented. An ensemble of three autoencoders together with hybrid Genetic Algorithms (GA) and fast simulated annealing was used to approximate missing data. Several significant insights were deduced from the simulation results. It was deduced that for the problem of missing data using computational intelligence approaches, the choice of optimisation methods plays a significant role in prediction. Although, it was observed that hybrid GA and Fast Simulated Annealing (FSA) can converge to the same search space and to almost the same values they differ significantly in duration. This unique contribution has demonstrated that a particular interest has to be paid to the choice of optimisation techniques and their decision boundaries. iii Another unique contribution of this work was not only to demonstrate that a dynamic programming is applicable in the problem of missing data, but to also show that it is efficient in addressing the problem of missing data. An NN-GA model was built to impute missing data, using the principle of dynamic programing. This approach makes it possible to modularise the problem of missing data, for maximum efficiency. With the advancements in parallel computing, various modules of the problem could be solved by different processors, working together in parallel. Furthermore, a method for imputing missing data in non-stationary time series data that learns incrementally even when there is a concept drift is proposed. This method works by measuring the heteroskedasticity to detect concept drift and explores an online learning technique. New direction for research, where missing data can be estimated for nonstationary applications are opened by the introduction of this novel method. Thus, this thesis has uniquely opened the doors of research to this area. Many other methods need to be developed so that they can be compared to the unique existing approach proposed in this thesis. Another novel technique for dealing with missing data for on-line condition monitoring problem was also presented and studied. The problem of classifying in the presence of missing data was addressed, where no attempts are made to recover the missing values. The problem domain was then extended to regression. The proposed technique performs better than the NN-GA approach, both in accuracy and time efficiency during testing. The advantage of the proposed technique is that it eliminates the need for finding the best estimate of the data, and hence, saves time. Lastly, instead of using complicated techniques to estimate missing values, an imputation approach based on rough sets is explored. Empirical results obtained using both real and synthetic data are given and they provide a valuable and promising insight to the problem of missing data. The work, has significantly confirmed that rough sets can be reliable for missing data estimation in larger and real databases. Missing data Genetic algorithms Neural networks Expectation maximization Estimation techniques
34	Missing imputation methods explored in big data analytics Brydon, Humphrey Charles January 2018 (has links) Philosophiae Doctor - PhD (Statistics and Population Studies) / The aim of this study is to look at the methods and processes involved in imputing missing data and more specifically, complete missing blocks of data. A further aim of this study is to look at the effect that the imputed data has on the accuracy of various predictive models constructed on the imputed data and hence determine if the imputation method involved is suitable. The identification of the missingness mechanism present in the data should be the first process to follow in order to identify a possible imputation method. The identification of a suitable imputation method is easier if the mechanism can be identified as one of the following; missing completely at random (MCAR), missing at random (MAR) or not missing at random (NMAR). Predictive models constructed on the complete imputed data sets are shown to be less accurate for those models constructed on data sets which employed a hot-deck imputation method. The data sets which employed either a single or multiple Monte Carlo Markov Chain (MCMC) or the Fully Conditional Specification (FCS) imputation methods are shown to result in predictive models that are more accurate. The addition of an iterative bagging technique in the modelling procedure is shown to produce highly accurate prediction estimates. The bagging technique is applied to variants of the neural network, a decision tree and a multiple linear regression (MLR) modelling procedure. A stochastic gradient boosted decision tree (SGBT) is also constructed as a comparison to the bagged decision tree. Final models are constructed from 200 iterations of the various modelling procedures using a 60% sampling ratio in the bagging procedure. It is further shown that the addition of the bagging technique in the MLR modelling procedure can produce a MLR model that is more accurate than that of the other more advanced modelling procedures under certain conditions. The evaluation of the predictive models constructed on imputed data is shown to vary based on the type of fit statistic used. It is shown that the average squared error reports little difference in the accuracy levels when compared to the results of the Mean Absolute Prediction Error (MAPE). The MAPE fit statistic is able to magnify the difference in the prediction errors reported. The Normalized Mean Bias Error (NMBE) results show that all predictive models constructed produced estimates that were an over-prediction, although these did vary depending on the data set and modelling procedure used. The Nash Sutcliffe efficiency (NSE) was used as a comparison statistic to compare the accuracy of the predictive models in the context of imputed data. The NSE statistic showed that the estimates of the models constructed on the imputed data sets employing a multiple imputation method were highly accurate. The NSE statistic results reported that the estimates from the predictive models constructed on the hot-deck imputed data were inaccurate and that a mean substitution of the fully observed data would have been a better method of imputation. The conclusion reached in this study shows that the choice of imputation method as well as that of the predictive model is dependent on the data used. Four unique combinations of imputation methods and modelling procedures were concluded for the data considered in this study.
35	Bayesian Nonresponse Models for the Analysis of Data from Small Areas: An Application to BMD and Age in NHANES III Liu, Ning 28 April 2003 (has links) We analyze data on bone mineral density (BMD) and age for white females age 20+ in the third National Health and Nutrition Examination Survey. For the sample the age of each individual is known, but some individuals did not have their BMD measured, mainly because they did not show up in the mobile examination centers. We have data from 35 counties, the small areas. We use two types of models to analyze the data. In the ignorable nonresponse model, BMD does not depend on whether an individual responds or not. In the nonignorable nonresponse model, BMD is related to whether he/she responds. We incorporate this relationship in our model by using a Bayesian approach. We further divide these two types of models into continuous and categorical data models. Our nonignorable nonresponse models have one important feature: They are ``close' to the ignorable nonresponse model thereby reducing the effects of the untestable assumptions so common in nonresponse models. In the continuous data models, because the age of all nonrespondents are known and there is a relation between BMD and age, age is used as a covariate. In the categorical data models BMD has three levels (normal, osteopenia, osteoporosis) and age has two levels (younger than 50 years, at least 50 years). Thus, age is a supplemental margin for the $2 imes 3$ categorical table. Our research on the categorical models is much deeper than on the continuous models. Our models are hierarchical, a feature that allows a ``borrowing of strength' across the counties. Individual inference for most of the counties is unreliable because there is large variation. This ``borrowing of strength' is therefore necessary because it permits a substantial reduction in variation. The joint posterior density of the parameters for each model is complex. Thus, we fit each model using Markov chain Monte Carlo methods to obtain samples from the posterior density. These samples are used to make inference about BMD and age, and the relation between BMD and age. For the continuous data models, we show that there is an important relation between BMD and age by using a deviance measure, and we show that the nonignorable nonresponse models are to be preferred. For the categorical data models, we are able to estimate the proportion of individuals in each BMD and age cell of the categorical table, and we can assess the relation between BMD and age using the Bayes factor. A sensitivity analysis shows that there are differences, typically small, in inference that permits different levels of association between BMD and age. A simulation study shows that there is not much difference in inference between the ignorable nonresponse models and the nonignorable nonresponse models. As expected, BMD depends on age and this inference can be obtained for some small counties. For the data we use, there are virtually no young individuals with osteoporosis. The nonignorable nonresponse models generalize the ignorable nonresponse models, and therefore, allow broader inference. missing data Bayesian nonresponse Bone densitometry Measurement Surveys
36	Techniques to handle missing values in a factor analysis Turville, Christopher, University of Western Sydney, Faculty of Informatics, Science and Technology January 2000 (has links) A factor analysis typically involves a large collection of data, and it is common for some of the data to be unrecorded. This study investigates the ability of several techniques to handle missing values in a factor analysis, including complete cases only, all available cases, imputing means, an iterative component method, singular value decomposition and the EM algorithm. A data set that is representative of that used for a factor analysis is simulated. Some of this data are then randomly removed to represent missing values, and the performance of the techniques are investigated over a wide range of conditions. Several criteria are used to investigate the abilities of the techniques to handle missing values in a factor analysis. Overall, there is no one technique that performs best for all of the conditions studied. The EM algorithm is generally the most effective technique except when there are ill-conditioned matrices present or when computing time is of concern. Some theoretical concerns are introduced regarding the effects that changes in the correlation matrix will have on the loadings of a factor analysis. A complicated expression is derived that shows that the change in factor loadings as a result of change in the elements of a correlation matrix involves components of eigenvectors and eigenvalues. / Doctor of Philosophy (PhD) data analysis missing data factor loadings Monte Carlo methods variances
37	Statistical modeling of longitudinal survey data with binary outcomes Ghosh, Sunita 20 December 2007 Data obtained from longitudinal surveys using complex multi-stage sampling designs contain cross-sectional dependencies among units caused by inherent hierarchies in the data, and within subject correlation arising due to repeated measurements. The statistical methods used for analyzing such data should account for stratification, clustering and unequal probability of selection as well as within-subject correlations due to repeated measurements. <p>The complex multi-stage design approach has been used in the longitudinal National Population Health Survey (NPHS). This on-going survey collects information on health determinants and outcomes in a sample of the general Canadian population. <p>This dissertation compares the model-based and design-based approaches used to determine the risk factors of asthma prevalence in the Canadian female population of the NPHS (marginal model). Weighted, unweighted and robust statistical methods were used to examine the risk factors of the incidence of asthma (event history analysis) and of recurrent asthma episodes (recurrent survival analysis). Missing data analysis was used to study the bias associated with incomplete data. To determine the risk factors of asthma prevalence, the Generalized Estimating Equations (GEE) approach was used for marginal modeling (model-based approach) followed by Taylor Linearization and bootstrap estimation of standard errors (design-based approach). The incidence of asthma (event history analysis) was estimated using weighted, unweighted and robust methods. Recurrent event history analysis was conducted using Anderson and Gill, Wei, Lin and Weissfeld (WLW) and Prentice, Williams and Peterson (PWP) approaches. To assess the presence of bias associated with missing data, the weighted GEE and pattern-mixture models were used.<p>The prevalence of asthma in the Canadian female population was 6.9% (6.1-7.7) at the end of Cycle 5. When comparing model-based and design- based approaches for asthma prevalence, design-based method provided unbiased estimates of standard errors. The overall incidence of asthma in this population, excluding those with asthma at baseline, was 10.5/1000/year (9.2-12.1). For the event history analysis, the robust method provided the most stable estimates and standard errors. <p>For recurrent event history, the WLW method provided stable standard error estimates. Finally, for the missing data approach, the pattern-mixture model produced the most stable standard errors <p>To conclude, design-based approaches should be preferred over model-based approaches for analyzing complex survey data, as the former provides the most unbiased parameter estimates and standard errors. NPHS Survey GEE Missing data Survival analysis Longitudinal Complex survey
38	Statistical modeling of longitudinal survey data with binary outcomes Ghosh, Sunita 20 December 2007 (has links) Data obtained from longitudinal surveys using complex multi-stage sampling designs contain cross-sectional dependencies among units caused by inherent hierarchies in the data, and within subject correlation arising due to repeated measurements. The statistical methods used for analyzing such data should account for stratification, clustering and unequal probability of selection as well as within-subject correlations due to repeated measurements. <p>The complex multi-stage design approach has been used in the longitudinal National Population Health Survey (NPHS). This on-going survey collects information on health determinants and outcomes in a sample of the general Canadian population. <p>This dissertation compares the model-based and design-based approaches used to determine the risk factors of asthma prevalence in the Canadian female population of the NPHS (marginal model). Weighted, unweighted and robust statistical methods were used to examine the risk factors of the incidence of asthma (event history analysis) and of recurrent asthma episodes (recurrent survival analysis). Missing data analysis was used to study the bias associated with incomplete data. To determine the risk factors of asthma prevalence, the Generalized Estimating Equations (GEE) approach was used for marginal modeling (model-based approach) followed by Taylor Linearization and bootstrap estimation of standard errors (design-based approach). The incidence of asthma (event history analysis) was estimated using weighted, unweighted and robust methods. Recurrent event history analysis was conducted using Anderson and Gill, Wei, Lin and Weissfeld (WLW) and Prentice, Williams and Peterson (PWP) approaches. To assess the presence of bias associated with missing data, the weighted GEE and pattern-mixture models were used.<p>The prevalence of asthma in the Canadian female population was 6.9% (6.1-7.7) at the end of Cycle 5. When comparing model-based and design- based approaches for asthma prevalence, design-based method provided unbiased estimates of standard errors. The overall incidence of asthma in this population, excluding those with asthma at baseline, was 10.5/1000/year (9.2-12.1). For the event history analysis, the robust method provided the most stable estimates and standard errors. <p>For recurrent event history, the WLW method provided stable standard error estimates. Finally, for the missing data approach, the pattern-mixture model produced the most stable standard errors <p>To conclude, design-based approaches should be preferred over model-based approaches for analyzing complex survey data, as the former provides the most unbiased parameter estimates and standard errors. NPHS Survey GEE Missing data Survival analysis Longitudinal Complex survey
39	Missing Data Problems in Machine Learning Marlin, Benjamin 01 August 2008 (has links) Learning, inference, and prediction in the presence of missing data are pervasive problems in machine learning and statistical data analysis. This thesis focuses on the problems of collaborative prediction with non-random missing data and classification with missing features. We begin by presenting and elaborating on the theory of missing data due to Little and Rubin. We place a particular emphasis on the missing at random assumption in the multivariate setting with arbitrary patterns of missing data. We derive inference and prediction methods in the presence of random missing data for a variety of probabilistic models including finite mixture models, Dirichlet process mixture models, and factor analysis. Based on this foundation, we develop several novel models and inference procedures for both the collaborative prediction problem and the problem of classification with missing features. We develop models and methods for collaborative prediction with non-random missing data by combining standard models for complete data with models of the missing data process. Using a novel recommender system data set and experimental protocol, we show that each proposed method achieves a substantial increase in rating prediction performance compared to models that assume missing ratings are missing at random. We describe several strategies for classification with missing features including the use of generative classifiers, and the combination of standard discriminative classifiers with single imputation, multiple imputation, classification in subspaces, and an approach based on modifying the classifier input representation to include response indicators. Results on real and synthetic data sets show that in some cases performance gains over baseline methods can be achieved by methods that do not learn a detailed model of the feature space. Computer Science Artificial Intelligence Machine Learning Missing Data 0800
40	Missing Data Problems in Machine Learning Marlin, Benjamin 01 August 2008 (has links) Learning, inference, and prediction in the presence of missing data are pervasive problems in machine learning and statistical data analysis. This thesis focuses on the problems of collaborative prediction with non-random missing data and classification with missing features. We begin by presenting and elaborating on the theory of missing data due to Little and Rubin. We place a particular emphasis on the missing at random assumption in the multivariate setting with arbitrary patterns of missing data. We derive inference and prediction methods in the presence of random missing data for a variety of probabilistic models including finite mixture models, Dirichlet process mixture models, and factor analysis. Based on this foundation, we develop several novel models and inference procedures for both the collaborative prediction problem and the problem of classification with missing features. We develop models and methods for collaborative prediction with non-random missing data by combining standard models for complete data with models of the missing data process. Using a novel recommender system data set and experimental protocol, we show that each proposed method achieves a substantial increase in rating prediction performance compared to models that assume missing ratings are missing at random. We describe several strategies for classification with missing features including the use of generative classifiers, and the combination of standard discriminative classifiers with single imputation, multiple imputation, classification in subspaces, and an approach based on modifying the classifier input representation to include response indicators. Results on real and synthetic data sets show that in some cases performance gains over baseline methods can be achieved by methods that do not learn a detailed model of the feature space. Computer Science Artificial Intelligence Machine Learning Missing Data 0800

Search results