Global ETD Search

51	Performance Comparison of Multiple Imputation Methods for Quantitative Variables for Small and Large Data with Differing Variability Onyame, Vincent 01 May 2021 (has links) Missing data continues to be one of the main problems in data analysis as it reduces sample representativeness and consequently, causes biased estimates. Multiple imputation methods have been established as an effective method of handling missing data. In this study, we examined multiple imputation methods for quantitative variables on twelve data sets with varied sizes and variability that were pseudo generated from an original data. The multiple imputation methods examined are the predictive mean matching, Bayesian linear regression and linear regression, non-Bayesian in the MICE (Multiple Imputation Chain Equation) package in the statistical software, R. The parameter estimates generated from the linear regression on the imputed data were compared to the closest parameter estimates from the complete data across all twelve data sets. Missing data Multiple imputation methods Quantitative data. Statistics and Probability
52	Determining the Size of a Galaxy's Globular Cluster Population through Imputation of Incomplete Data with Measurement Uncertainty Richard, Michael R. 11 1900 (has links) A globular cluster is a collection of stars that orbits the center of its galaxy as a single satellite. Understanding what influences the formations of these clusters provides understanding of galaxy structure and insight into their early development. We continue the work of Harris et al. (2013), who identified a set of predictors that accurately determined the number of clusters Ngc, through analysis of an incomplete dataset. We aimed to improve upon these results through imputation of the missing data. A small amount of precision was gained for the slope of Ngc~ R_esigma_ e, while the intercept suffered a small loss of precision. Estimates of intrinsic variance also increased with the addition of imputed data. We also found galaxy morphological type to be a significant predictor of Ngc in a model with R_esigma_ e. Although it increased precision of the slope and reduced the residual variance, its overall contribution was negligible. / Thesis / Master of Science (MSc) Missing Data Imputation Predictive Mean Matching Measurement Uncertainty Globular Clusters
53	Distribution of Metal Ions in Prostate and Urine during Prostate Carcinogenesis Xiao, Hong 26 September 2011 (has links) No description available. Biostatistics Prostate Cancer metal ions imputation Hotelling's T square
54	A Comparison of Last Observation Carried Forward and Multiple Imputation in a Longitudinal Clinical Trial Carmack, Tara Lynn 25 June 2012 (has links) No description available. Biostatistics LOCF Multiple Imputation missing data randomized clinical trials
55	The Effect of Item Parameter Uncertainty on Test Reliability Bodine, Andrew James 24 August 2012 (has links) No description available. Psychology Statistics item response theory IRT reliability MCMC multiple imputation
56	Log Linear Models for Prediction and Analysis of Networks Ouzienko, Vladimir January 2012 (has links) The heightened research activity in the interdisciplinary field of network science can be attributed to the emergence of the social network computer applications. Researchers understood early on that data describing how entities interconnect is highly valuable and that it offers a deeper understanding about the entities themselves. This is why there were so many studies done about various kinds of networks in the last 10-15 years. The study of the networks from the perspective of computer science usually has two objectives. The first objective is to develop statistical mechanisms capable of accurately describing and modeling observed real-world networks. A good fit of such mechanism suggests the correctness of the model's assumptions and leads to better understanding of the network. A second goal is more practical, a well performing model can be used to predict what will happen to the network in the future. Also, such model can be leveraged to use the information gleaned from network to predict what will happen to the networks entities. One important leitmotif of network research and analysis is wide adaptation of log linear models. In this work we apply this philosophy for study and evaluation of log-linear statistical models in various types of networks. We begin with proposal of the new Temporal Exponential Random Graph Model (tERGM) for the analysis and predictions in the binary temporal social networks. We then extended the model for applications in partially observed networks that change over time. Lastly, we generalize the tERGM model to predict the real-valued weighted links in the temporal non-social networks. The log-linear models are not limited to networks that change over time but can also be applied to networks that are static. One such static network is a social network composed of patients undergoing hemodialysis. Hemodialysis is prescribed to people suffering from the end stage renal disease; the treatment necessitates the attendance, on non-changing schedule, of the hemodialysis clinic for a prolonged time period and this is how the social ties are formed. The new log-linear Social Latent Vectors (SLV) model was applied to study such static social networks. The results obtained from SLV experiments suggest that social relationships formed by patients bear influence on individual patients clinical outcome. The study demonstrates how social network analysis can be applied to better understand the network constituents. / Computer and Information Science Computer Science Ergm Imputation Temoral Networks Weighted Networks
57	A Study of Machine Learning Approaches for Biomedical Signal Processing Shen, Minjie 10 June 2021 (has links) The introduction of high-throughput molecular profiling technologies provides the capability of studying diverse biological systems at molecular level. However, due to various limitations of measurement instruments, data preprocessing is often required in biomedical research. Improper preprocessing will have negative impact on the downstream analytics tasks. This thesis studies two important preprocessing topics: missing value imputation and between-sample normalization. Missing data is a major issue in quantitative proteomics data analysis. While many methods have been developed for imputing missing values in high-throughput proteomics data, comparative assessment on the accuracy of existing methods remains inconclusive, mainly because the true missing mechanisms are complex and the existing evaluation methodologies are imperfect. Moreover, few studies have provided an outlook of current and future development. We first report an assessment of eight representative methods collectively targeting three typical missing mechanisms. The selected methods are compared on both realistic simulation and real proteomics datasets, and the performance is evaluated using three quantitative measures. We then discuss fused regularization matrix factorization, a popular low-rank matrix factorization framework with similarity and/or biological regularization, which is extendable to integrating multi-omics data such as gene expressions or clinical variables. We further explore the potential application of convex analysis of mixtures, a biologically inspired latent variable modeling strategy, to missing value imputation. The preliminary results on proteomics data are provided together with an outlook into future development directions. While a few winners emerged from our comparative assessment, data-driven evaluation of imputation methods is imperfect because performance is evaluated indirectly on artificial missing or masked values not authentic missing values. Imputation accuracy may vary with signal intensity. Fused regularization matrix factorization provides a possibility of incorporating external information. Convex analysis of mixtures presents a biologically plausible new approach. Data normalization is essential to ensure accurate inference and comparability of gene expressions across samples or conditions. Ideally, gene expressions should be rescaled based on consistently expressed reference genes. However, for normalizing biologically diverse samples, the most commonly used reference genes have exhibited striking expression variability, and distribution-based approaches can be problematic when differentially expressed genes are significantly asymmetric. We introduce a Cosine score based iterative normalization (Cosbin) strategy to normalize biologically diverse samples. The between-sample normalization is based on iteratively identified consistently expressed genes, where differentially expressed genes are sequentially eliminated according to scale-invariant Cosine scores. We evaluate the performance of Cosbin and four other representative normalization methods (Total count, TMM/edgeR, DESeq2, DEGES/TCC) on both idealistic and realistic simulation data sets. Cosbin consistently outperforms the other methods across various performance criteria. Implemented in open-source R scripts and applicable to grouped or individual samples, the Cosbin tool will allow biologists to detect subtle yet important molecular signals across known or novel phenotypic groups. / Master of Science / Data preprocessing is often required due to various limitations of measurement instruments in biomedical research. This thesis studies two important preprocessing topics: missing value imputation and between-sample normalization. Missing data is a major issue in quantitative proteomics data analysis. Imputation is the process of substituting for missing values. We propose a more realistic assessment workflow which can preserve the original data distribution, and then assess eight representative general-purpose imputation strategies. We explore two biologically inspired imputation approaches: fused regularization matrix factorization (FRMF) and convex analysis of mixtures (CAM) imputation. FRMF integrates external information such as clinical variables and multi-omics data into imputation, while CAM imputation incorporates biological assumptions. We show that the integration of biological information improves the imputation performance. Data normalization is required to ensure correct comparison. For gene expression data, between sample normalization is needed. We propose a Cosine score based iterative normalization (Cosbin) strategy to normalize biologically diverse samples. We show that Cosbin significantly outperform other methods in both ideal simulation and realistic simulation. Implemented in open-source R scripts and applicable to grouped or individual samples, the Cosbin tool will allow biologists to detect subtle yet important molecular signals across known or novel cell types. bioinformatics imputation matrix factorization convex analysis normalization machine learning
58	Modélisation des données d'enquêtes cas-cohorte par imputation multiple : application en épidémiologie cardio-vasculaire / Modeling of case-cohort data by multiple imputation : application to cardio-vascular epidemiology Marti soler, Helena 04 May 2012 (has links) Les estimateurs pondérés généralement utilisés pour analyser les enquêtes cas-cohorte ne sont pas pleinement efficaces. Or, les enquêtes cas-cohorte sont un cas particulier de données incomplètes où le processus d'observation est contrôlé par les organisateurs de l'étude. Ainsi, des méthodes d'analyse pour données manquant au hasard (MA) peuvent être pertinentes, en particulier, l'imputation multiple, qui utilise toute l'information disponible et permet d'approcher l'estimateur du maximum de vraisemblance partielle.Cette méthode est fondée sur la génération de plusieurs jeux plausibles de données complétées prenant en compte les différents niveaux d'incertitude sur les données manquantes. Elle permet d'adapter facilement n'importe quel outil statistique disponible pour les données de cohorte, par exemple, l'estimation de la capacité prédictive d'un modèle ou d'une variable additionnelle qui pose des problèmes spécifiques dans les enquêtes cas-cohorte. Nous avons montré que le modèle d'imputation doit être estimé à partir de tous les sujets complètement observés (cas et non-cas) en incluant l'indicatrice de statut parmi les variables explicatives. Nous avons validé cette approche à l'aide de plusieurs séries de simulations: 1) données complètement simulées, où nous connaissions les vraies valeurs des paramètres, 2) enquêtes cas-cohorte simulées à partir de la cohorte PRIME, où nous ne disposions pas d'une variable de phase-1 (observée sur tous les sujets) fortement prédictive de la variable de phase-2 (incomplètement observée), 3) enquêtes cas-cohorte simulées à partir de la cohorte NWTS, où une variable de phase-1 fortement prédictive de la variable de phase-2 était disponible. Ces simulations ont montré que l'imputation multiple fournissait généralement des estimateurs sans biais des risques relatifs. Pour les variables de phase-1, ils approchaient la précision obtenue par l'analyse de la cohorte complète, ils étaient légèrement plus précis que l'estimateur calibré de Breslow et coll. et surtout que les estimateurs pondérés classiques. Pour les variables de phase-2, l'estimateur de l'imputation multiple était généralement sans biais et d'une précision supérieure à celle des estimateurs pondérés classiques et analogue à celle de l'estimateur calibré. Les résultats des simulations réalisées à partir des données de la cohorte NWTS étaient cependant moins bons pour les effets impliquant la variable de phase-2 : les estimateurs de l'imputation multiple étaient légèrement biaisés et moins précis que les estimateurs pondérés. Cela s'explique par la présence de termes d'interaction impliquant la variable de phase-2 dans le modèle d'analyse, d'où la nécessité d'estimer des modèles d'imputation spécifiques à différentes strates de la cohorte incluant parfois trop peu de cas pour que les conditions asymptotiques soient réunies.Nous recommandons d'utiliser l'imputation multiple pour obtenir des estimations plus précises des risques relatifs, tout en s'assurant qu'elles sont analogues à celles fournies par les analyses pondérées. Nos simulations ont également montré que l'imputation multiple fournissait des estimations de la valeur prédictive d'un modèle (C de Harrell) ou d'une variable additionnelle (différence des indices C, NRI ou IDI) analogues à celles fournies par la cohorte complète / The weighted estimators generally used for analyzing case-cohort studies are not fully efficient. However, case-cohort surveys are a special type of incomplete data in which the observation process is controlled by the study organizers. So, methods for analyzing Missing At Random (MAR) data could be appropriate, in particular, multiple imputation, which uses all the available information and allows to approximate the partial maximum likelihood estimator.This approach is based on the generation of several plausible complete data sets, taking into account all the uncertainty about the missing values. It allows adapting any statistical tool available for cohort data, for instance, estimators of the predictive ability of a model or of an additional variable, which meet specific problems with case-cohort data. We have shown that the imputation model must be estimated on all the completely observed subjects (cases and non-cases) including the case indicator among the explanatory variables. We validated this approach with several sets of simulations: 1) completely simulated data where the true parameter values were known, 2) case-cohort data simulated from the PRIME cohort, without any phase-1 variable (completely observed) strongly predictive of the phase-2 variable (incompletely observed), 3) case-cohort data simulated from de NWTS cohort, where a phase-1 variable strongly predictive of the phase-2 variable was available. These simulations showed that multiple imputation generally provided unbiased estimates of the risk ratios. For the phase-1 variables, they were almost as precise as the estimates provided by the full cohort, slightly more precise than Breslow et al. calibrated estimator and still more precise than classical weighted estimators. For the phase-2 variables, the multiple imputation estimator was generally unbiased, with a precision better than classical weighted estimators and similar to Breslow et al. calibrated estimator. The simulations performed with the NWTS cohort data provided less satisfactory results for the effects where the phase-2 variable was involved: the multiple imputation estimators were slightly biased and less precise than the weighted estimators. This can be explained by the interactions terms involving the phase-2 variable in the analysis model and the necessity of estimating specific imputation models in different strata not including sometimes enough cases to satisfy the asymptotic conditions. We advocate the use of multiple imputation for improving the precision of the risk ratios estimates while making sure they are similar to the weighted estimates.Our simulations also showed that multiple imputation provided estimates of a model predictive value (Harrell's C) or of an additional variable (difference of C indices, NRI or IDI) similar to those obtained from the full cohort. Enquêtes cas-cohorte Estimateurs pondérés Imputation multiple Capacité prédictive Case-cohort surveys Weighted estimators Multiple imputation Predictive ability
59	EXTENT OF LINKAGE DISEQUILIBRIUM, CONSISTENCY OF GAMETIC PHASE AND IMPUTATION ACCURACY WITHIN AND ACROSS CANADIAN DAIRY BREEDS Larmer, Steven 09 August 2012 (has links) Some dairy breeds have too few animals genotyped for within breed genomic selection to be carried out with sufficient accuracy. As such, the level of linkage disequilibrium within each breed as well as consistency of gametic phase across breeds was studied. High correlations of phase (>0.9) were found between all breed pairs at this same SNP density. The efficacy of imputing animals genotyped on lower density (6k and 50k) panels was then explored in order to increase the size of the reference population with 777k genotypes in a cost-effective manner. These results showed high accuracies (>0.92) in all imputation scenarios studies, using both a within breed and a multi-breed reference population for imputation. It was concluded that given the results of both of these studies, pooling breeds into a common reference population for genomic selection should be a viable option for accurate genomic selection in breeds with few genotyped individuals. / NSERC, USDA, CDN, DairyGen, Ayrshire Canada, Guernsey Canada, Semex, L'Alliance Boviteq Inc. Dairy Genetics Genomics Linkage Disequilibrium Gametic Phase Breeds Imputation Accuracy Imputation Accuracy FImpute Beagle Across Breeds Phase Consistency
60	Métodos de imputação de dados aplicados na área da saúde Nunes, Luciana Neves January 2007 (has links) Em pesquisas da área da saúde é muito comum que o pesquisador defronte-se com o problema de dados faltantes. Nessa situação, é freqüente que a decisão do pesquisador seja desconsiderar os sujeitos que tenham não-resposta em alguma ou algumas das variáveis, pois muitas das técnicas estatísticas foram desenvolvidas para analisar dados completos. Entretanto, essa exclusão de sujeitos pode gerar inferências que não são válidas, principalmente se os indivíduos que permanecem na análise são diferentes daqueles que foram excluídos. Nas duas últimas décadas, métodos de imputação de dados foram desenvolvidos com a intenção de se encontrar solução para esse problema. Esses métodos usam como base a idéia de preencher os dados faltantes com valores plausíveis. O método mais complexo de imputação é a chamada imputação múltipla. Essa tese tem por objetivo divulgar o método de imputação múltipla e através de dois artigos procura atingir esse objetivo. O primeiro artigo descreve duas técnicas de imputação múltipla e as aplica a um conjunto de dados reais. O segundo artigo faz a comparação do método de imputação múltipla com duas técnicas de imputação única através de uma aplicação a um modelo de risco para mortalidade cirúrgica. Para as aplicações foram usados dados secundários já utilizados por Klück (2004). / Missing data in health research is a very common problem. The most direct way of dealing with missing data is to exclude observations with missing data, probably because the traditional statistical methods have been developed for complete data sets. However, this decision may give biased results, mainly if the subjects considered in the analysis are different of those who have been excluded. In the last two decades, imputation methods were developed to solve this problem. The idea of the imputation is to fill in the missing data with reasonable values. The multiple imputation is the most complex method. The objective of this dissertation is to divulge the multiple imputation method through two papers. The first one describes two different types of multiple imputation and it shows an application to real data. The second paper shows a comparison among the multiple imputation and two single imputations applied to a risk model for surgical mortality. The used data sets were secondary data used by Klück (2004). Interpretacao estatística de dados Epidemiologia Estatísticas de saúde Epidemiologia : Estatistica Imputação múltipla Imputation methods Multiple imputation Missing data Nonresponse

Search results