11 |
Handling missing data problems in criminology :an introductionWang, Xue January 2016 (has links)
University of Macau / Faculty of Social Sciences / Department of Sociology
|
12 |
Intégration de données hétérogènes complexes à partir de tableaux de tailles déséquilibrées / Integrating heterogeneous complex data from unbalanced datasetsImbert, Alyssa 19 October 2018 (has links)
Les avancées des nouvelles technologies de séquençage ont permis aux études cliniques de produire des données volumineuses et complexes. Cette complexité se décline selon diverses modalités, notamment la grande dimension, l’hétérogénéité des données au niveau biologique (acquises à différents niveaux de l’échelle du vivant et à divers moments de l’expérience), l’hétérogénéité du type de données, le bruit (hétérogénéité biologique ou données entachées d’erreurs) dans les données et la présence de données manquantes (au niveau d’une valeur ou d’un individu entier). L’intégration de différentes données est donc un défi important pour la biologie computationnelle. Cette thèse s’inscrit dans un projet de recherche clinique sur l’obésité, DiOGenes, pour lequel nous avons fait des propositions méthodologiques pour l’analyse et l’intégration de données. Ce projet est basé sur une intervention nutritionnelle menée dans huit pays européens et vise à analyser les effets de différents régimes sur le maintien pondéral et sur certains marqueurs de risque cardio-vasculaire et de diabète, chez des individus obèses. Dans le cadre de ce projet, mes travaux ont porté sur l’analyse de données transcriptomiques (RNA-Seq) avec des individus manquants et sur l’intégration de données transcriptomiques (nouvelle technique QuantSeq) avec des données cliniques. La première partie de cette thèse est consacrée aux données manquantes et à l’inférence de réseaux à partir de données d’expression RNA-Seq. Lors d’études longitudinales transcriptomiques, il arrive que certains individus ne soient pas observés à certains pas de temps, pour des raisons expérimentales. Nous proposons une méthode d’imputation multiple hot-deck (hd-MI) qui permet d’intégrer de l’information externe mesurée sur les mêmes individus et d’autres individus. hd-MI permet d’améliorer la qualité de l’inférence de réseau. La seconde partie porte sur une étude intégrative de données cliniques et transcriptomiques (mesurées par QuantSeq) basée sur une approche réseau. Nous y montrons l’intérêt de cette nouvelle technique pour l’acquisition de données transcriptomiques et l’analysons par une approche d’inférence de réseau en lien avec des données cliniques d’intérêt. / The development of high-throughput sequencing technologies has lead to a massive acquisition of high dimensional and complex datasets. Different features make these datasets hard to analyze : high dimensionality, heterogeneity at the biological level or at the data type level, the noise in data (due to biological heterogeneity or to errors in data) and the presence of missing data (for given values or for an entire individual). The integration of various data is thus an important challenge for computational biology. This thesis is part of a large clinical research project on obesity, DiOGenes, in which we have developed methods for data analysis and integration. The project is based on a dietary intervention that was led in eight Europeans centers. This study investigated the effect of macronutrient composition on weight-loss maintenance and metabolic and cardiovascular risk factors after a phase of calorie restriction in obese individuals. My work have mainly focused on transcriptomic data analysis (RNA-Seq) with missing individuals and data integration of transcriptomic (new QuantSeq protocol) and clinic datasets. The first part is focused on missing data and network inference from RNA-Seq datasets. During longitudinal study, some observations are missing for some time step. In order to take advantage of external information measured simultaneously to RNA-Seq data, we propose an imputation method, hot-deck multiple imputation (hd-MI), that improves the reliability of network inference. The second part deals with an integrative study of clinical data and transcriptomic data, measured by QuantSeq, based on a network approach. The new protocol is shown efficient for transcriptome measurement. We proposed an analysis based on network inference that is linked to clinical variables of interest.
|
13 |
Evaluating PrediXcan’s Ability to Predict Differential Expression Between Alcoholics and Non-AlcoholicsDrake, John E, Jr 01 January 2019 (has links)
PrediXcan is a recent software for the imputation of gene expression from genotype data alone. Using an overlapping set of transcriptome datasets from postmortem brain tissues of donors with alcohol use disorder and neurotypical controls, which were generated by two different platforms (e.g., Arraystar and Affymetrix), and an additional unrelated transcriptome dataset from lung tissue, we sought to evaluate PrediXcan’s ability to impute gene expression and identify differentially expressed genes. From the Arraystar platform, 1.3% of matched genes between the measured and imputed expression had a Pearson correlation ≥ 0.5. Our attempt to replicate this finding using the expression data from the Affymetrix platform also lead to a similarly poor outcome (2.7%). Our third attempt using the transcriptome data from lung tissue produced similar results (1.1%) but performance improved markedly after filtering out genes with a low predicted R2, which was a model metric provided by the PrediXcan authors. For example, filtering out genes with a predicted R2 below 0.6 led to 16 genes remaining and a Pearson correlation of 0.365 between the measured and imputed expression. We were unable to reproduce similar performance gains with filtering the Arraystar or Affymetrix alcohol use disorder datasets. Given that PrediXcan can impute a narrow portion of the transcriptome, which is further reduced significantly by filtering, we believe caution is warranted with the interpretation of results derived from PrediXcan.
|
14 |
The development of a spatial-temporal data imputation technique for the applications of environmental monitoringHuang, Ya-Chen 12 September 2006 (has links)
In recent years, sustainable development has become one of the most important issues internationally. Many indicators related to sustainable development have been proposed and implemented, such as Island Taiwan and Urban Taiwan. However the missing values come along with environmental monitoring data pose serious problems when we conducted the study on building a sustainable development indicator for marine environment. Since data is the origin of the summarized information, such as indicators. Given the poor data quality caused by the missing values, there will be some doubts about the result accuracy when using such data set for estimation. It is therefore important to apply suitable data pre-processing, such that reliable information can be acquired by advanced data analysis. Several reasons cause the problem of missing value in environmental monitoring data, for example: breakdown of machines, ruin of samples, forgot recording, mismatch of records when merging data, and lost of records when processing data. The situations of missing data are also diverse, for example: in the same time of sampling, some data records at several sampling sites are partially or completely disappeared. On the contrary, partial or complete time series data are missing at the same sampling site. It is therefore obvious to see that the missing values of environmental monitoring data are both related to spatial and temporal dimensions. Currently the techniques of data imputation have been developed for certain types of data or the interpolation of missing values based on either geographic data distributions or time-series functions. To accommodate both spatial and temporal information in an analysis is rarely seen. The current study has been tried to integrate the related analysis procedures and develop a computing process using both spatial and temporal dimensions inherent in the environmental monitoring data. Such data imputation process can enhance the accuracy of estimated missing values.
|
15 |
On two topics with no bridge : bridge sampling with dependent draws and bias of the multiple imputation variance estimator /Romero, Martin. January 2003 (has links)
Thesis (Ph. D.)--University of Chicago, Dept. of Statistics, December 2003. / Includes bibliographical references. Also available on the Internet.
|
16 |
A Monte Carlo Study of Single Imputation in Survey SamplingXu, Nuo January 2013 (has links)
Missing values in sample survey can lead to biased estimation if not treated. Imputation was posted asa popular way to deal with missing values. In this paper, based on Särndal (1994, 2005)’s research, aMonte-Carlo simulation is conducted to study how the estimators work in different situations and howdifferent imputation methods work for different response distributions.
|
17 |
Efficient Estimation in a Regression Model with Missing ResponsesCrawford, Scott 2012 August 1900 (has links)
This article examines methods to efficiently estimate the mean response in a linear model with an unknown error distribution under the assumption that the responses are
missing at random. We show how the asymptotic variance is affected by the estimator of the regression parameter and by the imputation method. To estimate the regression parameter the Ordinary Least Squares method is efficient only if the error distribution happens to be normal. If the errors are not normal, then we propose a One Step Improvement estimator or a Maximum Empirical Likelihood estimator to estimate the parameter efficiently.
In order to investigate the impact that imputation has on estimation of the mean response, we compare the Listwise Deletion method and the Propensity Score method (which do not use imputation at all), and two imputation methods. We show that Listwise Deletion and the Propensity Score method are inefficient. Partial Imputation, where only the missing responses are imputed, is compared to Full Imputation, where both missing and non-missing responses are imputed. Our results show that in general Full Imputation is better than Partial Imputation. However, when the regression parameter is estimated very poorly, then Partial Imputation will outperform Full Imputation. The efficient estimator for the mean response is the Full Imputation estimator that uses an efficient estimator of the parameter.
|
18 |
Practical importance sampling methods for finite mixture models and multiple imputation /Steele, Russell John, January 2002 (has links)
Thesis (Ph. D.)--University of Washington, 2002. / Vita. Includes bibliographical references (p. 109-119).
|
19 |
Model Selection and Multivariate Inference Using Data Multiply Imputed for Disclosure Limitation and NonresponseKinney, Satkartar K. January 2007 (has links)
Thesis (Ph. D.)--Duke University, 2007.
|
20 |
The handling, analysis and reporting of missing data in patient reported outcome measures for randomised controlled trialsRombach, Ines January 2016 (has links)
Missing data is a potential source of bias in the results of randomised controlled trials (RCTs), which can have a negative impact on guidance derived from them, and ultimately patient care. This thesis aims to improve the understanding, handling, analysis and reporting of missing data in patient reported outcome measures (PROMs) for RCTs. A review of the literature provided evidence of discrepancies between recommended methodology and current practice in the handling and reporting of missing data. Particularly, missed opportunities to minimise missing data, the use of inappropriate analytical methods and lack of sensitivity analyses were noted. Missing data patterns were examined and found to vary between PROMs as well as across RCTs. Separate analyses illustrated difficulties in predicting missing data, resulting in uncertainty about assumed underlying missing data mechanisms. Simulation work was used to assess the comparative performance of statistical approaches for handling missing available in standard statistical software. Multiple imputation (MI) at either the item, subscale or composite score level was considered for missing PROMs data at a single follow-up time point. The choice of an MI approach depended on a multitude of factors, with MI at the item level being more beneficial than its alternatives for high proportions of item missingness. The approaches performed similarly for high proportions of unit-nonresponse; however, convergence issues were observed for MI at the item level. Maximum likelihood (ML), MI and inverse probability weighting (IPW) were evaluated for handling missing longitudinal PROMs data. MI was less biased than ML when additional post-randomisation data were available, while IPW introduced more bias compared to both ML and MI. A case study was used to explore approaches to sensitivity analyses to assess the impact of missing data. It was found that trial results could be susceptible to varying assumptions about missing data, and the importance of interpreting the results in this context was reiterated. This thesis provides researchers with guidance for the handling and reporting of missing PROMs data in order to decrease bias arising from missing data in RCTs.
|
Page generated in 0.0863 seconds