21 |
The Role of Missing Data Imputation in Clinical StudiesPeng, Zhimin January 2018 (has links)
No description available.
|
22 |
Statistical Analysis of Longitudinal Data with a Case StudyLiu, Kai January 2015 (has links)
Preterm birth is the leading cause of neonatal mortality and long-term morbidity. Neonatologists can adjust nutrition to preterm neonates to control their weight gain so that the possibility of long-term morbidity can be minimized. This optimization of growth trajectories of preterm infants can be achieved by studying a cohort of selected healthy preterm infants with weights observed during day 1 to day 21. However, missing values in such a data poses a big challenge in this case. In fact, missing data is a common problem faced by most applied researchers. Most statistical softwares deal with missing data by simply deleting subjects with missing items. Analyses carried out on such incomplete data result in biased estimates of the parameters of interest and consequently lead to misleading or invalid inference. Even though many statistical methods may provide robust analysis, it will be better to handle missing data by imputing them with plausible values and then carry out a suitable analysis on the full data. In this thesis, several imputation methods are first introduced and discussed. Once the data get completed by the use of any of these methods, the growth trajectories for this cohort of preterm infants can be presented in the form of percentile growth curves. These growth trajectories can now serve as references for the population of preterm babies. To find out the explicit growth rate, we are interested in establishing predictive models for weights at days 7, 14 and 21. I have used both univariate and multivariate linear models on the completed data. The resulting predictive models can then be used to calculate the target weight at days 7, 14 and 21 for any other infant given the information at birth. Then, neonatologists can adjust the amount of nutrition given in order to preterm infants to control their growth so that they will not grow too fast or too slow, thus avoiding later-life complications. / Thesis / Master of Science (MSc)
|
23 |
The wild bootstrap resampling in regression imputation algorithm with a Gaussian Mixture ModelMat Jasin, A., Neagu, Daniel, Csenki, Attila 08 July 2018 (has links)
Yes / Unsupervised learning of finite Gaussian mixture model (FGMM) is used to learn the distribution of population data. This paper proposes the use of the wild bootstrapping to create the variability of the imputed data in single miss-ing data imputation. We compare the performance and accuracy of the proposed method in single imputation and multiple imputation from the R-package Amelia II using RMSE, R-squared, MAE and MAPE. The proposed method shows better performance when compared with the multiple imputation (MI) which is indeed known as the golden method of missing data imputation techniques.
|
24 |
Avaliação de redes Bayesianas para imputação em variáveis qualitativas e quantitativas. / Evaluating Bayesian networks for imputation with qualitative and quantitative variables.Magalhães, Ismenia Blavatsky de 29 March 2007 (has links)
Redes Bayesianas são estruturas que combinam distribuições de probabilidade e grafos. Apesar das redes Bayesianas terem surgido na década de 80 e as primeiras tentativas em solucionar os problemas gerados a partir da não resposta datarem das décadas de 30 e 40, a utilização de estruturas deste tipo especificamente para imputação é bem recente: em 2002 em institutos oficiais de estatística e em 2003 no contexto de mineração de dados. O intuito deste trabalho é o de fornecer alguns resultados da aplicação de redes Bayesianas discretas e mistas para imputação. Para isso é proposto um algoritmo que combina o conhecimento de especialistas e dados experimentais observados de pesquisas anteriores ou parte dos dados coletados. Ao empregar as redes Bayesianas neste contexto, parte-se da hipótese de que uma vez preservadas as variáveis em sua relação original, o método de imputação será eficiente em manter propriedades desejáveis. Neste sentido, foram avaliados três tipos de consistências já existentes na literatura: a consistência da base de dados, a consistência lógica e a consistência estatística, e propôs-se a consistência estrutural, que se define como sendo a capacidade de a rede manter sua estrutura na classe de equivalência da rede original quando construída a partir dos dados após a imputação. É utilizada pela primeira vez uma rede Bayesiana mista para o tratamento da não resposta em variáveis quantitativas. Calcula-se uma medida de consistência estatística para redes mistas usando como recurso a imputação múltipla para a avaliação de parâmetros da rede e de modelos de regressão. Como aplicação foram conduzidos experimentos com base nos dados de domicílios e pessoas do Censo Demográfico 2000 do município de Natal e nos dados de um estudo sobre homicídios em Campinas. Dos resultados afirma-se que as redes Bayesianas para imputação em atributos discretos são promissoras, principalmente se o interesse estiver em manter a consistência estatística e o número de classes da variável for pequeno. Já para outras características, como o coeficiente de contingência entre as variáveis, são afetadas pelo método à medida que se aumenta o percentual de não resposta. Nos atributos contínuos, a mediana apresenta-se mais sensível ao método. / Bayesian networks are structures that combine probability distributions with graphs. Although Bayesian networks initially appeared in the 1980s and the first attempts to solve the problems generated from the non-response date back to the 1930s and 1940s, the use of structures of this kind specifically for imputation is rather recent: in 2002 by official statistical institutes, and in 2003 in the context of data mining. The purpose of this work is to present some results on the application of discrete and mixed Bayesian networks for imputation. For that purpose, we present an algorithm combining knowledge obtained from experts with experimental data derived from previous research or part of the collected data. To apply Bayesian networks in this context, it is assumed that once the variables are preserved in their original relation, the imputation method will be effective in maintaining desirable properties. Pursuant to this, three types of consistence which already exist in literature are evaluated: the database consistence, the logical consistence and the statistical consistence. In addition, the structural consistence is proposed, which can be defined as the ability of a network to maintain its structure in the equivalence class of the original network when built from the data after imputation. For the first time a mixed Bayesian network is used for the treatment of the non-response in quantitative variables. The statistical consistence for mixed networks is being developed by using, as a resource, the multiple imputation for evaluating network parameters and regression models. For the purpose of application, some experiences were conducted using simple networks based on data for dwellings and people from the 2000 Demographic Census in the City of Natal and on data from a study on homicides in the City of Campinas. It can be stated from the results that the Bayesian networks for imputation in discrete attributes seem to be promising, particularly if the interest is to maintain the statistical consistence and if the number of classes of the variable is small. Features such as the contingency tables coefficient among variables, on the other hand, are affected by this method as the percentage of non-response increases. The median is more sensitive to this method in continuous attributes.
|
25 |
Avaliação de redes Bayesianas para imputação em variáveis qualitativas e quantitativas. / Evaluating Bayesian networks for imputation with qualitative and quantitative variables.Ismenia Blavatsky de Magalhães 29 March 2007 (has links)
Redes Bayesianas são estruturas que combinam distribuições de probabilidade e grafos. Apesar das redes Bayesianas terem surgido na década de 80 e as primeiras tentativas em solucionar os problemas gerados a partir da não resposta datarem das décadas de 30 e 40, a utilização de estruturas deste tipo especificamente para imputação é bem recente: em 2002 em institutos oficiais de estatística e em 2003 no contexto de mineração de dados. O intuito deste trabalho é o de fornecer alguns resultados da aplicação de redes Bayesianas discretas e mistas para imputação. Para isso é proposto um algoritmo que combina o conhecimento de especialistas e dados experimentais observados de pesquisas anteriores ou parte dos dados coletados. Ao empregar as redes Bayesianas neste contexto, parte-se da hipótese de que uma vez preservadas as variáveis em sua relação original, o método de imputação será eficiente em manter propriedades desejáveis. Neste sentido, foram avaliados três tipos de consistências já existentes na literatura: a consistência da base de dados, a consistência lógica e a consistência estatística, e propôs-se a consistência estrutural, que se define como sendo a capacidade de a rede manter sua estrutura na classe de equivalência da rede original quando construída a partir dos dados após a imputação. É utilizada pela primeira vez uma rede Bayesiana mista para o tratamento da não resposta em variáveis quantitativas. Calcula-se uma medida de consistência estatística para redes mistas usando como recurso a imputação múltipla para a avaliação de parâmetros da rede e de modelos de regressão. Como aplicação foram conduzidos experimentos com base nos dados de domicílios e pessoas do Censo Demográfico 2000 do município de Natal e nos dados de um estudo sobre homicídios em Campinas. Dos resultados afirma-se que as redes Bayesianas para imputação em atributos discretos são promissoras, principalmente se o interesse estiver em manter a consistência estatística e o número de classes da variável for pequeno. Já para outras características, como o coeficiente de contingência entre as variáveis, são afetadas pelo método à medida que se aumenta o percentual de não resposta. Nos atributos contínuos, a mediana apresenta-se mais sensível ao método. / Bayesian networks are structures that combine probability distributions with graphs. Although Bayesian networks initially appeared in the 1980s and the first attempts to solve the problems generated from the non-response date back to the 1930s and 1940s, the use of structures of this kind specifically for imputation is rather recent: in 2002 by official statistical institutes, and in 2003 in the context of data mining. The purpose of this work is to present some results on the application of discrete and mixed Bayesian networks for imputation. For that purpose, we present an algorithm combining knowledge obtained from experts with experimental data derived from previous research or part of the collected data. To apply Bayesian networks in this context, it is assumed that once the variables are preserved in their original relation, the imputation method will be effective in maintaining desirable properties. Pursuant to this, three types of consistence which already exist in literature are evaluated: the database consistence, the logical consistence and the statistical consistence. In addition, the structural consistence is proposed, which can be defined as the ability of a network to maintain its structure in the equivalence class of the original network when built from the data after imputation. For the first time a mixed Bayesian network is used for the treatment of the non-response in quantitative variables. The statistical consistence for mixed networks is being developed by using, as a resource, the multiple imputation for evaluating network parameters and regression models. For the purpose of application, some experiences were conducted using simple networks based on data for dwellings and people from the 2000 Demographic Census in the City of Natal and on data from a study on homicides in the City of Campinas. It can be stated from the results that the Bayesian networks for imputation in discrete attributes seem to be promising, particularly if the interest is to maintain the statistical consistence and if the number of classes of the variable is small. Features such as the contingency tables coefficient among variables, on the other hand, are affected by this method as the percentage of non-response increases. The median is more sensitive to this method in continuous attributes.
|
26 |
Multiple Imputation in der Praxis : ein sozialwissenschaftliches Anwendungsbeispiel / Multiple imputation in practice : a socio-scientific example of useBöwing-Schmalenbrock, Melanie, Jurczok, Anne January 2011 (has links)
Multiple Imputation hat sich in den letzten Jahren als adäquate Methode zum Umgang mit fehlenden Werten erwiesen und etabliert. Das gilt zumindest für die Theorie, denn im Angesicht mangelnder anwendungsbezogener Erläuterungen und Einführungen verzichten in der Praxis viele Sozialwissenschaftler auf diese notwendige Datenaufbereitung. Trotz (oder vielleicht auch wegen) der stetig fortschreitenden Weiterentwicklung der Programme und Optionen zur Umsetzung Multipler Imputationen, sieht sich der Anwender mit zahlreichen Herausforderungen konfrontiert, für die er mitunter nur schwer Lösungsansätze findet. Die Schwierigkeiten reichen von der Analyse und Aufbereitung der Zielvariablen, über die Software-Entscheidung, die Auswahl der Prädiktoren bis hin zur Modell-Formulierung und Ergebnis-Evaluation. In diesem Beitrag wird die Funktionsweise und Anwendbarkeit Multipler Imputationen skizziert und es wird eine Herangehensweise entwickelt, die sich in der schrittweisen Umsetzung dieser Methode als nützlich erwiesen hat – auch für Einsteiger. Es werden konkrete potenzielle Schwierigkeiten angesprochen und mögliche Problemlösungen diskutiert; vor allem die jeweilige Beschaffenheit der fehlenden Werte steht hierbei im Vordergrund. Der Imputations-Prozess und alle mit ihm verbundenen Arbeitsschritte werden anhand eines Anwendungsbeispiels – der Multiplen Imputation des Gesamtvermögens reicher Haushalte – exemplarisch illustriert. / Multiple imputation established itself and proved adequate as method of handling missing observations – at least in theory. Annotations and explanations on how to apply multiple imputation in practice are scarce and this seems to discourage many social scientists to conduct this step of necessary data preparation. Despite (or maybe because of) the continuous and progressive development of programs and features to conduct multiple imputation the user is confronted with numerous challenges for which solutions are sometimes hard to find. The difficulties range from the analysis and preparation of the target variable to deciding in favor of a software package, selecting predictors, formulating a suitable model and evaluating the results. This paper will outline the operation and practicability of multiple imputations and will develop a useful approach, which has proven adequate in handling missing values step by step – even for beginners. It will discuss potential difficulties and gives specific solutions; especially the particular quality of missing data is paramount. The process of imputation with all its necessary steps will be illustrated by the multiple imputation of the total assets of wealthy households.
|
27 |
Méthodes d'analyse des données incomplètes incorporant l'incertitude attribuable aux valeurs manquantesBernard, Francis January 2013 (has links)
Lorsqu'on réalise une analyse des données dans le cadre d'une enquête, on est souvent confronté au problème des données manquantes. L'une des solutions les plus fréquemment utilisées est d'avoir recours aux méthodes d'imputation simple. Malheureusement, ces méthodes souffrnt d'un handicap important : les estimations courantes basées sur les valeurs observées et imputées considèrent à tort les valeurs imputées comme des valeurs connues, bien qu'une certaine forme d'incertitude plane au sujet des valeurs à imputer. En particulier, les intervalles de confiance pour les paramètres d'intérêt basés sur les données ainsi complétées n'incorporent pas l'incertitude qui est attribuable aux valeurs manquantes. Les méthodes basées sur le rééchantillonnage et l'imputation multiple -- une généralisation de l'imputation simple -- s'avèrent toutes deux des solutions courantes convenables au problème des données manquantes, du fait qu'elles incorporent cette incertitude. Une alternative consiste à avoir recours à l'imputation multiple à deux niveaux, une généralisation de l'imputation multiple (conventionnelle) qui a été développée dans la thèse que Shen [51] a rédigée en 2000 et qui permet d'exploiter les situations où la nature des valeurs manquantes suggère d'effectuer la procédure d'imputation en deux étapes plutôt qu'en une seule. Nous décrirons ces méthodes d'analyse des données incomplètes qui incorporent l'incertitude attribuable aux valeurs manquantes, nous soulèverons quelques problématiques intéressantes relatives au recours à ces méthodes et nous y proposerons des solutions appropriées. Finalement, nous illustrerons l'application de l'imputation multiple conventionnelle et de l'imputation multiple à deux niveaux au moyen d'exemples simples et concrets.
|
28 |
Inequality of opportunity : measurement and impact on economic growth / Inégalité d'opportunité : mesure et effet sur la croissance économiqueTeyssier, Geoffrey 17 November 2017 (has links)
Cette thèse porte sur la mesure de l'inégalité d'opportunité et son effet sur la croissance économique. Le Chapitre 1 étudie les propriétés axiomatiques de deux approches de mesure concurrentes. Dans les deux cas, la population est partitionnée en groupes rassemblant des personnes partageant les mêmes circonstances, ces déterminants de revenu que les individus ne peuvent choisir (ex. sexe ou milieu familial). L'inégalité d'opportunité est alors mesurée comme celle présente au sein d'une distribution contrefactuelle où chacun se voit attribuer le revenu représentatif de son groupe. La première approche considère la moyenne arithmétique comme revenu représentatif. Lorsque le nombre de groupes est grand et que leur taille est petite, ces moyennes sont peu précisément estimées. Afin de d'atténuer ce problème, la seconde approche, dite paramétrique, suppose que les circonstances n'ont pas d'effet d'interaction et remplace la moyenne arithmétique par la prédiction OLS du revenu régressé sur les circonstances. Le Chapitre 1 montre que la méthode paramétrique est faible d'un point de vue axiomatique. En particulier, elle ne respecte pas une version «entre-groupes» du principe des transferts. Le Chapitre 2 propose une méthodologie afin de contourner le manque actuel de micro-données sur les circonstances parentales, un déterminant majeur de l'inégalité d'opportunité. L'idée est d'utiliser 1 structure des enquêtes démographiques organisées autour de foyers afin de retrouver les circonstances parentales des adultes vivant avec leurs parents, puis d'utiliser une méthode d'ajustement statistique -l'imputation multiple -afin d'obtenir une mesure d'inégalité d'opportunité représentative de la population adulte dans son ensemble. Celle-ci est proche de la« vraie» inégalité d'opportunité, qui repose sur des questions directes à propos du milieu parental contenue dans l'enquête brésilienne du PNAD 1996. Le Chapitre 3 étudie empiriquement une récente explication quant au caractère peu concluant de la littérature empirique sur l'inégalité et la croissance: ce n'est pas l'inégalité de revenus qui compte pour la croissance mais ses deux composantes, à savoir l'inégalité d'opportunité et la composante résiduelle qu'est l'inégalité d'effort. Cette explication est validée au Brésil au niveau municipal durant la période 1980-2010, où le: inégalités d'opportunité et d'effort sont respectivement préjudiciables et bénéfiques à la croissance économique future, comme attendu. Leurs effets sont robustes et significatifs, contrairement à celui de l'inégalité total de revenus. / This thesis is about the measurement of inequality of opportunity and its impact on economic growth. Chapter 1 studies the axiomatic properties of two prominent measurement approaches. In both cases, the population is partitioned into groups of people sharing the same circumstances, those income determinants that are beyond individual control (e.g. sex or parental background) and that shape one's opportunities. Inequality of opportunity is then measured by applying a1 inequality index over a counterfactual distribution where each individual is attributed the representative income of his group. The first approach takes the representative income of a group to be its arithmetic mean. When a large number of small-sized groups are considered, these means can be poorly estimated. To mitigate this issue, the second approach, called parametric, assumes that circumstances have no interaction effect and takes this representative income to be the OLS predicted value of income regressed on circumstances. Chapter I shows that the parametric approach has poor axiomatic properties, especially with respect to a between-group version of the transfer principle. Chapter 2 provides a methodology to circumvent the current lack of microdata on parental background circumstances, a major driver of inequality of oppo1tunity. The idea is to retrieve the parental background of adults living with their parents thanks to the structure of household survey data, and then to apply a missing data procedure -multiple imputation -to obtain estimate of inequality of opportunity that are representative of the overall adult population. These estimates are shown to be close to their "true" counterpa1ts, based on direct questions about parental background contained in the Brazilian PNAD 1996 survey. Chapter 3 empirically investigates a recent and promising explanation for the inconclusiveness of traditional growth-inequality literature: income inequality does not matter for growth while its components -inequality of opportunity and the residual one, inequality of effort -do. This explanation is validated in Brazil at the municipality level over the period 1980-20 l 0, where inequalities of opportunity and effort are respectively detrimental and beneficial to subsequent growth, as expected. Their effects are robust and significant, in contrast to that of total income inequality.
|
29 |
New models for implementation of genome-wide evaluation in black poplar breeding program / Nouveaux modèles pour la mise en oeuvre de l'évaluation pangénomique dans le programme d'amélioration du peuplierPegard, Marie 19 December 2018 (has links)
Les espèces forestières sont particulières à bien des égards par rapport aux autres espècesdomestiquées. Les arbres forestiers ont de longues phases juvéniles, entrainant de long et couteuxcycles de sélection et nécessitant une sélection en plusieurs étapes indépendantes. Bien que cetteméthode soit efficace du point de vue opérationnel, elle reste couteuse en temps et en ressources,entrainant une dilution de l’intensité et de la précision de sélection. Au vu de ces contraintes,les arbres sont de bons candidats pour la mise en oeuvre de l’évaluation génomique. La sélectiongénomique (SG) repose sur le classement et la sélection d’individus à partir de l’informationcontenu dans leur génome sans utilisé une étape d’évaluation phénotypique et ainsi accélérerle processus de sélection.Ce travail visait à identifier les situations, les critères et les facteursdans lesquelles la SG pourrait être une option réalisable pour le peuplier. Notre étude a montréque les avantages de l’évaluation génomique dépendent du contexte. C’est dans des situationsles moins avantageuse que l’évaluation génomique se montre la plus performante, elle profiteégalement de la densification de l’information génétique de faible à moyenne suite à une étaped’imputation de haute qualité. La sélection génomique pourrait être une option intéressante àstade précoce, où la précision de la sélection est généralement faible et la variabilité génétiqueabondante. Notre travail a également montré qu’il est important d’évaluer les performancesavec des critères alternatifs, comme ceux liés au classement, notamment lorsque ces critèresrépondent au contexte opérationnel du programme d’élevage étudié. / Forest species are unique in many ways compared to other domesticated species. Forest trees have long juvenile phases, leading to long and costly selection cycles and requiring selection in several independent stages. Even if this method is operationally effective, it remains costly in terms of time and resources, resulting in a diluted intensity and accuracy of selection.In view of these constraints, trees are good candidates for the implementation of genomic evaluation. Genomic selection (SG) is based on the classification and selection of individuals from the information contained in their genome without using a phenotypic evaluation step and thus accelerating the selection process, in order to identify the situations, criteria and factors in which SG could be a feasible option for poplar. Our study showed that the benefits of genomic evaluation are context-dependent. Genomic evaluation is most effective in theless-advantageous situations, it also benefits from low to medium density genetic information following a high-quality imputation step. Genomic selection could be an interesting option at an early stage, when the accuracy of selection is generally low and genetic variability is abundant.Our work has also shown that it is important to evaluate performance with alternative criteria,such as those related to ranking, especially when these criteria fit the operational context of the breeding programme under study.
|
30 |
Impact of pre-imputation SNP-filtering on genotype imputation resultsRoshyara, Nab Raj, Kirsten, Holger, Horn, Katrin, Ahnert, Peter, Scholz, Markus 10 September 2014 (has links) (PDF)
Background: Imputation of partially missing or unobserved genotypes is an indispensable tool for SNP data analyses. However, research and understanding of the impact of initial SNP-data quality control on imputation results is still limited. In this paper, we aim to evaluate the effect of different strategies of pre-imputation quality filtering on the performance of the widely used imputation algorithms MaCH and IMPUTE. Results: We considered three scenarios: imputation of partially missing genotypes with usage of an external reference panel, without usage of an external reference panel, as well as imputation of ompletely un-typed SNPs using an external reference panel. We first created various datasets applying different SNP quality filters and masking certain percentages of randomly selected high-quality SNPs. We imputed these SNPs and compared the results between the different filtering scenarios by using established and newly proposed measures of imputation quality. While the established measures assess certainty of imputation results, our newly proposed measures focus on the agreement with true genotypes. These measures showed that pre-imputation SNP-filtering might be detrimental regarding imputation quality. Moreover, the strongest drivers of imputation quality were in general the burden of missingness and the number of SNPs used for imputation. We also found that using a reference panel always improves imputation quality of partially missing genotypes. MaCH performed slightly better than IMPUTE2 in most of our scenarios. Again, these results were more pronounced when using our newly defined measures of imputation quality. Conclusion: Even a moderate filtering has a detrimental effect on the imputation quality. Therefore little or no SNP filtering prior to imputation appears to be the best strategy for imputing small to moderately sized datasets. Our results also showed that for these datasets, MaCH performs slightly better than IMPUTE2 in most scenarios at the cost of increased computing time.
|
Page generated in 0.1073 seconds