Spelling suggestions: "subject:"missing data"" "subject:"missing mata""
141 |
Estimation of Regression Coefficients under a Truncated Covariate with Missing ValuesReinhammar, Ragna January 2019 (has links)
By means of a Monte Carlo study, this paper investigates the relative performance of Listwise Deletion, the EM-algorithm and the default algorithm in the MICE-package for R (PMM) in estimating regression coefficients under a left truncated covariate with missing values. The intention is to investigate whether the three frequently used missing data techniques are robust against left truncation when missing values are MCAR or MAR. The results suggest that no technique is superior overall in all combinations of factors studied. The EM-algorithm is unaffected by left truncation under MCAR but negatively affected by strong left truncation under MAR. Compared to the default MICE-algorithm, the performance of EM is more stable across distributions and combinations of sample size and missing rate. The default MICE-algorithm is improved by left truncation but is sensitive to missingness pattern and missing rate. Compared to Listwise Deletion, the EM-algorithm is less robust against left truncation when missing values are MAR. However, the decline in performance of the EM-algorithm is not large enough for the algorithm to be completely outperformed by Listwise Deletion, especially not when the missing rate is moderate. Listwise Deletion might be robust against left truncation but is inefficient.
|
142 |
Estimation des données manquantes par la métrologie virtuelle pour l'amélioration du régulateur Run-To-Run dans le domaine des semi-conducteurs / Estimation of missing data by virtual metrology for the improvement of the Run-To-Run controller in the field of semiconductorsJebri, Mohamed Ali 26 January 2018 (has links)
La thématique abordée porte sur la métrologie virtuelle (VM) pour estimer les données manquantes durant les processus de fabrications des semi-conducteurs. L'utilisation de la métrologie virtuelle permet également de fournir les mesures logicielles (estimations) des sorties pour alimenter les régulateurs run-to-run (R2R) mis en place pour le contrôle de la qualité des produits fabriqués. Pour remédier aux problèmes liés au retard de mesures causé par l'échantillonnage statique imposé par la stratégie et les équipements mis en place, notre contribution dans cette thèse est d'introduire la notion de l'échantillonnage dynamique intelligent. Cette stratégie est basée sur un algorithme qui prend en compte la condition de voisinage permettant d'éviter la mesure réelle même si l'échantillonnage statique l'exige. Cela permet de réduire le nombre de mesures réelles, le temps du cycle et le coût de production. Cette approche est assurée par un module de métrologie virtuelle (VM) que nous avons développé et qui peut être intégré dans une boucle de régulation R2R. Les résultats obtenus ont été validés sur des exemples académiques et sur des données réelles fournies par notre partenaire STMicroelectronics de Rousset concernant un processus chemical mechanical planarization (CMP). Ces données réelles ont permis également de valider les résultats obtenus de la métrologie virtuelle pour les fournir ensuite aux régulateurs R2R (ayant besoin de l'estimation de ces données). / The addressed work is about the virtual metrology (VM) for estimating missing data during semiconductor manufacturing processes. The use of virtual metrology tool also makes it possible to provide the software measurements (estimations) of the outputs to feed the run-to-run (R2R) controllers set up for the quality control of the manufactured products.To address these issues related to the delay of measurements caused by the static sampling imposed by the strategy and the equipments put in place, our contribution in this thesis is to introduce the notion of the dynamic dynamic sampling. This strategy is based on an algorithm that considers the neighborhood condition to avoid the actual measurement even if the static sampling requires it. This reduces the number of actual measurements, the cycle time and the cost of production. This approach is provided by a virtual metrology module (VM) that we have developed and which can be integrated into an R2R control loop. The obtained results were validated on academic examples and on real data provided by our partner STMicroelectronics of Rousset from a chemical mechanical planarization (CMP) process. This real data also enabled the results obtained from the virtual metrology to be validated and then supplied to the R2R regulators (who need the estimation of these data).
|
143 |
Practical considerations for genotype imputation and multi-trait multi-environment genomic prediction in a tropical maize breeding program / Considerações práticas para a imputação de genótipos e predição genômica aplicada a múltiplos caracteres e ambientes em um programa de melhoramento de milho tropicalOliveira, Amanda Avelar de 17 June 2019 (has links)
The availability of molecular markers covering the entire genome, such as single nucleotide polymorphism (SNP) markers, allied to the computational resources for processing large amounts of data, enabled the development of an approach for marker assisted selection for quantitative traits, known as genomic selection. In the last decade, genomic selection has been successfully implemented in a wide variety of animal and plant species, showing its benefits over traditional marker assisted selection and selection based only on pedigree information. However, some practical challenges may still limit the wide implementation of this method in a plant breeding program. For example, we cite the cost of high-density genotyping of a large number of individuals and the application of more complex models that take into account multiple traits and environments. Thus, this study aimed to i) investigate SNP calling and imputation strategies that allow cost-effective high-density genotyping, as well as ii) evaluating the application of multivariate genomic selection models to data from multiple traits and environments. This work was divided into two chapters. In the first chapter, we compared the accuracy of four imputation methods: NPUTE, Beagle, KNNI and FILLIN, using genotyping-by-sequencing (GBS) data from 1060 maize inbred lines, which were genotyped using different depths of coverage. In addition, two SNP calling and imputation strategies were evaluated. Our results indicated that combining SNP-calling and imputation strategies can enhance cost-effective genotyping, resulting in higher imputation accuracies. In the second chapter, multivariate genomic selection models, for multiple traits and environments, were compared with their univariate versions. We used data from 415 hybrids evaluated in the second season in four years (2006-2009) for grain yield, number of ears and grain moisture. Hybrid genotypes were inferred in silico based on their parental inbred lines using SNP markers obtained via GBS. However, genotypic information was available only for 257 hybrids, motivating the use of the H matrix, which combines genetic information based on pedigree and molecular markers. Our results demonstrated that the use of multi-trait multi-environment models can improve predictive abilities, especially to predict the performance of hybrids that have not yet been evaluated in any environment. / A disponibilidade de marcadores moleculares cobrindo todo o genoma, como os polimorfismos de nucleotídeos individuais (single nucleotide polymorphism - SNP), aliada aos recursos computacionais para o processamento de grande volume de dados, tornou possível o desenvolvimento de uma abordagem de melhoramento assistido para caracteres de herança quantitativa, conhecida como seleção genômica. Na última década a seleção genômica tem sido implementada com sucesso em uma enorme variedade de espécies animais e vegetais, comprovando suas vantagens sobre a seleção assistida por marcadores tradicional e a seleção baseada apenas em informações de parentesco. No entanto, alguns desafios práticos ainda podem limitar a implementação deste método em um programa de melhoramento de plantas. Como exemplos, citam-se o custo da genotipagem de alta densidade de um grande número de indivíduos e a aplicação de modelos mais complexos, que consideram múltiplos caracteres e ambientes. Dessa forma, este estudo teve como objetivos: i) investigar estratégias de identificação de SNPs e imputação que possibilitem uma genotipagem de alta densidade economicamente viável; e ii) avaliar a aplicação de modelos multivariados de seleção genômica para múltiplos caracteres e ambientes. Este trabalho foi divido em dois capítulos. No primeiro capítulo, comparou-se a acurácia de quatro métodos de imputação: NPUTE, Beagle, KNNI e FILLIN, usando dados de genotipagem por sequenciamento (genotyping-by-sequencing - GBS) de 1.060 linhagens de milho, que foram genotipadas usando diferentes profundidades de cobertura. Além disso, duas estratégias de identificação de SNPs e imputação foram avaliadas. Os resultados indicaram que a combinação de estratégias de detecção de polimorfismos e imputação pode possibilitar uma genotipagem economicamente viável, resultando em maiores acurácias de imputação. No segundo capítulo, modelos multivariados de seleção genômica, para múltiplos caracteres e ambientes, foram comparados com suas versões univariadas. Dados de 415 híbridos avaliados na segunda safra em quatro anos (2006-2009) para os caracteres produtividade de grãos, número de espigas e umidade foram utilizados. Os genótipos dos híbridos foram inferidos in silico com base nos genótipos das linhagens parentais usando marcadores SNPs obtidos via GBS. No entanto, informações genotípicas estavam disponíveis para apenas 257 híbridos, de modo que foi necessário fazer uso da matriz H, a qual combina informações de parentesco genético baseadas em pedigree e marcadores. Os resultados obtidos demonstraram que o uso de modelos de seleção genômica para múltiplos caracteres e ambientes pode aumentar a capacidade preditiva, especialmente para predizer a performance de híbridos nunca avaliados em qualquer ambiente.
|
144 |
Physical and Mental Health Status of Adults with Serious Mental Illness Participating in a Jail Diversion InterventionTelford, Robin 01 May 2014 (has links)
Adults with mental illnesses are at an increased risk to be diagnosed with one or more comorbid physical illnesses compared to the general population. Much of the disparities faced by adults with serious mental illnesses (SMI) can be attributed to medication side effects, increased risk for metabolic diseases, inability to communicate about severity and monitor physical health symptoms, poor health behaviors, high rates of smoking, and poor quality health care. The rate of physical illnesses for adults with mental illnesses are even higher among those who have been involved with the criminal justice system. In order to understand the relationship between physical and mental illnesses, longitudinal study designs are needed. Longitudinal studies can provide greater understanding of the temporal relationship of physical and mental illnesses. Despite the benefits of longitudinal studies, there also are challenges, including missing data.
The first manuscript of this dissertation explores the physical and mental health status of adults with mental illnesses. Secondary data were used from three different studies: a sample of adults with SMI enrolled in a mental health court jail diversion program (n=91); a sample of Medicaid enrollees with SMI in Florida (n=688) who were part of a larger Substance Abuse and Mental Health Services Administration (SAMHSA) study; and a sample of inpatient and outpatient adults with SMI from five different study sites (n=969). The samples were combined into two data sets, consisting of the jail diversion sample and the SAMHSA sample, and the jail diversion sample and the 5-site sample. Participants in these samples answered questions on the Short-Form Health Survey (SF-12), recent arrests, drug and alcohol use, socio-demographic information, and mental illness symptom severity (measured only in the criminal justice and 5-site samples).
Overall, the mental and physical health status scores were significantly lower for all of the participants compared to the general population mean scores. The participants reporting a recent arrest had a higher physical health score compared to those who did not have a recent arrest, and in the jail diversion and 5-site sample, had a lower mental health status score than those without a recent arrest. After taking age, drug and alcohol use, and psychiatric symptom severity into account, arrest was no longer associated with the physical health status score in either of the data sets. In the jail diversion and 5-site data set, arrest was still significantly associated with mental health status score after controlling for age, drug and alcohol use, and psychiatric symptom severity.
The second manuscript of this dissertation explores the analysis of missing data in a longitudinal study to determine the missing data mechanisms and missing data patterns, and subsequently, how to prepare the data for analysis by using multiple imputation or maximum likelihood estimation. Secondary data were drawn from the same jail diversion sample as in the first manuscript. Data were collected at baseline, three months, six months, and nine months. Only participants with the potential to have data collected at these time points were included (n=50).
Analysis revealed missing data due to missing item-level information, missing participant data at one time point but complete data at a subsequent time point, and missing participant data for those who dropped out of the study completely. The missing data mechanism for the missing item-level data were missing completely at random, whereas the participant-level missing data were missing at random. Multiple imputation was used for the item-level data and for the participant-level missing data. Maximum likelihood estimation was also used for the participant-level missing data and compared to the multiple imputation results. Findings suggest that multiple imputation produced more accurate parameter estimates, possibly due to the small sample size.
The findings from this study indicate that more research needs to be done to fully understand the physical illnesses experienced by adults with mental illnesses who are involved with the criminal justice system. Understanding mental and physical illness comorbidity is important in public health as it dictates appropriate treatments and training for behavioral health practitioners and staff. In addition, missing data in longitudinal studies cannot be ignored, as it can bias the results, and appropriate techniques for exploring the missing data must be used. When missing data is ignored in analyses, the subsequent results can be incorrect and unable to detect treatment effects, thereby preventing effective programs from receiving necessary funding. In addition, ignoring missing data can impact funding for behavioral health services by underestimating the prevalence and severity of mental illnesses. Future research should focus on exploring how mental and physical health are related in adults with a recent arrest compared to the general population, and ways to integrate services to address both mental and physical health.
|
145 |
A Study of Missing Data Imputation and Predictive Modeling of Strength Properties of Wood CompositesZeng, Yan 01 August 2011 (has links)
Problem: Real-time process and destructive test data were collected from a wood composite manufacturer in the U.S. to develop real-time predictive models of two key strength properties (Modulus of Rupture (MOR) and Internal Bound (IB)) of a wood composite manufacturing process. Sensor malfunction and data “send/retrieval” problems lead to null fields in the company’s data warehouse which resulted in information loss. Many manufacturers attempt to build accurate predictive models excluding entire records with null fields or using summary statistics such as mean or median in place of the null field. However, predictive model errors in validation may be higher in the presence of information loss. In addition, the selection of predictive modeling methods poses another challenge to many wood composite manufacturers.
Approach: This thesis consists of two parts addressing above issues: 1) how to improve data quality using missing data imputation; 2) what predictive modeling method is better in terms of prediction precision (measured by root mean square error or RMSE). The first part summarizes an application of missing data imputation methods in predictive modeling. After variable selection, two missing data imputation methods were selected after comparing six possible methods. Predictive models of imputed data were developed using partial least squares regression (PLSR) and compared with models of non-imputed data using ten-fold cross-validation. Root mean square error of prediction (RMSEP) and normalized RMSEP (NRMSEP) were calculated. The second presents a series of comparisons among four predictive modeling methods using imputed data without variable selection.
Results: The first part concludes that expectation-maximization (EM) algorithm and multiple imputation (MI) using Markov Chain Monte Carlo (MCMC) simulation achieved more precise results. Predictive models based on imputed datasets generated more precise prediction results (average NRMSEP of 5.8% for model of MOR model and 7.2% for model of IB) than models of non-imputed datasets (average NRMSEP of 6.3% for model of MOR and 8.1% for model of IB). The second part finds that Bayesian Additive Regression Tree (BART) produced most precise prediction results (average NRMSEP of 7.7% for MOR model and 8.6% for IB model) than other three models: PLSR, LASSO, and Adaptive LASSO.
|
146 |
Jackknife Emperical Likelihood Method and its ApplicationsYang, Hanfang 01 August 2012 (has links)
In this dissertation, we investigate jackknife empirical likelihood methods motivated by recent statistics research and other related fields. Computational intensity of empirical likelihood can be significantly reduced by using jackknife empirical likelihood methods without losing computational accuracy and stability. We demonstrate that proposed jackknife empirical likelihood methods are able to handle several challenging and open problems in terms of elegant asymptotic properties and accurate simulation result in finite samples. These interesting problems include ROC curves with missing data, the difference of two ROC curves in two dimensional correlated data, a novel inference for the partial AUC and the difference of two quantiles with one or two samples. In addition, empirical likelihood methodology can be successfully applied to the linear transformation model using adjusted estimation equations. The comprehensive simulation studies on coverage probabilities and average lengths for those topics demonstrate the proposed jackknife empirical likelihood methods have a good performance in finite samples under various settings. Moreover, some related and attractive real problems are studied to support our conclusions. In the end, we provide an extensive discussion about some interesting and feasible ideas based on our jackknife EL procedures for future studies.
|
147 |
Weather data for building simulation : New actual weather files for North Europe combining observed weather and modeled solar radiationLundström, Lukas January 2012 (has links)
Dynamic building simulation is increasingly necessary for accurately quantifying potential energy savings measures in retrofit projects, to compliant with new stricter directives from EU implanted into member states legislations and building codes. For good result the simulation model need to be accurately calibrated. This requires actual weather data, representative for the climate surrounding the given building, in order to calibrate against actual energy bills of the same period of time. The main objective of this degree project is to combine observed weather (temperature, humidity, wind etc.) data with modeled solar radiation data, utilizing the SMHI STRÅNG model system; and transform these data into AMY (Actual Meteorological Year) files to be used with building simulation software. This procedure gives actual weather datasets that will cover most of the urban and semi urban area in Northern Europe while still keeping the accuracy of observed weather data. A tool called Real-Time Weather Converter was developed to handle data retrieval & merging, filling of missing data points and to create the final AMY-file. Modeled solar radiation data from STRÅNG had only been validated against a Swedish solar radiation network; validation was now made by the author with wider geographic coverage. Validation results show that STRÅNG model system performs well for Sweden but less so outside of Sweden. There exist some areas outside of Sweden (mainly Central Europe) with reasonable good result for some periods but the result is not as consistent in the long run as for Sweden. The missing data fill scheme developed for the Real-Time Weather Converter does perform better than interpolation for data gaps (outdoor temperature) of about 9 to 48 hours. For gaps between 2 and 5 days the fill scheme will still give slightly better result than linear interpolation. Akima Spline interpolation performs better than linear interpolation for data gaps (outdoor temperature) in the interval 2 to about 8 hours. Temperature uncertainty was studied using data from the period 1981-2010 for selected sites. The result expressed as SD (Standard Deviation) for the uncertainty in yearly mean temperature is about 1˚C for the Nordic countries. On a monthly basis the variation in mean temperature is much stronger (for Nordic countries it ranges from 3.5 to 4.7 ˚C for winter months), while summer months have less variation (with SD in the range of 1.3 to 1.9 ˚C). The same pattern is visible in sites at more southern latitudes but with much lower variation, and still lower for sites near coast areas. E.g. the cost-near Camborne, UK, has a SD of 0.7 to 1.7 ˚C on monthly basis and yearly SD of 0.5 ˚C. Mean direct irradiance SD for studied sites ranges from 5 to 19 W/m2 on yearly basis, while on monthly basis the SD ranges from 40 to 60 W/m2 for summer months. However, the sample base was small and of inconsistent time periods and the numbers can only be seen as indicative. The commonly used IWEC (International Weather for Energy Calculations) files direct radiation parameter was found to have a very strong negative bias of about 20 to 40 % for Northern Europe. These files should be used with care, especially if solar radiation has a significant impact of on the building being modeled. Note that there exist also a newer set of files called IWEC2 that can be purchased from ASHRAE, these files seems not to be systematically biased for North Europe but haven’t been studied in this paper. The STRÅNG model system does catch the trend, also outside of Sweden, and is thus a very useful source of solar radiation data for model calibration.
|
148 |
Detecting Disguised Missing DataBelen, Rahime 01 February 2009 (has links) (PDF)
In some applications, explicit codes are provided for missing data such as NA (not available) however many applications do not provide such explicit codes and valid or invalid data codes are recorded as legitimate data values. Such missing values are known as disguised missing data. Disguised missing data may affect the quality of data analysis negatively, for example the results of discovered association rules in KDD-Cup-98 data sets have clearly shown the need of applying data quality management prior to analysis. In this thesis, to tackle the problem of disguised missing data, we analyzed embedded unbiased sample heuristic (EUSH), demonstrated the methods drawbacks and proposed a new methodology based on Chi Square Two Sample Test. The proposed method does not require any domain background knowledge and compares favorably with EUSH.
|
149 |
Design and analysis of response selective samples in observational studiesGrünewald, Maria January 2011 (has links)
Outcome dependent sampling may increase efficiency in observational studies. It is however not always obvious how to sample efficiently, and how to analyze the resulting data without introducing bias. This thesis describes a general framework for efficiency calculations in multistage sampling, with focus on what is sometimes referred to as ascertainment sampling. A method for correcting for the sampling scheme in analysis of ascertainment samples is also presented. Simulation based methods are used to overcome computational issues in both efficiency calculations and analysis of data. / At the time of doctoral defense, the following paper was unpublished and had a status as follows: Paper 1: Submitted.
|
150 |
Bayesian model estimation and comparison for longitudinal categorical dataTran, Thu Trung January 2008 (has links)
In this thesis, we address issues of model estimation for longitudinal categorical data and of model selection for these data with missing covariates. Longitudinal survey data capture the responses of each subject repeatedly through time, allowing for the separation of variation in the measured variable of interest across time for one subject from the variation in that variable among all subjects. Questions concerning persistence, patterns of structure, interaction of events and stability of multivariate relationships can be answered through longitudinal data analysis. Longitudinal data require special statistical methods because they must take into account the correlation between observations recorded on one subject. A further complication in analysing longitudinal data is accounting for the non- response or drop-out process. Potentially, the missing values are correlated with variables under study and hence cannot be totally excluded. Firstly, we investigate a Bayesian hierarchical model for the analysis of categorical longitudinal data from the Longitudinal Survey of Immigrants to Australia. Data for each subject is observed on three separate occasions, or waves, of the survey. One of the features of the data set is that observations for some variables are missing for at least one wave. A model for the employment status of immigrants is developed by introducing, at the first stage of a hierarchical model, a multinomial model for the response and then subsequent terms are introduced to explain wave and subject effects. To estimate the model, we use the Gibbs sampler, which allows missing data for both the response and explanatory variables to be imputed at each iteration of the algorithm, given some appropriate prior distributions. After accounting for significant covariate effects in the model, results show that the relative probability of remaining unemployed diminished with time following arrival in Australia. Secondly, we examine the Bayesian model selection techniques of the Bayes factor and Deviance Information Criterion for our regression models with miss- ing covariates. Computing Bayes factors involve computing the often complex marginal likelihood p(y|model) and various authors have presented methods to estimate this quantity. Here, we take the approach of path sampling via power posteriors (Friel and Pettitt, 2006). The appeal of this method is that for hierarchical regression models with missing covariates, a common occurrence in longitudinal data analysis, it is straightforward to calculate and interpret since integration over all parameters, including the imputed missing covariates and the random effects, is carried out automatically with minimal added complexi- ties of modelling or computation. We apply this technique to compare models for the employment status of immigrants to Australia. Finally, we also develop a model choice criterion based on the Deviance In- formation Criterion (DIC), similar to Celeux et al. (2006), but which is suitable for use with generalized linear models (GLMs) when covariates are missing at random. We define three different DICs: the marginal, where the missing data are averaged out of the likelihood; the complete, where the joint likelihood for response and covariates is considered; and the naive, where the likelihood is found assuming the missing values are parameters. These three versions have different computational complexities. We investigate through simulation the performance of these three different DICs for GLMs consisting of normally, binomially and multinomially distributed data with missing covariates having a normal distribution. We find that the marginal DIC and the estimate of the effective number of parameters, pD, have desirable properties appropriately indicating the true model for the response under differing amounts of missingness of the covariates. We find that the complete DIC is inappropriate generally in this context as it is extremely sensitive to the degree of missingness of the covariate model. Our new methodology is illustrated by analysing the results of a community survey.
|
Page generated in 0.0772 seconds