Global ETD Search

61	Model selection criteria in the presence of missing data based on the Kullback-Leibler discrepancy Sparks, JonDavid 01 December 2009 (has links) An important challenge in statistical modeling involves determining an appropriate structural form for a model to be used in making inferences and predictions. Missing data is a very common occurrence in most research settings and can easily complicate the model selection problem. Many useful procedures have been developed to estimate parameters and standard errors in the presence of missing data;however, few methods exist for determining the actual structural form of a modelwhen the data is incomplete. In this dissertation, we propose model selection criteria based on the Kullback-Leiber discrepancy that can be used in the presence of missing data. The criteria are developed by accounting for missing data using principles related to the expectation maximization (EM) algorithm and bootstrap methods. We formulate the criteria for three specific modeling frameworks: for the normal multivariate linear regression model, a generalized linear model, and a normal longitudinal regression model. In each framework, a simulation study is presented to investigate the performance of the criteria relative to their traditional counterparts. We consider a setting where the missingness is confined to the outcome, and also a setting where the missingness may occur in the outcome and/or the covariates. The results from the simulation studies indicate that our criteria provide better protection against underfitting than their traditional analogues. We outline the implementation of our methodology for a general discrepancy measure. An application is presented where the proposed criteria are utilized in a study that evaluates the driving performance of individuals with Parkinson's disease under low contrast (fog) conditions in a driving simulator. AIC Bootstrap EM Algorithm Kullback-Leibler discrepancy Missing Data Model Selection Biostatistics
62	Longitudinal Analysis of Resource Competitiveness and Homelessness Among Young Adults Prante, Matt F. 01 August 2013 (has links) Homelessness occurs when individual resources are not enough for the demands of a given environment. Exploring homelessness as a process of resource loss on a continuum of poverty leads to research and explanations concerning how people transition from being housed to being homeless. This study assessed the influence of age, gender, and race along with a set of eleven resource competitiveness variables on the risk of youth becoming homeless. Resource competitiveness variables were: parental income, personal income, possession of a driver's license (DL), live-in partner, parenthood, education and training, annual weeks-employed, substance abuse, and incarceration history. The data came from the Bureau of Labor Statistics' National Longitudinal Survey of Youth 1997 (NLSY97). This sample was restricted to those that were homeless or unstably housed and were between the ages of 18 and 24 (n = 141). Each case was then matched by age, gender, and race to two individuals randomly selected from the remaining NLSY97 sample (n = 282). This resulted in an overall N of 423. A growth model was used to analyze the data longitudinally. Partnership, education and training, DL, annual weeks-employed, and personal income were significantly associated with experiences of homelessness and unstable housing. All were negatively related, except for age, which was positively related to incidents of homelessness and unstable housing. Comparisons across the homeless, unstably housed, and control samples showed incremental changes in nearly all the covariates in this study, in relation to changes in housing status, supporting the importance of studying homelessness as a point on a continuum of resource loss versus a discrete state of being. homelessness longitudinal missing data NLSY97 resources youth Psychology Social and Behavioral Sciences
63	Comparação de método de imputação para dados de precipitação diária / Comparison of imputation method for daily precipitation data Teodoro, Valiana Alves 28 August 2019 (has links) As principais causas da redução da produtividade agrícola são os eventos climáticos, e a variável meteorológica de grande importância para a produção agrícola é a precipitação. Alguns dos problemas das bases de dados meteorológicos são a descontinuidade e dados faltantes. Nesse sentido, os dados de precipitação em ponto de grade (Gridpoint), são uma excelente fonte de informações em pesquisas climatológicas. Para superar os problemas de dados faltantes e construir um banco de dados completos é necessário um processo de imputação. Portanto, o objetivo do trabalho foi comparar metodologias de imputação, utilizou abordagens univariada e múltipla, e comparou o desempenho em termos de imputação em diferentes cenários de dados faltantes e utilizou a raiz do erro quadrático médio (RMSE) como métrica. Para séries de precipitação diária que tinham dados faltantes foi realizado a imputação pelo método imputação múltipla por equações encadeadas (MICE), utilizando a informação de mês, ano e precipitação em ponto de grade. Foram utilizados quatro modelos, nos quais a precipitação diária dependia de: mês; mês e ano; precipitação em ponto de grade; mês, ano e precipitação diária em ponto de grade. Utilizou-se a raiz do erro quadrático médio (RMSE) como métrica e para verificar as imputações, analisou-se a semelhança entre os dados observados e os dados imputados pelo Teste de Kolmogorov-Smirnov e pelos gráficos da média e variância das imputações. O modelo com o maior número de variáveis foi escolhido para imputar os dados faltantes das séries de precipitação diária. Nesse trabalho, o uso de dados de precipitação em ponto de grade mostrou ser na imputação de dados de séries de precipitação diária. Para uma série de precipitação diária completa, concentra-se na comparação e avaliação de métodos de imputação nas abordagens univariada e múltipla, para dados de precipitação diária. Na abordagem univariada, utilizou-se diferentes configurações filtro de Kalman, Média Móvel Ponderada e Decomposição Sazonal. Na abordagem múltipla, utilizou-se o método MICE, com diferentes modelos. Os dados faltantes foram estimados em uma série de precipitação diária, em que os dados faltantes foram gerados de maneira aleatória e em trechos e utilizou-se a raiz do erro quadrático médio (RMSE) como métrica. Os resultados identificaram que o método de Filtro de Kalman forneceu os menores valores de RMSE, para todos os cenários de dados faltantes. A aplicação do algoritmo Filtro de Kalman produziu melhores estimativas para os valores diários de precipitação. O Filtro de Kalman pode ser uma importante metodologia para imputação de dados de precipitação diária, garantido uma série temporal completa para análises de vários setores, dentre eles a agricultura. / The main causes of the reduction of agricultural productivity are the climatic events, and the meteorological variable of great importance for the agricultural production is precipitation. Some of the problems of meteorological databases are discontinuity and missing data. In this sense, grid point precipitation (Gridpoint) data is an excellent source of information in climatological research. To overcome missing data problems and build a continuous database, an imputation process is required. Therefore, this work has the objective of comparing two imputation methodologies, using the MICE method and the Kalman filter, and comparing the performance in terms of imputation in different scenarios of missing data, using root mean square error (RMSE) as metric. For series of daily precipitation that had missing data, imputation was carried out by the multiple imputation method by chain equations (MICE), using the information of month, year and precipitation in grid point. Four models were used, in which the daily precipitation depended on: month; month and year; precipitation in grid point; month, year and daily precipitation in grid point. The root mean squared error (RMSE) was used as a metric and to verify imputations, the similarity between the observed data and the data imputed by the Kolmogorov-Smirnov test and the mean and variance imputation graphs were analyzed. The model with the largest number of variables was chosen to impute missing data from the daily precipitation series. In this work, precipitation data in grid point showed the importance and advantages of their use as information in imputation of daily precipitation series data. For a complete daily precipitation series, it focuses on the comparison and evaluation of imputation methods in the univariate and multiple approaches for daily precipitation data. In the univariate approach, we used different Kalman filter configurations, Weighted Moving Average, and Seasonal Decomposition. In the multiple approach, the MICE method was used, with different models. The missing data were estimated in a series of daily precipitation, in which the missing data were generated randomly and in sections, and the root mean square error (RMSE) was used as a metric. The results identified that the Kalman Filter method provided the lowest RMSE values for all missing data scenarios. The application of the Kalman filter algorithm produced better estimates for the daily values of precipitation. The Kalman Filter can be an important methodology for imputation of daily precipitation data, ensuring a complete time series for analysis of several sectors, among them agriculture. Dados faltantes Filtro de Kalman Gripoint Kalman filters MICE MICE Missing data Ponto de grade
64	Learning from Incomplete Data Ghahramani, Zoubin, Jordan, Michael I. 24 January 1995 (has links) Real-world learning tasks often involve high-dimensional data sets with complex patterns of missing features. In this paper we review the problem of learning from incomplete data from two statistical perspectives---the likelihood-based and the Bayesian. The goal is two-fold: to place current neural network approaches to missing data within a statistical framework, and to describe a set of algorithms, derived from the likelihood-based framework, that handle clustering, classification, and function approximation from incomplete data in a principled and efficient manner. These algorithms are based on mixture modeling and make two distinct appeals to the Expectation-Maximization (EM) principle (Dempster, Laird, and Rubin 1977)---both for the estimation of mixture components and for coping with the missing data. AI MIT Artificial Intelligence missing data mixture models statistical learning EM algorithm maximum likelihood neural networks
65	Anomaly detection in unknown environments using wireless sensor networks Li, YuanYuan 01 May 2010 (has links) This dissertation addresses the problem of distributed anomaly detection in Wireless Sensor Networks (WSN). A challenge of designing such systems is that the sensor nodes are battery powered, often have different capabilities and generally operate in dynamic environments. Programming such sensor nodes at a large scale can be a tedious job if the system is not carefully designed. Data modeling in distributed systems is important for determining the normal operation mode of the system. Being able to model the expected sensor signatures for typical operations greatly simplifies the human designer’s job by enabling the system to autonomously characterize the expected sensor data streams. This, in turn, allows the system to perform autonomous anomaly detection to recognize when unexpected sensor signals are detected. This type of distributed sensor modeling can be used in a wide variety of sensor networks, such as detecting the presence of intruders, detecting sensor failures, and so forth. The advantage of this approach is that the human designer does not have to characterize the anomalous signatures in advance. The contributions of this approach include: (1) providing a way for a WSN to autonomously model sensor data with no prior knowledge of the environment; (2) enabling a distributed system to detect anomalies in both sensor signals and temporal events online; (3) providing a way to automatically extract semantic labels from temporal sequences; (4) providing a way for WSNs to save communication power by transmitting compressed temporal sequences; (5) enabling the system to detect time-related anomalies without prior knowledge of abnormal events; and, (6) providing a novel missing data estimation method that utilizes temporal and spatial information to replace missing values. The algorithms have been designed, developed, evaluated, and validated experimentally in synthesized data, and in real-world sensor network applications. wireless sensor network signal processing sensor fusion time-series analysis missing data imputation anomaly detection Robotics
66	Study and validation of data structures with missing values. Application to survival analysis Serrat i Piè, Carles 21 May 2001 (has links) En aquest treball tractem tres metodologies diferents -no paramètrica, paramètrica i semiparamètrica- per tal de considerar els patrons de dades amb valors no observats en un context d'anàlisi de la supervivència. Les dues primeres metodologies han estat desenvolupades sota les hipòtesis de MCAR (Missing Completely at Random) o MAR (Missing at Random). Primer, hem utilitzat el mètode de remostreig de bootstrap i un esquema d'imputació basat en un model bilineal en la matriu de dades per tal d'inferir sobre la distribució dels paràmetres d'interès. Per una altra banda, hem analitzat els inconvenients a l'hora d'obtenir inferències correctes quan es tracta el problema de forma totalment paramètrica, a la vegada que hem proposat algunes estratègies per tenir en compte la informació complementària que poden proporcionar altres covariants completament observades.De tota manera, en general no es pot suposar la ignorabilitat del mecanisme de no resposta. Aleshores, ens proposem desenvolupar un mètode semiparamètric per a l'anàlisi de la supervivència quan tenim un patró de no resposta no ignorable. Primer, proposem l'estimador de Kaplan-Meier Agrupat (GKM) com una alternativa a l'estimador KM estàndard per tal d'estimar la supervivència en un nombre finit de temps fixats. De tota manera, quan les covariants són parcialment observades ni l'estimador GKM estratificat ni l'estimador KM estratificat poden ser calculats directament a partir de la mostra. Aleshores, proposem una classe d'equacions d'estimació per tal d'obtenir estimadors semiparamètrics de les probabilitats i substituïm aquestes estimacions en l'estimador GKM estratificat. Ens referim a aquest nou estimador com l'estimador Kaplan-Meier Agrupat-Estimat (EGKM). Demostrem que els estimadors GKM i EGKM són arrel quadrada consistents i que asimptòticament segueixen una distribució normal multivariant, a la vegada que obtenim estimadors consistents per a la matriu de variància-covariància límit. L'avantatge de l'estimador EGKM és que proporciona estimacions no esbiaixades de la supervivència i permet utilitzar un model de selecció flexible per a les probabilitats de no resposta. Il·lustrem el mètode amb una aplicació a una cohort de pacients amb Tuberculosi i infectats pel VIH. Al final de l'aplicació, duem a terme una anàlisi de sensibilitat que inclou tots els tipus de patrons de no resposta, des de MCAR fins a no ignorable, i que permet que l'analista pugui obtenir conclusions després d'analitzar tots els escenaris plausibles i d'avaluar l'impacte que tenen les suposicions en el mecanisme no ignorable de no resposta sobre les inferències resultants.Acabem l'enfoc semiparamètric explorant el comportament de l'estimador EGKM per a mostres finites. Per fer-ho, duem a terme un estudi de simulació. Les simulacions, sota escenaris que tenen en compte diferents nivells de censura, de patrons de no resposta i de grandàries mostrals, il·lustren les bones propietats que té l'estimador que proposem. Per exemple, les probabilitats de cobertura tendeixen a les nominals quan el patró de no resposta fet servir en l'anàlisi és proper al vertader patró de no resposta que ha generat les dades. En particular, l'estimador és eficient en el cas menys informatiu dels considerats: aproximadament un 80% de censura i un 50% de dades no observades. / In this work we have approached three different methodologies --nonparametric, parametric and semiparametric-- to deal with data patterns with missing values in a survival analysis context. The first two approaches have been developed under the assumption that the investigator has enough information and can assume that the non-response mechanism is MCAR or MAR. In this situation, we have adapted a bootstrap and bilinear multiple imputation scheme to draw the distribution of the parameters of interest. On the other hand, we have analyzed the drawbacks encountered to get correct inferences, as well as, we have proposed some strategies to take into account the information provided by other fully observed covariates.However, in many situations it is impossible to assume the ignorability of the non-response probabilities. Then, we focus our interest in developing a method for survival analysis when we have a non-ignorable non-response pattern, using a semiparametric perspective. First, for right censored samples with completely observed covariates, we propose the Grouped Kaplan-Meier estimator (GKM) as an alternative to the standard KM estimator when we are interested in the survival at a finite number of fixed times of interest. However, when the covariates are partially observed, neither the stratified GKM estimator, nor the stratified KM estimator can be directly computed from the sample. Henceforth, we propose a class of estimating equations to obtain semiparametric estimates for these probabilities and then we substitute these estimates in the stratified GKM estimator. We refer to this new estimation procedure as Estimated Grouped Kaplan-Meier estimator (EGKM). We prove that the GKM and EGKM estimators are squared root consistent and asymptotically normal distributed, and a consistent estimator for their limiting variances is derived. The advantage of the EGKM estimator is that provides asymptotically unbiased estimates for the survival under a flexible selection model for the non-response probability pattern. We illustrate the method with a cohort of HIV-infected with Tuberculosis patients. At the end of the application, a sensitivity analysis that includes all types of non-response pattern, from MCAR to non-ignorable, allows the investigator to draw conclusions after analyzing all the plausible scenarios and evaluating the impact on the resulting inferences of the non-ignorable assumptions in the non-response mechanism.We close the semiparametric approach by exploring the behaviour of the EGKM estimator for finite samples. In order to do that, a simulation study is carried out. Simulations performed under scenarios taking into account different levels of censoring, non-response probability patterns and sample sizes show the good properties of the proposed estimator. For instance, the empirical coverage probabilities tend to the nominal ones when the non-response pattern used in the analysis is close to the true non-response pattern that generated the data. In particular, it is specially efficient in the less informative scenarios (e,g, around a 80% of censoring and a 50% of missing data). semiparametric estadística missing data investigació operativa survival analysis 1209. Estadística 311
67	A Monte Carlo Study: The Impact of Missing Data in Cross-Classification Random Effects Models Alemdar, Meltem 12 August 2009 (has links) Unlike multilevel data with a purely nested structure, data that are cross-classified not only may be clustered into hierarchically ordered units but also may belong to more than one unit at a given level of a hierarchy. In a cross-classified design, students at a given school might be from several different neighborhoods and one neighborhood might have students who attend a number of different schools. In this type of scenario, schools and neighborhoods are considered to be cross-classified factors, and cross-classified random effects modeling (CCREM) should be used to analyze these data appropriately. A common problem in any type of multilevel analysis is the presence of missing data at any given level. There has been little research conducted in the multilevel literature about the impact of missing data, and none in the area of cross-classified models. The purpose of this study was to examine the effect of data that are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR), on CCREM estimates while exploring multiple imputation to handle the missing data. In addition, this study examined the impact of including an auxiliary variable that is correlated with the variable with missingness (the level-1 predictor) in the imputation model for multiple imputation. This study expanded on the CCREM Monte Carlo simulation work of Meyers (2004) by the inclusion of studying the effect of missing data and method for handling these missing data with CCREM. The results demonstrated that in general, multiple imputation met Hoogland and Boomsma’s (1998) relative bias estimation criteria (less than 5% in magnitude) for parameter estimates under different types of missing data patterns. For the standard error estimates, substantial relative bias (defined by Hoogland and Boomsma as greater than 10%) was found in some conditions. When multiple imputation was used to handle the missing data then substantial bias was found in the standard errors in most cells where data were MNAR. This bias increased as a function of the percentage of missing data. Cross- Classified Data Cross-Classified Random Effects Models Missing Data Multiple Imputation Education Education Policy
68	Statistical Evaluation of Continuous-Scale Diagnostic Tests with Missing Data Wang, Binhuan 12 June 2012 (has links) The receiver operating characteristic (ROC) curve methodology is the statistical methodology for assessment of the accuracy of diagnostics tests or bio-markers. Currently most widely used statistical methods for the inferences of ROC curves are complete-data based parametric, semi-parametric or nonparametric methods. However, these methods cannot be used in diagnostic applications with missing data. In practical situations, missing diagnostic data occur more commonly due to various reasons such as medical tests being too expensive, too time consuming or too invasive. This dissertation aims to develop new nonparametric statistical methods for evaluating the accuracy of diagnostic tests or biomarkers in the presence of missing data. Specifically, novel nonparametric statistical methods will be developed with different types of missing data for (i) the inference of the area under the ROC curve (AUC, which is a summary index for the diagnostic accuracy of the test) and (ii) the joint inference of the sensitivity and the specificity of a continuous-scale diagnostic test. In this dissertation, we will provide a general framework that combines the empirical likelihood and general estimation equations with nuisance parameters for the joint inferences of sensitivity and specificity with missing diagnostic data. The proposed methods will have sound theoretical properties. The theoretical development is challenging because the proposed profile log-empirical likelihood ratio statistics are not the standard sum of independent random variables. The new methods have the power of likelihood based approaches and jackknife method in ROC studies. Therefore, they are expected to be more robust, more accurate and less computationally intensive than existing methods in the evaluation of competing diagnostic tests. AUC Bootstrap Diagnostic tests Empirical likelihood Estimating equations Imputation Jackknife Missing data ROC curve Verification bias
69	Second Level Cluster Dependencies: A Comparison of Modeling Software and Missing Data Techniques Larsen, Ross Allen Andrew 2010 August 1900 (has links) Dependencies in multilevel models at the second level have never been thoroughly examined. For certain designs first-level subjects are independent over time, but the second level subjects may exhibit nonzero covariances over time. Following a review of revelant literature the first study investigated which widely used computer programs adequately take into account these dependencies in their analysis. This was accomplished through a simulation study with SAS, and examples of analyses with Mplus and LISREL. The second study investigated the impact of two different missing data techniques for such designs in the case where data is missing at the first level with a simulation study in SAS. The first study simulated data produced in a multiyear study varying the numbers of subjects in the first and second levels, the number of data waves, the magnitude of effects at both the first and second level, and the magnitude of the second level covariance. Results showed that SAS and the MULTILEV component in LISREL analyze such data well while Mplus does not. The second study compared two missing data techniques in the presence of a second level dependency, multiple imputation (MI) and full information maximum likelihood (FIML). They were compared in a SAS simulation study in which the data was simulated with all the factors of the first study and the addition of missing data varied in amounts and patterns (missing completely at random or missing at random). Results showed that FIML is superior to MI because it produces lower bias and correctly estimates standard errors Simulation Structural Equation Modeling Software Missing Data Full Information Maximum Likelihood Multiple Imputations
70	Partial least squares structural equation modelling with incomplete data : an investigation of the impact of imputation methods Mohd Jamil, J. B. January 2012 (has links) Despite considerable advances in missing data imputation methods over the last three decades, the problem of missing data remains largely unsolved. Many techniques have emerged in the literature as candidate solutions. These techniques can be categorised into two classes: statistical methods of data imputation and computational intelligence methods of data imputation. Due to the longstanding use of statistical methods in handling missing data problems, it takes quite some time for computational intelligence methods to gain profound attention even though these methods have analogous accuracy, in comparison to other approaches. The merits of both these classes have been discussed at length in the literature, but only limited studies make significant comparison to these classes. This thesis contributes to knowledge by firstly, conducting a comprehensive comparison of standard statistical methods of data imputation, namely, mean substitution (MS), regression imputation (RI), expectation maximization (EM), tree imputation (TI) and multiple imputation (MI) on missing completely at random (MCAR) data sets. Secondly, this study also compares the efficacy of these methods with a computational intelligence method of data imputation, ii namely, a neural network (NN) on missing not at random (MNAR) data sets. The significance difference in performance of the methods is presented. Thirdly, a novel procedure for handling missing data is presented. A hybrid combination of each of these statistical methods with a NN, known here as the post-processing procedure, was adopted to approximate MNAR data sets. Simulation studies for each of these imputation approaches have been conducted to assess the impact of missing values on partial least squares structural equation modelling (PLS-SEM) based on the estimated accuracy of both structural and measurement parameters. The best method to deal with particular missing data mechanisms is highly recognized. Several significant insights were deduced from the simulation results. It was figured that for the problem of MCAR by using statistical methods of data imputation, MI performs better than the other methods for all percentages of missing data. Another unique contribution is found when comparing the results before and after the NN post-processing procedure. This improvement in accuracy may be resulted from the neural network's ability to derive meaning from the imputed data set found by the statistical methods. Based on these results, the NN post-processing procedure is capable to assist MS in producing significant improvement in accuracy of the approximated values. This is a promising result, as MS is the weakest method in this study. This evidence is also informative as MS is often used as the default method available to users of PLS-SEM software. 658

Search results