Global ETD Search

71	Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification Säfström, Stella January 2019 (has links) The aim of the thesis is to investigate how the classification performance of random forest and logistic regression differ, given an imbalanced data set with MCAR missing data. The performance is measured in terms of accuracy and sensitivity. Two analyses are performed: one with a simulated data set and one application using data from the Swedish population registries. The simulation study is created to have the same class imbalance at 1:5. The missing values are handled using three different techniques: complete case analysis, predictive mean matching and mean imputation. The thesis concludes that logistic regression and random forest are on average equally accurate, with some instances of random forest outperforming logistic regression. Logistic regression consistently outperforms random forest with regards to sensitivity. This implies that logistic regression may be the best option for studies where the goal is to accurately predict outcomes in the minority class. None of the missing data techniques stood out in terms of performance. Random forest logistic regression imputation classification MCAR missing data imbalanced data Probability Theory and Statistics Sannolikhetsteori och statistik
72	Avaliação de redes Bayesianas para imputação em variáveis qualitativas e quantitativas. / Evaluating Bayesian networks for imputation with qualitative and quantitative variables. Magalhães, Ismenia Blavatsky de 29 March 2007 (has links) Redes Bayesianas são estruturas que combinam distribuições de probabilidade e grafos. Apesar das redes Bayesianas terem surgido na década de 80 e as primeiras tentativas em solucionar os problemas gerados a partir da não resposta datarem das décadas de 30 e 40, a utilização de estruturas deste tipo especificamente para imputação é bem recente: em 2002 em institutos oficiais de estatística e em 2003 no contexto de mineração de dados. O intuito deste trabalho é o de fornecer alguns resultados da aplicação de redes Bayesianas discretas e mistas para imputação. Para isso é proposto um algoritmo que combina o conhecimento de especialistas e dados experimentais observados de pesquisas anteriores ou parte dos dados coletados. Ao empregar as redes Bayesianas neste contexto, parte-se da hipótese de que uma vez preservadas as variáveis em sua relação original, o método de imputação será eficiente em manter propriedades desejáveis. Neste sentido, foram avaliados três tipos de consistências já existentes na literatura: a consistência da base de dados, a consistência lógica e a consistência estatística, e propôs-se a consistência estrutural, que se define como sendo a capacidade de a rede manter sua estrutura na classe de equivalência da rede original quando construída a partir dos dados após a imputação. É utilizada pela primeira vez uma rede Bayesiana mista para o tratamento da não resposta em variáveis quantitativas. Calcula-se uma medida de consistência estatística para redes mistas usando como recurso a imputação múltipla para a avaliação de parâmetros da rede e de modelos de regressão. Como aplicação foram conduzidos experimentos com base nos dados de domicílios e pessoas do Censo Demográfico 2000 do município de Natal e nos dados de um estudo sobre homicídios em Campinas. Dos resultados afirma-se que as redes Bayesianas para imputação em atributos discretos são promissoras, principalmente se o interesse estiver em manter a consistência estatística e o número de classes da variável for pequeno. Já para outras características, como o coeficiente de contingência entre as variáveis, são afetadas pelo método à medida que se aumenta o percentual de não resposta. Nos atributos contínuos, a mediana apresenta-se mais sensível ao método. / Bayesian networks are structures that combine probability distributions with graphs. Although Bayesian networks initially appeared in the 1980s and the first attempts to solve the problems generated from the non-response date back to the 1930s and 1940s, the use of structures of this kind specifically for imputation is rather recent: in 2002 by official statistical institutes, and in 2003 in the context of data mining. The purpose of this work is to present some results on the application of discrete and mixed Bayesian networks for imputation. For that purpose, we present an algorithm combining knowledge obtained from experts with experimental data derived from previous research or part of the collected data. To apply Bayesian networks in this context, it is assumed that once the variables are preserved in their original relation, the imputation method will be effective in maintaining desirable properties. Pursuant to this, three types of consistence which already exist in literature are evaluated: the database consistence, the logical consistence and the statistical consistence. In addition, the structural consistence is proposed, which can be defined as the ability of a network to maintain its structure in the equivalence class of the original network when built from the data after imputation. For the first time a mixed Bayesian network is used for the treatment of the non-response in quantitative variables. The statistical consistence for mixed networks is being developed by using, as a resource, the multiple imputation for evaluating network parameters and regression models. For the purpose of application, some experiences were conducted using simple networks based on data for dwellings and people from the 2000 Demographic Census in the City of Natal and on data from a study on homicides in the City of Campinas. It can be stated from the results that the Bayesian networks for imputation in discrete attributes seem to be promising, particularly if the interest is to maintain the statistical consistence and if the number of classes of the variable is small. Features such as the contingency tables coefficient among variables, on the other hand, are affected by this method as the percentage of non-response increases. The median is more sensitive to this method in continuous attributes. Bayesian networks Imputação Imputação múltipla Imputation Missing data Multiple imputation Não resposta Redes Bayesianas
73	Dados filogenômicos para inferência de relações evolutivas entre espécies do gênero Cereus Mill. (Cactaceae, Cereeae) / Phylogenomic data for inference of evolutionary relationships among species of the genus Cereus Mill. (Cactaceae, Cereeae) Bombonato, Juliana Rodrigues 04 June 2018 (has links) Estudos filogenômicos usando Sequenciamento de Próxima Geração (do inglês, Next Generation Sequencing - NGS) estão se tornando cada vez mais comuns. O uso de marcadores oriundos do sequenciamento de DNA de uma biblioteca genômica reduzida, neste caso ddRADSeq (do inglês, Double Digestion Restriction Site Associated DNA Sequencing), para este fim é promissor, pelo menos considerando sua relação custo-benefício em grandes conjuntos de dados de grupos não-modelo, bem como a representação genômica recuperada. Aqui usamos ddRADSeq para inferir a filogenia em nível de espécie do gênero Cereus (Cactaceae). Esse gênero compreende em cerca de 25 espécies reconhecidas predominantemente sul-americanas distribuídas em quatro subgêneros. Nossa amostra inclui representantes de Cereus, além de espécies dos gêneros próximos, Cipocereus e Praecereus, além de grupos externos. A biblioteca ddRADSeq foi preparada utilizando as enzimas EcoRI e HPAII. Após o controle de qualidade (tamanho e quantificação dos fragmentos), a biblioteca foi sequenciada no Illumina HiSeq 2500. O processamento de bioinformática a partir de arquivos FASTQ incluiu o controle da presença de adaptadores, filtragem por qualidade (softwares FastQC, MultiQC e SeqyClean) e chamada de SNPs (software iPyRAD). Três cenários de permissividade a dados faltantes foram realizados no iPyRAD, recuperando conjuntos de dados com 333 (até 40% de dados perdidos), 1440 (até 60% de dados perdidos) e 6141 (até 80% de dados faltantes) loci. Para cada conjunto de dados, árvores de Máxima Verossimilhança (MV) foram geradas usando duas supermatrizes: SNPs ligados e Loci. Em geral, observamos algumas inconsistências entre as árvores ML geradas em softwares distintos (IQTree e RaxML) ou baseadas no tipo de matriz distinta (SNPs ligados e Loci). Por outro lado, a precisão e a resolução, foram melhoradas usando o maior conjunto de dados (até 80% de dados perdidos). Em geral, apresentamos uma filogenia com resolução inédita para o gênero Cereus, que foi resolvido como um provável grupo monofilético, composto por quatro clados principais e com alto suporte em suas relações internas. Além disso, nossos dados contribuem para agregar informações sobre o debate sobre o aumento de dados faltantes para conduzir a análise filogenética com loci RAD. / Phylogenomics studies using Next Generation Sequencing (NGS) are becoming increasingly common. The use of Double Digest Restriction Site Associated DNA Sequencing (ddRADSeq) markers to this end is promising, at least considering its cost-effectiveness in large datasets of non-model groups as well as the genome-wide representation recovered in the data. Here we used ddRADSeq to infer the species level phylogeny of genus Cereus (Cactaceae). This genus comprises about 25 species recognized predominantly South American species distributed into four subgenera. Our sample includes representatives of Cereus, in addition to species from the closely allied genera Cipocereus and Praecereus, besides outgroups. The ddRADSeq library was prepared using EcoRI and HPAII enzymes. After the quality control (fragments size and quantification) the library was sequenced in Illumina HiSeq 2500. The bioinformatic processing on raw FASTQ files included adapter trimming, quality filtering (FastQC, MultiQC and SeqyClean softwares) and SNPs calling (iPyRAD software). Three scenarios of permissiveness to missing data were carry out in iPyRAD, recovering datasets with 333 (up tp 40% missing data), 1440 (up to 60% missing data) and 6141 (up to 80% missing data) loci. For each dataset, Maximum Likelihood (ML) trees were generated using two supermatrices: SNPs linked and Loci. In general, we observe few inconsistences between ML trees generated in distinct softwares (IQTree and RaxML) or based in distinctive matrix type (SNP linked and Loci). On the other hand, the accuracy and resolution were improved using the larger dataset (up to 80% missing data). Overall, we present a phylogeny with unprecedent resolution for genus Cereus, which was resolved as a likely monophyletic group, composed by four main clades and with high support in their internal relationships. Further, our data contributes to aggregate information on the debate about to increasing missing data to conduct phylogenetic analysis with RAD loci.
74	Exploratory Visualization of Data with Variable Quality Huang, Shiping 11 January 2005 (has links) Data quality, which refers to correctness, uncertainty, completeness and other aspects of data, has became more and more prevalent and has been addressed across multiple disciplines. Data quality could be introduced and presented in any of the data manipulation processes such as data collection, transformation, and visualization. Data visualization is a process of data mining and analysis using graphical presentation and interpretation. The correctness and completeness of the visualization discoveries to a large extent depend on the quality of the original data. Without the integration of quality information with data presentation, the analysis of data using visualization is incomplete at best and can lead to inaccurate or incorrect conclusions at worst. This thesis addresses the issue of data quality visualization. Incorporating data quality measures into the data displays is challenging in that the display is apt to be cluttered when faced with multiple dimensions and data records. We investigate both the incorporation of data quality information in traditional multivariate data display techniques as well as develop novel visualization and interaction tools that operate in data quality space. We validate our results using several data sets that have variable quality associated with dimensions, records, and data values. Visualization Uncertainty Missing Data Imputation Data Quality Electronic data processing Quality control Visualization Data processing
75	Imputação múltipla de dados faltantes: exemplo de aplicação no Estudo Pró-Saúde / Multiple imputation of missing data: application in the Pro-Saude Program Thaís de Paulo Rangel 05 March 2013 (has links) Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Dados faltantes são um problema comum em estudos epidemiológicos e, dependendo da forma como ocorrem, as estimativas dos parâmetros de interesse podem estar enviesadas. A literatura aponta algumas técnicas para se lidar com a questão, e, a imputação múltipla vem recebendo destaque nos últimos anos. Esta dissertação apresenta os resultados da utilização da imputação múltipla de dados no contexto do Estudo Pró-Saúde, um estudo longitudinal entre funcionários técnico-administrativos de uma universidade no Rio de Janeiro. No primeiro estudo, após simulação da ocorrência de dados faltantes, imputou-se a variável cor/raça das participantes, e aplicou-se um modelo de análise de sobrevivência previamente estabelecido, tendo como desfecho a história auto-relatada de miomas uterinos. Houve replicação do procedimento (100 vezes) para se determinar a distribuição dos coeficientes e erros-padrão das estimativas da variável de interesse. Apesar da natureza transversal dos dados aqui utilizados (informações da linha de base do Estudo Pró-Saúde, coletadas em 1999 e 2001), buscou-se resgatar a história do seguimento das participantes por meio de seus relatos, criando uma situação na qual a utilização do modelo de riscos proporcionais de Cox era possível. Nos cenários avaliados, a imputação demonstrou resultados satisfatórios, inclusive quando da avaliação de performance realizada. A técnica demonstrou um bom desempenho quando o mecanismo de ocorrência dos dados faltantes era do tipo MAR (Missing At Random) e o percentual de não-resposta era de 10%. Ao se imputar os dados e combinar as estimativas obtidas nos 10 bancos (m=10) gerados, o viés das estimativas era de 0,0011 para a categoria preta e 0,0015 para pardas, corroborando a eficiência da imputação neste cenário. Demais configurações também apresentaram resultados semelhantes. No segundo artigo, desenvolve-se um tutorial para aplicação da imputação múltipla em estudos epidemiológicos, que deverá facilitar a utilização da técnica por pesquisadores brasileiros ainda não familiarizados com o procedimento. São apresentados os passos básicos e decisões necessárias para se imputar um banco de dados, e um dos cenários utilizados no primeiro estudo é apresentado como exemplo de aplicação da técnica. Todas as análises foram conduzidas no programa estatístico R, versão 2.15 e os scripts utilizados são apresentados ao final do texto. / Missing data are a common problem in epidemiologic studies and depending on the way they occur, the resulting estimates may be biased. Literature shows several techniques to deal with this subject and multiple imputation has been receiving attention in the recent years. This dissertation presents the results of applying multiple imputation of missing data in the context of the Pro-Saude Study, a longitudinal study among civil servants at a university in Rio de Janeiro, Brazil. In the first paper, after simulation of missing data, the variable color/race of the female servants was imputed and analyzed through a previously established survival model, which had the self-reported history of uterine leiomyoma as the outcome. The process has been replicated a hundred times in order to determine the distribution of the coefficient and standard errors of the variable being imputed. Although the data presented were cross-sectionally collected (baseline data of the Pro-Saude Study, gathered in 1999 and 2001), the following of the servants were determined using self-reported information. In this scenario, the Cox proportional hazards model could be applied. In the situations created, imputation showed adequate results, including in the performance analyses. The technique had a satisfactory effectiveness when the missing mechanism was MAR (Missing At Random) and the percent of missing data was 10. Imputing the missing information and combining the estimates of the 10 resulting datasets produced a bias of 0,0011 to black women and 0,0015 to brown (mixed-race) women, what corroborates the efficiency of multiple imputation in this scenario. In the second paper, a tutorial was created to guide the application of multiple imputation in epidemiologic studies, which should facilitate the use of the technique by Brazilian researchers who are still not familiarized with the procedure. Basic steps and important decisions necessary to impute a dataset are presented and one of the scenarios of the first paper is used as an application example. All the analyses were performed at R statistical software, version 2.15 and the scripts are presented at the end of the text. Dados faltantes Imputação múltipla Análise de sobrevivência Tutorial Missing data Multiple imputation Survival analysis Tutorial EPIDEMIOLOGIA
76	Avaliação de redes Bayesianas para imputação em variáveis qualitativas e quantitativas. / Evaluating Bayesian networks for imputation with qualitative and quantitative variables. Ismenia Blavatsky de Magalhães 29 March 2007 (has links) Redes Bayesianas são estruturas que combinam distribuições de probabilidade e grafos. Apesar das redes Bayesianas terem surgido na década de 80 e as primeiras tentativas em solucionar os problemas gerados a partir da não resposta datarem das décadas de 30 e 40, a utilização de estruturas deste tipo especificamente para imputação é bem recente: em 2002 em institutos oficiais de estatística e em 2003 no contexto de mineração de dados. O intuito deste trabalho é o de fornecer alguns resultados da aplicação de redes Bayesianas discretas e mistas para imputação. Para isso é proposto um algoritmo que combina o conhecimento de especialistas e dados experimentais observados de pesquisas anteriores ou parte dos dados coletados. Ao empregar as redes Bayesianas neste contexto, parte-se da hipótese de que uma vez preservadas as variáveis em sua relação original, o método de imputação será eficiente em manter propriedades desejáveis. Neste sentido, foram avaliados três tipos de consistências já existentes na literatura: a consistência da base de dados, a consistência lógica e a consistência estatística, e propôs-se a consistência estrutural, que se define como sendo a capacidade de a rede manter sua estrutura na classe de equivalência da rede original quando construída a partir dos dados após a imputação. É utilizada pela primeira vez uma rede Bayesiana mista para o tratamento da não resposta em variáveis quantitativas. Calcula-se uma medida de consistência estatística para redes mistas usando como recurso a imputação múltipla para a avaliação de parâmetros da rede e de modelos de regressão. Como aplicação foram conduzidos experimentos com base nos dados de domicílios e pessoas do Censo Demográfico 2000 do município de Natal e nos dados de um estudo sobre homicídios em Campinas. Dos resultados afirma-se que as redes Bayesianas para imputação em atributos discretos são promissoras, principalmente se o interesse estiver em manter a consistência estatística e o número de classes da variável for pequeno. Já para outras características, como o coeficiente de contingência entre as variáveis, são afetadas pelo método à medida que se aumenta o percentual de não resposta. Nos atributos contínuos, a mediana apresenta-se mais sensível ao método. / Bayesian networks are structures that combine probability distributions with graphs. Although Bayesian networks initially appeared in the 1980s and the first attempts to solve the problems generated from the non-response date back to the 1930s and 1940s, the use of structures of this kind specifically for imputation is rather recent: in 2002 by official statistical institutes, and in 2003 in the context of data mining. The purpose of this work is to present some results on the application of discrete and mixed Bayesian networks for imputation. For that purpose, we present an algorithm combining knowledge obtained from experts with experimental data derived from previous research or part of the collected data. To apply Bayesian networks in this context, it is assumed that once the variables are preserved in their original relation, the imputation method will be effective in maintaining desirable properties. Pursuant to this, three types of consistence which already exist in literature are evaluated: the database consistence, the logical consistence and the statistical consistence. In addition, the structural consistence is proposed, which can be defined as the ability of a network to maintain its structure in the equivalence class of the original network when built from the data after imputation. For the first time a mixed Bayesian network is used for the treatment of the non-response in quantitative variables. The statistical consistence for mixed networks is being developed by using, as a resource, the multiple imputation for evaluating network parameters and regression models. For the purpose of application, some experiences were conducted using simple networks based on data for dwellings and people from the 2000 Demographic Census in the City of Natal and on data from a study on homicides in the City of Campinas. It can be stated from the results that the Bayesian networks for imputation in discrete attributes seem to be promising, particularly if the interest is to maintain the statistical consistence and if the number of classes of the variable is small. Features such as the contingency tables coefficient among variables, on the other hand, are affected by this method as the percentage of non-response increases. The median is more sensitive to this method in continuous attributes. Imputação Imputação múltipla Não resposta Redes Bayesianas Bayesian networks Imputation Missing data Multiple imputation
77	Model selection criteria in the presence of missing data based on the Kullback-Leibler discrepancy Sparks, JonDavid 01 December 2009 (has links) An important challenge in statistical modeling involves determining an appropriate structural form for a model to be used in making inferences and predictions. Missing data is a very common occurrence in most research settings and can easily complicate the model selection problem. Many useful procedures have been developed to estimate parameters and standard errors in the presence of missing data;however, few methods exist for determining the actual structural form of a modelwhen the data is incomplete. In this dissertation, we propose model selection criteria based on the Kullback-Leiber discrepancy that can be used in the presence of missing data. The criteria are developed by accounting for missing data using principles related to the expectation maximization (EM) algorithm and bootstrap methods. We formulate the criteria for three specific modeling frameworks: for the normal multivariate linear regression model, a generalized linear model, and a normal longitudinal regression model. In each framework, a simulation study is presented to investigate the performance of the criteria relative to their traditional counterparts. We consider a setting where the missingness is confined to the outcome, and also a setting where the missingness may occur in the outcome and/or the covariates. The results from the simulation studies indicate that our criteria provide better protection against underfitting than their traditional analogues. We outline the implementation of our methodology for a general discrepancy measure. An application is presented where the proposed criteria are utilized in a study that evaluates the driving performance of individuals with Parkinson's disease under low contrast (fog) conditions in a driving simulator. AIC Bootstrap EM Algorithm Kullback-Leibler discrepancy Missing Data Model Selection Biostatistics
78	Longitudinal Analysis of Resource Competitiveness and Homelessness Among Young Adults Prante, Matt F. 01 August 2013 (has links) Homelessness occurs when individual resources are not enough for the demands of a given environment. Exploring homelessness as a process of resource loss on a continuum of poverty leads to research and explanations concerning how people transition from being housed to being homeless. This study assessed the influence of age, gender, and race along with a set of eleven resource competitiveness variables on the risk of youth becoming homeless. Resource competitiveness variables were: parental income, personal income, possession of a driver's license (DL), live-in partner, parenthood, education and training, annual weeks-employed, substance abuse, and incarceration history. The data came from the Bureau of Labor Statistics' National Longitudinal Survey of Youth 1997 (NLSY97). This sample was restricted to those that were homeless or unstably housed and were between the ages of 18 and 24 (n = 141). Each case was then matched by age, gender, and race to two individuals randomly selected from the remaining NLSY97 sample (n = 282). This resulted in an overall N of 423. A growth model was used to analyze the data longitudinally. Partnership, education and training, DL, annual weeks-employed, and personal income were significantly associated with experiences of homelessness and unstable housing. All were negatively related, except for age, which was positively related to incidents of homelessness and unstable housing. Comparisons across the homeless, unstably housed, and control samples showed incremental changes in nearly all the covariates in this study, in relation to changes in housing status, supporting the importance of studying homelessness as a point on a continuum of resource loss versus a discrete state of being. homelessness longitudinal missing data NLSY97 resources youth Psychology Social and Behavioral Sciences
79	Comparação de método de imputação para dados de precipitação diária / Comparison of imputation method for daily precipitation data Teodoro, Valiana Alves 28 August 2019 (has links) As principais causas da redução da produtividade agrícola são os eventos climáticos, e a variável meteorológica de grande importância para a produção agrícola é a precipitação. Alguns dos problemas das bases de dados meteorológicos são a descontinuidade e dados faltantes. Nesse sentido, os dados de precipitação em ponto de grade (Gridpoint), são uma excelente fonte de informações em pesquisas climatológicas. Para superar os problemas de dados faltantes e construir um banco de dados completos é necessário um processo de imputação. Portanto, o objetivo do trabalho foi comparar metodologias de imputação, utilizou abordagens univariada e múltipla, e comparou o desempenho em termos de imputação em diferentes cenários de dados faltantes e utilizou a raiz do erro quadrático médio (RMSE) como métrica. Para séries de precipitação diária que tinham dados faltantes foi realizado a imputação pelo método imputação múltipla por equações encadeadas (MICE), utilizando a informação de mês, ano e precipitação em ponto de grade. Foram utilizados quatro modelos, nos quais a precipitação diária dependia de: mês; mês e ano; precipitação em ponto de grade; mês, ano e precipitação diária em ponto de grade. Utilizou-se a raiz do erro quadrático médio (RMSE) como métrica e para verificar as imputações, analisou-se a semelhança entre os dados observados e os dados imputados pelo Teste de Kolmogorov-Smirnov e pelos gráficos da média e variância das imputações. O modelo com o maior número de variáveis foi escolhido para imputar os dados faltantes das séries de precipitação diária. Nesse trabalho, o uso de dados de precipitação em ponto de grade mostrou ser na imputação de dados de séries de precipitação diária. Para uma série de precipitação diária completa, concentra-se na comparação e avaliação de métodos de imputação nas abordagens univariada e múltipla, para dados de precipitação diária. Na abordagem univariada, utilizou-se diferentes configurações filtro de Kalman, Média Móvel Ponderada e Decomposição Sazonal. Na abordagem múltipla, utilizou-se o método MICE, com diferentes modelos. Os dados faltantes foram estimados em uma série de precipitação diária, em que os dados faltantes foram gerados de maneira aleatória e em trechos e utilizou-se a raiz do erro quadrático médio (RMSE) como métrica. Os resultados identificaram que o método de Filtro de Kalman forneceu os menores valores de RMSE, para todos os cenários de dados faltantes. A aplicação do algoritmo Filtro de Kalman produziu melhores estimativas para os valores diários de precipitação. O Filtro de Kalman pode ser uma importante metodologia para imputação de dados de precipitação diária, garantido uma série temporal completa para análises de vários setores, dentre eles a agricultura. / The main causes of the reduction of agricultural productivity are the climatic events, and the meteorological variable of great importance for the agricultural production is precipitation. Some of the problems of meteorological databases are discontinuity and missing data. In this sense, grid point precipitation (Gridpoint) data is an excellent source of information in climatological research. To overcome missing data problems and build a continuous database, an imputation process is required. Therefore, this work has the objective of comparing two imputation methodologies, using the MICE method and the Kalman filter, and comparing the performance in terms of imputation in different scenarios of missing data, using root mean square error (RMSE) as metric. For series of daily precipitation that had missing data, imputation was carried out by the multiple imputation method by chain equations (MICE), using the information of month, year and precipitation in grid point. Four models were used, in which the daily precipitation depended on: month; month and year; precipitation in grid point; month, year and daily precipitation in grid point. The root mean squared error (RMSE) was used as a metric and to verify imputations, the similarity between the observed data and the data imputed by the Kolmogorov-Smirnov test and the mean and variance imputation graphs were analyzed. The model with the largest number of variables was chosen to impute missing data from the daily precipitation series. In this work, precipitation data in grid point showed the importance and advantages of their use as information in imputation of daily precipitation series data. For a complete daily precipitation series, it focuses on the comparison and evaluation of imputation methods in the univariate and multiple approaches for daily precipitation data. In the univariate approach, we used different Kalman filter configurations, Weighted Moving Average, and Seasonal Decomposition. In the multiple approach, the MICE method was used, with different models. The missing data were estimated in a series of daily precipitation, in which the missing data were generated randomly and in sections, and the root mean square error (RMSE) was used as a metric. The results identified that the Kalman Filter method provided the lowest RMSE values for all missing data scenarios. The application of the Kalman filter algorithm produced better estimates for the daily values of precipitation. The Kalman Filter can be an important methodology for imputation of daily precipitation data, ensuring a complete time series for analysis of several sectors, among them agriculture. Dados faltantes Filtro de Kalman Gripoint Kalman filters MICE MICE Missing data Ponto de grade
80	Learning from Incomplete Data Ghahramani, Zoubin, Jordan, Michael I. 24 January 1995 (has links) Real-world learning tasks often involve high-dimensional data sets with complex patterns of missing features. In this paper we review the problem of learning from incomplete data from two statistical perspectives---the likelihood-based and the Bayesian. The goal is two-fold: to place current neural network approaches to missing data within a statistical framework, and to describe a set of algorithms, derived from the likelihood-based framework, that handle clustering, classification, and function approximation from incomplete data in a principled and efficient manner. These algorithms are based on mixture modeling and make two distinct appeals to the Expectation-Maximization (EM) principle (Dempster, Laird, and Rubin 1977)---both for the estimation of mixture components and for coping with the missing data. AI MIT Artificial Intelligence missing data mixture models statistical learning EM algorithm maximum likelihood neural networks

Search results