Spelling suggestions: "subject:"missing values"" "subject:"kissing values""
1 |
The development of a spatial-temporal data imputation technique for the applications of environmental monitoringHuang, Ya-Chen 12 September 2006 (has links)
In recent years, sustainable development has become one of the most important issues internationally. Many indicators related to sustainable development have been proposed and implemented, such as Island Taiwan and Urban Taiwan. However the missing values come along with environmental monitoring data pose serious problems when we conducted the study on building a sustainable development indicator for marine environment. Since data is the origin of the summarized information, such as indicators. Given the poor data quality caused by the missing values, there will be some doubts about the result accuracy when using such data set for estimation. It is therefore important to apply suitable data pre-processing, such that reliable information can be acquired by advanced data analysis. Several reasons cause the problem of missing value in environmental monitoring data, for example: breakdown of machines, ruin of samples, forgot recording, mismatch of records when merging data, and lost of records when processing data. The situations of missing data are also diverse, for example: in the same time of sampling, some data records at several sampling sites are partially or completely disappeared. On the contrary, partial or complete time series data are missing at the same sampling site. It is therefore obvious to see that the missing values of environmental monitoring data are both related to spatial and temporal dimensions. Currently the techniques of data imputation have been developed for certain types of data or the interpolation of missing values based on either geographic data distributions or time-series functions. To accommodate both spatial and temporal information in an analysis is rarely seen. The current study has been tried to integrate the related analysis procedures and develop a computing process using both spatial and temporal dimensions inherent in the environmental monitoring data. Such data imputation process can enhance the accuracy of estimated missing values.
|
2 |
DATA PREPROCESSING MANAGEMENT SYSTEMAnumalla, Kalyani January 2007 (has links)
No description available.
|
3 |
Methodological and Clinical Issues in the Analysis of Data from HIV Cardiovascular Research: Validity of Ultrasound Methods, Impact of Anti-Retroviral Therapy on Atherosclerosis, and Imputation of Missing ValuesOdueyungbo, Adefowope 11 1900 (has links)
<p>Background and Objectives: There are some methodological and clinical challenges in conducting HIV related research. A subset of such challenges include: non-availability of a universally accepted method to quantify subclinical atherosclerosis in HIV patients; ultrasound imaging techniques aimed at quantifying atheroma burden and endothelial dysfunction have been proposed, however there is no universally accepted ultrasound protocol; conflicting inferences on the nature ofthe relationship between anti-retroviral therapy (ART) and cardiovascular disease (CVD) due to small sample sizes; and missing data from longitudinal studies and ultrasound data. The objective of this thesis is to investigate selected aspects of the afore-mentioned issues, and to provide recommendations for future research.</p> <p>Methods:</p> <p>Project 1: We compared the construct validity of carotid artery intima media thickness (IMT) and brachial artery flow mediated vasodilation (FMD); two non-invasive ultrasound techniques used in measuring the extent of sub-clinical atherosclerosis. Baseline and one-year follow-up data were obtained for a sample of 257 subjects aged 35 years or older, recruited into an ongoing study of cardiovascular risk in HIV. An ultrasound technique having statistically significantly strong association with known CVD risk factors was adjudged to have good construct validity. The relationship between baseline IMT or FMD and known CVD risk factors was studied using multiple regression analysis. We modelled the relationship between progression of IMT or FMD and risk factors using fixed-effects models.</p> <p>Project 2: To more precisely investigate the relationship between ARTs and IMT (as a surrogate for CVD), we pooled cross-sectional baseline, record-level data for 1,032 patients recruited across three cohort studies in Canada, France and USA in a metaanalysis. We investigated the association between exposure to ARTs and CVD using hierarchical linear models.</p> <p>Project 3: On missing data, we studied the impact ofan inclusive strategy for conducting multiple imputation (MI) on the efficiency ofregression parameter estimates using Monte-Carlo simulation. In an inclusive strategy, all final analysis variables are included in a multivariate normal model to impute plausible values for missing data. This issue is not well studied for longitudinal HIV data.</p> <p>Results and Conclusions:</p> <p>Project 1: Baseline IMT was significantly associated with age (p < 0.001), male gender (p = 0.034), current smoking status (p < 0.001), systolic blood pressure (p < 0.001) and total:HDL cholesterol ratio (p = 0.004). IMT progression was significantly associated with age (p < 0.001), male gender (p = 0.0051) and current smoking status (p = 0.011). Neither extent nor progression ofFMD was significantly associated with any of the examined vascular risk factors. IMT was adjudged to have better construct validity than FMD.</p> <p>Project 2: Similar to some (but not all) previous studies, AR Ts do not appear to lead to CVD independent of traditional risk factors. However, exploratory analysis of two-way interactions suggests statistically significant moderating effects between ARTs and traditional risk factors. These results warrant further investigation into potential
moderating effects between ARTs and known CVD risk factors.</p> <p>Project 3: In conducting MI, simulation results show that a strategy that includes all final analysis model variables in the imputation model provides the least combined variability and bias for final regression estimates. This is important to note because final regression estimates are used in making clinically relevant inferences in practice.</p> / Thesis / Doctor of Philosophy (PhD)
|
4 |
Bayesian Mixtures and Gene Expression Profiling with Missing DataChang, Xiaoqing January 2008 (has links)
No description available.
|
5 |
Evaluation verschiedener Imputationsverfahren zur Aufbereitung großer Datenbestände am Beispiel der SrV-Studie von 2013Meister, Romy 09 March 2016 (has links) (PDF)
Missing values are a serious problem in surveys. The literature suggests to replace these with realistic values using imputation methods. This master thesis examines four different imputation techniques concerning their ability for handling missing data. Therefore, mean imputation, conditional mean imputation, Expectation-Maximization algorithm and Markov-Chain-Monte-Carlo method are presented. In addition, the three first mentioned methods were simulated by using a large real data set. To analyse the quality of these techniques a metric variable of the original data set was chosen to generate some missing values considering different percentages of missingness and common missing data mechanism. After the replacement of the simulated missing values, several statistical parameters, like quantiles, arithmetic mean and variance of all completed data sets were calculated in order to compare them with the parameters from the original data set. The results, that have been established by empiric data analysis, show that the Expectation-Maximization algorithm estimates all considered statistical parameters of the complete data set far better than the other analysed imputation methods, although the assumption of a multivariate normal distribution could not be achieved. It is found, that the mean as well as the conditional mean imputation produce statistically significant estimator for the arithmetic mean under the supposition of missing completely at random, whereas other parameters as the variance do not show the estimated effects. Generally, the accuracy of all estimators from the three imputation methods decreases with increasing percentage of missingness. The results lead to the conclusion that the Expectation-Maximization algorithm should be preferred over the mean and the conditional mean imputation.
|
6 |
Data envelopment analysis with sparse dataGullipalli, Deep Kumar January 1900 (has links)
Master of Science / Department of Industrial & Manufacturing Systems Engineering / David H. Ben-Arieh / Quest for continuous improvement among the organizations and issue of missing data for data analysis are never ending. This thesis brings these two topics under one roof, i.e., to evaluate the productivity of organizations with sparse data. This study focuses on Data Envelopment Analysis (DEA) to determine the efficiency of 41 member clinics of Kansas Association of Medically Underserved (KAMU) with missing data. The primary focus of this thesis is to develop new reliable methods to determine the missing values and to execute DEA.
DEA is a linear programming methodology to evaluate relative technical efficiency of homogenous Decision Making Units, using multiple inputs and outputs. Effectiveness of DEA depends on the quality and quantity of data being used. DEA outcomes are susceptible to missing data, thus, creating a need to supplement sparse data in a reliable manner. Determining missing values more precisely improves the robustness of DEA methodology.
Three methods to determine the missing values are proposed in this thesis based on three different platforms. First method named as Average Ratio Method (ARM) uses average value, of all the ratios between two variables. Second method is based on a modified Fuzzy C-Means Clustering algorithm, which can handle missing data. The issues associated with this clustering algorithm are resolved to improve its effectiveness. Third method is based on interval approach. Missing values are replaced by interval ranges estimated by experts. Crisp efficiency scores are identified in similar lines to how DEA determines efficiency scores using the best set of weights.
There exists no unique way to evaluate the effectiveness of these methods. Effectiveness of these methods is tested by choosing a complete dataset and assuming varying levels of data as missing. Best set of recovered missing values, based on the above methods, serves as a source to execute DEA. Results show that the DEA efficiency scores generated with recovered values are close within close proximity to the actual efficiency scores that would be generated with the complete data.
As a summary, this thesis provides an effective and practical approach for replacing missing values needed for DEA.
|
7 |
Caracterização da estrutura de interação genótipo e ambiente utilizando modelo AMMI e W-AMMI por meio de Biplot / Characterization of structure of genotype and environment interaction using AMMI and W-AMMI models through BiplotHirai, Welinton Yoshio 05 February 2019 (has links)
A estatística é uma ferramenta muito importante na área de melhoramento genético devido a necessidade de se analisar, em determinadas espécies, características de adaptabilidade e estabilidade. Uma medida que ajuda o pesquisador nas avaliações destes comportamentos é a interação genótipo x ambiente (IGA). Existem inúmeras metodologias que ajudam na caracterização deste efeito, e um destes métodos é o modelo AMMI (Additive Main effects and Multiplicative Interaction), em que os efeitos são estimados utilizando ANOVA (Análise de Variância) e a estrutura da interação é caracterizada por meio de ACP (Análise de Componentes Principais). Entretanto, uma pressuposição necessária para o modelo é a homogeneidade de variâncias, e caso este pressuposto não aconteça, foi proposto uma generalização do modelo AMMI, o W-AMMI (Weighted AMMI) que utiliza um método de DVS (Decomposição em Valores Singulares) ponderado. Com isto, o trabalho teve como objetivo avaliar a IGA por meio dos modelos AMMI e W-AMMI através de gráficos Biplot\'s. Utilizou-se 2 conjuntos de dados, em que o primeiro experimento foi realizado no Instituto Agronômico de Cam- pinas, afim de avaliar um híbrido de uva (SR 0.501-17) enxertada sobre 4 porta-enxertos (IAC 766, IAC 572, IAC 571-6 e IAC 313), em 2 municípios do estado de São Paulo (Votuporanga e Jundiaí) nos anos de 2012, 2013 e 2014. O segundo experimento foi realizado pela EMBRAPA - Fortaleza com o objetivo de caracterizar o fruto do melão a partir de 92 famílias, em 3 diferentes ambientes. Na primeira análise, mesmo com o fator da interação da ANOVA conjunta não sendo significativa, prosseguiu-se com a abordagem do modelo AMMI afim de caracterizar o comportamento de estabilidade que os porta-enxertos apresentam nos diferentes ambientes. Por conta da alta heterogeneidade entre os ambientes, observou-se que o modelo W-AMMI apresentou melhor comportamento para a caracterização da IGA. A análise para o segundo experimento apresentou dados faltantes, e desta forma, foi utilizado o método de imputação baseado na DVS livre de distribuição. Por falta de hetero- geneidade nos 3 ambientes, constatou-se que o modelo W-AMMI apresentou comportamento parecido com o AMMI para a descrição da IGA. Conclui-se que mesmo em casos que haja independência entre os fatores de genótipos e ambientes, seria viável ao pesquisador utilizar o modelo AMMI como um complemento na análise, devido a complexidade multivariada que este fator pode apresentar. Além disto, para experimentos com homogeneidade de variância o modelo W-AMMI não apresenta melhora na caracterização, evidenciando desta forma, o objetivo da metodologia. / Statistic is a very important approach in field of quantitative genetics due to necessity analyse, in determinate species, characteristics of the adaptability and stability. One measure that helps the researcher assessing this behavior is the Genotype × Environment Interaction (GEI). There many methodologies that help characterization this effect, and one of there methods is the AMMI (Additive Main effects and Multiplicative Interaction) model, where the effects are estimated using ANOVA (Analysis of Variance) and the structure of interaction is characterized for PCA (Principal Component Analysis). However, a assumption necessity for the model is the homogeneity of variance, and for this, was proposed a generalization of AMMI model, the W-AMMI (Weighted AMMI) that using SVD (Singular Value Decomposition) weighted. In this work, a objective was evaluated the GEI through the AMMI and W-AMMI models using Biplot\'s graphs. It was analyzed two data sets, fist experiment was design in IAC (Institute Agronomic of Campinas), that evaluate a hybrid grape (SR 0.501-17) on four rootstocks (IAC 766, IAC 572, IAC 571-6 e IAC 313), in two city of state of São Paulo (Votuporanga and Jundiaí), in the years 2012, 2013 and 2014. The second experiment was carried by EMBRAPA-Fortaleza with aim of characterizing the melon fruit from 92 families, in 3 differents environments. In the first analysis, even with interaction factor of ANOVA not was significative, continued with approach of AMMI model, in order to characterize the stability behavior the rootstocks present in the different environments. Due to heterogeneity among the environments, was observed that W-AMMI model presented better behavior for description of the IGA. The analysis for the second experiment presented missing values, and was used the imputation method based on DVS free of distribution. Due to lack of heterogeneity in the environments, it was observed that W- AMMI model presented similar with AMMI, for description of the GEI. Finally, was concluded the even in cases whose factor of genotype and environments being independence, would be feasible for the researcher use the AMMI model for complement in the analysis, because the multivariate complexity that this factor can present. In addition, for experiments with homogeneity of variance, the W-AMMI model does not present improvement in the characterization, thus evidencing the objective of the methodology.
|
8 |
Estratégias para tratamento de variáveis com dados faltantes durante o desenvolvimento de modelos preditivos / Strategies for treatment of variables with missing data during the development of predictive modelsAssunção, Fernando 09 May 2012 (has links)
Modelos preditivos têm sido cada vez mais utilizados pelo mercado a fim de auxiliarem as empresas na mitigação de riscos, expansão de carteiras, retenção de clientes, prevenção a fraudes, entre outros objetivos. Entretanto, durante o desenvolvimento destes modelos é comum existirem, dentre as variáveis preditivas, algumas que possuem dados não preenchidos (missings), sendo necessário assim adotar algum procedimento para tratamento destas variáveis. Dado este cenário, este estudo tem o objetivo de discutir metodologias de tratamento de dados faltantes em modelos preditivos, incentivando o uso de algumas delas já conhecidas pelo meio acadêmico, só que não utilizadas pelo mercado. Para isso, este trabalho descreve sete metodologias. Todas elas foram submetidas a uma aplicação empírica utilizando uma base de dados referente ao desenvolvimento de um modelo de Credit Score. Sobre esta base foram desenvolvidos sete modelos (um para cada metodologia descrita) e seus resultados foram avaliados e comparados através de índices de desempenho amplamente utilizados pelo mercado (KS, Gini, ROC e Curva de Aprovação). Nesta aplicação, as técnicas que apresentaram melhor desempenho foram a que tratam os dados faltantes como uma categoria à parte (técnica já utilizada pelo mercado) e a metodologia que consiste em agrupar os dados faltantes na categoria conceitualmente mais semelhante. Já a que apresentou o pior desempenho foi a metodologia que simplesmente não utiliza a variável com dados faltantes, outro procedimento comumente visto no mercado. / Predictive models have been increasingly used by the market in order to assist companies in risk mitigation, portfolio growth, customer retention, fraud prevention, among others. During the model development, however, it is usual to have, among the predictive variables, some who have data not filled in (missing values), thus it is necessary to adopt a procedure to treat these variables. Given this scenario, the aim of this study is to discuss frameworks to deal with missing data in predictive models, encouraging the use of some already known by academia that are still not used by the market. This paper describes seven methods, which were submitted to an empirical application using a Credit Score data set. Each framework described resulted in a predictive model developed and the results were evaluated and compared through a series of widely used performance metrics (KS, Gini, ROC curve, Approval curve). In this application, the frameworks that presented better performance were the ones that treated missing data as a separate category (technique already used by the market) and the framework which consists of grouping the missing data in the category most similar conceptually. The worst performance framework otherwise was the one that simply ignored the variable containing missing values, another procedure commonly used by the market.
|
9 |
Substituição de valores ausentes: uma abordagem baseada em um algoritmo evolutivo para agrupamento de dados / Missing value substitution: an approach based on evolutionary algorithm for clustering dataSilva, Jonathan de Andrade 29 April 2010 (has links)
A substituição de valores ausentes, também conhecida como imputação, é uma importante tarefa para a preparação dos dados em aplicações de mineração de dados. Este trabalho propõe e avalia um algoritmo para substituição de valores ausentes baseado em um algoritmo evolutivo para agrupamento de dados. Este algoritmo baseia-se na suposição de que grupos (previamente desconhecidos) de dados podem prover informações úteis para o processo de imputação. Para avaliar experimentalmente o algoritmo proposto, simulações de valores ausentes foram realizadas em seis bases de dados, para problemas de classificação, com a aplicação de dois mecanismos amplamente usados em experimentos controlados: MCAR e MAR. Os algoritmos de imputação têm sido tradicionalmente avaliados por algumas medidas de capacidade de predição. Entretanto, essas tradicionais medidas de avaliação não estimam a influência dos métodos de imputação na etapa final em tarefas de modelagem (e.g., em classificação). Este trabalho descreve resultados experimentais obtidos sob a perspectiva de predição e inserção de tendências (viés) em problemas de classificação. Os resultados de diferentes cenários nos quais o algoritmo proposto, apresenta em geral, desempenho semelhante a outros seis algoritmos de imputação reportados na literatura. Finalmente, as análises estatísticas reportadas sugerem que melhores resultados de predição não implicam necessariamente em menor viés na classificação / The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. This work proposes and evaluates an algorithm for missing values imputation that is based on an evolutionary algorithm for clustering. This algorithm is based on the assumption that clusters of (partially unknown) data can provide useful information for the imputation process. In order to experimentally assess the proposed method, simulations of missing values were performed on six classification datasets, with two missingness mechanisms widely used in practice: MCAR and MAR. Imputation algorithms have been traditionally assessed by some measures of prediction capability. However, this traditionall approach does not allow inferring the influence of imputed values in the ultimate modeling tasks (e.g., in classification). This work describes the experimental results obtained from the prediction and insertion bias perspectives in classification problems. The results illustrate different scenarios in which the proposed algorithm performs similarly to other six imputation algorithms reported in the literature. Finally, statistical analyses suggest that best prediction results do not necessarily imply in less classification bias
|
10 |
Análise do número de grupos em bases de dados incompletas utilizando agrupamentos nebulosos e reamostragem Bootstrap / Analysis the number of clusters present in incomplete datasets using a combination of the fuzzy clustering and resampling bootstrappingMilagre, Selma Terezinha 18 July 2008 (has links)
A técnica de agrupamento de dados é amplamente utilizada em análise exploratória, a qual é frequentemente necessária em diversas áreas de pesquisa tais como medicina, biologia e estatística, para avaliar potenciais hipóteses a serem utilizadas em estudos subseqüentes. Em bases de dados reais, a ocorrência de dados incompletos, nos quais os valores de um ou mais atributos do dado são desconhecidos, é bastante comum. Este trabalho apresenta um método capaz de identificar o número de grupos presentes em bases de dados incompletas, utilizando a combinação das técnicas de agrupamentos nebulosos e reamostragem bootstrap. A qualidade da classificação é baseada em medidas de comparação tradicionais como F1, Classificação Cruzada, Hubert e outras. Os estudos foram feitos em oito bases de dados. As quatro primeiras são bases de dados artificiais, a quinta e a sexta são a wine e íris. A sétima e oitava bases são formadas por uma coleção brasileira de 119 estirpes de Bradyrhizobium. Para avaliar toda informação sem introduzir estimativas, fez-se a modificação do algoritmo Fuzzy C-Means (FCM) utilizando-se um vetor de índices de atributos, os quais indicam onde o valor de um atributo é observado ou não, modificando-se ento, os cálculos do centro e distância ao centro. As simulações foram feitas de 2 até 8 grupos utilizando-se 100 sub-amostras. Os percentuais de valores faltando utilizados foram 2%, 5%, 10%, 20% e 30%. Os resultados deste trabalho demonstraram que nosso método é capaz de identificar participações relevantes, até em presença de altos índices de dados incompletos, sem a necessidade de se fazer nenhuma suposição sobre a base de dados. As medidas Hubert e índice randômico ajustado encontraram os melhores resultados experimentais. / Clustering in exploratory data analysis is often necessary in several areas of the survey such as medicine, biology and statistics, to evaluate potential hypotheses for subsequent studies. In real datasets the occurrence of incompleteness, where the values of some of the attributes are unknown, is very common. This work presents a method capable to identifying the number of clusters present in incomplete datasets, using a combination of the fuzzy clustering and resampling (bootstrapping). The quality of classification is based on the traditional measures, like F1, Cross-Classification, Hubert and others. The studies were made on eigth datasets. The first four are artificial datasets, the fifth and sixth are the wine and iris datasets. The seventh and eighth databases are composed of the brazilian collection of 119 Bradyrhizobium strains. To evaluate all information without introducing estimates, a modification of the Fuzzy C-Means (FCM) algorithm was developed using an index vector of attributes, which indicates whether an attribute value is observed or not, and changing the center and distance calculations. The simulations were made from 2 to 8 clusters using 100 sub-samples. The percentages of the missing values used were 2%, 5%, 10%, 20% and 30%. Even lacking data and with no special requirements of the database, the results of this work demonstrate that the proposed method is capable to identifying relevant partitions. The best experimental results were found using Hubert and corrected randomness measures.
|
Page generated in 0.1425 seconds