Spelling suggestions: "subject:"missing value"" "subject:"missing alue""
1 |
變數遺漏值的多重插補應用於條件評估法 / Multiple imputation for missing covariates in contingent valua-tion survey費詩元, Fei, Shih Yuan Unknown Date (has links)
多數關於願付價格(WTP)之研究中,遺漏資料通常被視為完全隨機遺漏(MCAR)並刪除之。然而,研究中的某些重要變數若具有過高的遺漏比例時,則可能造成分析上的偏誤。
收入在許多條件評估(Contingent Valuation)調查中經常扮演著一個重要的角色,同時其也是受訪者最傾向於遺漏的變項之一。在這份研究中,我們將透過模擬的方式來評估多重插補法(Multiple Imputa- tion) 於插補願付價格調查中之遺漏收入之表現。我們考慮三種資料情況:刪除遺漏資料後所剩餘之完整資料、一次插補資料、以及多重插補資料,針對這三種情況,藉由三要素混合模型(Three-Component Mixture Model)所進行之分析來評估其優劣。模擬結果顯示,多重插補法之分析結果優於僅利用刪除遺漏資料所剩餘之完整資料進行分析之結果,並且隨著遺漏比例上升,其優劣更是明顯。我們也發現多重插補法之結果也比起一次插補來的更加可靠、穩定。因此如果資料遺漏機制非完全隨機遺漏之機制時,我們認為多重插補法是一個值得信任且表現不錯的處理方法。
此外,文中也透過「竹東及朴子地區心臟血管疾病長期追蹤研究」(Cardio Vascular Disease risk FACtor Two-township Study,簡稱CVDFACTS) 之資料來進行實證分析。文中示範一些評估遺漏機制的技巧,包括比較存活曲線以及邏輯斯迴歸。透過實證分析,我們發現插補前後的確造成模型分析及估計上的差異。 / Most often, studies focus on willingness to pay (WTP) simply ignore the missing values and treat them as if they were missing completely at random. It is well-known that such a practice might cause serious bias and lead to incorrect results.
Income is one of the most influential variables in CV (contingent valuation) study and is also the variable that respondents most likely fail to respond. In the present study, we evaluate the performance of multiple imputation (MI) on missing income in the analysis of WTP through a series of simulation experiments. Several approaches such as complete-case analysis, single imputation, and MI are considered and com-pared. We show that performance with MI is always better than complete-case analy-sis, especially when the missing rate gets high. We also show that MI is more stable and reliable than single imputation.
As an illustration, we use data from Cardio Vascular Disease risk FACtor Two-township Study (CVDFACTS). We demonstrate how to determine the missing mechanism through comparing the survival curves and a logistic regression model fitting. Based on the empirical study, we find that discarding cases with missing in-come can lead to something different from that with multiple imputation. If the dis-carded cases are not missing complete at random, the remaining samples will be biased. That can be a serious problem in CV research. To conclude, MI is a useful method to deal with missing value problems and it should be worthwhile to give it a try in CV studies.
|
2 |
預測模型中遺失值之選填順序研究 / Research of acquisition order of missing values in predictive model施雲天 Unknown Date (has links)
預測模型已經被廣泛運用在日常生活中,例如銀行信用評比、消費者行為或是疾病的預測等等。然而不論在建構或使用預測模型的時候,我們都會在訓練資料或是測試資料中遇到遺失值的問題,因而降低預測的表現。面對遺失值有很多種處理方式,刪除、填補、模型建構以及機器學習都是可以使用的方法;除此之外,直接用某個成本去取得遺失值也是一個選擇。
本研究著重的議題是用某成本去取得遺失值,並且利用決策樹(因為其在建構時可以容納遺失值)來當作預測模型,希望可以找到用較低的成本的填值方法達到較高的準確率。我們延續過去Error Sampling中Uncertainty Score的概念與邏輯。提出U-Sampling來判斷不同特徵值的「重要性排序」。相較於過去Error Sampling用「受試者」(row-based)的重要性來排序。U-Sampling是根據「特徵值」(column-based)的重要性來排序。
我們用8組UCI machine Learning Repository的資料進行兩組實驗,分別讓訓練資料以及測試資料含有一定比例的遺失值。再利用U-Sampling、Random Sampling以及過去文獻所提及的Error Sampling作準確率和錯誤減少率的比較。實驗結果顯示在訓練資料有遺失值的情況,U-Sampling在70%以上的檔案表現較佳;而在測試資料有遺失值的情況,U-Sampling則是在87.5%的檔案表現較佳。
另外,我們也研究了對於不同的遺失比例對於上述方法的效果是否有影響,可以用來判斷哪種情況比較適用哪一種選值方法。希望透過U-Sampling,可以先挑選重要的特徵值來填補,用較少的遺失值取得就得到較高的準確率,也因此可以節省處理遺失值的成本。
|
3 |
Overcoming the Curse of Missing and Noisy Data in Computational Drug DesignMeng, Fanwang January 2022 (has links)
Machine learning (ML) has enjoyed great success in chemistry and drug design, from designing synthetic pathways to drug screening, to biomolecular property predictions, etc.. However, ML model's generalizability and robustness require high-quality training data, which is often difficult to obtain, especially when the training data is acquired from experimental measurements. While one can always discard all data associated with noisy and/or missing values, this often results in discarding invaluable data.
This thesis presents and applies mathematical techniques to solve this problem, and applies them to problems in molecular medicinal chemistry. In chapter 1, we indicate that the missing-data problem can be expressed as a matrix completion problem, and we point out how frequently matrix completion problems arise in (bio)chemical problems. Next, we use matrix completion to impute the missing values in protein-NMR data, and use this as a stepping-stone for understanding protein allostery in Chapter 2. This chapter also used several other techniques from statistical data analysis and machine learning, including denoising (from robust principal component analysis), latent feature identification from singular-value decomposition, and residue clustering by a Gaussian mixture model.
In chapter 3, Δ-learning was used to predict the free energies of hydration (Δ𝐺). The aim of this study is to correct estimated hydration energies from low-level quantum chemistry calculations using continuum solvation models without significant additional computation. Extensive feature engineering, with 8 different regression algorithms and with Gaussian process regression (38 different kernels) were used to construct the predictive models. The optimal model gives us MAE of 0.6249 kcal/mol and RMSE of 1.0164 kcal/mol. Chapter 4 provides an open-source computational tool Procrustes to find the maximum similarities between metrics. Some examples are also given to show how to use Procrustes for chemical and biological problems. Finally, in Chapters 5 and 6, a database for permeability of the blood-brain barrier (BBB) was curated, and combined with resampling strategies to form predictive models. The resulting models have promising performance and are released along with a computational tool B3clf for its evaluation. / Thesis / Doctor of Science (PhD)
|
4 |
Error Structure of Randomized Design Under Background Correlation with a Missing ValueChang, Tseng-Chi 01 May 1965 (has links)
The analysis of variance technique is probably the most popular statistical technique used for testing hypotheses and estimating parameters. Eisenhart presents two classes of problems solvable by the analysis of variance and the assumption underlying each class. Cochran lists the assumptions and also discusses the consequences when these assumptions are not met. It is evident that if all the assumptions are not satisfied, the confidence placed in any result obtained in this manner is adversely affected to varying degrees according to the extent of the violation. One of the assumptions in the analysis of variance procedures is that of uncorrelated errors. The experimenter may not always meet this conditions because of economical or environmental reasons. In fact, Wilk questions the validity of the assumption of uncorrelated errors in any physical situation. For example, consider an experiment over a sequence of years. A correlation due to years may exist, no matter what randomization technique is used, because the outcome of the previous year determines to a great extent the outcome of this year. Another example would be the case of selecting experimental units from the same source, such as, sampling students with the same background or selecting units from the same production process. This points out the fact that the condition such as background, or a defect in the production process may have forced a correlation among the experimental units. Problems of this nature frequently occur in Industrial, Biological, and Psychological experiments.
|
5 |
Stochastic Analysis of Networked SystemsJanuary 2020 (has links)
abstract: This dissertation presents a novel algorithm for recovering missing values of co-evolving time series with partial embedded network information. The idea is to connect two sources of data through a shared low dimensional latent space. The proposed algorithm, named NetDyna, is an Expectation-Maximization algorithm, and uses the Kalman filter and matrix factorization approaches to infer the missing values both in the time series and embedded network. The experimental results on real datasets, including a Motes dataset and a Motion Capture dataset, show that (1) NetDyna outperforms other state-of-the-art algorithms, especially with partially observed network information; (2) its computational complexity scales linearly with the time duration of time series; and (3) the algorithm recovers the embedded network in addition to missing time series values.
This dissertation also studies a load balancing algorithm, the so called power-of-two-choices(Po2), for many-server systems (with N servers) and focuses on the convergence of stationary distribution of Po2 in the both light and heavy traffic regimes to the solution of mean-field system. The framework of Stein’s method and state space collapse (SSC) are used to analyze both regimes.
In both regimes, the thesis first uses the argument of state space collapse to show that the probability of the state being far from the mean-field solution is small enough. By a simple Markov inequality, it is able to show that the probability is indeed very small with a proper choice of parameters.
Then, for the state space close to the solution of mean-field model, the thesis uses Stein’s method to show that the stochastic system is close to a linear mean-field model. By characterizing the generator difference, it is able to characterize the dominant terms in both regimes. Note that for heavy traffic case, the lower and upper bound analysis of a tridiagonal matrix, which arises from the linear mean-field model, is needed. From the dominant term, it allows to calculate the coefficient of the convergence rate.
In the end, comparisons between the theoretical predictions and numerical simulations are presented. / Dissertation/Thesis / Doctoral Dissertation Electrical Engineering 2020
|
6 |
Méthodes informées de factorisaton matricielle pour l'étalonnage de réseaux de capteurs mobiles et la cartographie de champs de pollution / Informed method of matrix factorization for calibration of mobile sensor networks and pollution fields mappingDorffer, Clément 13 December 2017 (has links)
Le mobile crowdsensing consiste à acquérir des données géolocalisées et datées d'une foule de capteurs mobiles (issus de ou connectés à des smartphones). Dans cette thèse, nous nous intéressons au traitement des données issues du mobile crowdsensing environnemental. En particulier, nous proposons de revisiter le problème d'étalonnage aveugle de capteurs comme un problème informé de factorisation matricielle à données manquantes, où les facteurs contiennent respectivement le modèle d'étalonnage fonction du phénomène physique observé (nous proposons des approches pour des modèles affines et non linéaires) et les paramètres d'étalonnage de chaque capteur. Par ailleurs, dans l'application de surveillance de la qualité de l'air que nous considérons, nous supposons avoir à notre disposition des mesures très précises mais distribuées de manière très parcimonieuse dans le temps et l'espace, que nous couplons aux multiples mesures issues de capteurs mobiles. Nos approches sont dites informées car (i) les facteurs matriciels sont structurés par la nature du problème, (ii) le phénomène observé peut être décomposé sous forme parcimonieuse dans un dictionnaire connu ou approché par un modèle physique/géostatistique, et (iii) nous connaissons la fonction d'étalonnage moyenne des capteurs à étalonner. Les approches proposées sont plus performantes que des méthodes basées sur la complétion de la matrice de données observées ou les techniques multi-sauts de la littérature, basées sur des régressions robustes. Enfin, le formalisme informé de factorisation matricielle nous permet aussi de reconstruire une carte fine du phénomène physique observé. / Mobile crowdsensing aims to acquire geolocated and timestamped data from a crowd of sensors (from or connected to smartphones). In this thesis, we focus on processing data from environmental mobile crowdsensing. In particular, we propose to revisit blind sensor calibration as an informed matrix factorization problem with missing entries, where factor matrices respectively contain the calibration model which is a function of the observed physical phenomenon (we focus on approaches for affine or nonlinear sensor responses) and the calibration parameters of each sensor. Moreover, in the considered air quality monitoring application, we assume to pocee- some precise measurements- which are sparsely distributed in space and time - that we melt with the multiple measurements from the mobile sensors. Our approaches are "informed" because (i) factor matrices are structured by the problem nature, (ii) the physical phenomenon can be decomposed using sparse decomposition with a known dictionary or can be approximated by a physical or a geostatistical model, and (iii) we know the mean calibration function of the sensors to be calibrated. The proposed approaches demonstrate better performances than the one based on the completion of the observed data matrix or the multi-hop calibration method from the literature, based on robust regression. Finally, the informed matrix factorization formalism also provides an accurate reconstruction of the observed physical field.
|
7 |
Novel computationally intelligent machine learning algorithms for data mining and knowledge discoveryGheyas, Iffat A. January 2009 (has links)
This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model.
|
8 |
預測模型的遺失值處理─選值順序的研究 / Handling Missing Values in Predictive Model - Research of the Order of Data Acquisition黃秋芸, Huang, Chiu Yun Unknown Date (has links)
商業知識的發展突飛猛進,其中,預測模型在眾多商業智慧中扮演重要的角色,然而,當我們從大量資料萃取隱藏、未知與潛在具有實用性的資訊處理過程時,往往會遇到許多資料品質上的問題而難以著手分析,尤其是遺失值 (Missing value)的問題在資料前置處理階段更是常見的困難。因此,要如何在建立預測模型時有效的處理遺失值是一個很重要的議題。
過去已有許多文獻致力於遺失值處理的議題,其中,Active Feature-Value Acquisition的相關研究更針對訓練資料的選填順序深入探討。Active Feature-Value Acquisition的概念是從具有遺失值的訓練資料中,選擇適當的遺失資料填補,讓預測的模型在最具效率的情況下達到理想的準確率。本研究將延續Active Feature-Value Acquisition的研究主軸,優先考量決策樹上的節點為遺失值選值填補的順序,提出一個新的訓練資料遺失值的選填順序方法─I Sampling,並透過實際的數據進行訓練與測試,同時我們也與過去文獻所提出的方法進行比較,了解不同的填值偵測與順序的選擇對於一個預測模型的分類準確率是否有影響,並了解各個方法的優缺點與在不同情境下的適用性。
本研究所提出的新方法與驗證的結果,將可給予未來從事預測行為的管理或學術工作一些參考與建議,可以依據不同性質的資料採取合宜的選值方式,以節省取值的成本並提高預測模型的分類能力。 / The importance of business intelligence is accelerated developing nowadays. Especially predictive models play a key role in numerous business intelligence tasks. However, while we extract information from unidentified data, there are critical problems of how to handle the missing values, especially in the data pre-processing phase. Therefore, it is important to identify which methods best deal with the missing data when building predictive models.
There are several papers dedicated in the research of strategies to deal with the missing values. The topic of Active-Feature Acquisition (aka. AFA) especially worked on the priority order of choosing which feature-value to acquire. The goal of AFA is to reduce the costs of achieving a desired model accuracy by identifying instances for which obtaining complete information is most informative. Followed by the AFA concept, we present an approach- I Sampling, in which feature-values are selected for acquisition based on the attribute on the top node of the current decision tree. Also we compare our approach with other methods in different situations and data missing patterns.
Experimental results demonstrate that our approach can induce accurate models using substantially fewer feature-value acquisitions as compared to alternative policies in some situations. The method we proposed can aid the further predictive works in academic and business area. They can therefore choose the right method based on their needs and obtain the informative data in an efficient way.
|
9 |
Análise e predição de desembarque de characiformes migradores do município de Santarém-PASantana, Isabela Feitosa 19 July 2009 (has links)
Made available in DSpace on 2015-04-11T13:56:31Z (GMT). No. of bitstreams: 1
Dissertacao Isabela.pdf: 1836486 bytes, checksum: bf6c8c5db338bc806954228e1b7b94fe (MD5)
Previous issue date: 2009-07-19 / Conselho Nacional de Desenvolvimento Científico e Tecnológico / As séries históricas de 11 anos de desembarque das espécies Prochilodus nigricans e Semaprochilodus sp., ocorridas no período de janeiro de 1992 a dezembro de 2002 no município de Santarém-PA, foram utilizadas para análise e predição, juntamente com séries de SOI, SST‟s e níveis hidrológicos dos rios Amazonas e Tapajós. Infelizmente, os dados relativos às séries de desembarque de jaraquis e das cotas do Rio tapajós possuíam missing values, o que impossibilitava a realização de análises e predições, porém, o uso da modelagem de Box & Jenkins permitiu completar essas lacunas. Após as estimações dos missing values, promovemos a análise espectral em todas as variáveis citadas, verificamos ciclos relacionados com os fenômenos El Niño e La Niña, com duração de 2 a 7 anos, notamos que esses eventos influenciaram fortemente na variação do nível dos rios e, conseqüentemente, no desembarque dessas espécies. Notamos, também, aumento dos valores de desembarque nos períodos de 2 a 3 anos. Estes períodos podem estar relacionados à ocorrência de fortes cheias que, provavelmente, geraram o sucesso reprodutivo dessas espécies, levando ao aumento das capturas após 2 ou 3 anos. Outras oscilações foram observadas nos desembarques e nível dos rios, tais como oscilações semi-anuais e intra-sazonais. Sabemos que estas oscilações possuem certa influência sobre as precipitações na região amazônica e, portanto, sobre a pesca, mas ainda são necessários estudos mais apurados para o melhor entendimento dessas oscilações sobre o comportamento da pesca dessas espécies. Os modelos de Box & Jenkins também foram usados para a modelagem de desembarque nos anos de 2003 e 2004, a fim de verificar a eficiência desta ferramenta para predições. Empregamos ferramentas métricas que definem o erro das predições, com isso, observamos que os modelos ARIMA são eficientes na predição para médio e curto prazo (12 meses), no qual o modelo demonstrou bom ajuste nas predições para o ano de 2003 em ambas as espécies. / As séries históricas de 11 anos de desembarque das espécies Prochilodus nigricans e Semaprochilodus sp., ocorridas no período de janeiro de 1992 a dezembro de 2002 no município de Santarém-PA, foram utilizadas para análise e predição, juntamente com séries de SOI, SST‟s e níveis hidrológicos dos rios Amazonas e Tapajós. Infelizmente, os dados relativos às séries de desembarque de jaraquis e das cotas do Rio tapajós possuíam missing values, o que impossibilitava a realização de análises e predições, porém, o uso da modelagem de Box & Jenkins permitiu completar essas lacunas. Após as estimações dos missing values, promovemos a análise espectral em todas as variáveis citadas, verificamos ciclos relacionados com os fenômenos El Niño e La Niña, com duração de 2 a 7 anos, notamos que esses eventos influenciaram fortemente na variação do nível dos rios e, conseqüentemente, no desembarque dessas espécies. Notamos, também, aumento dos valores de desembarque nos períodos de 2 a 3 anos. Estes períodos podem estar relacionados à ocorrência de fortes cheias que, provavelmente, geraram o sucesso reprodutivo dessas espécies, levando ao aumento das capturas após 2 ou 3 anos. Outras oscilações foram observadas nos desembarques e nível dos rios, tais como oscilações semi-anuais e intra-sazonais. Sabemos que estas oscilações possuem certa influência sobre as precipitações na região amazônica e, portanto, sobre a pesca, mas ainda são necessários estudos mais apurados para o melhor entendimento dessas oscilações sobre o comportamento da pesca dessas espécies. Os modelos de Box & Jenkins também foram usados para a modelagem de desembarque nos anos de 2003 e 2004, a fim de verificar a eficiência desta ferramenta para predições. Empregamos ferramentas métricas que definem o erro das predições, com isso, observamos que os modelos ARIMA são eficientes na predição para médio e curto prazo (12 meses), no qual o modelo demonstrou bom ajuste nas predições para o ano de 2003 em ambas as espécies.
|
10 |
An Evaluation of Protein Quantification Methods in Shotgun Proteomics and Applications in Multi-OmicsGARDNER, MIRANDA Lynn January 2021 (has links)
No description available.
|
Page generated in 0.0786 seconds