1 |
預測模型中遺失值之選填順序研究 / Research of acquisition order of missing values in predictive model施雲天 Unknown Date (has links)
本研究著重的議題是用某成本去取得遺失值,並且利用決策樹(因為其在建構時可以容納遺失值)來當作預測模型,希望可以找到用較低的成本的填值方法達到較高的準確率。我們延續過去Error Sampling中Uncertainty Score的概念與邏輯。提出U-Sampling來判斷不同特徵值的「重要性排序」。相較於過去Error Sampling用「受試者」(row-based)的重要性來排序。U-Sampling是根據「特徵值」(column-based)的重要性來排序。
我們用8組UCI machine Learning Repository的資料進行兩組實驗,分別讓訓練資料以及測試資料含有一定比例的遺失值。再利用U-Sampling、Random Sampling以及過去文獻所提及的Error Sampling作準確率和錯誤減少率的比較。實驗結果顯示在訓練資料有遺失值的情況,U-Sampling在70%以上的檔案表現較佳;而在測試資料有遺失值的情況,U-Sampling則是在87.5%的檔案表現較佳。
2 |
預測模型的遺失值處理─選值順序的研究 / Handling Missing Values in Predictive Model - Research of the Order of Data Acquisition黃秋芸, Huang, Chiu Yun Unknown Date (has links)
商業知識的發展突飛猛進,其中,預測模型在眾多商業智慧中扮演重要的角色,然而,當我們從大量資料萃取隱藏、未知與潛在具有實用性的資訊處理過程時,往往會遇到許多資料品質上的問題而難以著手分析,尤其是遺失值 (Missing value)的問題在資料前置處理階段更是常見的困難。因此,要如何在建立預測模型時有效的處理遺失值是一個很重要的議題。
過去已有許多文獻致力於遺失值處理的議題,其中,Active Feature-Value Acquisition的相關研究更針對訓練資料的選填順序深入探討。Active Feature-Value Acquisition的概念是從具有遺失值的訓練資料中,選擇適當的遺失資料填補,讓預測的模型在最具效率的情況下達到理想的準確率。本研究將延續Active Feature-Value Acquisition的研究主軸,優先考量決策樹上的節點為遺失值選值填補的順序,提出一個新的訓練資料遺失值的選填順序方法─I Sampling,並透過實際的數據進行訓練與測試,同時我們也與過去文獻所提出的方法進行比較,了解不同的填值偵測與順序的選擇對於一個預測模型的分類準確率是否有影響,並了解各個方法的優缺點與在不同情境下的適用性。
本研究所提出的新方法與驗證的結果,將可給予未來從事預測行為的管理或學術工作一些參考與建議,可以依據不同性質的資料採取合宜的選值方式,以節省取值的成本並提高預測模型的分類能力。 / The importance of business intelligence is accelerated developing nowadays. Especially predictive models play a key role in numerous business intelligence tasks. However, while we extract information from unidentified data, there are critical problems of how to handle the missing values, especially in the data pre-processing phase. Therefore, it is important to identify which methods best deal with the missing data when building predictive models.
There are several papers dedicated in the research of strategies to deal with the missing values. The topic of Active-Feature Acquisition (aka. AFA) especially worked on the priority order of choosing which feature-value to acquire. The goal of AFA is to reduce the costs of achieving a desired model accuracy by identifying instances for which obtaining complete information is most informative. Followed by the AFA concept, we present an approach- I Sampling, in which feature-values are selected for acquisition based on the attribute on the top node of the current decision tree. Also we compare our approach with other methods in different situations and data missing patterns.
Experimental results demonstrate that our approach can induce accurate models using substantially fewer feature-value acquisitions as compared to alternative policies in some situations. The method we proposed can aid the further predictive works in academic and business area. They can therefore choose the right method based on their needs and obtain the informative data in an efficient way.
3 |
含遺失值之列聯表最大概似估計量及模式的探討 / Maximum Likelihood Estimation in Contingency Tables with Missing Data黃珮菁, Huang, Pei-Ching Unknown Date (has links)
在處理具遺失值之類別資料時,傳統的方法是將資料捨棄,但是這通常不是明智之舉,這些遺失某些分類訊息的資料通常還是可以提供其它重要的訊息,尤其當這類型資料的個數佔大多數時,將其捨棄可能使得估計的變異數增加,甚至影響最後的決策。如何將這些遺失某些訊息的資料納入考慮,作出完整的分析是最近幾十年間頗為重要的課題。本文主要整理了五種分析這類型資料的方法,分別為單樣本方法、多樣本方法、概似方程式因式分解法、EM演算法,以上四種方法可使用在資料遺失呈隨機分佈的條件成立下來進行分析。第五種則為樣本遺失不呈隨機分佈之分析方法。 / Traditionally, the simple way to deal with observations for which some of the variables are missing so that they cannot cross-classified into a contingency table simply excludes them from any analysis. However, it is generally agreed that such a practice would usually affect both the accuracy and the precision of the results. The purpose of the study is to bring together some of the sound alternatives available in the literature, and provide a comprehensive review. Four methods for handling data missing at random are discussed, they are single-sample method, multiple-sample method, factorization of the likelihood method, and EM algorithm. In addition, one way of handling data missing not at random is also reviewed.
Page generated in 0.0174 seconds