1 |
變數轉換之離群值偵測 / Detection of Outliers with Data Transformation吳秉勳, David Wu Unknown Date (has links)
在迴歸分析中,當資料中存在很多離群值時,偵測的工作變得非常不容易。 在此狀況下,我們無法使用傳統的殘差分析正確地偵測出其是否存在,此現象稱為遮蔽效應(The Masking Effect)。 而為了避免此效應的發生,我們利用最小中位數穩健迴歸估計值(Least Median Squares Estimator)正確地找出這些群集離群值,此估計值擁有最大即50﹪的容離值 (Breakdown point)。 在這篇論文中,用來求出最小中位數穩健迴歸估計值的演算法稱為步進搜尋演算法 (the Forward Search Algorithm)。 結果顯示,我們可以利用此演算法得到的穩健迴歸估計值,很快並有效率的找出資料中的群集離群值;另外,更進一步的結果顯示,我們只需從資料中隨機選取一百次子集,並進行步進搜尋,即可得到概似的穩健迴歸估計值並正確的找出那些群集離群值。 最後,我們利用鐘乳石圖(Stalactite Plot)列出所有被偵測到的離群值。
在多變量資料中,我們若使用Mahalanobis距離也會遭遇到同樣的屏蔽效應。 而此一問題,隨著另一高度穩健估計值的採用,亦可迎刃而解。 此估計值稱為最小體積橢圓體估計值 (Minimum Volume Ellipsoid),其亦擁有最大即50﹪的容離值。 在此,我們也利用步進搜尋法求出此估計值,並利用鐘乳石圖列出所有被偵測到的離群值。
這篇論文的第二部分則利用變數轉換的技巧將迴歸資料中的殘差項常態化並且加強其等變異的特性以利後續的資料分析。 在步進搜尋進行的過程中,我們觀察分數統計量(Score Statistic)和其他相關診斷統計量的變化。 結果顯示,這些統計量一起提供了有關轉換參數選取豐富的資訊,並且我們亦可從步進搜尋進行的過程中觀察出某些離群值對參數選取的影響。 / Detecting regression outliers is not trivial when there are many of them. The methods of using classical diagnostic plots sometimes fail to detect them. This phenomenon is known as the masking effect. To avoid this, we propose to find out those multiple outliers by using a highly robust regression estimator called the least median squares (LMS) estimator which has maximal breakdown point. The algorithm in search of the LMS estimator is called the forward search algorithm. The estimator found by the forward search is shown to lead to the rapid detection of multiple outliers. Furthermore, the result reveals that 100 repeats of a simple forward search from a random starting subset are shown to provide sufficiently robust parameter estimators to reveal multiple outliers. Finally, those detected outliers are exhibited by the stalactite plot that shows greatly stable pattern of them.
Referring to multivariate data, the Mahalanobis distance also suffers from the masking effect that can be remedied by using a highly robust estimator called the minimum volume ellipsoid (MVE) estimator. It can also be found by using the forward search algorithm and it also has maximal breakdown point. The detected outliers are then displayed in the stalactite plot.
The second part of this dissertation is the transformation of regression data so that the approximate normality and the homogeneity of the residuals can be achieved. During the process of the forward search, we monitor the quantity of interest called score statistic and some other diagnostic plots. They jointly provide a wealth of information about transformation along with the effect of individual observation on this statistic.
|
Page generated in 0.0212 seconds