Spelling suggestions: "subject:"are under ROC curve""
1 |
Statistical Geocomputing: Spatial Outlier Detection in Precision AgricultureChu Su, Peter 29 September 2011 (has links)
The collection of crop yield data has become much easier with the introduction of technologies such as the Global Positioning System (GPS), ground-based yield sensors, and Geographic Information Systems (GIS). This explosive growth and widespread use of spatial data has challenged the ability to derive useful spatial knowledge. In addition, outlier detection as one important pre-processing step remains a challenge because the technique and the definition of spatial neighbourhood remain non-trivial, and the quantitative assessments of false positives, false negatives, and the concept of region outlier remain unexplored. The overall aim of this study is to evaluate different spatial outlier detection techniques in terms of their accuracy and computational efficiency, and examine the performance of these outlier removal techniques in a site-specific management context.
In a simulation study, unconditional sequential Gaussian simulation is performed to generate crop yield as the response variable along with two explanatory variables. Point and region spatial outliers are added to the simulated datasets by randomly selecting observations and adding or subtracting a Gaussian error term. With simulated data which contains known spatial outliers in advance, the assessment of spatial outlier techniques can be conducted as a binary classification exercise, treating each spatial outlier detection technique as a classifier. Algorithm performance is evaluated with the area and partial area under the ROC curve up to different true positive and false positive rates. Outlier effects in on-farm research are assessed in terms of the influence of each spatial outlier technique on coefficient estimates from a spatial regression model that accounts for autocorrelation.
Results indicate that for point outliers, spatial outlier techniques that account for spatial autocorrelation tend to be better than standard spatial outlier techniques in terms of higher sensitivity, lower false positive detection rate, and consistency in performance. They are also more resistant to changes in the neighbourhood definition. In terms of region outliers, standard techniques tend to be better than spatial autocorrelation techniques in all performance aspects because they are less affected by masking and swamping effects. In particular, one spatial autocorrelation technique, Averaged Difference, is superior to all other techniques in terms of both point and region outlier scenario because of its ability to incorporate spatial autocorrelation while at the same time, revealing the variation between nearest neighbours.
In terms of decision-making, all algorithms led to slightly different coefficient estimates, and therefore, may result in distinct decisions for site-specific management.
The results outlined here will allow an improved removal of crop yield data points that are potentially problematic. What has been determined here is the recommendation of using Averaged Difference algorithm for cleaning spatial outliers in yield dataset. Identifying the optimal nearest neighbour parameter for the neighbourhood aggregation function is still non-trivial. The recommendation is to specify a large number of nearest neighbours, large enough to capture the region size. Lastly, the unbiased coefficient estimates obtained with Average Difference suggest it is the better method for pre-processing spatial outliers in crop yield data, which underlines its suitability for detecting spatial outlier in the context of on-farm research.
|
2 |
Statistical Geocomputing: Spatial Outlier Detection in Precision AgricultureChu Su, Peter 29 September 2011 (has links)
The collection of crop yield data has become much easier with the introduction of technologies such as the Global Positioning System (GPS), ground-based yield sensors, and Geographic Information Systems (GIS). This explosive growth and widespread use of spatial data has challenged the ability to derive useful spatial knowledge. In addition, outlier detection as one important pre-processing step remains a challenge because the technique and the definition of spatial neighbourhood remain non-trivial, and the quantitative assessments of false positives, false negatives, and the concept of region outlier remain unexplored. The overall aim of this study is to evaluate different spatial outlier detection techniques in terms of their accuracy and computational efficiency, and examine the performance of these outlier removal techniques in a site-specific management context.
In a simulation study, unconditional sequential Gaussian simulation is performed to generate crop yield as the response variable along with two explanatory variables. Point and region spatial outliers are added to the simulated datasets by randomly selecting observations and adding or subtracting a Gaussian error term. With simulated data which contains known spatial outliers in advance, the assessment of spatial outlier techniques can be conducted as a binary classification exercise, treating each spatial outlier detection technique as a classifier. Algorithm performance is evaluated with the area and partial area under the ROC curve up to different true positive and false positive rates. Outlier effects in on-farm research are assessed in terms of the influence of each spatial outlier technique on coefficient estimates from a spatial regression model that accounts for autocorrelation.
Results indicate that for point outliers, spatial outlier techniques that account for spatial autocorrelation tend to be better than standard spatial outlier techniques in terms of higher sensitivity, lower false positive detection rate, and consistency in performance. They are also more resistant to changes in the neighbourhood definition. In terms of region outliers, standard techniques tend to be better than spatial autocorrelation techniques in all performance aspects because they are less affected by masking and swamping effects. In particular, one spatial autocorrelation technique, Averaged Difference, is superior to all other techniques in terms of both point and region outlier scenario because of its ability to incorporate spatial autocorrelation while at the same time, revealing the variation between nearest neighbours.
In terms of decision-making, all algorithms led to slightly different coefficient estimates, and therefore, may result in distinct decisions for site-specific management.
The results outlined here will allow an improved removal of crop yield data points that are potentially problematic. What has been determined here is the recommendation of using Averaged Difference algorithm for cleaning spatial outliers in yield dataset. Identifying the optimal nearest neighbour parameter for the neighbourhood aggregation function is still non-trivial. The recommendation is to specify a large number of nearest neighbours, large enough to capture the region size. Lastly, the unbiased coefficient estimates obtained with Average Difference suggest it is the better method for pre-processing spatial outliers in crop yield data, which underlines its suitability for detecting spatial outlier in the context of on-farm research.
|
3 |
Multiple hypothesis testing and multiple outlier identification methodsYin, Yaling 13 April 2010
Traditional multiple hypothesis testing procedures, such as that of Benjamini and Hochberg, fix an error rate and determine the corresponding rejection region. In 2002 Storey proposed a fixed rejection region procedure and showed numerically that it can gain more power than the fixed error rate procedure of Benjamini and Hochberg while controlling the same false discovery rate (FDR). In this thesis it is proved that when the number of alternatives is small compared to the total number of hypotheses, Storeys method can be less powerful than that of Benjamini and Hochberg. Moreover, the two procedures are compared by setting them to produce the same FDR. The difference in power between Storeys procedure and that of Benjamini and Hochberg is near zero when the distance between the null and alternative distributions is large, but Benjamini and Hochbergs procedure becomes more powerful as the distance decreases. It is shown that modifying the Benjamini and Hochberg procedure to incorporate an estimate of the proportion of true null hypotheses as proposed by Black gives a procedure with superior power.<p>
Multiple hypothesis testing can also be applied to regression diagnostics. In this thesis, a Bayesian method is proposed to test multiple hypotheses, of which the i-th null and alternative hypotheses are that the i-th observation is not an outlier versus it is, for i=1,...,m. In the proposed Bayesian model, it is assumed that outliers have a mean shift, where the proportion of outliers and the mean shift respectively follow a Beta prior distribution and a normal prior distribution. It is proved in the thesis that for the proposed model, when there exists more than one outlier, the marginal distributions of the deletion residual of the i-th observation under both null and alternative hypotheses are doubly noncentral t distributions. The outlyingness of the i-th observation is measured by the marginal posterior probability that the i-th observation is an outlier given its deletion residual. An importance sampling method is proposed to calculate this probability. This method requires the computation of the density of the doubly noncentral F distribution and this is approximated using Patnaiks approximation. An algorithm is proposed in this thesis to examine the accuracy of Patnaiks approximation. The comparison of this algorithms output with Patnaiks approximation shows that the latter can save massive computation time without losing much accuracy.<p>
The proposed Bayesian multiple outlier identification procedure is applied to some simulated data sets. Various simulation and prior parameters are used to study the sensitivity of the posteriors to the priors. The area under the ROC curves (AUC) is calculated for each combination of parameters. A factorial design analysis on AUC is carried out by choosing various simulation and prior parameters as factors. The resulting AUC values are high for various selected parameters, indicating that the proposed method can identify the majority of outliers within tolerable errors. The results of the factorial design show that the priors do not have much effect on the marginal posterior probability as long as the sample size is not too small.<p>
In this thesis, the proposed Bayesian procedure is also applied to a real data set obtained by Kanduc et al. in 2008. The proteomes of thirty viruses examined by Kanduc et al. are found to share a high number of pentapeptide overlaps to the human proteome. In a linear regression analysis of the level of viral overlaps to the human proteome and the length of viral proteome, it is reported by Kanduc et al. that among the thirty viruses, human T-lymphotropic virus 1, Rubella virus, and hepatitis C virus, present relatively higher levels of overlaps with the human proteome than the predicted level of overlaps. The results obtained using the proposed procedure indicate that the four viruses with extremely large sizes (Human herpesvirus 4, Human herpesvirus 6, Variola virus, and Human herpesvirus 5) are more likely to be the outliers than the three reported viruses. The results with thefour extreme viruses deleted confirm the claim of Kanduc et al.
|
4 |
Multiple hypothesis testing and multiple outlier identification methodsYin, Yaling 13 April 2010 (has links)
Traditional multiple hypothesis testing procedures, such as that of Benjamini and Hochberg, fix an error rate and determine the corresponding rejection region. In 2002 Storey proposed a fixed rejection region procedure and showed numerically that it can gain more power than the fixed error rate procedure of Benjamini and Hochberg while controlling the same false discovery rate (FDR). In this thesis it is proved that when the number of alternatives is small compared to the total number of hypotheses, Storeys method can be less powerful than that of Benjamini and Hochberg. Moreover, the two procedures are compared by setting them to produce the same FDR. The difference in power between Storeys procedure and that of Benjamini and Hochberg is near zero when the distance between the null and alternative distributions is large, but Benjamini and Hochbergs procedure becomes more powerful as the distance decreases. It is shown that modifying the Benjamini and Hochberg procedure to incorporate an estimate of the proportion of true null hypotheses as proposed by Black gives a procedure with superior power.<p>
Multiple hypothesis testing can also be applied to regression diagnostics. In this thesis, a Bayesian method is proposed to test multiple hypotheses, of which the i-th null and alternative hypotheses are that the i-th observation is not an outlier versus it is, for i=1,...,m. In the proposed Bayesian model, it is assumed that outliers have a mean shift, where the proportion of outliers and the mean shift respectively follow a Beta prior distribution and a normal prior distribution. It is proved in the thesis that for the proposed model, when there exists more than one outlier, the marginal distributions of the deletion residual of the i-th observation under both null and alternative hypotheses are doubly noncentral t distributions. The outlyingness of the i-th observation is measured by the marginal posterior probability that the i-th observation is an outlier given its deletion residual. An importance sampling method is proposed to calculate this probability. This method requires the computation of the density of the doubly noncentral F distribution and this is approximated using Patnaiks approximation. An algorithm is proposed in this thesis to examine the accuracy of Patnaiks approximation. The comparison of this algorithms output with Patnaiks approximation shows that the latter can save massive computation time without losing much accuracy.<p>
The proposed Bayesian multiple outlier identification procedure is applied to some simulated data sets. Various simulation and prior parameters are used to study the sensitivity of the posteriors to the priors. The area under the ROC curves (AUC) is calculated for each combination of parameters. A factorial design analysis on AUC is carried out by choosing various simulation and prior parameters as factors. The resulting AUC values are high for various selected parameters, indicating that the proposed method can identify the majority of outliers within tolerable errors. The results of the factorial design show that the priors do not have much effect on the marginal posterior probability as long as the sample size is not too small.<p>
In this thesis, the proposed Bayesian procedure is also applied to a real data set obtained by Kanduc et al. in 2008. The proteomes of thirty viruses examined by Kanduc et al. are found to share a high number of pentapeptide overlaps to the human proteome. In a linear regression analysis of the level of viral overlaps to the human proteome and the length of viral proteome, it is reported by Kanduc et al. that among the thirty viruses, human T-lymphotropic virus 1, Rubella virus, and hepatitis C virus, present relatively higher levels of overlaps with the human proteome than the predicted level of overlaps. The results obtained using the proposed procedure indicate that the four viruses with extremely large sizes (Human herpesvirus 4, Human herpesvirus 6, Variola virus, and Human herpesvirus 5) are more likely to be the outliers than the three reported viruses. The results with thefour extreme viruses deleted confirm the claim of Kanduc et al.
|
5 |
多標記接受者操作特徵曲線下部分面積最佳線性組合之研究 / The study on the optimal linear combination of markers based on the partial area under the ROC curve許嫚荏, Hsu, Man Jen Unknown Date (has links)
本論文的研究目標是建構一個由多標記複合成的最佳疾病診斷工具,所考慮的評估準則為操作者特徵曲線在特定特異度範圍之線下面積(pAUC)。在常態分布假設下,我們推導多標記線性組合之pAUC以及最佳線性組合之必要條件。由於函數本身過於複雜使得計算困難。除此之外,我們也發現其最佳解可能不唯一,以及局部極值存在,這些情況使得現有演算法的運用受限,我們因此提出多重初始值演算法。當母體參數未知時,我們利用最大概似估計量以獲得樣本pAUC以及令其極大化之最佳線性組合,並證明樣本最佳線性組合將一致性地收斂到母體最佳線性組合。在進一步的研究中,我們針對單標記的邊際判別能力、多標記的複合判別能力以及個別標記的條件判別能力,分別提出相關統計檢定方法。這些統計檢定被運用至兩個標記選取的方法,分別是前進選擇法與後退淘汰法。我們運用這些方法以選取與疾病檢測有顯著相關的標記。本論文透過模擬研究來驗證所提出的演算法、統計檢定方法以及標記選取的方法。另外,也將這些方法運用在數組實際資料上。 / The aim of this work is to construct a composite diagnostic
tool based on multiple biomarkers under the criterion of
the partial area under a ROC curve (pAUC) for a predetermined specificity range. Recently several studies are interested in the optimal linear combination maximizing the whole area under a ROC curve (AUC). In this study, we focus on finding the optimal linear combination by a direct maximization of the pAUC under normal assumption. In order
to find an analytic solution, the first derivative of the
pAUC is derived. The form is so complicated, that a further validation on the Hessian matrix is difficult. In addition,
we find that the pAUC maximizer may not be unique and sometimes, local maximizers exist. As a result, the existing algorithms, which depend on the initial-point, are inadequate to serve our needs. We propose a new algorithm by
adopting several initial points at one time. In addition,
when the population parameters are unknown and only a
random sample data set is available, the maximizer of the sample version of the pAUC is shown to be a strong consistent estimator of its theoretical counterpart. We further focus on determining whether a biomarker set, or one specific biomarker has a significant contribution to the disease diagnosis. We propose three statistical tests for the identification of the discriminatory power. The proposed tests are applied to biomarker selection for reducing the variable number in advanced analysis. Numerical studies are performed to validate the proposed algorithm and the proposed statistical procedures.
|
Page generated in 0.0961 seconds