The collection of crop yield data has become much easier with the introduction of technologies such as the Global Positioning System (GPS), ground-based yield sensors, and Geographic Information Systems (GIS). This explosive growth and widespread use of spatial data has challenged the ability to derive useful spatial knowledge. In addition, outlier detection as one important pre-processing step remains a challenge because the technique and the definition of spatial neighbourhood remain non-trivial, and the quantitative assessments of false positives, false negatives, and the concept of region outlier remain unexplored. The overall aim of this study is to evaluate different spatial outlier detection techniques in terms of their accuracy and computational efficiency, and examine the performance of these outlier removal techniques in a site-specific management context.
In a simulation study, unconditional sequential Gaussian simulation is performed to generate crop yield as the response variable along with two explanatory variables. Point and region spatial outliers are added to the simulated datasets by randomly selecting observations and adding or subtracting a Gaussian error term. With simulated data which contains known spatial outliers in advance, the assessment of spatial outlier techniques can be conducted as a binary classification exercise, treating each spatial outlier detection technique as a classifier. Algorithm performance is evaluated with the area and partial area under the ROC curve up to different true positive and false positive rates. Outlier effects in on-farm research are assessed in terms of the influence of each spatial outlier technique on coefficient estimates from a spatial regression model that accounts for autocorrelation.
Results indicate that for point outliers, spatial outlier techniques that account for spatial autocorrelation tend to be better than standard spatial outlier techniques in terms of higher sensitivity, lower false positive detection rate, and consistency in performance. They are also more resistant to changes in the neighbourhood definition. In terms of region outliers, standard techniques tend to be better than spatial autocorrelation techniques in all performance aspects because they are less affected by masking and swamping effects. In particular, one spatial autocorrelation technique, Averaged Difference, is superior to all other techniques in terms of both point and region outlier scenario because of its ability to incorporate spatial autocorrelation while at the same time, revealing the variation between nearest neighbours.
In terms of decision-making, all algorithms led to slightly different coefficient estimates, and therefore, may result in distinct decisions for site-specific management.
The results outlined here will allow an improved removal of crop yield data points that are potentially problematic. What has been determined here is the recommendation of using Averaged Difference algorithm for cleaning spatial outliers in yield dataset. Identifying the optimal nearest neighbour parameter for the neighbourhood aggregation function is still non-trivial. The recommendation is to specify a large number of nearest neighbours, large enough to capture the region size. Lastly, the unbiased coefficient estimates obtained with Average Difference suggest it is the better method for pre-processing spatial outliers in crop yield data, which underlines its suitability for detecting spatial outlier in the context of on-farm research.
Identifer | oai:union.ndltd.org:WATERLOO/oai:uwspace.uwaterloo.ca:10012/6347 |
Date | 29 September 2011 |
Creators | Chu Su, Peter |
Source Sets | University of Waterloo Electronic Theses Repository |
Language | English |
Detected Language | English |
Type | Thesis or Dissertation |
Page generated in 0.0021 seconds