Knowledge discovery on spatio-temporal datasets has attracted<br />growing interests. Recent advances on remote sensing technology mean<br />that massive amounts of spatio-temporal data are being collected,<br />and its volume keeps increasing at an ever faster pace. It becomes<br />critical to design efficient algorithms for identifying novel and<br />meaningful patterns from massive spatio-temporal datasets. Different<br />from the other data sources, this data exhibits significant<br />space-time statistical dependence, and the assumption of i.i.d. is<br />no longer valid. The exact modeling of space-time dependence will<br />render the exponential growth of model complexity as the data size<br />increases. This research focuses on the construction of efficient<br />and effective approaches using approximate inference techniques for<br />three main mining tasks, including spatial outlier detection, robust<br />spatio-temporal prediction, and novel applications to real world<br />problems.<br /><br />Spatial novelty patterns, or spatial outliers, are those data points<br />whose characteristics are markedly different from their spatial<br />neighbors. There are two major branches of spatial outlier detection<br />methodologies, which can be either global Kriging based or local<br />Laplacian smoothing based. The former approach requires the exact<br />modeling of spatial dependence, which is time extensive; and the<br />latter approach requires the i.i.d. assumption of the smoothed<br />observations, which is not statistically solid. These two approaches<br />are constrained to numerical data, but in real world applications we<br />are often faced with a variety of non-numerical data types, such as<br />count, binary, nominal, and ordinal. To summarize, the main research<br />challenges are: 1) how much spatial dependence can be eliminated via<br />Laplace smoothing; 2) how to effectively and efficiently detect<br />outliers for large numerical spatial datasets; 3) how to generalize<br />numerical detection methods and develop a unified outlier detection<br />framework suitable for large non-numerical datasets; 4) how to<br />achieve accurate spatial prediction even when the training data has<br />been contaminated by outliers; 5) how to deal with spatio-temporal<br />data for the preceding problems.<br /><br />To address the first and second challenges, we mathematically<br />validated the effectiveness of Laplacian smoothing on the<br />elimination of spatial autocorrelations. This work provides<br />fundamental support for existing Laplacian smoothing based methods.<br />We also discovered a nontrivial side-effect of Laplacian smoothing,<br />which ingests additional spatial variations to the data due to<br />convolution effects. To capture this extra variability, we proposed<br />a generalized local statistical model, and designed two fast forward<br />and backward outlier detection methods that achieve a better balance<br />between computational efficiency and accuracy than most existing<br />methods, and are well suited to large numerical spatial datasets.<br /><br />We addressed the third challenge by mapping non-numerical variables<br />to latent numerical variables via a link function, such as logit<br />function used in logistic regression, and then utilizing<br />error-buffer artificial variables, which follow a Student-t<br />distribution, to capture the large valuations caused by outliers. We<br />proposed a unified statistical framework, which integrates the<br />advantages of spatial generalized linear mixed model, robust spatial<br />linear model, reduced-rank dimension reduction, and Bayesian<br />hierarchical model. A linear-time approximate inference algorithm<br />was designed to infer the posterior distribution of the error-buffer<br />artificial variables conditioned on observations. We demonstrated<br />that traditional numerical outlier detection methods can be directly<br />applied to the estimated artificial variables for outliers<br />detection. To the best of our knowledge, this is the first<br />linear-time outlier detection algorithm that supports a variety of<br />spatial attribute types, such as binary, count, ordinal, and<br />nominal.<br /><br />To address the fourth and fifth challenges, we proposed a robust<br />version of the Spatio-Temporal Random Effects (STRE) model, namely<br />the Robust STRE (R-STRE) model. The regular STRE model is a recently<br />proposed statistical model for large spatio-temporal data that has a<br />linear order time complexity, but is not best suited for<br />non-Gaussian and contaminated datasets. This deficiency can be<br />systemically addressed by increasing the robustness of the model<br />using heavy-tailed distributions, such as the Huber, Laplace, or<br />Student-t distribution to model the measurement error, instead of<br />the traditional Gaussian. However, the resulting R-STRE model<br />becomes analytical intractable, and direct application of<br />approximate inferences techniques still has a cubic order time<br />complexity. To address the computational challenge, we reformulated<br />the prediction problem as a maximum a posterior (MAP) problem with a<br />non-smooth objection function, transformed it to a equivalent<br />quadratic programming problem, and developed an efficient<br />interior-point numerical algorithm with a near linear order<br />complexity. This work presents the first near linear time robust<br />prediction approach for large spatio-temporal datasets in both<br />offline and online cases. / Ph. D.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/19220 |
Date | 21 January 2013 |
Creators | Chen, Feng |
Contributors | Computer Science, Lu, Chang-Tien, Ramakrishnan, Naren, Wang, Yue J., Lou, Wenjing, Chen, Ing-Ray |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Dissertation |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.005 seconds