Global ETD Search

Return to search

Efficient Algorithms for Mining Large Spatio-Temporal Data

Knowledge discovery on spatio-temporal datasets has attracted growing interests. Recent advances on remote sensing technology mean that massive amounts of spatio-temporal data are being collected, and its volume keeps increasing at an ever faster pace. It becomes critical to design efficient algorithms for identifying novel and meaningful patterns from massive spatio-temporal datasets. Different from the other data sources, this data exhibits significant space-time statistical dependence, and the assumption of i.i.d. is no longer valid. The exact modeling of space-time dependence will render the exponential growth of model complexity as the data size increases. This research focuses on the construction of efficient and effective approaches using approximate inference techniques for three main mining tasks, including spatial outlier detection, robust spatio-temporal prediction, and novel applications to real world problems. Spatial novelty patterns, or spatial outliers, are those data points whose characteristics are markedly different from their spatial neighbors. There are two major branches of spatial outlier detection methodologies, which can be either global Kriging based or local Laplacian smoothing based. The former approach requires the exact modeling of spatial dependence, which is time extensive; and the latter approach requires the i.i.d. assumption of the smoothed observations, which is not statistically solid. These two approaches are constrained to numerical data, but in real world applications we are often faced with a variety of non-numerical data types, such as count, binary, nominal, and ordinal. To summarize, the main research challenges are: 1) how much spatial dependence can be eliminated via Laplace smoothing; 2) how to effectively and efficiently detect outliers for large numerical spatial datasets; 3) how to generalize numerical detection methods and develop a unified outlier detection framework suitable for large non-numerical datasets; 4) how to achieve accurate spatial prediction even when the training data has been contaminated by outliers; 5) how to deal with spatio-temporal data for the preceding problems. To address the first and second challenges, we mathematically validated the effectiveness of Laplacian smoothing on the elimination of spatial autocorrelations. This work provides fundamental support for existing Laplacian smoothing based methods. We also discovered a nontrivial side-effect of Laplacian smoothing, which ingests additional spatial variations to the data due to convolution effects. To capture this extra variability, we proposed a generalized local statistical model, and designed two fast forward and backward outlier detection methods that achieve a better balance between computational efficiency and accuracy than most existing methods, and are well suited to large numerical spatial datasets. We addressed the third challenge by mapping non-numerical variables to latent numerical variables via a link function, such as logit function used in logistic regression, and then utilizing error-buffer artificial variables, which follow a Student-t distribution, to capture the large valuations caused by outliers. We proposed a unified statistical framework, which integrates the advantages of spatial generalized linear mixed model, robust spatial linear model, reduced-rank dimension reduction, and Bayesian hierarchical model. A linear-time approximate inference algorithm was designed to infer the posterior distribution of the error-buffer artificial variables conditioned on observations. We demonstrated that traditional numerical outlier detection methods can be directly applied to the estimated artificial variables for outliers detection. To the best of our knowledge, this is the first linear-time outlier detection algorithm that supports a variety of spatial attribute types, such as binary, count, ordinal, and nominal. To address the fourth and fifth challenges, we proposed a robust version of the Spatio-Temporal Random Effects (STRE) model, namely the Robust STRE (R-STRE) model. The regular STRE model is a recently proposed statistical model for large spatio-temporal data that has a linear order time complexity, but is not best suited for non-Gaussian and contaminated datasets. This deficiency can be systemically addressed by increasing the robustness of the model using heavy-tailed distributions, such as the Huber, Laplace, or Student-t distribution to model the measurement error, instead of the traditional Gaussian. However, the resulting R-STRE model becomes analytical intractable, and direct application of approximate inferences techniques still has a cubic order time complexity. To address the computational challenge, we reformulated the prediction problem as a maximum a posterior (MAP) problem with a non-smooth objection function, transformed it to a equivalent quadratic programming problem, and developed an efficient interior-point numerical algorithm with a near linear order complexity. This work presents the first near linear time robust prediction approach for large spatio-temporal datasets in both offline and online cases. / Ph. D.

Spatio-Temporal Analysis

Outlier Detection

Robust Prediction

Energy Disaggregation

Identifer	oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/19220
Date	21 January 2013
Creators	Chen, Feng
Contributors	Computer Science, Lu, Chang-Tien, Ramakrishnan, Naren, Wang, Yue J., Lou, Wenjing, Chen, Ing-Ray
Publisher	Virginia Tech
Source Sets	Virginia Tech Theses and Dissertation
Detected Language	English
Type	Dissertation
Format	ETD, application/pdf
Rights	In Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0028 seconds

Efficient Algorithms for Mining Large Spatio-Temporal Data

Description

Links & Downloads

Tags

Additional Fields