1 |
The wild bootstrap resampling in regression imputation algorithm with a Gaussian Mixture ModelMat Jasin, A., Neagu, Daniel, Csenki, Attila 08 July 2018 (has links)
Yes / Unsupervised learning of finite Gaussian mixture model (FGMM) is used to learn the distribution of population data. This paper proposes the use of the wild bootstrapping to create the variability of the imputed data in single miss-ing data imputation. We compare the performance and accuracy of the proposed method in single imputation and multiple imputation from the R-package Amelia II using RMSE, R-squared, MAE and MAPE. The proposed method shows better performance when compared with the multiple imputation (MI) which is indeed known as the golden method of missing data imputation techniques.
|
2 |
Missing imputation methods explored in big data analyticsBrydon, Humphrey Charles January 2018 (has links)
Philosophiae Doctor - PhD (Statistics and Population Studies) / The aim of this study is to look at the methods and processes involved in imputing missing data and more specifically, complete missing blocks of data. A further aim of this study is to look at the effect that the imputed data has on the accuracy of various predictive models constructed on the imputed data and hence determine if the imputation method involved is suitable.
The identification of the missingness mechanism present in the data should be the first process to follow in order to identify a possible imputation method. The identification of a suitable imputation method is easier if the mechanism can be identified as one of the following; missing completely at random (MCAR), missing at random (MAR) or not missing at random (NMAR).
Predictive models constructed on the complete imputed data sets are shown to be less accurate for those models constructed on data sets which employed a hot-deck imputation method. The data sets which employed either a single or multiple Monte Carlo Markov Chain (MCMC) or the Fully Conditional Specification (FCS) imputation methods are shown to result in predictive models that are more accurate.
The addition of an iterative bagging technique in the modelling procedure is shown to produce highly accurate prediction estimates. The bagging technique is applied to variants of the neural network, a decision tree and a multiple linear regression (MLR) modelling procedure. A stochastic gradient boosted decision tree (SGBT) is also constructed as a comparison to the bagged decision tree.
Final models are constructed from 200 iterations of the various modelling procedures using a 60% sampling ratio in the bagging procedure. It is further shown that the addition of the bagging technique in the MLR modelling procedure can produce a MLR model that is more accurate than that of the other more advanced modelling procedures under certain conditions.
The evaluation of the predictive models constructed on imputed data is shown to vary based on the type of fit statistic used. It is shown that the average squared error reports little difference in the accuracy levels when compared to the results of the Mean Absolute Prediction Error (MAPE). The MAPE fit statistic is able to magnify the difference in the prediction errors reported. The Normalized Mean Bias Error (NMBE) results show that all predictive models constructed produced estimates that were an over-prediction, although these did vary depending on the data set and modelling procedure used.
The Nash Sutcliffe efficiency (NSE) was used as a comparison statistic to compare the accuracy of the predictive models in the context of imputed data. The NSE statistic showed that the estimates of the models constructed on the imputed data sets employing a multiple imputation method were highly accurate. The NSE statistic results reported that the estimates from the predictive models constructed on the hot-deck imputed data were inaccurate and that a mean substitution of the fully observed data would have been a better method of imputation.
The conclusion reached in this study shows that the choice of imputation method as well as that of the predictive model is dependent on the data used. Four unique combinations of imputation methods and modelling procedures were concluded for the data considered in this study.
|
3 |
Anomaly detection in unknown environments using wireless sensor networksLi, YuanYuan 01 May 2010 (has links)
This dissertation addresses the problem of distributed anomaly detection in Wireless Sensor Networks (WSN). A challenge of designing such systems is that the sensor nodes are battery powered, often have different capabilities and generally operate in dynamic environments. Programming such sensor nodes at a large scale can be a tedious job if the system is not carefully designed. Data modeling in distributed systems is important for determining the normal operation mode of the system. Being able to model the expected sensor signatures for typical operations greatly simplifies the human designer’s job by enabling the system to autonomously characterize the expected sensor data streams. This, in turn, allows the system to perform autonomous anomaly detection to recognize when unexpected sensor signals are detected. This type of distributed sensor modeling can be used in a wide variety of sensor networks, such as detecting the presence of intruders, detecting sensor failures, and so forth. The advantage of this approach is that the human designer does not have to characterize the anomalous signatures in advance.
The contributions of this approach include: (1) providing a way for a WSN to autonomously model sensor data with no prior knowledge of the environment; (2) enabling a distributed system to detect anomalies in both sensor signals and temporal events online; (3) providing a way to automatically extract semantic labels from temporal sequences; (4) providing a way for WSNs to save communication power by transmitting compressed temporal sequences; (5) enabling the system to detect time-related anomalies without prior knowledge of abnormal events; and, (6) providing a novel missing data estimation method that utilizes temporal and spatial information to replace missing values. The algorithms have been designed, developed, evaluated, and validated experimentally in synthesized data, and in real-world sensor network applications.
|
4 |
The Single Imputation Technique in the Gaussian Mixture Model FrameworkAisyah, Binti M.J. January 2018 (has links)
Missing data is a common issue in data analysis. Numerous techniques have
been proposed to deal with the missing data problem. Imputation is the most
popular strategy for handling the missing data. Imputation for data analysis is
the process to replace the missing values with any plausible values. Two most
frequent imputation techniques cited in literature are the single imputation and
the multiple imputation.
The multiple imputation, also known as the golden imputation technique, has
been proposed by Rubin in 1987 to address the missing data. However, the
inconsistency is the major problem in the multiple imputation technique. The
single imputation is less popular in missing data research due to bias and less
variability issues. One of the solutions to improve the single imputation
technique in the basic regression model: the main motivation is that, the
residual is added to improve the bias and variability. The residual is drawn by
normal distribution assumption with a mean of 0, and the variance is equal to
the residual variance. Although new methods in the single imputation
technique, such as stochastic regression model, and hot deck imputation,
might be able to improve the variability and bias issues, the single imputation
techniques suffer with the uncertainty that may underestimate the R-square or
standard error in the analysis results.
The research reported in this thesis provides two imputation solutions for the
single imputation technique. In the first imputation procedure, the wild
bootstrap is proposed to improve the uncertainty for the residual variance in
the regression model. In the second solution, the predictive mean matching
(PMM) is enhanced, where the regression model is taking the main role to generate the recipient values while the observations in the donors are taken
from the observed values. Then the missing values are imputed by randomly
drawing one of the observations in the donor pool. The size of the donor pool
is significant to determine the quality of the imputed values. The fixed size of
donor is used to be employed in many existing research works with PMM
imputation technique, but might not be appropriate in certain circumstance
such as when the data distribution has high density region. Instead of using
the fixed size of donor pool, the proposed method applies the radius-based
solution to determine the size of donor pool. Both proposed imputation
procedures will be combined with the Gaussian mixture model framework to
preserve the original data distribution.
The results reported in the thesis from the experiments on benchmark and
artificial data sets confirm improvement for further data analysis. The proposed
approaches are therefore worthwhile to be considered for further investigation
and experiments.
|
5 |
Statistical Modeling and Analysis of Bivariate Spatial-Temporal Data with the Application to Stream Temperature StudyLi, Han 04 November 2014 (has links)
Water temperature is a critical factor for the quality and biological condition of streams. Among various factors affecting stream water temperature, air temperature is one of the most important factors related to water temperature. To appropriately quantify the relationship between water and air temperatures over a large geographic region, it is important to accommodate the spatial and temporal information of the steam temperature. In this dissertation, I devote effort to several statistical modeling techniques for analyzing bivariate spatial-temporal data in a stream temperature study.
In the first part, I focus our analysis on the individual stream. A time varying coefficient model (VCM) is used to study the relationship between air temperature and water temperature for each stream. The time varying coefficient model enables dynamic modeling of the relationship, and therefore can be used to enhance the understanding of water and air temperature relationships. The proposed model is applied to 10 streams in Maryland, West Virginia, Virginia, North Carolina and Georgia using daily maximum temperatures. The VCM approach increases the prediction accuracy by more than 50% compared to the simple linear regression model and the nonlinear logistic model.
The VCM that describes the relationship between water and air temperatures for each stream is represented by slope and intercept curves from the fitted model. In the second part, I consider water and air temperatures for different streams that are spatial correlated. I focus on clustering multiple streams by using intercept and slope curves estimated from the VCM. Spatial information is incorporated to make clustering results geographically meaningful. I further propose a weighted distance as a dissimilarity measure for streams, which provides a flexible framework to interpret the clustering results under different weights. Real data analysis shows that streams in same cluster share similar geographic features such as solar radiation, percent forest and elevation.
In the third part, I develop a spatial-temporal VCM (STVCM) to deal with missing data. The STVCM takes both spatial and temporal variation of water temperature into account. I develop a novel estimation method that emphasizes the time effect and treats the space effect as a varying coefficient for the time effect. A simulation study shows that the performance of the STVCM on missing data imputation is better than several existing methods such as the neural network and the Gaussian process. The STVCM is also applied to all 156 streams in this study to obtain a complete data record. / Ph. D.
|
6 |
A Study of Missing Data Imputation and Predictive Modeling of Strength Properties of Wood CompositesZeng, Yan 01 August 2011 (has links)
Problem: Real-time process and destructive test data were collected from a wood composite manufacturer in the U.S. to develop real-time predictive models of two key strength properties (Modulus of Rupture (MOR) and Internal Bound (IB)) of a wood composite manufacturing process. Sensor malfunction and data “send/retrieval” problems lead to null fields in the company’s data warehouse which resulted in information loss. Many manufacturers attempt to build accurate predictive models excluding entire records with null fields or using summary statistics such as mean or median in place of the null field. However, predictive model errors in validation may be higher in the presence of information loss. In addition, the selection of predictive modeling methods poses another challenge to many wood composite manufacturers.
Approach: This thesis consists of two parts addressing above issues: 1) how to improve data quality using missing data imputation; 2) what predictive modeling method is better in terms of prediction precision (measured by root mean square error or RMSE). The first part summarizes an application of missing data imputation methods in predictive modeling. After variable selection, two missing data imputation methods were selected after comparing six possible methods. Predictive models of imputed data were developed using partial least squares regression (PLSR) and compared with models of non-imputed data using ten-fold cross-validation. Root mean square error of prediction (RMSEP) and normalized RMSEP (NRMSEP) were calculated. The second presents a series of comparisons among four predictive modeling methods using imputed data without variable selection.
Results: The first part concludes that expectation-maximization (EM) algorithm and multiple imputation (MI) using Markov Chain Monte Carlo (MCMC) simulation achieved more precise results. Predictive models based on imputed datasets generated more precise prediction results (average NRMSEP of 5.8% for model of MOR model and 7.2% for model of IB) than models of non-imputed datasets (average NRMSEP of 6.3% for model of MOR and 8.1% for model of IB). The second part finds that Bayesian Additive Regression Tree (BART) produced most precise prediction results (average NRMSEP of 7.7% for MOR model and 8.6% for IB model) than other three models: PLSR, LASSO, and Adaptive LASSO.
|
Page generated in 0.135 seconds