171 |
Performance of Imputation Algorithms on Artificially Produced Missing at Random DataOketch, Tobias O 01 May 2017 (has links)
Missing data is one of the challenges we are facing today in modeling valid statistical models. It reduces the representativeness of the data samples. Hence, population estimates, and model parameters estimated from such data are likely to be biased.
However, the missing data problem is an area under study, and alternative better statistical procedures have been presented to mitigate its shortcomings. In this paper, we review causes of missing data, and various methods of handling missing data. Our main focus is evaluating various multiple imputation (MI) methods from the multiple imputation of chained equation (MICE) package in the statistical software R. We assess how these MI methods perform with different percentages of missing data. A multiple regression model was fit on the imputed data sets and the complete data set. Statistical comparisons of the regression coefficients are made between the models using the imputed data and the complete data.
|
172 |
Development of Multiple Regression Models to Predict Sources of Fecal PollutionHall, Kimberlee K., Scheuerman, Phillip R. 01 November 2017 (has links)
This study assessed the usefulness of multivariate statistical tools to characterize watershed dynamics and prioritize streams for remediation. Three multiple regression models were developed using water quality data collected from Sinking Creek in the Watauga River watershed in Northeast Tennessee. Model 1 included all water quality parameters, model 2 included parameters identified by stepwise regression, and model 3 was developed using canonical discriminant analysis. Models were evaluated in seven creeks to determine if they correctly classified land use and level of fecal pollution. At the watershed level, the models were statistically significant (p < 0.001) but with low r2 values (Model 1 r2 = 0.02, Model 2 r2 = 0.01, Model 3 r2 = 0.35). Model 3 correctly classified land use in five of seven creeks. These results suggest this approach can be used to set priorities and identify pollution sources, but may be limited when applied across entire watersheds.
|
173 |
Improved Methods and Selecting Classification Types for Time-Dependent Covariates in the Marginal Analysis of Longitudinal DataChen, I-Chen 01 January 2018 (has links)
Generalized estimating equations (GEE) are popularly utilized for the marginal analysis of longitudinal data. In order to obtain consistent regression parameter estimates, these estimating equations must be unbiased. However, when certain types of time-dependent covariates are presented, these equations can be biased unless an independence working correlation structure is employed. Moreover, in this case regression parameter estimation can be very inefficient because not all valid moment conditions are incorporated within the corresponding estimating equations. Therefore, approaches using the generalized method of moments or quadratic inference functions have been proposed for utilizing all valid moment conditions. However, we have found that such methods will not always provide valid inference and can also be improved upon in terms of finite-sample regression parameter estimation. Therefore, we propose a modified GEE approach and a selection method that will both ensure the validity of inference and improve regression parameter estimation.
In addition, these modified approaches assume the data analyst knows the type of time-dependent covariate, although this likely is not the case in practice. Whereas hypothesis testing has been used to determine covariate type, we propose a novel strategy to select a working covariate type in order to avoid potentially high type II error rates with these hypothesis testing procedures. Parameter estimates resulting from our proposed method are consistent and have overall improved mean squared error relative to hypothesis testing approaches.
Finally, for some real-world examples the use of mean regression models may be sensitive to skewness and outliers in the data. Therefore, we extend our approaches from their use with marginal quantile regression to modeling the conditional quantiles of the response variable. Existing and proposed methods are compared in simulation studies and application examples.
|
174 |
THE USE OF 3-D HIGHWAY DIFFERENTIAL GEOMETRY IN CRASH PREDICTION MODELINGAmiridis, Kiriakos 01 January 2019 (has links)
The objective of this research is to evaluate and introduce a new methodology regarding rural highway safety. Current practices rely on crash prediction models that utilize specific explanatory variables, whereas the depository of knowledge for past research is the Highway Safety Manual (HSM). Most of the prediction models in the HSM identify the effect of individual geometric elements on crash occurrence and consider their combination in a multiplicative manner, where each effect is multiplied with others to determine their combined influence. The concepts of 3-dimesnional (3-D) representation of the roadway surface have also been explored in the past aiming to model the highway structure and optimize the roadway alignment. The use of differential geometry on utilizing the 3-D roadway surface in order to understand how new metrics can be used to identify and express roadway geometric elements has been recently utilized and indicated that this may be a new approach in representing the combined effects of all geometry features into single variables. This research will further explore this potential and examine the possibility to utilize 3-D differential geometry in representing the roadway surface and utilize its associated metrics to consider the combined effect of roadway features on crashes. It is anticipated that a series of single metrics could be used that would combine horizontal and vertical alignment features and eventually predict roadway crashes in a more robust manner.
It should be also noted that that the main purpose of this research is not to simply suggest predictive crash models, but to prove in a statistically concrete manner that 3-D metrics of differential geometry, e.g. Gaussian Curvature and Mean Curvature can assist in analyzing highway design and safety. Therefore, the value of this research is oriented towards the proof of concept of the link between 3-D geometry in highway design and safety. This thesis presents the steps and rationale of the procedure that is followed in order to complete the proposed research. Finally, the results of the suggested methodology are compared with the ones that would be derived from the, state-of-the-art, Interactive Highway Safety Design Model (IHSDM), which is essentially the software that is currently used and based on the findings of the HSM.
|
175 |
TRANSFORMS IN SUFFICIENT DIMENSION REDUCTION AND THEIR APPLICATIONS IN HIGH DIMENSIONAL DATAWeng, Jiaying 01 January 2019 (has links)
The big data era poses great challenges as well as opportunities for researchers to develop efficient statistical approaches to analyze massive data. Sufficient dimension reduction is such an important tool in modern data analysis and has received extensive attention in both academia and industry.
In this dissertation, we introduce inverse regression estimators using Fourier transforms, which is superior to the existing SDR methods in two folds, (1) it avoids the slicing of the response variable, (2) it can be readily extended to solve the high dimensional data problem. For the ultra-high dimensional problem, we investigate both eigenvalue decomposition and minimum discrepancy approaches to achieve optimal solutions and also develop a novel and efficient optimization algorithm to obtain the sparse estimates. We derive asymptotic properties of the proposed estimators and demonstrate its efficiency gains compared to the traditional estimators. The oracle properties of the sparse estimates are derived. Simulation studies and real data examples are used to illustrate the effectiveness of the proposed methods.
Wavelet transform is another tool that effectively detects information from time-localization of high frequency. Parallel to our proposed Fourier transform methods, we also develop a wavelet transform version approach and derive the asymptotic properties of the resulting estimators.
|
176 |
ACCOUNTING FOR SPATIAL AUTOCORRELATION IN MODELING THE DISTRIBUTION OF WATER QUALITY VARIABLESMiralha, Lorrayne 01 January 2018 (has links)
Several studies in hydrology have reported differences in outcomes between models in which spatial autocorrelation (SAC) is accounted for and those in which SAC is not. However, the capacity to predict the magnitude of such differences is still ambiguous. In this thesis, I hypothesized that SAC, inherently possessed by a response variable, influences spatial modeling outcomes. I selected ten watersheds in the USA and analyzed them to determine whether water quality variables with higher Moran’s I values undergo greater increases in the coefficient of determination (R²) and greater decreases in residual SAC (rSAC) after spatial modeling. I compared non-spatial ordinary least squares to two spatial regression approaches, namely, spatial lag and error models. The predictors were the principal components of topographic, land cover, and soil group variables. The results revealed that water quality variables with higher inherent SAC showed more substantial increases in R² and decreases in rSAC after performing spatial regressions. In this study, I found a generally linear relationship between the spatial model outcomes (R² and rSAC) and the degree of SAC in each water quality variable. I suggest that the inherent level of SAC in response variables can predict improvements in models before spatial regression is performed. The benefits of this study go beyond modeling selection and performance, it has the potential to uncover hydrologic connectivity patterns that can serve as insights to water quality managers and policy makers.
|
177 |
Generalizing Multistage Partition Procedures for Two-parameter Exponential PopulationsWang, Rui 06 August 2018 (has links)
ANOVA analysis is a classic tool for multiple comparisons and has been widely used in numerous disciplines due to its simplicity and convenience. The ANOVA procedure is designed to test if a number of different populations are all different. This is followed by usual multiple comparison tests to rank the populations. However, the probability of selecting the best population via ANOVA procedure does not guarantee the probability to be larger than some desired prespecified level. This lack of desirability of the ANOVA procedure was overcome by researchers in early 1950's by designing experiments with the goal of selecting the best population. In this dissertation, a single-stage procedure is introduced to partition k treatments into "good" and "bad" groups with respect to a control population assuming some key parameters are known. Next, the proposed partition procedure is genaralized for the case when the parameters are unknown and a purely-sequential procedure and a two-stage procedure are derived. Theoretical asymptotic properties, such as first order and second order properties, of the proposed procedures are derived to document the efficiency of the proposed procedures. These theoretical properties are studied via Monte Carlo simulations to document the performance of the procedures for small and moderate sample sizes.
|
178 |
Detection of Sand Boils from Images using Machine Learning ApproachesKuchi, Aditi S 23 May 2019 (has links)
Levees provide protection for vast amounts of commercial and residential properties. However, these structures degrade over time, due to the impact of severe weather, sand boils, subsidence of land, seepage, etc. In this research, we focus on detecting sand boils. Sand boils occur when water under pressure wells up to the surface through a bed of sand. These make levees especially vulnerable. Object detection is a good approach to confirm the presence of sand boils from satellite or drone imagery, which can be utilized to assist in the automated levee monitoring methodology. Since sand boils have distinct features, applying object detection algorithms to it can result in accurate detection. To the best of our knowledge, this research work is the first approach to detect sand boils from images. In this research, we compare some of the latest deep learning methods, Viola Jones algorithm, and other non-deep learning methods to determine the best performing one. We also train a Stacking-based machine learning method for the accurate prediction of sand boils. The accuracy of our robust model is 95.4%.
|
179 |
Making Models with BayesOlid, Pilar 01 December 2017 (has links)
Bayesian statistics is an important approach to modern statistical analyses. It allows us to use our prior knowledge of the unknown parameters to construct a model for our data set. The foundation of Bayesian analysis is Bayes' Rule, which in its proportional form indicates that the posterior is proportional to the prior times the likelihood. We will demonstrate how we can apply Bayesian statistical techniques to fit a linear regression model and a hierarchical linear regression model to a data set. We will show how to apply different distributions to Bayesian analyses and how the use of a prior affects the model. We will also make a comparison between the Bayesian approach and the traditional frequentist approach to data analyses.
|
180 |
ESTIMATING THE RESPIRATORY LUNG MOTION MODEL USING TENSOR DECOMPOSITION ON DISPLACEMENT VECTOR FIELDKang, Kingston 01 January 2018 (has links)
Modern big data often emerge as tensors. Standard statistical methods are inadequate to deal with datasets of large volume, high dimensionality, and complex structure. Therefore, it is important to develop algorithms such as low-rank tensor decomposition for data compression, dimensionality reduction, and approximation.
With the advancement in technology, high-dimensional images are becoming ubiquitous in the medical field. In lung radiation therapy, the respiratory motion of the lung introduces variabilities during treatment as the tumor inside the lung is moving, which brings challenges to the precise delivery of radiation to the tumor. Several approaches to quantifying this uncertainty propose using a model to formulate the motion through a mathematical function over time. [Li et al., 2011] uses principal component analysis (PCA) to propose one such model using each image as a long vector. However, the images come in a multidimensional arrays, and vectorization breaks the spatial structure. Driven by the needs to develop low-rank tensor decomposition and provided the 4DCT and Displacement Vector Field (DVF), we introduce two tensor decompositions, Population Value Decomposition (PVD) and Population Tucker Decomposition (PTD), to estimate the respiratory lung motion with high levels of accuracy and data compression. The first algorithm is a generalization of PVD [Crainiceanu et al., 2011] to higher order tensor. The second algorithm generalizes the concept of PVD using Tucker decomposition. Both algorithms are tested on clinical and phantom DVFs. New metrics for measuring the model performance are developed in our research. Results of the two new algorithms are compared to the result of the PCA algorithm.
|
Page generated in 0.1239 seconds