Spelling suggestions: "subject:"estatistics"" "subject:"cstatistics""
551 |
Functional Component Analysis and Regression Using Elastic MethodsUnknown Date (has links)
Constructing generative models for functional observations is an important task in statistical function analysis. In general, functional data contains both phase (or x or horizontal) and amplitude (or y or vertical) variability. Traditional methods often ignore the phase variability and focus solely on the amplitude variation, using cross-sectional techniques such as functional principal component analysis for dimensional reduction and regression for data modeling. Ignoring phase variability leads to a loss of structure in the data, and inefficiency in data models. Moreover, most methods use a "pre-processing'' alignment step to remove the phase-variability; without considering a more natural joint solution. This dissertation presents three approaches to this problem. The first relies on separating the phase (x-axis) and amplitude (y-axis), then modeling these components using joint distributions. This separation in turn, is performed using a technique called elastic alignment of functions that involves a new mathematical representation of functional data. Then, using individual principal components, one for each phase and amplitude components, it imposes joint probability models on principal coefficients of these components while respecting the nonlinear geometry of the phase representation space. The second combines the phase-variability into the objective function for two component analysis methods, functional principal component analysis and functional principal least squares. This creates a more complete solution, as the phase-variability is removed while simultaneously extracting the components. The third approach combines the phase-variability into the functional linear regression model and then extends the model to logistic and multinomial logistic regression. Through incorporating the phase-variability a more parsimonious regression model is obtained and therefore, more accurate prediction of observations is achieved. These models then are easily extended from functional data to curves (which are essentially functions in R2) to perform regression with curves as predictors. These ideas are demonstrated using random sampling for models estimated from simulated and real datasets, and show their superiority over models that ignore phase-amplitude separation. Furthermore, the models are applied to classification of functional data and achieve high performance in applications involving SONAR signals of underwater objects, handwritten signatures, periodic body movements recorded by smart phones, and physiological data. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of
Doctor of Philosophy. / Summer Semester, 2014. / May 20, 2014. / Amplitude Variability, Functional Data Analysis, Function Alignment, Functional Regression, Function Principal Component Analysis, Phase Variability / Includes bibliographical references. / Anuj Srivastava, Professor Co-Directing Dissertation; Wei Wu, Professor Co-Directing Dissertation; Eric Klassen, University Representative; Fred Huffer, Committee Member.
|
552 |
Sparse Generalized PCA and Dependency Learning for Large-Scale Applications Beyond GaussianityUnknown Date (has links)
The age of big data has re-invited much interest in dimension reduction. How to cope with high-dimensional data remains a
difficult problem in statistical learning. In this study, we consider the task of dimension reduction---projecting data into a lower-rank
subspace while preserving maximal information. We investigate the pitfalls of classical PCA, and propose a set of algorithm that functions
under high dimension, extends to all exponential family distributions, performs feature selection at the mean time, and takes missing
value into consideration. Based upon the best performing one, we develop the SG-PCA algorithm. With acceleration techniques and a
progressive screening scheme, it demonstrates superior scalability and accuracy compared to existing methods. Concerned with the
independence assumption of dimension reduction techniques, we propose a novel framework, the Generalized Indirect Dependency Learning
(GIDL), to learn and incorporate association structure in multivariate statistical analysis. Without constraints on the particular
distribution of the data, GIDL takes any pre-specified smooth loss function and is able to both extract and infuse its association into
the regression, classification or dimension reduction problem. Experiments at the end serve to demonstrate its efficacy. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements
for the degree of Doctor of Philosophy. / Spring Semester 2016. / March 29, 2016. / Includes bibliographical references. / Yiyuan She, Professor Directing Dissertation; Teng Ma, University Representative; Xufeng Niu,
Committee Member; Debajyoti Sinha, Committee Member; Elizabeth Slate, Committee Member.
|
553 |
Predictive Accuracy Measures for Binary Outcomes: Impact of Incidence Rate and Optimization TechniquesUnknown Date (has links)
Evaluating the performance of models predicting a binary outcome can be done using a variety of measures. While some measures
intend to describe the model's overall fit, others more accurately describe the model's ability to discriminate between the two outcomes.
If a model fits well but doesn't discriminate well, what does that tell us? Given two models, if one discriminates well but has poor fit
while the other fits well but discriminates poorly, which of the two should we choose? The measures of interest for our research include
the area under the ROC curve, Brier Score, discrimination slope, Log-Loss, R-squared and F-score. To examine the underlying relationships
among all of the measures, real data and simulation studies are used. The real data comes from multiple cardiovascular research studies
and the simulation studies are run under general conditions and also for incidence rates ranging from 2% to 50%. The results of these
analyses provide insight into the relationships among the measures and raise concern for scenarios when the measures may yield different
conclusions. The impact of incidence rate on the relationships provides a basis for exploring alternative maximization routines to
logistic regression. While most of the measures are easily optimized using the Newton-Raphson algorithm, the maximization of the area
under the ROC curve requires optimization of a non-linear, non-differentiable function. Usage of the Nelder-Mead simplex algorithm and
close connections to economics research yield unique parameter estimates and general asymptotic conditions. Using real and simulated data
to compare optimizing the area under the ROC curve to logistic regression further reveals the impact of incidence rate on the
relationships, significant increases in achievable areas under the ROC curve, and differences in conclusions about including a variable in
a model. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements
for the degree of Doctor of Philosophy. / Spring Semester 2016. / April 8, 2016. / auc, brier score, incidence rate, logistic regression, optimization / Includes bibliographical references. / Daniel McGee, Professor Co-Directing Thesis; Elizabeth Slate, Professor Co-Directing Thesis;
Isaac Eberstein, University Representative; Fred Huffer, Committee Member.
|
554 |
Intensity Estimation in Poisson Processes with Phase VariabilityUnknown Date (has links)
Intensity estimation for Poisson processes is a classical problem and has been extensively studied over the past few decades.
However, current methods of intensity estimation assume phase variability or compositional noise, i.e. a nonlinear shift along the time
axis, is nonexistent in the data which is an unreasonable assumption for practical observations. The key challenge is that these
observations are not "aligned'', and registration procedures are required for successful estimation. As a result, these estimation methods
can yield estimators that are inefficient or that under-perform in simulations and applications. This dissertation summarizes two key
projects which examine estimation of the intensity of a Poisson process in the presence of phase variability. The first project proposes
an alignment-based framework for intensity estimation. First, it is shown that the intensity function is area-preserved with respect to
compositional noise. Such a property implies that the time warping is only encoded in the density, or normalized intensity, function.
Then, the intensity function can be decomposed into the product of the estimated total intensity (a scalar value) and the estimated
density function. The estimation of the density relies on a metric which measures the phase difference between two density functions. An
asymptotic study shows that the proposed estimation algorithm provides a consistent estimator for the normalized intensity. The success of
the proposed estimation algorithm is illustrated using two simulations and the new framework is applied in a real data set of neural spike
trains, showing that the proposed estimation method yields improved classification accuracy over previous methods. The second project
utilizes 2014 Florida data from the Healthcare Cost and Utilization Project's State Inpatient Database and State Emergency Department
Database (provided to the U.S. Department of Health and Human Services, Agency for Healthcare Research and Quality by the Florida Agency
for Health Care Administration) to examine heart failure emergency department arrival times. Current estimation methods for examining
emergency department arrival data ignore the functional nature of the data and implement naive analysis methods. In this dissertation, the
arrivals are treated as a Poisson process and the intensity of the process is estimated using existing density estimation and function
registration methods. The results of these analyses show the importance of considering the functional nature of emergency department
arrival data and the critical role that function registration plays in the intensity estimation of the arrival process. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements
for the degree of Doctor of Philosophy. / Fall Semester 2016. / October 7, 2016. / emergency department utilization, functional data analysis, function registration, intensity estimation,
Poisson process, spike train / Includes bibliographical references. / Wei Wu, Professor Directing Dissertation; James Whyte, IV, University Representative; Anuj
Srivastava, Committee Member; Eric Chicken, Committee Member.
|
555 |
Sparse Feature and Element Selection in High-Dimensional Vector Autoregressive ModelsUnknown Date (has links)
This thesis is to identify the underlying structures of multivariate time series and propose a methodology to construct
predictive VAR models. Due to the complexity of high dimensions in multivariate time series, forecasting a target series with many
predictors in VAR models poses a challenge in statistical learning and modeling. The quadratically increasing dimension of parameter
space, which is known as "curse of dimensionality" poses considerable challenges to multivariate time series models. Meanwhile, there are
two facts involved in reducing dimensions in multivariate time series: first, some nuisance time series exist and better to be removed,
second a target time series is typically driven by few dependent elements constructed from some indices. To address these challenge and
facts, our approach is to reduce both the dimensions of the series and the features involved in each series simultaneously. As a result,
the original high dimensional structure can be modeled using a lower dimensional time series, and subsequently the forecasting performance
will be improved. The methodology we introduced in this work is called Sparse Feature and Element Selection (SFES). It employs a "L1 +
group L1" penalty to conduct group selection and variable selection within each group simultaneously. Our contributions in this thesis are
two-folds. First, the doubly-constrained regularization in SFES is a convex mathematical problem, and we optimize it using a fast but
simple-to-implement algorithm. We evaluate this algorithm with a large-scale dataset and theoretically prove that it has guaranteed strict
iterative convergence and global optimality. Second, we theoretically present non-asymptotic results based on combined statistical and
computational analysis. A sharp oracle inequality is proved to reveal its power in predictive learning. We compare SFES with the related
work of Sparse Group Lasso (SGL) to show that the proposed method is both computationally efficient and theoretically justified.
Experiments using simulation data and real-world macroeconomic time series data are conducted to demonstrate the efficiency and efficacy
of the proposed SFES in practice. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements
for the degree of Doctor of Philosophy. / Fall Semester 2016. / October 28, 2016. / consistency, Feature Selection, Spase, VAR / Includes bibliographical references. / Xufeng Niu, Professor Co-Directing Dissertation; Yiyuan She, Professor Co-Directing Dissertation;
Yingmei Cheng, University Representative; Fred Huffer, Committee Member; Wei Wu, Committee Member.
|
556 |
First Steps towards Image Denoising under Low-Light ConditionsUnknown Date (has links)
The application of noise reduction or performing denoising on an image is a very important topic in the field of computer
vision and computational photography. Many popular state of the art denoising algorithms are trained and evaluated using images with
artificial noise. These trained algorithms and their evaluations on synthetic data may lead to incorrect conclusions about their
performances. In this paper we will first introduce a benchmark dataset of uncompressed color images corrupted by natural noise due to
low-light conditions, together with spatially and intensity-aligned low noise images of the same scenes. The dataset contains over 100
scenes and more than 500 images, including both RAW formatted images and 8 bit BMP pixel and intensity aligned images. We will also
introduce a method for estimating the true noise level in each of our images, since even the low noise images contain a small amount of
noise. Through this noise estimation method we will construct a convolutional neural network model for automatic noise estimation in
single noisy images. Finally, we improve upon a state-of-the-art denoising algorithm Block Matching through 3D filtering (BM3D) by
learning a specialized denoising parameter using another developed convolutional neural network. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the Doctor of
Philosophy. / Fall Semester 2016. / November 18, 2016. / Convolutional Neural Networks, Image Denoising, Machine Learning, Mobile Phone RAW data, RAW Uncompressed
Images / Includes bibliographical references. / Anke Meyer-Baese, University Representative; Antonio Linero, Committee Member; Jinfeng Zhang,
Committee Member.
|
557 |
Modeling Multivariate Data with Parameter-Based SubspacesUnknown Date (has links)
When modeling multivariate data such as vectorized images, one might have an extra parameter of contextual information that could be used to treat some observations as more similar to others. For example, images of faces can vary by yaw rotation, and one would expect a face rotated 65 degrees to the left to have characteristics more similar to a face rotated 55 degrees to the left than to a face rotated 65 degrees to the right. We introduce a novel method, parameterized principal component analysis (PPCA), that can model data with linear variation like principal component analysis (PCA), but can also take advantage of this parameter of contextual information like yaw rotation. Like PCA, PPCA models an observation using a mean vector and the product of observation-specific coefficients and basis vectors. Unlike PCA, PPCA treats the elements of the mean vector and basis vectors as smooth, piecewise linear functions of the contextual parameter. PPCA is fit by a penalized optimization that penalizes potential models which have overly large differences between corresponding mean or basis vector elements for similar parameter values. The penalty ensures that each observation's projection will share information with observations that have similar parameter values, but not with observations that have dissimilar parameter values. We tested PPCA on artificial data based on known, smooth functions of an added parameter, as well as on three real datasets with different types of parameters. We compared PPCA to independent principal component analysis (IPCA), which groups observations by their parameter values and projects each group using principal component analysis with no sharing of information for different groups. PPCA recovers the known functions with less error and projects the datasets' test set observations with consistently less reconstruction error than IPCA does. PPCA's performance is particularly strong, relative to IPCA, when there are limited training data. We also tested the use of spectral clustering to form the groups in an IPCA model. In our experiment, the clustered IPCA model had very similar error to the parameter-based IPCA model, suggesting that spectral clustering might be a viable alternative if one did not know the parameter values for an application. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of Doctor of Philosophy. / Summer Semester 2016. / May 18, 2016. / Includes bibliographical references. / Adrian Barbu, Professor Directing Dissertation; Anke Meyer-Baese, University Representative; Yihuan She, Committee Member; Jinfeng Zhang, Committee Member.
|
558 |
Bayesian Inference and Novel Models for Survival Data with Cured FractionUnknown Date (has links)
Existing cure-rate survival models are generally not convenient for modeling and estimating the survival quantiles of a patient with specified covariate values. They also do not allow inference on the change in the number of clonogens over time. This dissertation proposes two novel classes of cure-rate model, the transform-both-sides cure-rate model (TBSCRM) and the clonogen proliferation cure-rate model (CPCRM). Both can be used to make inference about both the cure-rate and the survival probabilities over time. The TBSCRM can also produce estimates of a patient's quantiles of survival time, and the CPCRM can produce estimates of a patient's expected number of clonogens at each time. We develop methods of Bayesian inference about the covariate effects on relevant quantities such as the cure-rate, methods which use Markov Chain Monte Carlo (MCMC) tools. We also show that the TBSCRM-based and CPCRM-based Bayesian methods perform well in simulation studies and outperform existing cure-rate models in application to the breast cancer survival data from the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER) database. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of Doctor of Philosophy. / Summer Semester 2016. / July 14, 2016. / Includes bibliographical references. / Debajyoti Sinha, Professor Directing Dissertation; Robert Glueckauf, University Representative; Elizabeth Slate, Committee Member; Debdeep Pati, Committee Member.
|
559 |
Investigating the Chi-Square-Based Model-Fit Indexes for WLSMV and ULSMV EstimatorsUnknown Date (has links)
In structural equation modeling (SEM), researchers use the model chi-square statistic and model-fit indexes to evaluate model-data fit. Root mean square error of approximation (RMSEA), comparative fit index (CFI), and Tucker-Lewis index (TLI) are widely applied model-fit indexes. When data are ordered and categorical, the most popular estimator is the diagonally weighted least squares (DWLS) estimator. Robust corrections have been proposed to adjust the uncorrected chi-square statistic from DWLS so that its first and second order moments are in alignment with the target central chi-square distribution under correctly specified models. DWLS with such a correction is called the mean- and variance-adjusted weighted least squares (WLSMV) estimator. An alternative to WLSMV is the mean-and variance-adjusted unweighted least squares (ULSMV) estimator, which has been shown to perform as well as, or slightly better than WLSMV. Because the chi-square statistic is corrected, the chi-square-based RMSEA, CFI, and TLI are thus also corrected by replacing the uncorrected chi-square statistic with the robust chi-square statistic. The robust model fit indexes calculated in such a way are named as the population-corrected robust (PR) model fit indexes following Brosseau-Liard, Savalei, and Li (2012). The PR model fit indexes are currently reported in almost every application when WLSMV or ULSMV is used. Nevertheless, previous studies have found the PR model fit indexes from WLSMV are sensitive to several factors such as sample sizes, model sizes, and thresholds for categorization. The first focus of this dissertation is on the dependency of model fit indexes on the thresholds for ordered categorical data. Because the weight matrix in the WLSMV fit function and the correction factors for both WLSMV and ULSMV include the asymptotic variances of thresholds and polychoric correlations, the model fit indexes are very likely to depend on the thresholds. The dependency of model fit indexes on the thresholds is not a desirable property, because when the misspecification lies in the factor structures (e.g., cross loadings are ignored or two factors are considered as a single factor), model fit indexes should reflect such misspecification rather than the threshold values. As alternatives to the PR model fit indexes, Brosseau-Liard et al. (2012), Brosseau-Liard and Savalei (2014), and Li and Bentler (2006) proposed the sample-corrected robust (SR) model fit indexes. The PR fit indexes are found to converge to distorted asymptotic values, but the SR fit indexes converge to their definitions asymptotically. However, the SR model fit indexes were proposed for continuous data, and have been neither investigated nor implemented in SEM software when WLSMV and ULSMV are applied. This dissertation thus investigates the PR and SR model fit indexes for WLSMV and ULSMV. The first part of the simulation study examines the dependency of the model fit indexes on the thresholds when the model misspecification results from omitting cross-loadings or collapsing factors in confirmatory factor analysis. The study is conducted on extremely large computer-generated datasets in order to approximate the asymptotic values of model fit indexes. The results find that only the SR fit indexes from ULSMV are independent of the population threshold values, given the other design factors. The PR fit indexes from ULSMV, and the PR and SR fit indexes from WLSMV are influenced by thresholds, especially when data are binary and the hypothesized model is greatly misspecified. The second part of the simulation varies the sample sizes from 100 to 1000 to investigate whether the SR fit indexes under finite samples are more accurate estimates of the defined values of RMSEA, CFI, and TLI, compared with the uncorrected model fit indexes without robust correction and the PR fit indexes. Results show that the SR fit indexes are the more accurate in general. However, when the thresholds are different across items, data are binary, and sample size is less than 500, all versions of these indexes can be very inaccurate. In such situations, larger sample sizes are needed. In addition, the conventional cutoffs developed from continuous data with maximum likelihood (e.g., RMSEA < .06, CFI > .95, and TLI > .95; Hu & Bentler, 1999) have been applied to WLSMV and ULSMV regardless of the arguments against such a practice (e.g., Marsh, Hau, & Wen, 2004). For comparison purposes, this dissertation reports the RMSEA, CFI, and TLI based on continuous data using maximum likelihood before the variables are categorized to create ordered categorical data. Results show that the model fit indexes from maximum likelihood are very different from those from WLSMV and ULSMV, suggesting that the conventional rules should not be applied to WLSMV and ULSMV. / A Dissertation submitted to the Department of Educational Psychology and Learning Systems in partial fulfillment of the requirements for the degree of Doctor of Philosophy. / Summer Semester 2016. / July 5, 2016. / Model Fit Indexes, Ordered Categorical Data, Structural Equation Modeling, ULSMV, WLSMV / Includes bibliographical references. / Yanyun Yang, Professor Directing Dissertation; Fred W. Huffer, University Representative; Russell G. Almond, Committee Member; Betsy J. Becker, Committee Member; Insu Paek, Committee Member.
|
560 |
Nonparametric Detection of Arbitrary Changes to Distributions and Methods of Regularization of Piecewise Constant Functional DataUnknown Date (has links)
Nonparametric statistical methods can refer a wide variety of techniques. In this dissertation, we focus on two problems in statistics which are common applications of nonparametric statistics. The main body of the dissertation focuses on distribution-free process control for detection of arbitrary changes to the distribution of an underlying random variable. A secondary problem, also part of the broad umbrella of nonparametric statistics, is the proper approximation of a function. Statistical process control minimizes disruptions to a properly controlled process and quickly terminates out of control processes. Although rarely satisfied in practice, strict distributional assumptions are often needed to monitor these processes. Previous models have often exclusively focused on monitoring changes in the mean or variance of the underlying process. The proposed model establishes a monitoring method requiring few distributional assumptions while monitoring all changes in the underlying distribution generating the data. No assumptions on the form of the in-control distribution are made other than independence within and between observed samples. Windowing is employed to reduce computational complexity of the algorithm as well as ensure fast detection of changes. Results indicate quicker detection of large jumps than in many previously established methods. It is now common to analyze large quantities of data generated by sensors over time. Traditional analysis techniques do not incorporate the inherent functional structure often present in this type of data. The second focus of this dissertation is the development of a analysis method for functional data where the range of the function has a discrete, ordinal structure. Use is made of spline based methods using a piecewise constant function approximation. After a large amount of data reduction is achieved, generalized linear mixed model methodology is employed in order to model the data. / A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of Doctor of Philosophy. / Spring Semester 2017. / April 6, 2017. / functional data, nonparametric, process control, regularization / Includes bibliographical references. / Eric Chicken, Professor Directing Dissertation; Guosheng Liu, University Representative; Debdeep Pati, Committee Member; Minjing Tao, Committee Member.
|
Page generated in 0.0965 seconds