Advances in Model Selection Techniques with Applications to Statistical Network Analysis and Recommender SystemsFranco Saldana, Diego January 2016 (has links)
This dissertation focuses on developing novel model selection techniques, the process by which a statistician selects one of a number of competing models of varying dimensions, under an array of different statistical assumptions on observed data. Traditionally, two main reasons have been advocated by researchers for performing model selection strategies over classical maximum likelihood estimates (MLEs). The first reason is prediction accuracy, where by shrinking or setting to zero some model parameters, one sacrifices the unbiasedness of MLEs for a reduced variance, which in turn leads to an overall improvement in predictive performance. The second reason relates to interpretability of the selected models in the presence of a large number of predictors, where in order to obtain a parsimonious representation exhibiting the relationship between the response and covariates, we are willing to sacrifice some of the smaller details brought in by spurious predictors. In the first part of this work, we revisit the family of variable selection techniques known as sure independence screening procedures for generalized linear models and the Cox proportional hazards model. After clever combination of some of its most powerful variants, we propose new extensions based on the idea of sample splitting, data-driven thresholding, and combinations thereof. A publicly available package developed in the R statistical software demonstrates considerable improvements in terms of model selection and competitive computational time between our enhanced variable selection procedures and traditional penalized likelihood methods applied directly to the full set of covariates. Next, we develop model selection techniques within the framework of statistical network analysis for two frequent problems arising in the context of stochastic blockmodels: community number selection and change-point detection. In the second part of this work, we propose a composite likelihood based approach for selecting the number of communities in stochastic blockmodels and its variants, with robustness consideration against possible misspecifications in the underlying conditional independence assumptions of the stochastic blockmodel. Several simulation studies, as well as two real data examples, demonstrate the superiority of our composite likelihood approach when compared to the traditional Bayesian Information Criterion or variational Bayes solutions. In the third part of this thesis, we extend our analysis on static network data to the case of dynamic stochastic blockmodels, where our model selection task is the segmentation of a time-varying network into temporal and spatial components by means of a change-point detection hypothesis testing problem. We propose a corresponding test statistic based on the idea of data aggregation across the different temporal layers through kernel-weighted adjacency matrices computed before and after each candidate change-point, and illustrate our approach on synthetic data and the Enron email corpus. The matrix completion problem consists in the recovery of a low-rank data matrix based on a small sampling of its entries. In the final part of this dissertation, we extend prior work on nuclear norm regularization methods for matrix completion by incorporating a continuum of penalty functions between the convex nuclear norm and nonconvex rank functions. We propose an algorithmic framework for computing a family of nonconvex penalized matrix completion problems with warm-starts, and present a systematic study of the resulting spectral thresholding operators. We demonstrate that our proposed nonconvex regularization framework leads to improved model selection properties in terms of finding low-rank solutions with better predictive performance on a wide range of synthetic data and the famous Netflix data recommender system.
Weighted quantile regression and oracle model selection. / CUHK electronic theses & dissertations collectionJanuary 2009 (has links)
In this dissertation I suggest a new (regularized) weighted quantile regression estimation approach for nonlinear regression models and double threshold ARCH (DTARCH) models. I allow the number of parameters in the nonlinear regression models to be fixed or diverge. The proposed estimation method is robust and efficient and is applicable to other models. I use the adaptive-LASSO and SCAD regularization to select parameters in the nonlinear regression models. I simultaneously estimate the AR and ARCH parameters in the DTARCH model using the proposed weighted quantile regression. The values of the proposed methodology are revealed. / Keywords: Weighted quantile regression, Adaptive-LASSO, High dimensionality, Model selection, Oracle property, SCAD, DTARCH models. / Under regularity conditions, I establish asymptotic distributions of the proposed estimators, which show that the model selection methods perform as well as if the correct submodels are known in advance. I also suggest an algorithm for fast implementation of the proposed methodology. Simulations are conducted to compare different estimators, and a real example is used to illustrate their performance. / Jiang, Xuejun. / Adviser: Xinyuan Song. / Source: Dissertation Abstracts International, Volume: 73-01, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2009. / Includes bibliographical references (leaves 86-92). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong,  System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Lee Shun-yi. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2001. / Includes bibliographical references (leaves 57-61). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Introduction --- p.1 / Chapter 1.2 --- Tests of Hypotheses --- p.4 / Chapter 1.2.1 --- Likelihood Ratio Statistic --- p.4 / Chapter 1.2.2 --- The Rao´ة s Score Statistic --- p.5 / Chapter 1.2.3 --- Wald's Statistic --- p.6 / Chapter 1.3 --- Notation --- p.6 / Chapter 2 --- Fixed Effects Model --- p.8 / Chapter 2.1 --- Introduction --- p.8 / Chapter 2.2 --- Pearson Chi-square Statistic --- p.9 / Chapter 2.3 --- Logistic Regression Model --- p.11 / Chapter 2.3.1 --- Testing Linear Hypotheses about the Regression Coefficients --- p.12 / Chapter 2.4 --- Combining Proportions --- p.16 / Chapter 2.4.1 --- Classical Estimators --- p.17 / Chapter 2.4.2 --- Jackknife Estimator --- p.18 / Chapter 2.4.3 --- Cross-validatory estimators --- p.19 / Chapter 3 --- Random Effects Model --- p.21 / Chapter 3.1 --- Introduction --- p.21 / Chapter 3.2 --- DerSimonian and Laird Method --- p.22 / Chapter 3.3 --- Generalized linear model with random effect --- p.24 / Chapter 3.3.1 --- Quasi-Likelihood --- p.25 / Chapter 3.3.2 --- Testing Linear Hypotheses about the Regression Coefficients --- p.26 / Chapter 3.3.3 --- MINQUE --- p.27 / Chapter 3.3.4 --- Score Test --- p.31 / Chapter 4 --- Overdispersion and Intraclass Correlation --- p.36 / Chapter 4.1 --- Introduction --- p.36 / Chapter 4.2 --- C(α) Test --- p.39 / Chapter 4.2.1 --- Correlated Binomial model and Beta-Binomial model --- p.40 / Chapter 4.2.2 --- C(α) Statistic Based On Quasi-likclihood --- p.46 / Chapter 4.3 --- Donner Statistic --- p.48 / Chapter 4.4 --- Rao and Scott Statistic --- p.51 / Chapter 5 --- Example and Discussion --- p.53 / Bibliography --- p.57
Foster, Scott David
This thesis concerns the identification of quantitative trait loci (QTL) for important traits in cattle line crosses. One of these traits is birth weight of calves, which affects both animal production and welfare through correlated effects on parturition and subsequent growth. Birth weight was one of the traits measured in the Davies' Gene Mapping Project. These data form the motivation for the methods presented in this thesis. Multiple QTL models have been previously proposed and are likely to be superior to single QTL models. The multiple QTL models can be loosely divided into two categories : 1 ) model building methods that aim to generate good models that contain only a subset of all the potential QTL ; and 2 ) methods that consider all the observed marker explanatory variables. The first set of methods can be misleading if an incorrect model is chosen. The second set of methods does not have this limitation. However, a full fixed effect analysis is generally not possible as the number of marker explanatory variables is typically large with respect to the number of observations. This can be overcome by using constrained estimation methods or by making the marker effects random. One method of constrained estimation is the least absolute selection and shrinkage operator (LASSO). This method has the appealing ability to produce predictions of effects that are identically zero. The LASSO can also be specified as a random model where the effects follow a double exponential distribution. In this thesis, the LASSO is investigated from a random effects model perspective. Two methods to approximate the marginal likelihood are presented. The first uses the standard form for the double exponential distribution and requires adjustment of the score equations for unbiased estimation. The second is based on an alternative probability model for the double exponential distribution. It was developed late in the candidature and gives similar dispersion parameter estimates to the first approximation, but does so in a more direct manner. The alternative LASSO model suggests some novel types of predictors. Methods for a number of different types of predictors are specified and are compared for statistical efficiency. Initially, inference for the LASSO effects is performed using simulation. Essentially, this treats the random effects as fixed effects and tests the null hypothesis that the effect is zero. In simulation studies, it is shown to be a useful method to identify important effects. However, the effects are random, so such a test is not strictly appropriate. After the specification of the alternative LASSO model, a method for making probability statements about the random effects being above or below zero is developed. This method is based on the predictive distribution of the random effects (posterior in Bayesian terminology). The random LASSO model is not sufficiently flexible to model most QTL mapping data. Typically, these data arise from large experiments and require models containing terms for experimental design. For example, the Davies' Gene Mapping experiment requires fixed effects for different sires, a covariate for birthdate within season and random normal effects for management group. To accommodate these sources of variation a mixed model is employed. The marker effects are included into this model as random LASSO effects. Estimation of the dispersion parameters is based on an approximate restricted likelihood (an extension of the first method of estimation for the simple random effects model). Prediction of the random effects is performed using a generalisation of Henderson's mixed model equations. The performance of the LASSO linear mixed model for QTL identification is assessed via simulation. It performs well against other commonly used methods but it may lack power for lowly heritable traits in small experiments. However, the rate of false positives in such situations is much lower. Also, the LASSO method is more precise in locating the correct marker rather than a marker in its vicinity. Analysis of the Davies' Gene Mapping Data using the methods described in this thesis identified five non-zero marker-within-sire effects ( there were 570 such effects). This analysis clearly shows that most of the genome does not affect the trait of interest. The simulation results and the analysis of the Davies' Gene Mapping Project Data show that the LASSO linear mixed model is a competitive method for QTL identification. It provides a flexible method to model the genetic and experimental effects simultaneously. / Thesis (Ph.D.)--School of Agriculture, Food and Wine, 2006.
Residual maximum likelihood (REML) estimation is a popular method of estimation for variance parameters in linear mixed models, which typically requires an iterative scheme. The aim of this thesis is to review several popular iterative schemes and to develop an improved iterative strategy that will work for a wide class of models. The average information (AI) algorithm is a computationally convenient and efficient algorithm to use when starting values are in the neighbourhood of the REML solution. However when reasonable starting values are not available, the algorithm can fail to converge. The expectation-maximisation (EM) algorithm and the parameter expanded EM (PXEM) algorithm are good alternatives in these situations but they can be very slow to converge. The formulation of these algorithms for a general linear mixed model is presented, along with their convergence properties. A series of hybrid algorithms are presented. EM or PXEM iterations are used initially to obtain variance parameter estimates that are in the neighbourhood of the REML solution, and then AI iterations are used to ensure rapid convergence. Composite local EM/AI and local PXEM/AI schemes are also developed; the local EM and local PXEM algorithms update only the random effect variance parameters, with the estimates of the residual error variance parameters held fixed. Techniques for determining when to use EM-type iterations and when to switch to AI iterations are investigated. Methods for obtaining starting values for the iterative schemes are also presented. The performance of these various schemes is investigated for several different linear mixed models. A number of data sets are used, including published data sets and simulated data. The performance of the basic algorithms is compared to that of the various hybrid algorithms, using both uninformed and informed starting values. The theoretical and empirical convergence rates are calculated and compared for the basic algorithms. The direct comparison of the AI and PXEM algorithms shows that the PXEM algorithm, although an improvement over the EM algorithm, still falls well short of the AI algorithm in terms of speed of convergence. However, when the starting values are too far from the REML solution, the AI algorithm can be unstable. Instability is most likely to arise in models with a more complex variance structure. The hybrid schemes use EM-type iterations to move close enough to the REML solution to enable the AI algorithm to successfully converge. They are shown to be robust to choice of starting values like the EM and PXEM algorithms, while demonstrating fast convergence like the AI algorithm. / Thesis (Ph.D.) - University of Adelaide, School of Agriculture, Food and Wine, 2008
Liu, Qing, 1961-
31 August 1993
This thesis considers likelihood inferences for generalized linear models with additional random effects. The likelihood function involved ordinarily cannot be evaluated in closed form and numerical integration is needed. The theme of the thesis is a closed-form approximation based on Laplace's method. We first consider a special yet important case of the above general setting -- the Mantel-Haenszel-type model with overdispersion. It is seen that the Laplace approximation is very accurate for likelihood inferences in that setting. The approach and results on accuracy apply directly to the more general setting involving multiple parameters and covariates. Attention is then given to how to maximize out nuisance parameters to obtain the profile likelihood function for parameters of interest. In evaluating the accuracy of the Laplace approximation, we utilized Gauss-Hermite quadrature. Although this is commonly used, it was found that in practice inadequate thought has been given to the implementation. A systematic method is proposed for transforming the variable of integration to ensure that the Gauss-Hermite quadrature is effective. We found that under this approach the Laplace approximation is a special case of the Gauss-Hermite quadrature. / Graduation date: 1994
23 February 1994
Graduation date: 1994
Batten, Douglas James,
Thesis (M.A.S.), Memorial University of Newfoundland, 2000. / Bibliography: leaves 71-73.
Item and person parameter estimation using hierarchical generalized linear models and polytomous item response theory modelsWilliams, Natasha Jayne. January 2003 (has links)
Thesis (Ph. D.)--University of Texas at Austin, 2003. / Vita. Includes bibliographical references. Available also from UMI Company.
Item and person parameter estimation using hierarchical generalized linear models and polytomous item response theory modelsWilliams, Natasha Jayne 27 July 2011 (has links)
Not available / text
Page generated in 0.1695 seconds