41 |
[en] A THEORY BASED, DATA DRIVEN SELECTION FOR THE REGULARIZATION PARAMETER FOR LASSO / [pt] SELECIONANDO O PARÂMETRO DE REGULARIZAÇÃO PARA O LASSO: BASEADO NA TEORIA E NOS DADOSDANIEL MARTINS COUTINHO 25 March 2021 (has links)
[pt] O presente trabalho apresenta uma nova forma de selecionar o parâmetro
de regularização do LASSO e do adaLASSO. Ela é baseada na teoria e
incorpora a estimativa da variância do ruído. Nós mostramos propriedades
teóricas e simulações Monte Carlo que o nosso procedimento é capaz de lidar
com mais variáveis no conjunto ativo do que outras opções populares para a
escolha do parâmetro de regularização. / [en] We provide a new way to select the regularization parameter for the
LASSO and adaLASSO. It is based on the theory and incorporates an estimate
of the variance of the noise. We show theoretical properties of the procedure
and Monte Carlo simulations showing that it is able to handle more variables
in the active set than other popular options for the regularization parameter.
|
42 |
Sélection de variables pour des processus ponctuels spatiaux / Feature selection for spatial point processesChoiruddin, Achmad 15 September 2017 (has links)
Les applications récentes telles que les bases de données forestières impliquent des observations de données spatiales associées à l'observation de nombreuses covariables spatiales. Nous considérons dans cette thèse le problème de l'estimation d'une forme paramétrique de la fonction d'intensité dans un tel contexte. Cette thèse développe les procédures de sélection des variables et donne des garanties quant à leur validité. En particulier, nous proposons deux approches différentes pour la sélection de variables : les méthodes de type lasso et les procédures de type Sélecteur de Dantzig. Pour les méthodes envisageant les techniques de type lasso, nous dérivons les propriétés asymptotiques des estimations obtenues par les fontions d'estimation dérivées par les vraisemblances de la Poisson et de la régression logistique pénalisées par une grande classe de pénalités. Nous prouvons que les estimations obtenues par de ces procédures satisfont la consistance, sparsité et la normalité asymptotique. Pour la partie sélecteur de Dantzig, nous développons une version modifiée du sélecteur de Dantzig, que nous appelons le sélecteur Dantzig linéaire adaptatif (ALDS), pour obtenir les estimations d'intensité. Plus précisément, les estimations ALDS sont définies comme la solution à un problème d'optimisation qui minimise la somme des coefficients des estimations soumises à une approximation linéaire du vecteur score comme une contrainte. Nous constatons que les estimations obtenues par de ces méthodes ont des propriétés asymptotiques semblables à celles proposées précédemment à l'aide de méthode régularisation du lasso adaptatif. Nous étudions les aspects computationnels des méthodes développées en utilisant les procédures de type lasso et de type Sélector Dantzig. Nous établissons des liens entre l'estimation de l'intensité des processus ponctuels spatiaux et les modèles linéaires généralisés (GLM), donc nous n'avons qu'à traiter les procédures de la sélection des variables pour les GLM. Ainsi, des procédures de calcul plus faciles sont implémentées et un algorithme informatique rapide est proposé. Des études de simulation sont menées pour évaluer les performances des échantillons finis des estimations de chacune des deux approches proposées. Enfin, nos méthodes sont appliquées pour modéliser les emplacements spatiaux, une espèce d'arbre dans la forêt observée avec un grand nombre de facteurs environnementaux. / Recent applications such as forestry datasets involve the observations of spatial point pattern data combined with the observation of many spatial covariates. We consider in this thesis the problem of estimating a parametric form of the intensity function in such a context. This thesis develops feature selection procedures and gives some guarantees on their validity. In particular, we propose two different feature selection approaches: the lasso-type methods and the Dantzig selector-type procedures. For the methods considering lasso-type techniques, we derive asymptotic properties of the estimates obtained from estimating functions derived from Poisson and logistic regression likelihoods penalized by a large class of penalties. We prove that the estimates obtained from such procedures satisfy consistency, sparsity, and asymptotic normality. For the Dantzig selector part, we develop a modified version of the Dantzig selector, which we call the adaptive linearized Dantzig selector (ALDS), to obtain the intensity estimates. More precisely, the ALDS estimates are defined as the solution to an optimization problem which minimizes the sum of coefficients of the estimates subject to linear approximation of the score vector as a constraint. We find that the estimates obtained from such methods have asymptotic properties similar to the ones proposed previously using an adaptive lasso regularization term. We investigate the computational aspects of the methods developped using either lasso-type procedures or the Dantzig selector-type approaches. We make links between spatial point processes intensity estimation and generalized linear models (GLMs), so we only have to deal with feature selection procedures for GLMs. Thus, easier computational procedures are implemented and computationally fast algorithm are proposed. Simulation experiments are conducted to highlight the finite sample performances of the estimates from each of two proposed approaches. Finally, our methods are applied to model the spatial locations a species of tree in the forest observed with a large number of environmental factors.
|
43 |
Contributions to Structured Variable Selection Towards Enhancing Model Interpretation and Computation EfficiencyShen, Sumin 07 February 2020 (has links)
The advances in data-collecting technologies provides great opportunities to access large sample-size data sets with high dimensionality. Variable selection is an important procedure to extract useful knowledge from such complex data. While in many real-data applications, appropriate selection of variables should facilitate the model interpretation and computation efficiency. It is thus important to incorporate domain knowledge of underlying data generation mechanism to select key variables for improving the model performance. However, general variable selection techniques, such as the best subset selection and the Lasso, often do not take the underlying data generation mechanism into considerations. This thesis proposal aims to develop statistical modeling methodologies with a focus on the structured variable selection towards better model interpretation and computation efficiency. Specifically, this thesis proposal consists of three parts: an additive heredity model with coefficients incorporating the multi-level data, a regularized dynamic generalized linear model with piecewise constant functional coefficients, and a structured variable selection method within the best subset selection framework.
In Chapter 2, an additive heredity model is proposed for analyzing mixture-of-mixtures (MoM) experiments. The MoM experiment is different from the classical mixture experiment in that the mixture component in MoM experiments, known as the major component, is made up of sub-components, known as the minor components. The proposed model considers an additive structure to inherently connect the major components with the minor components. To enable a meaningful interpretation for the estimated model, we apply the hierarchical and heredity principles by using the nonnegative garrote technique for model selection. The performance of the additive heredity model was compared to several conventional methods in both unconstrained and constrained MoM experiments. The additive heredity model was then successfully applied in a real problem of optimizing the Pringlestextsuperscript{textregistered} potato crisp studied previously in the literature.
In Chapter 3, we consider the dynamic effects of variables in the generalized linear model such as logistic regression. This work is motivated from the engineering problem with varying effects of process variables to product quality caused by equipment degradation. To address such challenge, we propose a penalized dynamic regression model which is flexible to estimate the dynamic coefficient structure. The proposed method considers modeling the functional coefficient parameter as piecewise constant functions. Specifically, under the penalized regression framework, the fused lasso penalty is adopted for detecting the changes in the dynamic coefficients. The group lasso penalty is applied to enable a sparse selection of variables. Moreover, an efficient parameter estimation algorithm is also developed based on alternating direction method of multipliers. The performance of the dynamic coefficient model is evaluated in numerical studies and three real-data examples.
In Chapter 4, we develop a structured variable selection method within the best subset selection framework. In the literature, many techniques within the LASSO framework have been developed to address structured variable selection issues. However, less attention has been spent on structured best subset selection problems. In this work, we propose a sparse Ridge regression method to address structured variable selection issues. The key idea of the proposed method is to re-construct the regression matrix in the angle of experimental designs. We employ the estimation-maximization algorithm to formulate the best subset selection problem as an iterative linear integer optimization (LIO) problem. the mixed integer optimization algorithm as the selection step. We demonstrate the power of the proposed method in various structured variable selection problems. Moverover, the proposed method can be extended to the ridge penalized best subset selection problems. The performance of the proposed method is evaluated in numerical studies. / Doctor of Philosophy / The advances in data-collecting technologies provides great opportunities to access large sample-size data sets with high dimensionality. Variable selection is an important procedure to extract useful knowledge from such complex data. While in many real-data applications, appropriate selection of variables should facilitate the model interpretation and computation efficiency. It is thus important to incorporate domain knowledge of underlying data generation mechanism to select key variables for improving the model performance.
However, general variable selection techniques often do not take the underlying data generation mechanism into considerations. This thesis proposal aims to develop statistical modeling methodologies with a focus on the structured variable selection towards better model interpretation and computation efficiency. The proposed approaches have been applied to real-world problems to demonstrate their model performance.
|
44 |
Predictor Selection in Linear Regression: L1 regularization of a subset of parameters and Comparison of L1 regularization and stepwise selectionHu, Qing 11 May 2007 (has links)
Background: Feature selection, also known as variable selection, is a technique that selects a subset from a large collection of possible predictors to improve the prediction accuracy in regression model. First objective of this project is to investigate in what data structure LASSO outperforms forward stepwise method. The second objective is to develop a feature selection method, Feature Selection by L1 Regularization of Subset of Parameters (LRSP), which selects the model by combining prior knowledge of inclusion of some covariates, if any, and the information collected from the data. Mathematically, LRSP minimizes the residual sum of squares subject to the sum of the absolute value of a subset of the coefficients being less than a constant. In this project, LRSP is compared with LASSO, Forward Selection, and Ordinary Least Squares to investigate their relative performance for different data structures. Results: simulation results indicate that for moderate number of small sized effects, forward selection outperforms LASSO in both prediction accuracy and the performance of variable selection when the variance of model error term is smaller, regardless of the correlations among the covariates; forward selection also works better in the performance of variable selection when the variance of error term is larger, but the correlations among the covariates are smaller. LRSP was shown to be an efficient method to deal with the problems when prior knowledge of inclusion of covariates is available, and it can also be applied to problems with nuisance parameters, such as linear discriminant analysis.
|
45 |
Room Correction for Smart SpeakersMårtensson, Simon January 2019 (has links)
Portable smart speakers with wireless connections have in recent years become more popular. These speakers are often moved to new locations and placed in different positions in different rooms, which affects the sound a listener is hearing from the speaker. These speakers usually have microphones on them, typically used for voice recording. This thesis aims to provide a way to compensate for the speaker position’s effect on the sound (so called room correction) using the microphones on the speaker and the speaker itself. Firstly, the room frequency response is estimated for several different speaker positions in a room. The room frequency response is the frequency response between the speaker and the listener. From these estimates, the relationship between the speaker’s position and the room frequency response is modeled. Secondly,an algorithm that estimates the speaker’s position is developed. The algorithm estimates the position by detecting reflections from nearby walls using the microphones on the speaker. The acquired position estimates are used as input for the room frequency response model, which makes it possible to automatically apply room correction when placing the speaker in new positions. The room correction is shown to correct the room frequency response so that the bass has the same power as the mid- and high frequency sounds from the speaker, which is according to the research aim. Also, the room correction is shown to make the room frequency response vary less with respect to the speaker’s position.
|
46 |
Análise e comparação de alguns métodos alternativos de seleção de variáveis preditoras no modelo de regressão linear / Analysis and comparison of some alternative methods of selection of predictor variables in linear regression models.Marques, Matheus Augustus Pumputis 04 June 2018 (has links)
Neste trabalho estudam-se alguns novos métodos de seleção de variáveis no contexto da regressão linear que surgiram nos últimos 15 anos, especificamente o LARS - Least Angle Regression, o NAMS - Noise Addition Model Selection, a Razão de Falsa Seleção - RFS (FSR em inglês), o LASSO Bayesiano e o Spike-and-Slab LASSO. A metodologia foi a análise e comparação dos métodos estudados e aplicações. Após esse estudo, realizam-se aplicações em bases de dados reais e um estudo de simulação, em que todos os métodos se mostraram promissores, com os métodos Bayesianos apresentando os melhores resultados. / In this work, some new variable selection methods that have appeared in the last 15 years in the context of linear regression are studied, specifically the LARS - Least Angle Regression, the NAMS - Noise Addition Model Selection, the False Selection Rate - FSR, the Bayesian LASSO and the Spike-and-Slab LASSO. The methodology was the analysis and comparison of the studied methods. After this study, applications to real data bases are made, as well as a simulation study, in which all methods are shown to be promising, with the Bayesian methods showing the best results.
|
47 |
Penalised regression for high-dimensional data : an empirical investigation and improvements via ensemble learningWang, Fan January 2019 (has links)
In a wide range of applications, datasets are generated for which the number of variables p exceeds the sample size n. Penalised likelihood methods are widely used to tackle regression problems in these high-dimensional settings. In this thesis, we carry out an extensive empirical comparison of the performance of popular penalised regression methods in high-dimensional settings and propose new methodology that uses ensemble learning to enhance the performance of these methods. The relative efficacy of different penalised regression methods in finite-sample settings remains incompletely understood. Through a large-scale simulation study, consisting of more than 1,800 data-generating scenarios, we systematically consider the influence of various factors (for example, sample size and sparsity) on method performance. We focus on three related goals --- prediction, variable selection and variable ranking --- and consider six widely used methods. The results are supported by a semi-synthetic data example. Our empirical results complement existing theory and provide a resource to compare performance across a range of settings and metrics. We then propose a new ensemble learning approach for improving the performance of penalised regression methods, called STructural RANDomised Selection (STRANDS). The approach, that builds and improves upon the Random Lasso method, consists of two steps. In both steps, we reduce dimensionality by repeated subsampling of variables. We apply a penalised regression method to each subsampled dataset and average the results. In the first step, subsampling is informed by variable correlation structure, and in the second step, by variable importance measures from the first step. STRANDS can be used with any sparse penalised regression approach as the ``base learner''. In simulations, we show that STRANDS typically improves upon its base learner, and demonstrate that taking account of the correlation structure in the first step can help to improve the efficiency with which the model space may be explored. We propose another ensemble learning method to improve the prediction performance of Ridge Regression in sparse settings. Specifically, we combine Bayesian Ridge Regression with a probabilistic forward selection procedure, where inclusion of a variable at each stage is probabilistically determined by a Bayes factor. We compare the prediction performance of the proposed method to penalised regression methods using simulated data.
|
48 |
Approaches to modelling functional time series with an application to electricity generation dataJin, Zehui January 2018 (has links)
We study the half-hourly electricity generation by coal and by gas in the UK over a period of three years from 2012 to 2014. As a highly frequent time series, daily cycles along with seasonality and trend across days can be seen in the data for each fuel. Taylor (2003), Taylor et al. (2006), and Taylor (2008) studied time series of the similar features by introducing double seasonality into the methods for a single univariate time series. As we are interested in the continuous variation in the generation within a day, the half-hourly observations within a day are considered as a continuous function. In this way, a time series of half-hourly discrete observations is transformed into a time series of daily functions. The idea of a time series of functions can also seen in Shang (2013), Shang and Hyndman (2011) and Hyndman and Ullah (2007). We improve their methods in a few ways. Firstly, we identify the systematic effect due to the factors that take effect in a long term, such as weather and prices of fuels, and the intrinsic differences between the days of the week. The systematic effect is modeled and removed before we study the day-by-day random variation in the functions. Secondly, we extend functional principal component analysis (PCA), which was applied on one group of functions in Shang (2013), Shang and Hyndman (2011) and Hyndman and Ullah (2007), into partial common PCA, in order to consider the covariance structures of two groups of functions and their similarities. A test on the goodness of the approximation to the functions given by the common eigenfunctions is also proposed. The idea of bootstrapping residuals from the approximation seen in Shang (2014) is employed but is improved with non-overlapping blocks and moving blocks of residuals. Thirdly, we use a vector autoregressive (VAR) model, which is a multivariate approach, to model the scores on common eigenfunctions of a group such that the cross-correlation between the scores can be considered. We include Lasso penalties in the VAR model to select the significant covariates and refit the selection with ordinary least squares to reduce the bias. Our method is compared with the stepwise procedure by Pfaff (2007), and is proved to be less variable and more accurate on estimation and prediction. Finally, we propose the method to give the point forecasts of the daily functions. It is more complicated than the methods of Shang (2013), Shang and Hyndman (2011) and Hyndman and Ullah (2007) as the systematic effect needs to be included. An adjustment interval is also given along with a point forecast, which represents the range within which the true function might vary. Our methods to give the point forecast and the adjustment interval include the information updating after the training period, which is not considered in the classical predicting equations of VAR and GARCH seen in Tsay (2013) and Engle and Bollerslev (1986).
|
49 |
Análise e comparação de alguns métodos alternativos de seleção de variáveis preditoras no modelo de regressão linear / Analysis and comparison of some alternative methods of selection of predictor variables in linear regression models.Matheus Augustus Pumputis Marques 04 June 2018 (has links)
Neste trabalho estudam-se alguns novos métodos de seleção de variáveis no contexto da regressão linear que surgiram nos últimos 15 anos, especificamente o LARS - Least Angle Regression, o NAMS - Noise Addition Model Selection, a Razão de Falsa Seleção - RFS (FSR em inglês), o LASSO Bayesiano e o Spike-and-Slab LASSO. A metodologia foi a análise e comparação dos métodos estudados e aplicações. Após esse estudo, realizam-se aplicações em bases de dados reais e um estudo de simulação, em que todos os métodos se mostraram promissores, com os métodos Bayesianos apresentando os melhores resultados. / In this work, some new variable selection methods that have appeared in the last 15 years in the context of linear regression are studied, specifically the LARS - Least Angle Regression, the NAMS - Noise Addition Model Selection, the False Selection Rate - FSR, the Bayesian LASSO and the Spike-and-Slab LASSO. The methodology was the analysis and comparison of the studied methods. After this study, applications to real data bases are made, as well as a simulation study, in which all methods are shown to be promising, with the Bayesian methods showing the best results.
|
50 |
High-Dimensional Analysis of Convex Optimization-Based Massive MIMO DecodersBen Atitallah, Ismail 04 1900 (has links)
A wide range of modern large-scale systems relies on recovering a signal from noisy linear measurements. In many applications, the useful signal has inherent properties, such as sparsity, low-rankness, or boundedness, and making use of these properties
and structures allow a more efficient recovery. Hence, a significant amount of work has been dedicated to developing and analyzing algorithms that can take advantage of the signal structure. Especially, since the advent of Compressed Sensing (CS) there has been significant progress towards this direction. Generally speaking, the signal structure can be harnessed by solving an appropriate regularized or constrained M-estimator.
In modern Multi-input Multi-output (MIMO) communication systems, all transmitted signals are drawn from finite constellations and are thus bounded. Besides, most recent modulation schemes such as Generalized Space Shift Keying (GSSK) or Generalized Spatial Modulation (GSM) yield signals that are inherently sparse. In the recovery procedure, boundedness and sparsity can be promoted by using the ℓ1 norm regularization and by imposing an ℓ∞ norm constraint respectively.
In this thesis, we propose novel optimization algorithms to recover certain classes of structured signals with emphasis on MIMO communication systems. The exact analysis permits a clear characterization of how well these systems perform. Also, it allows an automatic tuning of the parameters. In each context, we define the appropriate performance metrics and we analyze them exactly in the High Dimentional Regime (HDR).
The framework we use for the analysis is based on Gaussian process inequalities; in particular, on a new strong and tight version of a classical comparison inequality (due to Gordon, 1988) in the presence of additional convexity assumptions. The new
framework that emerged from this inequality is coined as Convex Gaussian Min-max Theorem (CGMT).
|
Page generated in 0.032 seconds