81 |
[en] FORECASTING INDUSTRIAL PRODUCTION IN BRAZIL USING MANY PREDICTORS / [pt] PREVENDO A PRODUÇÃO INDUSTRIAL BRASILEIRA USANDO MUITOS PREDITORESLEONARDO DE PAOLI CARDOSO DE CASTRO 23 December 2016 (has links)
[pt] Nesse artigo, utilizamos o índice de produção industrial brasileira para
comparar a capacidade preditiva de regressões irrestritas e regressões sujeitas
a penalidades usando muitos preditores. Focamos no least absolute
shrinkage and selection operator (LASSO) e suas extensões. Propomos também
uma combinação entre métodos de encolhimento e um algorítmo de
seleção de variáveis (PVSA). A performance desses métodos foi comparada
com a de um modelo de fatores. Nosso estudo apresenta três principais resultados.
Em primeiro lugar, os modelos baseados no LASSO apresentaram
performance superior a do modelo usado como benchmark em projeções de
curto prazo. Segundo, o PSVA teve desempenho superior ao benchmark independente
do horizonte de projeção. Finalmente, as variáveis com a maior
capacidade preditiva foram consistentemente selecionadas pelos métodos
considerados. Como esperado, essas variáveis são intimamente relacionadas
à atividade industrial brasileira. Exemplos incluem a produção de veículos
e a expedição de papelão. / [en] In this article we compared the forecasting accuracy of unrestricted
and penalized regressions using many predictors for the Brazilian industrial
production index. We focused on the least absolute shrinkage and selection
operator (Lasso) and its extensions. We also proposed a combination
between penalized regressions and a variable search algorithm (PVSA).
Factor-based models were used as our benchmark specification. Our study
produced three main findings. First, Lasso-based models over-performed the
benchmark in short-term forecasts. Second, the PSVA over-performed the
proposed benchmark, regardless of the horizon. Finally, the best predictive
variables are consistently chosen by all methods considered. As expected,
these variables are closely related to Brazilian industrial activity. Examples
include vehicle production and cardboard production.
|
82 |
Regulariserad linjär regression för modellering av företags valutaexponering / Regularised Linear Regression for Modelling of Companies' Currency ExposureHahn, Karin, Tamm, Erik January 2021 (has links)
Inom fondförvaltning används kvantitativa metoder för att förutsäga hur företags räkenskaper kommer att förändras vid nästa kvartal jämfört med motsvarande kvartal året innan. Banken SEB använder i dag multipel linjär regression med förändring av intäkter som beroende variabel och förändring av valutakurser som oberoende variabler. Det är problematiskt av tre anledningar. Först och främst har valutor ofta stor multikolinjäritet, vilket ger instabila skattningar. För det andra det kan ett företags intäkter bero på ett urval av de valutor som används som data varför regression inte bör ske mot alla valutor. För det tredje är nyare data mer relevant för prediktioner. Dessa problem kan hanteras genom att använda regulariserings- och urvalsmetoder, mer specifikt elastic net och viktad regression. Vi utvärderar dessa metoder för en stor mängd företag genom att jämföra medelabsolutfelet mellan multipel linjär regression och regulariserad linjär regression med viktning. Utvärderingen visar att en sådan modell presterar bättre i 65,0 % av de företag som ingår i ett stort globalt aktieindex samt får ett medelabsolutfel på 14 procentenheter. Slutsatsen blir att elastic net och viktad regression adresserar problemen med den ursprungliga modellen och kan användas för bättre förutsägelser av intäkternas beroende av valutakurser. / Quantative methods are used in fund management to predict the change in companies' revenues at the next quarterly report compared to the corresponding quarter the year before. The Swedish bank SEB already uses multiple linear regression with change of revenue as the depedent variable and change of exchange rates as independent variables. This is problematic for three reasons. Firstly, currencies often exibit large multicolinearity, which yields volatile estimates. Secondly, a company's revenue can depend on a subset of the currencies included in the dataset. With the multicolinearity in mind, it is benifical to not regress against all the currencies. Thirdly, newer data is more relevant for the predictions. These issues can be handled by using regularisation and selection methods, more specifically elastic net and weighted regression. We evaluate these methods for a large number of companies by comparing the mean absolute error between multiple linear regression and regularised linear regression with weighting. The evaluation shows that such model performs better for 65.0% of the companies included in a large global share index with a mean absolute error of 14 percentage points. The conclusion is that elastic net and weighted regression address the problems with the original model and can be used for better predictions of how the revenues depend on exchange rates.
|
83 |
Le lasso linéaire : une méthode pour des données de petites et grandes dimensions en régression linéaireWatts, Yan 04 1900 (has links)
Dans ce mémoire, nous nous intéressons à une façon géométrique de voir la méthode du
Lasso en régression linéaire. Le Lasso est une méthode qui, de façon simultanée, estime les
coefficients associés aux prédicteurs et sélectionne les prédicteurs importants pour expliquer
la variable réponse. Les coefficients sont calculés à l’aide d’algorithmes computationnels.
Malgré ses vertus, la méthode du Lasso est forcée de sélectionner au maximum n variables
lorsque nous nous situons en grande dimension (p > n). De plus, dans un groupe de variables
corrélées, le Lasso sélectionne une variable “au hasard”, sans se soucier du choix de la variable.
Pour adresser ces deux problèmes, nous allons nous tourner vers le Lasso Linéaire. Le
vecteur réponse est alors vu comme le point focal de l’espace et tous les autres vecteurs
de variables explicatives gravitent autour du vecteur réponse. Les angles formés entre le
vecteur réponse et les variables explicatives sont supposés fixes et nous serviront de base pour
construire la méthode. L’information contenue dans les variables explicatives est projetée
sur le vecteur réponse. La théorie sur les modèles linéaires normaux nous permet d’utiliser
les moindres carrés ordinaires (MCO) pour les coefficients du Lasso Linéaire.
Le Lasso Linéaire (LL) s’effectue en deux étapes. Dans un premier temps, des variables
sont écartées du modèle basé sur leur corrélation avec la variable réponse; le nombre de
variables écartées (ou ordonnées) lors de cette étape dépend d’un paramètre d’ajustement
γ. Par la suite, un critère d’exclusion basé sur la variance de la distribution de la variable
réponse est introduit pour retirer (ou ordonner) les variables restantes. Une validation croisée
répétée nous guide dans le choix du modèle final.
Des simulations sont présentées pour étudier l’algorithme en fonction de différentes valeurs
du paramètre d’ajustement γ. Des comparaisons sont effectuées entre le Lasso Linéaire
et des méthodes compétitrices en petites dimensions (Ridge, Lasso, SCAD, etc.). Des améliorations
dans l’implémentation de la méthode sont suggérées, par exemple l’utilisation de
la règle du 1se nous permettant d’obtenir des modèles plus parcimonieux. Une implémentation
de l’algorithme LL est fournie dans la fonction R intitulée linlasso, disponible au
https://github.com/yanwatts/linlasso. / In this thesis, we are interested in a geometric way of looking at the Lasso method in
the context of linear regression. The Lasso is a method that simultaneously estimates the
coefficients associated with the predictors and selects the important predictors to explain the
response variable. The coefficients are calculated using computational algorithms. Despite
its virtues, the Lasso method is forced to select at most n variables when we are in highdimensional
contexts (p > n). Moreover, in a group of correlated variables, the Lasso selects
a variable “at random”, without caring about the choice of the variable.
To address these two problems, we turn to the Linear Lasso. The response vector is then
seen as the focal point of the space and all other explanatory variables vectors orbit around
the response vector. The angles formed between the response vector and the explanatory
variables are assumed to be fixed, and will be used as a basis for constructing the method.
The information contained in the explanatory variables is projected onto the response vector.
The theory of normal linear models allows us to use ordinary least squares (OLS) for the
coefficients of the Linear Lasso.
The Linear Lasso (LL) is performed in two steps. First, variables are dropped from
the model based on their correlation with the response variable; the number of variables
dropped (or ordered) in this step depends on a tuning parameter γ. Then, an exclusion
criterion based on the variance of the distribution of the response variable is introduced
to remove (or order) the remaining variables. A repeated cross-validation guides us in the
choice of the final model.
Simulations are presented to study the algorithm for different values of the tuning parameter
γ. Comparisons are made between the Linear Lasso and competing methods in
small dimensions (Ridge, Lasso, SCAD, etc.). Improvements in the implementation of the
method are suggested, for example the use of the 1se rule allowing us to obtain more parsimonious
models. An implementation of the LL algorithm is provided in the function R
entitled linlasso available at https://github.com/yanwatts/linlasso.
|
84 |
Three Essays in EconomicsDaniel G Kebede (16652025) 03 August 2023 (has links)
<p> The overall theme of my dissertation is applying frontier econometric models to interesting economic problems. The first chapter analyzes how individual consumption responds to permanent and transitory income shocks is limited by model misspecification and availability of data. The misspecification arises from ignoring unemployment risk while estimating income shocks. I employ the Heckman two step regression model to consistently estimate income shocks. Moreover, to deal with data sparsity, I propose identifying the partial consumption insurance and income and consumption volatility heterogeneities at the household level using Least Absolute Shrinkage and Selection Operator (LASSO). Using PSID data, I estimate partial consumption insurance against permanent shock of 63% and 49% for white and black household heads, respectively; the white and black household heads self-insure against 100% and 90% of the transitory income shocks, respectively. Moreover, I find income and consumption volatilities and partial consumption insurance parameters vary across time. In the second chapter I recast smooth structural break test proposed by Chen and Hong (2012), in a predictive regression setting. The regressors are characterized using the local to non-stationarity framework. I conduct a Monte Carlo experiment to evaluate the finite sample performance of the test statistic and examine an empirical example to demonstrate its practical application. The Monte Carlo simulations show that the test statistic has better power and size compared to the popular SupF and LM. Empirically, compared to SupF and LM, the test statistic rejects the null hypothesis of no structural break more frequently when there actually is a structural break present in the data. The third chapter is a collaboration with James Reeder III. We study the effects of using promotions to drive public policy diffusion in regions with polarized political beliefs. We estimate a model that allows for heterogeneous effects at the county-level based upon state-level promotional offerings to drive vaccine adoption during COVID-19. Central to our empirical application is accounting for the endogenous action of state-level agents in generating promotional schemes. To address this challenge, we synthesize various sources of data at the county-level and leverage advances in both the Bass Diffusion model and 10 machine learning. Studying the vaccine rates at the county-level within the United States, we find evidence that the use of promotions actually reduced the overall rates of adoption in obtaining vaccination, a stark difference from other studies examining more localized vaccine rates. The negative average effect is driven primarily by the large number of counties that are described as republican leaning based upon their voting record in the 2020 election. Even directly accounting for the population’s vaccine hesitancy, this result still stands. Thus, our analysis suggests that in the polarized setting of the United States electorate, more localized policies on contentious topics may yield better outcomes than broad, state-level dictates. </p>
|
85 |
MTG-kortsprissättning: en regressionsanalys för att bestämma nyckelfaktorer för kortpriser / MTG Card Pricing: a Regression Analysis of Determining Key Factors of Card PricesMichael, Adam January 2023 (has links)
Genom att analysera kortegenskaperna hos Magic the Gathering-kort harmodeller tagits fram för att bestämma deras inverkan på kortpriset. Tidigarestudier har inte fokuserat på spel-egenskaperna, vilket är vad som särskiljer dettaarbete från tidigare forskning. För att modellera effekten av spel-egenskapernahar dessa kvantifierats och undersökts med hjälp av Minsta-kvadratmetoden ochLasso-regression, med hjälp av programmeringsspråket R. Resultaten indikeraratt faktorer direkt kopplade till samlarbarhet och spelbarhet har den störstainverkan på priset för Magic the Gathering-kort. Dessa resultat har diskuteratsmed utgångspunkt från olika perspektiv, såsom Wizards of the Coast (utgivarenav Magic the Gathering), spelare, samlare och investerare. Genom att fokusera påspel-egenskaperna har denna studie bidragit till området på ett sätt som tidigareforskning inte har gjort, vilket ger en mer helhetsbild av Magic the Gathering-kortsvärde. / By analyzing the card properties of Magic the Gathering cards, models have beendeveloped to determine their impact on card prices. Previous studies have notfocused on gameplay properties, which distinguishes this work from previousresearch. To model the effect of gameplay properties, they have been quantifiedand examined using Least Squares Method and Lasso Regression, with the helpof the programming language R. The results indicate that factor directly relateradto collectability and playability have the greatest impact on the price of Magic theGathering cards. These results have been discussed from various perspectives,such as Wizards of the Coast (the publisher of Magic the Gathering), players,collectors, and investors. By focusing on gameplay properties, this study hascontributed to the field in a way that previous research has not, providing a morecomprehensive understanding of the value of Magic the Gathering cards.
|
86 |
Regularized and robust regression methods for high dimensional dataHashem, Hussein Abdulahman January 2014 (has links)
Recently, variable selection in high-dimensional data has attracted much research interest. Classical stepwise subset selection methods are widely used in practice, but when the number of predictors is large these methods are difficult to implement. In these cases, modern regularization methods have become a popular choice as they perform variable selection and parameter estimation simultaneously. However, the estimation procedure becomes more difficult and challenging when the data suffer from outliers or when the assumption of normality is violated such as in the case of heavy-tailed errors. In these cases, quantile regression is the most appropriate method to use. In this thesis we combine these two classical approaches together to produce regularized quantile regression methods. Chapter 2 shows a comparative simulation study of regularized and robust regression methods when the response variable is continuous. In chapter 3, we develop a quantile regression model with a group lasso penalty for binary response data when the predictors have a grouped structure and when the data suffer from outliers. In chapter 4, we extend this method to the case of censored response variables. Numerical examples on simulated and real data are used to evaluate the performance of the proposed methods in comparisons with other existing methods.
|
87 |
Tuning Parameter Selection in L1 Regularized Logistic RegressionShi, Shujing 05 December 2012 (has links)
Variable selection is an important topic in regression analysis and is intended to select the best subset of predictors. Least absolute shrinkage and selection operator (Lasso) was introduced by Tibshirani in 1996. This method can serve as a tool for variable selection because it shrinks some coefficients to exact zero by a constraint on the sum of absolute values of regression coefficients. For logistic regression, Lasso modifies the traditional parameter estimation method, maximum log likelihood, by adding the L1 norm of the parameters to the negative log likelihood function, so it turns a maximization problem into a minimization one. To solve this problem, we first need to give the value for the parameter of the L1 norm, called tuning parameter. Since the tuning parameter affects the coefficients estimation and variable selection, we want to find the optimal value for the tuning parameter to get the most accurate coefficient estimation and best subset of predictors in the L1 regularized regression model. There are two popular methods to select the optimal value of the tuning parameter that results in a best subset of predictors, Bayesian information criterion (BIC) and cross validation (CV). The objective of this paper is to evaluate and compare these two methods for selecting the optimal value of tuning parameter in terms of coefficients estimation accuracy and variable selection through simulation studies.
|
88 |
Real Time Frequency Analysis of Signals From Lasso Catheter For Radiofrequency Ablation During Atrial FibrillationYadav, Prashant 01 January 2005 (has links)
Real time spectrum analysis of signals obtained through lasso catheter during radiofrequency ablation of pulmonary vein was performed to determine the channel with dominant frequency. Threshold algorithm was used for signals which could be classified as type I and type II AF. Type III AF Signals which were highly fractionated or differentiated were evaluated for frequency content by performing Fast Fourier Transform. Data from Seven patients was collected and an episode of 180 ± 40 seconds was recorded and analyzed for each pulmonary vein that showed electrical activation. Frequency spectra for one second segment of signal for each channel were determined. The frequencies of channels were then compared to determine the channel with highest or dominant frequency. In most cases the frequency of a single channel varied erratically between 1 to 10 Hz for every subsequent one second segment which made DF detection among the channels unreliable and a single channel with dominant frequency could not be determined. A five second averaging for each channel did not produce a stable DF output and improvement was minimal. The erratic frequency behavior could be attributed to the spatial shift of micro- reentrant circuits or temporal variation in waveform over lap at the point of detection. To determine the DF more precisely either an increase in number of electrode or increase in time segment block for DF calculation is warranted. Increasing the time segment block will defeat the purpose of real time analysis thus an increase in number of electrode mapping the area of interest would be appropriate to resolve the issue.
|
89 |
Ridle for sparse regression with mandatory covariates with application to the genetic assessment of histologic grades of breast cancerZhai, Jing, Hsu, Chiu-Hsieh, Daye, Z. John 25 January 2017 (has links)
Background: Many questions in statistical genomics can be formulated in terms of variable selection of candidate biological factors for modeling a trait or quantity of interest. Often, in these applications, additional covariates describing clinical, demographical or experimental effects must be included a priori as mandatory covariates while allowing the selection of a large number of candidate or optional variables. As genomic studies routinely require mandatory covariates, it is of interest to propose principled methods of variable selection that can incorporate mandatory covariates. Methods: In this article, we propose the ridge-lasso hybrid estimator (ridle), a new penalized regression method that simultaneously estimates coefficients of mandatory covariates while allowing selection for others. The ridle provides a principled approach to mitigate effects of multicollinearity among the mandatory covariates and possible dependency between mandatory and optional variables. We provide detailed empirical and theoretical studies to evaluate our method. In addition, we develop an efficient algorithm for the ridle. Software, based on efficient Fortran code with R-language wrappers, is publicly and freely available at https://sites.google.com/site/zhongyindaye/software. Results: The ridle is useful when mandatory predictors are known to be significant due to prior knowledge or must be kept for additional analysis. Both theoretical and comprehensive simulation studies have shown that the ridle to be advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves. A microarray gene expression analysis of the histologic grades of breast cancer has identified 24 genes, in which 2 genes are selected only by the ridle among current methods and found to be associated with tumor grade. Conclusions: In this article, we proposed the ridle as a principled sparse regression method for the selection of optional variables while incorporating mandatory ones. Results suggest that the ridle is advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves.
|
90 |
Statistical methods for the testing and estimation of linear dependence structures on paired high-dimensional data : application to genomic dataMestres, Adrià Caballé January 2018 (has links)
This thesis provides novel methodology for statistical analysis of paired high-dimensional genomic data, with the aimto identify gene interactions specific to each group of samples as well as the gene connections that change between the two classes of observations. An example of such groups can be patients under two medical conditions, in which the estimation of gene interaction networks is relevant to biologists as part of discerning gene regulatory mechanisms that control a disease process like, for instance, cancer. We construct these interaction networks fromdata by considering the non-zero structure of correlationmatrices, which measure linear dependence between random variables, and their inversematrices, which are commonly known as precision matrices and determine linear conditional dependence instead. In this regard, we study three statistical problems related to the testing, single estimation and joint estimation of (conditional) dependence structures. Firstly, we develop hypothesis testingmethods to assess the equality of two correlation matrices, and also two correlation sub-matrices, corresponding to two classes of samples, and hence the equality of the underlying gene interaction networks. We consider statistics based on the average of squares, maximum and sum of exceedances of sample correlations, which are suitable for both independent and paired observations. We derive the limiting distributions for the test statistics where possible and, for practical needs, we present a permuted samples based approach to find their corresponding non-parametric distributions. Cases where such hypothesis testing presents enough evidence against the null hypothesis of equality of two correlation matrices give rise to the problem of estimating two correlation (or precision) matrices. However, before that we address the statistical problem of estimating conditional dependence between random variables in a single class of samples when data are high-dimensional, which is the second topic of the thesis. We study the graphical lasso method which employs an L1 penalized likelihood expression to estimate the precision matrix and its underlying non-zero graph structure. The lasso penalization termis given by the L1 normof the precisionmatrix elements scaled by a regularization parameter, which determines the trade-off between sparsity of the graph and fit to the data, and its selection is our main focus of investigation. We propose several procedures to select the regularization parameter in the graphical lasso optimization problem that rely on network characteristics such as clustering or connectivity of the graph. Thirdly, we address the more general problem of estimating two precision matrices that are expected to be similar, when datasets are dependent, focusing on the particular case of paired observations. We propose a new method to estimate these precision matrices simultaneously, a weighted fused graphical lasso estimator. The analogous joint estimation method concerning two regression coefficient matrices, which we call weighted fused regression lasso, is also developed in this thesis under the same paired and high-dimensional setting. The two joint estimators maximize penalized marginal log likelihood functions, which encourage both sparsity and similarity in the estimated matrices, and that are solved using an alternating direction method of multipliers (ADMM) algorithm. Sparsity and similarity of thematrices are determined by two tuning parameters and we propose to choose them by controlling the corresponding average error rates related to the expected number of false positive edges in the estimated conditional dependence networks. These testing and estimation methods are implemented within the R package ldstatsHD, and are applied to a comprehensive range of simulated data sets as well as to high-dimensional real case studies of genomic data. We employ testing approaches with the purpose of discovering pathway lists of genes that present significantly different correlation matrices on healthy and unhealthy (e.g., tumor) samples. Besides, we use hypothesis testing problems on correlation sub-matrices to reduce the number of genes for estimation. The proposed joint estimation methods are then considered to find gene interactions that are common between medical conditions as well as interactions that vary in the presence of unhealthy tissues.
|
Page generated in 0.031 seconds