Return to search

Supervised ridge regression in high dimensional linear regression. / 高維線性回歸的監督嶺回歸 / CUHK electronic theses & dissertations collection / Gao wei xian xing hui gui de jian du ling hui gui

在機器學習領域,我們通常有很多的特徵變量,以確定一些回應變量的行為。例如在基因測試問題,我們有數以萬計的基因用來作為特徵變量,而它們與某些疾病的關係需要被確定。沒有提供具體的知識,最簡單和基本的方法來模擬這種問題會是一個線性的模型。有很多現成的方法來解決線性回歸問題,像傳統的普通最小二乘回歸法,嶺回歸和套索回歸。設 N 為樣本數和,p 為特徵變量數,在普通的情況下,我們通常有足夠的樣本(N> P)。 在這種情況下,普通線性回歸的方法,例如嶺回歸通常會給予合理的對未來的回應變量測值的預測。隨著現代統計學的發展,我們經常會遇到高維問題(N << P),如 DNA 芯片數據的測試問題。在這些類型的高維問題中,確定特徵變量和回應變量之間的關係在沒有任何進一步的假設的情況下是相當困難的。在很多現實問題中,儘管有大量的特徵變量存在,但是完全有可能只有極少數的特徵變量和回應變量有直接關係,而大部分其他的特徵變量都是無效的。 套索和嶺回歸等傳統線性回歸在高維問題中有其局限性。套索回歸在應用於高維問題時,會因為測量噪聲的存在而表現得很糟糕,這將導致非常低的預測準確率。嶺回歸也有其明顯的局限性。它不能夠分開真正的特徵變量和無效的特徵變量。我提出的新方法的目的就是在高維線性回歸中克服以上兩種方法的局限性,從而導致更精確和穩定的預測。想法其實很簡單,與其做一個單一步驟的線性回歸,我們將回歸過程分成兩個步驟。第一步,我们棄那些預測有相關性很小或為零的特徵變量。第二步,我們應該得到一個消減過的特徵變量集,我們將用這個集和回應變量來進行嶺回歸從而得到我們需要的結果。 / In the field of statistical learning, we usually have a lot of features to determine the behavior of some response. For example in gene testing problems we have lots of genes as features and their relations with certain disease need to be determined. Without specific knowledge available, the most simple and fundamental way to model this kind of problem would be a linear model. There are many existing method to solve linear regression, like conventional ordinary least squares, ridge regression and LASSO (least absolute shrinkage and selection operator). Let N denote the number of samples and p denote the number of predictors, in ordinary settings where we have enough samples (N > p), ordinary linear regression methods like ridge regression will usually give reasonable predictions for the future values of the response. In the development of modern statistical learning, it's quite often that we meet high dimensional problems (N << p), like documents classification problems and microarray data testing problems. In high-dimensional problems it is generally quite difficult to identify the relationship between the predictors and the response without any further assumptions. Despite the fact that there are many predictors for prediction, most of the predictors are actually spurious in a lot of real problems. A predictor being spurious means that it is not directly related to the response. For example in microarray data testing problems, millions of genes may be available for doing prediction, but only a few hundred genes are actually related to the target disease. Conventional techniques in linear regression like LASSO and ridge regression both have their limitations in high-dimensional problems. The LASSO is one of the "state of the art technique for sparsity recovery, but when applied to high-dimensional problems, LASSO's performance is degraded a lot due to the presence of the measurement noise, which will result in high variance prediction and large prediction error. Ridge regression on the other hand is more robust to the additive measurement noise, but has its obvious limitation of not being able to separate true predictors from spurious predictors. As mentioned previously in many high-dimensional problems a large number of the predictors could be spurious, then in these cases ridge's disability in separating spurious and true predictors will result in poor interpretability of the model as well as poor prediction performance. The new technique that I will propose in this thesis aims to accommodate for the limitations of these two methods thus resulting in more accurate and stable prediction performance in a high-dimensional linear regression problem with signicant measurement noise. The idea is simple, instead of the doing a single step regression, we divide the regression procedure into two steps. In the first step we try to identify the seemingly relevant predictors and those that are obviously spurious by calculating the uni-variant correlations between the predictors and the response. We then discard those predictors that have very small or zero correlation with the response. After the first step we should have obtained a reduced predictor set. In the second step we will perform a ridge regression between the reduced predictor set and the response, the result of this ridge regression will then be our desired output. The thesis will be organized as follows, first I will start with a literature review about the linear regression problem and introduce in details about the ridge and LASSO and explain more precisely about their limitations in high-dimensional problems. Then I will introduce my new method called supervised ridge regression and show the reasons why it should dominate the ridge and LASSO in high-dimensional problems, and some simulation results will be demonstrated to strengthen my argument. Finally I will conclude with the possible limitations of my method and point out possible directions for further investigations. / Detailed summary in vernacular field only. / Zhu, Xiangchen. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2013. / Includes bibliographical references (leaves 68-69). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts also in Chinese. / Chapter 1. --- BASICS ABOUT LINEAR REGRESSION --- p.2 / Chapter 1.1 --- Introduction --- p.2 / Chapter 1.2 --- Linear Regression and Least Squares --- p.2 / Chapter 1.2.1 --- Standard Notations --- p.2 / Chapter 1.2.2 --- Least Squares and Its Geometric Meaning --- p.4 / Chapter 2. --- PENALIZED LINEAR REGRESSION --- p.9 / Chapter 2.1 --- Introduction --- p.9 / Chapter 2.2 --- Deficiency of the Ordinary Least Squares Estimate --- p.9 / Chapter 2.3 --- Ridge Regression --- p.12 / Chapter 2.3.1 --- Introduction to Ridge Regression --- p.12 / Chapter 2.3.2 --- Expected Prediction Error And Noise Variance Decomposition of Ridge Regression --- p.13 / Chapter 2.3.3 --- Shrinkage effects on different principal components by ridge regression --- p.18 / Chapter 2.4 --- The LASSO --- p.22 / Chapter 2.4.1 --- Introduction to the LASSO --- p.22 / Chapter 2.4.2 --- The Variable Selection Ability and Geometry of LASSO --- p.25 / Chapter 2.4.3 --- Coordinate Descent Algorithm to solve for the LASSO --- p.28 / Chapter 3. --- LINEAR REGRESSION IN HIGH-DIMENSIONAL PROBLEMS --- p.31 / Chapter 3.1 --- Introduction --- p.31 / Chapter 3.2 --- Spurious Predictors and Model Notations for High-dimensional Linear Regression --- p.32 / Chapter 3.3 --- Ridge and LASSO in High-dimensional Linear Regression --- p.34 / Chapter 4. --- THE SUPERVISED RIDGE REGRESSION --- p.39 / Chapter 4.1 --- Introduction --- p.39 / Chapter 4.2 --- Definition of Supervised Ridge Regression --- p.39 / Chapter 4.3 --- An Underlying Latent Model --- p.43 / Chapter 4.4 --- Ridge LASSO and Supervised Ridge Regression --- p.45 / Chapter 4.4.1 --- LASSO vs SRR --- p.45 / Chapter 4.4.2 --- Ridge regression vs SRR --- p.46 / Chapter 5. --- TESTING AND SIMULATION --- p.49 / Chapter 5.1 --- A Simulation Example --- p.49 / Chapter 5.2 --- More Experiments --- p.54 / Chapter 5.2.1 --- Correlated Spurious and True Predictors --- p.55 / Chapter 5.2.2 --- Insufficient Amount of Data Samples --- p.59 / Chapter 5.2.3 --- Low Dimensional Problem --- p.62 / Chapter 6. --- CONCLUSIONS AND DISCUSSIONS --- p.66 / Chapter 6.1 --- Conclusions --- p.66 / Chapter 6.2 --- References and Related Works --- p.68

Identiferoai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_328085
Date January 2013
ContributorsZhu, Xiangchen., Chinese University of Hong Kong Graduate School. Division of Information Engineering.
Source SetsThe Chinese University of Hong Kong
LanguageEnglish, Chinese
Detected LanguageEnglish
TypeText, bibliography
Formatelectronic resource, electronic resource, remote, 1 online resource ([7], 69 leaves) : ill. (some col.)
RightsUse of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Page generated in 0.0025 seconds