Return to search

Statistical Methods for High Dimensional Variable Selection

<p> This thesis focuses on high dimensional variable selection and addresses the limitation of existing penalized likelihood-based prediction models, as well as multiple hypothesis testing issues in jump detection. In the first project, we proposed a weighted sparse network learning method which allows users to first estimate a data driven network with sparsity property. The estimated network is then optimally combined using a weighted approach to a known or partially known network structure. We adapted the &ell;<sub>1</sub> penalties and proved the oracle property of our proposed model which aims to improve the accuracy of parameter estimation and achieves a parsimonious model in high dimensional setting. We further implemented a stability selection method for tuning the parameters and compared its performance to the cross-validation approach. We implemented our proposed framework for several generalized linear models including the Gaussian, logistic, and Cox proportional hazards (partial) models. We carried out extensive Monte Carlo simulations and compared the performance of our proposed model to the existing methods. Results showed that in the absence of prior information for constructing known network, our approach showed significant improvement over the elastic net models using data driven estimated network structure. On the other hand, if the prior network is correctly specified in advance, our prediction model significantly outperformed other methods. Results further showed that our proposed method is robust to network misspecification and the &ell;<sub>1</sub> penalty improves the prediction and variable selection regardless of the magnitude of the effects size. We also found that the stability selection method achieved a more robust parameter tuning results compared to the cross-validation approach, for all three phenotypes (continuous, binary and survival) considered in our simulation studies. Case studies on proteomic ovarian cancer and gene expression skin cutaneous melanoma further demonstrated that our proposed model achieved good operating characteristics in predicting response to platinum-based chemotherapy and survival risk. We further extended our work in statistical predictive learning in nonlinear prediction, where the traditional generalized linear models are insufficient. Nonlinear methods such as kernel methods show a great power in mapping the nonlinear space to a linear space, which can be easily incorporated into generalized linear models. This thesis demonstrated how to apply multiple kernel tricks to generalized linear model. Results from simulation shows that our proposed multiple kernel learning method can successfully identify the nonlinear likelihood functions under various scenarios. </p><p> The second project concerns jump detection in high frequency financial data. Nonparametric tests are popular and efficient methods for detecting jumps in high frequency financial data. Each method has its own advantageous and disadvantageous and their performance could be affected by the underlying noise and dynamic structure. To address this, we proposed a robust <i> p</i>-values pooling method which aims to combine the advantages of each method. We focus on model validation within a Monte Carlo framework to assess the reproducibility and false discovery rate. Reproducible analysis via correspondence curve and irreproducible discovery rate were analyzed with replicates to study local dependency and robustness across replicates. Extensive simulation studies of high frequency trading data at the minute level were carried out and the operating characteristics of these methods were compared via the false discovery rate control (FDR) framework. Our proposed method was robust across all scenario under reproducibility and FDR analysis. Finally, we applied the method to minute level data from the Limit Order Book System&mdash;the Efficient Reconstruction System (LOBSTER). An R package <b>JumpTest</b> implementing these methods is made available on the Comprehensive R Archive Network (CRAN).</p><p>

Identiferoai:union.ndltd.org:PROQUEST/oai:pqdtoai.proquest.com:13427291
Date19 April 2019
CreatorsLi, Kaiqiao
PublisherState University of New York at Stony Brook
Source SetsProQuest.com
LanguageEnglish
Detected LanguageEnglish
Typethesis

Page generated in 0.1018 seconds