1 |
Statistical Discovery of Biomarkers in MetagenomicsAbdul Wahab, Ahmad Hakeem January 2015 (has links)
Metagenomics holds unyielding potential in uncovering relationships within microbial communities that have yet to be discovered, particularly because the field circumvents the need to isolate and culture microbes from their natural environmental settings. A common research objective is to detect biomarkers, microbes are associated with changes in a status. For instance, determining such microbes across conditions such as healthy and diseased groups for instance allows researchers to identify pathogens and probiotics. This is often achieved via analysis of differential abundance of microbes. The problem is that differential abundance analysis looks at each microbe individually without considering the possible associations the microbes may have with each other. This is not favorable, since microbes rarely act individually but within intricate communities involving other microbes. An alternative would be variable selection techniques such as Lasso or Elastic Net which considers all the microbes simultaneously and conducts selection. However, Lasso often selects only a representative feature of a correlated cluster of features and the Elastic Net may incorrectly select unimportant features too frequently and erratically due to high levels of sparsity and variation in the data.\par In this research paper, the proposed method AdaLassop is an augmented variable selection technique that overcomes the misgivings of Lasso and Elastic Net. It provides researchers with a holistic model that takes into account the effects of selected biomarkers in presence of other important biomarkers. For AdaLassop, variable selection on sparse ultra-high dimensional data is implemented using the Adaptive Lasso with p-values extracted from Zero Inflated Negative Binomial Regressions as augmented weights. Comprehensive simulations involving varying correlation structures indicate that AdaLassop has optimal performance in the presence multicollinearity. This is especially apparent as sample size grows. Application of Adalassop on a Metagenome-wide study of diabetic patients reveals both pathogens and probiotics that have been researched in the medical field.
|
2 |
ELASTIC NET FOR CHANNEL ESTIMATION IN MASSIVE MIMOPeken, Ture, Tandon, Ravi, Bose, Tamal 10 1900 (has links)
Next generation wireless systems will support higher data rates, improved spectral efficiency, and less latency. Massive multiple-input multiple-output (MIMO) is proposed to satisfy these demands. In massive MIMO, many benefits come from employing hundreds of antennas at the base station (BS) and serving dozens of user terminals (UTs) per cell. As the number of antennas increases at the BS, the channel becomes sparse. By exploiting sparse channel in massive MIMO, compressive sensing (CS) methods can be implemented to estimate the channel. In CS methods, the length of pilot sequences can be shortened compared to pilot-based methods. In this paper, a novel channel estimation algorithm based on a CS method called elastic net is proposed. Channel estimation accuracy of pilot-based, lasso, and elastic-net based methods in massive MIMO are compared. It is shown that the elastic-net based method gives the best performance in terms of error for the less pilot symbols and SNR values.
|
3 |
Regularized Markov Model for Modeling Disease TransitioningHuang, Shuang, Huang, Shuang January 2017 (has links)
In longitudinal studies of chronic diseases, the disease states of individuals are often collected at several pre-scheduled clinical visits, but the exact states and the times of transitioning from one state to another between observations are not observed. This is commonly referred to as "panel data". Statistical challenges arise in panel data in regard to identifying predictors governing the transitions between different disease states with only the partially observed disease history. Continuous-time Markov models (CTMMs) are commonly used to analyze panel data, and allow maximum likelihood estimations without making any assumptions about the unobserved states and transition times. By assuming that the underlying disease process is Markovian, CTMMs yield tractable likelihood. However, CTMMs generally allow covariate effect to differ for different transitions, resulting in a much higher number of coefficients to be estimated than the number of covariates, and model overfitting can easily happen in practice. In three papers, I develop a regularized CTMM using the elastic net penalty for panel data, and implement it in an R package. The proposed method is capable of simultaneous variable selection and estimation even when the dimension of the covariates is high.
In the first paper (Section 2), I use elastic net penalty to regularize the CTMM, and derive an efficient coordinate descent algorithm to solve the corresponding optimization problem. The algorithm takes advantage of the multinomial state distribution under the non-informative observation scheme assumption to simplify computation of key quantities. Simulation study shows that this method can effectively select true non-zero predictors while reducing model size.
In the second paper (Section 3), I extend the regularized CTMM developed in the previous paper to accommodate exact death times and censored states. Death is commonly included as an endpoint in longitudinal studies, and exact time of death can be easily obtained but the state path leading to death is usually unknown. I show that exact death times result in a very different form of likelihood, and the dependency of death time on the model requires significantly different numerical methods for computing the derivatives of the log likelihood, a key quantity for the coordinate descent algorithm. I propose to use numerical differentiation to compute the derivatives of the log likelihood. Computation of the derivatives of the log likelihood from a transition involving a censored state is also discussed. I carry out a simulation study to evaluate the performance of this extension, which shows consistently good variable selection properties and comparable prediction accuracy compared to the oracle models where only true non-zero coefficient are fitted. I then apply the regularized CTMM to the airflow limitation data to the TESAOD (The Tucson Epidemiological Study of Airway Obstructive Disease) study with exact death times and censored states, and obtain a prediction model with great size reduction from a total of 220 potential parameters.
Methods developed in the first two papers are implemented in an R package markovnet, and a detailed introduction to the key functionalities of the package is demonstrated with a simulated data set in the third paper (Section 4). Finally, some conclusion remarks are given and directions to future work are discussed (Section 5).
The outline for this dissertation is as follows. Section 1 presents an in-depth background regarding panel data, CTMMs, and penalized regression methods, as well as an brief description of the TESAOD study design. Section 2 describes the first paper entitled "Regularized continuous-time Markov model via elastic net'". Section 3 describes the second paper entitled "Regularized continuous-time Markov model with exact death times and censored states"'. Section 4 describes the third paper "Regularized continuous-time Markov model for panel data: the markovnet package for R"'. Section 5 gives an overall summary and a discussion of future work.
|
4 |
LASSO與其衍生方法之特性比較 / Property comparison of LASSO and its derivative methods黃昭勳, Huang, Jau-Shiun Unknown Date (has links)
本論文比較了幾種估計線性模型係數的方法,包括LASSO、Elastic Net、LAD-LASSO、EBLASSO和EBENet。有別於普通最小平方法,這些方法在估計模型係數的同時,能夠達到變數篩選,也就是刪除不重要的解釋變數,只將重要的變數保留在模型中。在現今大數據的時代,資料量有著愈來愈龐大的趨勢,其中不乏上百個甚至上千個解釋變數的資料,對於這樣的資料,變數篩選就顯得更加重要。本文主要目的為評估各種估計模型係數方法的特性與優劣,當中包含了兩種模擬研究與兩筆實際資料應用。由模擬的分析結果來看,每種估計方法都有不同的特性,沒有一種方法使用在所有資料都是最好的。 / In this study, we compare several methods for estimating coefficients of linear models, including LASSO, Elastic Net, LAD-LASSO, EBLASSO and EBENet. These methods are different from Ordinary Least Square (OLS) because they allow estimation of coefficients and variable selection simultaneously. In other words, these methods eliminate non-important predictors and only important predictors remain in the model. In the age of big data, quantity of data has become larger and larger. A datum with hundreds of or thousands of predictors is also common. For this type of data, variable selection is apparently more essential. The primary goal of this article is to compare properties of different variable selection methods as well as to find which method best fits a large number of data. Two simulation scenarios and two real data applications are included in this study. By analyzing results from the simulation study, we can find that every method enjoys different characteristics, and no standard method can handle all kinds of data.
|
5 |
分類蛋白質質譜資料變數選取的探討 / On Variable Selection of Classifying Proteomic Spectra Data林婷婷 Unknown Date (has links)
本研究所利用的資料是來自美國東維吉尼亞醫學院所提供的攝護腺癌蛋白質質譜資料,其資料有原始資料和另一筆經過事前處理過的資料,而本研究是利用事前處理過的資料來作實証分析。由於此種資料通常都是屬於高維度資料,故變數間具有高度相關的現象也很常見,因此從大量的特徵變數中選取到重要的特徵變數來準確的判斷攝護腺的病變程度成為一個非常普遍且重要的課題。那麼本研究的目的是欲探討各(具有懲罰項)迴歸模型對於分類蛋白質質譜資料之變數選取結果,藉由LARS、Stagewise、LASSO、Group LASSO和Elastic Net各(具有懲罰項)迴歸模型將變數選入的先後順序當作其排序所產生的判別結果與利用「統計量排序」(t檢定、ANOVA F檢定以及Kruskal-Wallis檢定)以及SVM「分錯率排序」的判別結果相比較。而分析的結果顯示,Group LASSO對於六種兩兩分類的分錯率,其分錯率趨勢的表現都較其他方法穩定,並不會有大起大落的現象發生,且最小分錯率也幾乎較其他方法理想。此外Group LASSO在四分類的判別結果在與其他方法相較下也顯出此法可得出最低的分錯率,亦表示若須同時判別四種類別時,相較於其他方法之下Group LASSO的判別準確度最優。 / Our research uses the prostate proteomic spectra data which is offered by Eastern Virginia Medical School. The materials have raw data and preprocessed data. Our research uses the preprocessed data to do the analysis of real example. Because this kind of materials usually have high dimension, so it maybe has highly correlation between variables very common, therefore choose from a large number of characteristic variables to accurately determine the pathological change degree of the Prostate is become a very general and important subject. Then the purpose of our research wants to discuss every (penalized) regression model in variable selection results for classifying the proteomic spectra data. With LARS, Stagewise, LASSO, Group LASSO and Elastic Net, each variable is chosen successively by each (penalized) regression model, and it is regarded as each variable’s order then produce discrimination results. After that, we use their results to compare with using statistic order (t-test, ANOVA F-test and Kruskal-Wallis test) and SVM fault rate order. And the result of analyzing reveals Group LASSO to two by two of six kinds of rate by mistake that classify, the mistake rate behavior of trend is more stable than other ways, it doesn’t appear big rise or big fall phenomenon. Furthermore, this way’s mistake rate is almostly more ideal than other ways. Moreover, using Group LASSO to get the discrimination result of four classifications has the lowest mistake rate under comparing with other methods. In other words, when must distinguish four classifications in the same time, Group LASSO’s discrimination accuracy is optimum.
|
6 |
Regulariserad linjär regression för modellering av företags valutaexponering / Regularised Linear Regression for Modelling of Companies' Currency ExposureHahn, Karin, Tamm, Erik January 2021 (has links)
Inom fondförvaltning används kvantitativa metoder för att förutsäga hur företags räkenskaper kommer att förändras vid nästa kvartal jämfört med motsvarande kvartal året innan. Banken SEB använder i dag multipel linjär regression med förändring av intäkter som beroende variabel och förändring av valutakurser som oberoende variabler. Det är problematiskt av tre anledningar. Först och främst har valutor ofta stor multikolinjäritet, vilket ger instabila skattningar. För det andra det kan ett företags intäkter bero på ett urval av de valutor som används som data varför regression inte bör ske mot alla valutor. För det tredje är nyare data mer relevant för prediktioner. Dessa problem kan hanteras genom att använda regulariserings- och urvalsmetoder, mer specifikt elastic net och viktad regression. Vi utvärderar dessa metoder för en stor mängd företag genom att jämföra medelabsolutfelet mellan multipel linjär regression och regulariserad linjär regression med viktning. Utvärderingen visar att en sådan modell presterar bättre i 65,0 % av de företag som ingår i ett stort globalt aktieindex samt får ett medelabsolutfel på 14 procentenheter. Slutsatsen blir att elastic net och viktad regression adresserar problemen med den ursprungliga modellen och kan användas för bättre förutsägelser av intäkternas beroende av valutakurser. / Quantative methods are used in fund management to predict the change in companies' revenues at the next quarterly report compared to the corresponding quarter the year before. The Swedish bank SEB already uses multiple linear regression with change of revenue as the depedent variable and change of exchange rates as independent variables. This is problematic for three reasons. Firstly, currencies often exibit large multicolinearity, which yields volatile estimates. Secondly, a company's revenue can depend on a subset of the currencies included in the dataset. With the multicolinearity in mind, it is benifical to not regress against all the currencies. Thirdly, newer data is more relevant for the predictions. These issues can be handled by using regularisation and selection methods, more specifically elastic net and weighted regression. We evaluate these methods for a large number of companies by comparing the mean absolute error between multiple linear regression and regularised linear regression with weighting. The evaluation shows that such model performs better for 65.0% of the companies included in a large global share index with a mean absolute error of 14 percentage points. The conclusion is that elastic net and weighted regression address the problems with the original model and can be used for better predictions of how the revenues depend on exchange rates.
|
7 |
STATISTICAL METHODS IN MICROARRAY DATA ANALYSISHuang, Liping 01 January 2009 (has links)
This dissertation includes three topics. First topic: Regularized estimation in the AFT model with high dimensional covariates. Second topic: A novel application of quantile regression for identification of biomarkers exemplified by equine cartilage microarray data. Third topic: Normalization and analysis of cDNA microarray using linear contrasts.
|
8 |
Marginal false discovery rate approaches to inference on penalized regression modelsMiller, Ryan 01 August 2018 (has links)
Data containing large number of variables is becoming increasingly more common and sparsity inducing penalized regression methods, such the lasso, have become a popular analysis tool for these datasets due to their ability to naturally perform variable selection. However, quantifying the importance of the variables selected by these models is a difficult task. These difficulties are compounded by the tendency for the most predictive models, for example those which were chosen using procedures like cross-validation, to include substantial amounts of noise variables with no real relationship with the outcome. To address the task of performing inference on penalized regression models, this thesis proposes false discovery rate approaches for a broad class of penalized regression models. This work includes the development of an upper bound for the number of noise variables in a model, as well as local false discovery rate approaches that quantify the likelihood of each individual selection being a false discovery. These methods are applicable to a wide range of penalties, such as the lasso, elastic net, SCAD, and MCP; a wide range of models, including linear regression, generalized linear models, and Cox proportional hazards models; and are also extended to the group regression setting under the group lasso penalty. In addition to studying these methods using numerous simulation studies, the practical utility of these methods is demonstrated using real data from several high-dimensional genome wide association studies.
|
9 |
Statistical Methods for Functional Metagenomic Analysis Based on Next-Generation Sequencing DataPookhao, Naruekamol January 2014 (has links)
Metagenomics is the study of a collective microbial genetic content recovered directly from natural (e.g., soil, ocean, and freshwater) or host-associated (e.g., human gut, skin, and oral) environmental communities that contain microorganisms, i.e., microbiomes. The rapid technological developments in next generation sequencing (NGS) technologies, enabling to sequence tens or hundreds of millions of short DNA fragments (or reads) in a single run, facilitates the studies of multiple microorganisms lived in environmental communities. Metagenomics, a relatively new but fast growing field, allows us to understand the diversity of microbes, their functions, cooperation, and evolution in a particular ecosystem. Also, it assists us to identify significantly different metabolic potentials in different environments. Particularly, metagenomic analysis on the basis of functional features (e.g., pathways, subsystems, functional roles) enables to contribute the genomic contents of microbes to human health and leads us to understand how the microbes affect human health by analyzing a metagenomic data corresponding to two or multiple populations with different clinical phenotypes (e.g., diseased and healthy, or different treatments). Currently, metagenomic analysis has substantial impact not only on genetic and environmental areas, but also on clinical applications. In our study, we focus on the development of computational and statistical methods for functional metagnomic analysis of sequencing data that is obtained from various environmental microbial samples/communities.
|
10 |
Statistical Research on COVID-19 ResponseHuang, Xiaolin 06 June 2022 (has links)
COVID-19 has affected the lives of millions of people worldwide. This thesis includes two statistical studies on the response to COVID-19. The first study explores the impact of lockdown timing on COVID-19 transmission across US counties. We used functional principal component analysis to extract COVID-19 transmission patterns from county-wise case counts, and used supervised machine learning to identify risk factors, with the timing of lockdowns being the most significant. In particular, we found a critical time point for lockdowns, as lockdowns implemented after this time point were associated with significantly more cases and faster spread. The second study proposes an adaptive sample pooling strategy for efficient COVID-19 diagnostic testing. When testing a cohort, our strategy dynamically updates the prevalence estimate after each test if possible, and uses the updated information to choose the optimal pool size for the subsequent test. Simulation studies show that compared to traditional pooling strategies, our strategy reduces the number of tests required to test a cohort and is more resilient to inaccurate prevalence inputs. We have developed a dashboard application to guide the clinicians through the test procedure when using our strategy. / Graduate / 2023-05-27
|
Page generated in 0.0726 seconds