41 |
Advancements in Degradation Modeling, Uncertainty Quantification and Spatial Variable SelectionXie, Yimeng 30 June 2016 (has links)
This dissertation focuses on three research projects: 1) construction of simultaneous prediction intervals/bounds for at least k out of m future observations; 2) semi-parametric degradation model for accelerated destructive degradation test (ADDT) data; and 3) spatial variable selection and application to Lyme disease data in Virginia. Followed by the general introduction in Chapter 1, the rest of the dissertation consists of three main chapters. Chapter 2 presents the construction of two-sided simultaneous prediction intervals (SPIs) or one-sided simultaneous prediction bounds (SPBs) to contain at least k out of m future observations, based on complete or right censored data from (log)-location-scale family of distributions. SPI/SPB calculated by the proposed procedure has exact coverage probability for complete and Type II censored data. In Type I censoring case, it has asymptotically correct coverage probability and reasonably good results for small samples. The proposed procedures can be extended to multiply-censored data or randomly censored data. Chapter 3 focuses on the analysis of ADDT data. We use a general degradation path model with correlated covariance structure to describe ADDT data. Monotone B-splines are used to modeling the underlying degradation process. A likelihood based iterative procedure for parameter estimation is developed. The confidence intervals of parameters are calculated using the nonparametric bootstrap procedure. Both simulated data and real datasets are used to compare the semi-parametric model with the existing parametric models. Chapter 4 studies the Lyme disease emergence in Virginia. The objective is to find important environmental and demographical covariates that are associated with Lyme disease emergence. To address the high-dimentional integral problem in the loglikelihood function, we consider the penalized quasi loglikelihood and the approximated loglikelihood based on Laplace approximation. We impose the adaptive elastic net penalty to obtain sparse estimation of parameters and thus to achieve variable selection of important variables. The proposed methods are investigated in simulation studies. We also apply the proposed methods to Lyme disease data in Virginia. Finally, Chapter 5 contains general conclusions and discussions for future work. / Ph. D.
|
42 |
On the Use of Grouped Covariate Regression in Oversaturated ModelsLoftus, Stephen Christopher 11 December 2015 (has links)
As data collection techniques improve, oftentimes the number of covariates exceeds the number of observations. When this happens, regression models become oversaturated and, thus, inestimable. Many classical and Bayesian techniques have been designed to combat this difficulty, with various means of combating the oversaturation. However, these techniques can be tricky to implement well, difficult to interpret, and unstable.
What is proposed is a technique that takes advantage of the natural clustering of variables that can often be found in biological and ecological datasets known as the omics datasests. Generally speaking, omics datasets attempt to classify host species structure or function by characterizing a group of biological molecules, such as genes (Genomics), the proteins (Proteomics), and metabolites (Metabolomics). By clustering the covariates and regressing on a single value for each cluster, the model becomes both estimable and stable. In addition, the technique can account for the variability within each cluster, allow for the inclusion of expert judgment, and provide a probability of inclusion for each cluster. / Ph. D.
|
43 |
Variable Selection and Decision Trees: The DiVaS and ALoVaS MethodsRoberts, Lucas R. 06 November 2014 (has links)
In this thesis we propose a novel modification to Bayesian decision tree methods. We provide a historical survey of the statistics and computer science research in decision trees. Our approach facilitates covariate selection explicitly in the model, something not present in previous research. We define a transformation that allows us to use priors from linear models to facilitate covariate selection in decision trees. Using this transform, we modify many common approaches to variable selection in the linear model and bring these methods to bear on the problem of explicit covariate selection in decision tree models. We also provide theoretical guidelines, including a theorem, which gives necessary and sufficient conditions for consistency of decision trees in infinite dimensional spaces. Our examples and case studies use both simulated and real data cases with moderate to large numbers of covariates. The examples support the claim that our approach is to be preferred in large dimensional datasets. Moreover, our approach shown here has, as a special case, the model known as Bayesian CART. / Ph. D.
|
44 |
Robust Feature Screening Procedures for Mixed Type of DataSun, Jinhui 16 December 2016 (has links)
High dimensional data have been frequently collected in many fields of scientific research and technological development. The traditional idea of best subset selection methods, which use penalized L_0 regularization, is computationally too expensive for many modern statistical applications. A large number of variable selection approaches via various forms of penalized least squares or likelihood have been developed to select significant variables and estimate their effects simultaneously in high dimensional statistical inference. However, in modern applications in areas such as genomics and proteomics, ultra-high dimensional data are often collected, where the dimension of data may grow exponentially with the sample size. In such problems, the regularization methods can become computationally unstable or even infeasible. To deal with the ultra-high dimensionality, Fan and Lv (2008) proposed a variable screening procedure via correlation learning to reduce dimensionality in sparse ultra-high dimensional models. Since then many authors further developed the procedure and applied to various statistical models. However, they all focused on single type of predictors, that is, the predictors are either all continuous or all discrete. In practice, we often collect mixed type of data, which contains both continuous and discrete predictors. For example, in genetic studies, we can collect information on both gene expression profiles and single nucleotide polymorphism (SNP) genotypes. Furthermore, outliers are often present in the observations due to experimental errors and other reasons. And the true trend underlying the data might not follow the parametric models assumed in many existing screening procedures. Hence a robust screening procedure against outliers and model misspecification is desired. In my dissertation, I shall propose a robust feature screening procedure for mixed type of data. To gain insights on screening for individual types of data, I first studied feature screening procedures for single type of data in Chapter 2 based on marginal quantities. For each type of data, new feature screening procedures are proposed and simulation studies are performed to compare their performances with existing procedures. The aim is to identify a best robust screening procedure for each type of data. In Chapter 3, I combine these best screening procedures to form the robust feature screening procedure for mixed type of data. Its performance will be assessed by simulation studies. I shall further illustrate the proposed procedure by the analysis of a real example. / Ph. D. / In modern applications in areas such as genomics and proteomics, ultra-high dimensional data are often collected, where the dimension of data may grow exponentially with the sample size. To deal with the ultra-high dimensionality, Fan and Lv (2008) proposed a variable screening procedure via correlation learning to reduce dimensionality in sparse ultra-high dimensional models. Since then many authors further developed the procedure and applied to various statistical models. However, they all focused on single type of predictors, that is, the predictors are either all continuous or all discrete. In practice, we often collect mixed type of data, which contains both continuous and discrete predictors. Furthermore, outliers are often present in the observations due to experimental errors and other reasons. Hence a robust screening procedure against outliers and model misspecification is desired. In my dissertation, I shall propose a robust feature screening procedure for mixed type of data. I first studied feature screening procedures for single type of data based on marginal quantities. For each type of data, new feature screening procedures are proposed and simulation studies are performed to compare their performances with existing procedures. The aim is to identify a best robust screening procedure for each type of data. Then i combined these best screening procedures to form the robust feature screening procedure for mixed type of data. Its performance will be assessed by simulation studies and the analysis of real examples.
|
45 |
Variable Selection in High-Dimensional DataReichhuber, Sarah, Hallberg, Johan January 2021 (has links)
Estimating the variables of importance in inferentialmodelling is of significant interest in many fields of science,engineering, biology, medicine, finance and marketing. However,variable selection in high-dimensional data, where the number ofvariables is relatively large compared to the observed data points,is a major challenge and requires more research in order toenhance reliability and accuracy. In this bachelor thesis project,several known methods of variable selection, namely orthogonalmatching pursuit (OMP), ridge regression, lasso, adaptive lasso,elastic net, adaptive elastic net and multivariate adaptive regressionsplines (MARS) were implemented on a high-dimensional dataset.The aim of this bachelor thesis project was to analyze andcompare these variable selection methods. Furthermore theirperformance on the same data set but extended, with the numberof variables and observations being of similar size, were analyzedand compared as well. This was done by generating models forthe different variable selection methods using built-in packagesin R and coding in MATLAB. The models were then used topredict the observations, and these estimations were compared tothe real observations. The performances of the different variableselection methods were analyzed utilizing different evaluationmethods. It could be concluded that some of the variable selectionmethods provided more accurate models for the implementedhigh-dimensional data set than others. Elastic net, for example,was one of the methods that performed better. Additionally, thecombination of final models could provide further insight in whatvariables that are crucial for the observations in the given dataset, where, for example, variable 112 and 23 appeared to be ofimportance. / Att skatta vilka variabler som är viktigai inferentiell modellering är av stort intresse inom mångaforskningsområden, industrier, biologi, medicin, ekonomi ochmarknadsföring. Variabel-selektion i högdimensionella data, därantalet variabler är relativt stort jämfört med antalet observeradedatapunkter, är emellertid en stor utmaning och krävermer forskning för att öka trovärdigheten och noggrannheteni resultaten. I detta projekt implementerades ett flertal kändavariabel-selektions-metoder, nämligen orthogonal matching pursuit(OMP), ridge regression, lasso, elastic net, adaptive lasso,adaptive elastic net och multivariate adaptive regression splines(MARS), på ett högdimensionellt data-set. Syftet med dettakandidat-examensarbete var att analysera och jämföra resultatenav dessa metoder. Vidare analyserades och jämfördes metodernasresultat på samma data-set, fast utökat, med antalet variableroch observationer ungefär lika stora. Detta gjordes genom attgenerera modeller för de olika variabel-selektions-metodernavia inbygga paket i R och programmering i MATLAB. Dessamodeller användes sedan för att prediktera observationer, ochestimeringarna jämfördes därefter med de verkliga observationerna.Resultaten av de olika variabel-selektions-metodernaanalyserades sedan med hjälp av ett flertal evaluerings-metoder.Det kunde fastställas att vissa av de implementerade variabelselektions-metoderna gav mer relevanta modeller för datanän andra. Exempelvis var elastic net en av metoderna sompresterade bättre. Dessutom drogs slutsatsen att kombineringav resultaten av de slutgiltiga modellerna kunde ge en djupareinsikt i vilka variabler som är viktiga för observationerna, där,till exempel, variabel 112 och 23 tycktes ha betydelse. / Kandidatexjobb i elektroteknik 2021, KTH, Stockholm
|
46 |
Statistical Applications of Linear Programming for Feature Selection via Regularization MethodsYao, Yonggang 01 October 2008 (has links)
No description available.
|
47 |
Consistent bi-level variable selection via composite group bridge penalized regressionSeetharaman, Indu January 1900 (has links)
Master of Science / Department of Statistics / Kun Chen / We study the composite group bridge penalized regression methods for conducting bilevel variable selection in high dimensional linear regression models with a diverging number of predictors. The proposed method combines the ideas of bridge regression (Huang et al., 2008a) and group bridge regression (Huang et al., 2009), to achieve variable selection consistency
in both individual and group levels simultaneously, i.e., the important groups and
the important individual variables within each group can both be correctly identi ed with
probability approaching to one as the sample size increases to in nity. The method takes full advantage of the prior grouping information, and the established bi-level oracle properties ensure that the method is immune to possible group misidenti cation. A related adaptive group bridge estimator, which uses adaptive penalization for improving bi-level selection, is also investigated. Simulation studies show that the proposed methods have superior performance in comparison to many existing methods.
|
48 |
Prediction and variable selection in sparse ultrahigh dimensional additive modelsRamirez, Girly Manguba January 1900 (has links)
Doctor of Philosophy / Department of Statistics / Haiyan Wang / The advance in technologies has enabled many fields to collect datasets where the number of covariates (p) tends to be much bigger than the number of observations (n), the
so-called ultrahigh dimensionality. In this setting, classical regression methodologies are
invalid. There is a great need to develop methods that can explain the variations of the
response variable using only a parsimonious set of covariates. In the recent years, there
have been significant developments of variable selection procedures. However, these available procedures usually result in the selection of too many false variables. In addition, most of the available procedures are appropriate only when the response variable is linearly associated with the covariates. Motivated by these concerns, we propose another procedure for variable selection in ultrahigh dimensional setting which has the ability to reduce the number of false positive variables. Moreover, this procedure can be applied when the response variable is continuous or binary, and when the response variable is linearly or non-linearly related to the covariates. Inspired by the Least Angle Regression approach, we develop two multi-step algorithms to select variables in sparse ultrahigh dimensional additive models. The variables go through a series of nonlinear dependence evaluation following a Most Significant Regression (MSR) algorithm. In addition, the MSR algorithm is also designed to
implement prediction of the response variable. The first algorithm called MSR-continuous
(MSRc) is appropriate for a dataset with a response variable that is continuous. Simulation
results demonstrate that this algorithm works well. Comparisons with other methods such
as greedy-INIS by Fan et al. (2011) and generalized correlation procedure by Hall and Miller (2009) showed that MSRc not only has false positive rate that is significantly less than both methods, but also has accuracy and true positive rate comparable with greedy-INIS. The second algorithm called MSR-binary (MSRb) is appropriate when the response variable is binary. Simulations demonstrate that MSRb is competitive in terms of prediction accuracy and true positive rate, and better than GLMNET in terms of false positive rate. Application of MSRb to real datasets is also presented. In general, MSR algorithm usually selects fewer variables while preserving the accuracy of predictions.
|
49 |
Variable selection for kernel methods with application to binary classificationOosthuizen, Surette 03 1900 (has links)
Thesis (PhD (Statistics and Actuarial Science))—University of Stellenbosch, 2008. / The problem of variable selection in binary kernel classification is addressed in this thesis.
Kernel methods are fairly recent additions to the statistical toolbox, having originated
approximately two decades ago in machine learning and artificial intelligence. These
methods are growing in popularity and are already frequently applied in regression and
classification problems.
Variable selection is an important step in many statistical applications. Thereby a better
understanding of the problem being investigated is achieved, and subsequent analyses of
the data frequently yield more accurate results if irrelevant variables have been eliminated.
It is therefore obviously important to investigate aspects of variable selection for kernel
methods.
Chapter 2 of the thesis is an introduction to the main part presented in Chapters 3 to 6. In
Chapter 2 some general background material on kernel methods is firstly provided, along
with an introduction to variable selection. Empirical evidence is presented substantiating
the claim that variable selection is a worthwhile enterprise in kernel classification
problems. Several aspects which complicate variable selection in kernel methods are
discussed.
An important property of kernel methods is that the original data are effectively
transformed before a classification algorithm is applied to it. The space in which the
original data reside is called input space, while the transformed data occupy part of a
feature space. In Chapter 3 we investigate whether variable selection should be performed
in input space or rather in feature space. A new approach to selection, so-called feature-toinput
space selection, is also proposed. This approach has the attractive property of
combining information generated in feature space with easy interpretation in input space. An empirical study reveals that effective variable selection requires utilisation of at least
some information from feature space.
Having confirmed in Chapter 3 that variable selection should preferably be done in feature
space, the focus in Chapter 4 is on two classes of selecion criteria operating in feature
space: criteria which are independent of the specific kernel classification algorithm and
criteria which depend on this algorithm. In this regard we concentrate on two kernel
classifiers, viz. support vector machines and kernel Fisher discriminant analysis, both of
which are described in some detail in Chapter 4. The chapter closes with a simulation
study showing that two of the algorithm-independent criteria are very competitive with the
more sophisticated algorithm-dependent ones.
In Chapter 5 we incorporate a specific strategy for searching through the space of variable
subsets into our investigation. Evidence in the literature strongly suggests that backward
elimination is preferable to forward selection in this regard, and we therefore focus on
recursive feature elimination. Zero- and first-order forms of the new selection criteria
proposed earlier in the thesis are presented for use in recursive feature elimination and their
properties are investigated in a numerical study. It is found that some of the simpler zeroorder
criteria perform better than the more complicated first-order ones.
Up to the end of Chapter 5 it is assumed that the number of variables to select is known.
We do away with this restriction in Chapter 6 and propose a simple criterion which uses the
data to identify this number when a support vector machine is used. The proposed criterion
is investigated in a simulation study and compared to cross-validation, which can also be
used for this purpose. We find that the proposed criterion performs well.
The thesis concludes in Chapter 7 with a summary and several discussions for further
research.
|
50 |
Influential data cases when the C-p criterion is used for variable selection in multiple linear regressionUys, Daniel Wilhelm January 2003 (has links)
Dissertation (PhD)--Stellenbosch University, 2003. / ENGLISH ABSTRACT: In this dissertation we study the influence of data cases when the Cp criterion of Mallows (1973)
is used for variable selection in multiple linear regression. The influence is investigated in
terms of the predictive power and the predictor variables included in the resulting model when
variable selection is applied. In particular, we focus on the importance of identifying and
dealing with these so called selection influential data cases before model selection and fitting
are performed. For this purpose we develop two new selection influence measures, both based
on the Cp criterion. The first measure is specifically developed to identify individual selection
influential data cases, whereas the second identifies subsets of selection influential data cases.
The success with which these influence measures identify selection influential data cases, is
evaluated in example data sets and in simulation. All results are derived in the coordinate free
context, with special application in multiple linear regression. / AFRIKAANSE OPSOMMING: Invloedryke waarnemings as die C-p kriterium vir veranderlike seleksie in meervoudigelineêre regressie gebruik word: In hierdie proefskrif ondersoek ons die invloed van waarnemings as die Cp kriterium van Mallows
(1973) vir veranderlike seleksie in meervoudige lineêre regressie gebruik word. Die
invloed van waarnemings op die voorspellingskrag en die onafhanklike veranderlikes wat ingesluit
word in die finale geselekteerde model, word ondersoek. In besonder fokus ons op
die belangrikheid van identifisering van en handeling met sogenaamde seleksie invloedryke
waarnemings voordat model seleksie en passing gedoen word. Vir hierdie doel word twee
nuwe invloedsmaatstawwe, albei gebaseer op die Cp kriterium, ontwikkel. Die eerste maatstaf
is spesifiek ontwikkelom die invloed van individuele waarnemings te meet, terwyl die tweede
die invloed van deelversamelings van waarnemings op die seleksie proses meet. Die sukses
waarmee hierdie invloedsmaatstawwe seleksie invloedryke waarnemings identifiseer word
beoordeel in voorbeeld datastelle en in simulasie. Alle resultate word afgelei binne die koërdinaatvrye
konteks, met spesiale toepassing in meervoudige lineêre regressie.
|
Page generated in 0.1173 seconds