• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 115
  • 61
  • 21
  • 20
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 263
  • 263
  • 68
  • 67
  • 59
  • 55
  • 51
  • 39
  • 34
  • 32
  • 31
  • 30
  • 30
  • 29
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Variable Selection and Decision Trees: The DiVaS and ALoVaS Methods

Roberts, Lucas R. 06 November 2014 (has links)
In this thesis we propose a novel modification to Bayesian decision tree methods. We provide a historical survey of the statistics and computer science research in decision trees. Our approach facilitates covariate selection explicitly in the model, something not present in previous research. We define a transformation that allows us to use priors from linear models to facilitate covariate selection in decision trees. Using this transform, we modify many common approaches to variable selection in the linear model and bring these methods to bear on the problem of explicit covariate selection in decision tree models. We also provide theoretical guidelines, including a theorem, which gives necessary and sufficient conditions for consistency of decision trees in infinite dimensional spaces. Our examples and case studies use both simulated and real data cases with moderate to large numbers of covariates. The examples support the claim that our approach is to be preferred in large dimensional datasets. Moreover, our approach shown here has, as a special case, the model known as Bayesian CART. / Ph. D.
42

Robust Feature Screening Procedures for Mixed Type of Data

Sun, Jinhui 16 December 2016 (has links)
High dimensional data have been frequently collected in many fields of scientific research and technological development. The traditional idea of best subset selection methods, which use penalized L_0 regularization, is computationally too expensive for many modern statistical applications. A large number of variable selection approaches via various forms of penalized least squares or likelihood have been developed to select significant variables and estimate their effects simultaneously in high dimensional statistical inference. However, in modern applications in areas such as genomics and proteomics, ultra-high dimensional data are often collected, where the dimension of data may grow exponentially with the sample size. In such problems, the regularization methods can become computationally unstable or even infeasible. To deal with the ultra-high dimensionality, Fan and Lv (2008) proposed a variable screening procedure via correlation learning to reduce dimensionality in sparse ultra-high dimensional models. Since then many authors further developed the procedure and applied to various statistical models. However, they all focused on single type of predictors, that is, the predictors are either all continuous or all discrete. In practice, we often collect mixed type of data, which contains both continuous and discrete predictors. For example, in genetic studies, we can collect information on both gene expression profiles and single nucleotide polymorphism (SNP) genotypes. Furthermore, outliers are often present in the observations due to experimental errors and other reasons. And the true trend underlying the data might not follow the parametric models assumed in many existing screening procedures. Hence a robust screening procedure against outliers and model misspecification is desired. In my dissertation, I shall propose a robust feature screening procedure for mixed type of data. To gain insights on screening for individual types of data, I first studied feature screening procedures for single type of data in Chapter 2 based on marginal quantities. For each type of data, new feature screening procedures are proposed and simulation studies are performed to compare their performances with existing procedures. The aim is to identify a best robust screening procedure for each type of data. In Chapter 3, I combine these best screening procedures to form the robust feature screening procedure for mixed type of data. Its performance will be assessed by simulation studies. I shall further illustrate the proposed procedure by the analysis of a real example. / Ph. D.
43

Advancements in Degradation Modeling, Uncertainty Quantification and Spatial Variable Selection

Xie, Yimeng 30 June 2016 (has links)
This dissertation focuses on three research projects: 1) construction of simultaneous prediction intervals/bounds for at least k out of m future observations; 2) semi-parametric degradation model for accelerated destructive degradation test (ADDT) data; and 3) spatial variable selection and application to Lyme disease data in Virginia. Followed by the general introduction in Chapter 1, the rest of the dissertation consists of three main chapters. Chapter 2 presents the construction of two-sided simultaneous prediction intervals (SPIs) or one-sided simultaneous prediction bounds (SPBs) to contain at least k out of m future observations, based on complete or right censored data from (log)-location-scale family of distributions. SPI/SPB calculated by the proposed procedure has exact coverage probability for complete and Type II censored data. In Type I censoring case, it has asymptotically correct coverage probability and reasonably good results for small samples. The proposed procedures can be extended to multiply-censored data or randomly censored data. Chapter 3 focuses on the analysis of ADDT data. We use a general degradation path model with correlated covariance structure to describe ADDT data. Monotone B-splines are used to modeling the underlying degradation process. A likelihood based iterative procedure for parameter estimation is developed. The confidence intervals of parameters are calculated using the nonparametric bootstrap procedure. Both simulated data and real datasets are used to compare the semi-parametric model with the existing parametric models. Chapter 4 studies the Lyme disease emergence in Virginia. The objective is to find important environmental and demographical covariates that are associated with Lyme disease emergence. To address the high-dimentional integral problem in the loglikelihood function, we consider the penalized quasi loglikelihood and the approximated loglikelihood based on Laplace approximation. We impose the adaptive elastic net penalty to obtain sparse estimation of parameters and thus to achieve variable selection of important variables. The proposed methods are investigated in simulation studies. We also apply the proposed methods to Lyme disease data in Virginia. Finally, Chapter 5 contains general conclusions and discussions for future work. / Ph. D.
44

Variable Selection in High-Dimensional Data

Reichhuber, Sarah, Hallberg, Johan January 2021 (has links)
Estimating the variables of importance in inferentialmodelling is of significant interest in many fields of science,engineering, biology, medicine, finance and marketing. However,variable selection in high-dimensional data, where the number ofvariables is relatively large compared to the observed data points,is a major challenge and requires more research in order toenhance reliability and accuracy. In this bachelor thesis project,several known methods of variable selection, namely orthogonalmatching pursuit (OMP), ridge regression, lasso, adaptive lasso,elastic net, adaptive elastic net and multivariate adaptive regressionsplines (MARS) were implemented on a high-dimensional dataset.The aim of this bachelor thesis project was to analyze andcompare these variable selection methods. Furthermore theirperformance on the same data set but extended, with the numberof variables and observations being of similar size, were analyzedand compared as well. This was done by generating models forthe different variable selection methods using built-in packagesin R and coding in MATLAB. The models were then used topredict the observations, and these estimations were compared tothe real observations. The performances of the different variableselection methods were analyzed utilizing different evaluationmethods. It could be concluded that some of the variable selectionmethods provided more accurate models for the implementedhigh-dimensional data set than others. Elastic net, for example,was one of the methods that performed better. Additionally, thecombination of final models could provide further insight in whatvariables that are crucial for the observations in the given dataset, where, for example, variable 112 and 23 appeared to be ofimportance. / Att skatta vilka variabler som är viktigai inferentiell modellering är av stort intresse inom mångaforskningsområden, industrier, biologi, medicin, ekonomi ochmarknadsföring. Variabel-selektion i högdimensionella data, därantalet variabler är relativt stort jämfört med antalet observeradedatapunkter, är emellertid en stor utmaning och krävermer forskning för att öka trovärdigheten och noggrannheteni resultaten. I detta projekt implementerades ett flertal kändavariabel-selektions-metoder, nämligen orthogonal matching pursuit(OMP), ridge regression, lasso, elastic net, adaptive lasso,adaptive elastic net och multivariate adaptive regression splines(MARS), på ett högdimensionellt data-set. Syftet med dettakandidat-examensarbete var att analysera och jämföra resultatenav dessa metoder. Vidare analyserades och jämfördes metodernasresultat på samma data-set, fast utökat, med antalet variableroch observationer ungefär lika stora. Detta gjordes genom attgenerera modeller för de olika variabel-selektions-metodernavia inbygga paket i R och programmering i MATLAB. Dessamodeller användes sedan för att prediktera observationer, ochestimeringarna jämfördes därefter med de verkliga observationerna.Resultaten av de olika variabel-selektions-metodernaanalyserades sedan med hjälp av ett flertal evaluerings-metoder.Det kunde fastställas att vissa av de implementerade variabelselektions-metoderna gav mer relevanta modeller för datanän andra. Exempelvis var elastic net en av metoderna sompresterade bättre. Dessutom drogs slutsatsen att kombineringav resultaten av de slutgiltiga modellerna kunde ge en djupareinsikt i vilka variabler som är viktiga för observationerna, där,till exempel, variabel 112 och 23 tycktes ha betydelse. / Kandidatexjobb i elektroteknik 2021, KTH, Stockholm
45

Statistical Applications of Linear Programming for Feature Selection via Regularization Methods

Yao, Yonggang 01 October 2008 (has links)
No description available.
46

Consistent bi-level variable selection via composite group bridge penalized regression

Seetharaman, Indu January 1900 (has links)
Master of Science / Department of Statistics / Kun Chen / We study the composite group bridge penalized regression methods for conducting bilevel variable selection in high dimensional linear regression models with a diverging number of predictors. The proposed method combines the ideas of bridge regression (Huang et al., 2008a) and group bridge regression (Huang et al., 2009), to achieve variable selection consistency in both individual and group levels simultaneously, i.e., the important groups and the important individual variables within each group can both be correctly identi ed with probability approaching to one as the sample size increases to in nity. The method takes full advantage of the prior grouping information, and the established bi-level oracle properties ensure that the method is immune to possible group misidenti cation. A related adaptive group bridge estimator, which uses adaptive penalization for improving bi-level selection, is also investigated. Simulation studies show that the proposed methods have superior performance in comparison to many existing methods.
47

Prediction and variable selection in sparse ultrahigh dimensional additive models

Ramirez, Girly Manguba January 1900 (has links)
Doctor of Philosophy / Department of Statistics / Haiyan Wang / The advance in technologies has enabled many fields to collect datasets where the number of covariates (p) tends to be much bigger than the number of observations (n), the so-called ultrahigh dimensionality. In this setting, classical regression methodologies are invalid. There is a great need to develop methods that can explain the variations of the response variable using only a parsimonious set of covariates. In the recent years, there have been significant developments of variable selection procedures. However, these available procedures usually result in the selection of too many false variables. In addition, most of the available procedures are appropriate only when the response variable is linearly associated with the covariates. Motivated by these concerns, we propose another procedure for variable selection in ultrahigh dimensional setting which has the ability to reduce the number of false positive variables. Moreover, this procedure can be applied when the response variable is continuous or binary, and when the response variable is linearly or non-linearly related to the covariates. Inspired by the Least Angle Regression approach, we develop two multi-step algorithms to select variables in sparse ultrahigh dimensional additive models. The variables go through a series of nonlinear dependence evaluation following a Most Significant Regression (MSR) algorithm. In addition, the MSR algorithm is also designed to implement prediction of the response variable. The first algorithm called MSR-continuous (MSRc) is appropriate for a dataset with a response variable that is continuous. Simulation results demonstrate that this algorithm works well. Comparisons with other methods such as greedy-INIS by Fan et al. (2011) and generalized correlation procedure by Hall and Miller (2009) showed that MSRc not only has false positive rate that is significantly less than both methods, but also has accuracy and true positive rate comparable with greedy-INIS. The second algorithm called MSR-binary (MSRb) is appropriate when the response variable is binary. Simulations demonstrate that MSRb is competitive in terms of prediction accuracy and true positive rate, and better than GLMNET in terms of false positive rate. Application of MSRb to real datasets is also presented. In general, MSR algorithm usually selects fewer variables while preserving the accuracy of predictions.
48

Variable selection for kernel methods with application to binary classification

Oosthuizen, Surette 03 1900 (has links)
Thesis (PhD (Statistics and Actuarial Science))—University of Stellenbosch, 2008. / The problem of variable selection in binary kernel classification is addressed in this thesis. Kernel methods are fairly recent additions to the statistical toolbox, having originated approximately two decades ago in machine learning and artificial intelligence. These methods are growing in popularity and are already frequently applied in regression and classification problems. Variable selection is an important step in many statistical applications. Thereby a better understanding of the problem being investigated is achieved, and subsequent analyses of the data frequently yield more accurate results if irrelevant variables have been eliminated. It is therefore obviously important to investigate aspects of variable selection for kernel methods. Chapter 2 of the thesis is an introduction to the main part presented in Chapters 3 to 6. In Chapter 2 some general background material on kernel methods is firstly provided, along with an introduction to variable selection. Empirical evidence is presented substantiating the claim that variable selection is a worthwhile enterprise in kernel classification problems. Several aspects which complicate variable selection in kernel methods are discussed. An important property of kernel methods is that the original data are effectively transformed before a classification algorithm is applied to it. The space in which the original data reside is called input space, while the transformed data occupy part of a feature space. In Chapter 3 we investigate whether variable selection should be performed in input space or rather in feature space. A new approach to selection, so-called feature-toinput space selection, is also proposed. This approach has the attractive property of combining information generated in feature space with easy interpretation in input space. An empirical study reveals that effective variable selection requires utilisation of at least some information from feature space. Having confirmed in Chapter 3 that variable selection should preferably be done in feature space, the focus in Chapter 4 is on two classes of selecion criteria operating in feature space: criteria which are independent of the specific kernel classification algorithm and criteria which depend on this algorithm. In this regard we concentrate on two kernel classifiers, viz. support vector machines and kernel Fisher discriminant analysis, both of which are described in some detail in Chapter 4. The chapter closes with a simulation study showing that two of the algorithm-independent criteria are very competitive with the more sophisticated algorithm-dependent ones. In Chapter 5 we incorporate a specific strategy for searching through the space of variable subsets into our investigation. Evidence in the literature strongly suggests that backward elimination is preferable to forward selection in this regard, and we therefore focus on recursive feature elimination. Zero- and first-order forms of the new selection criteria proposed earlier in the thesis are presented for use in recursive feature elimination and their properties are investigated in a numerical study. It is found that some of the simpler zeroorder criteria perform better than the more complicated first-order ones. Up to the end of Chapter 5 it is assumed that the number of variables to select is known. We do away with this restriction in Chapter 6 and propose a simple criterion which uses the data to identify this number when a support vector machine is used. The proposed criterion is investigated in a simulation study and compared to cross-validation, which can also be used for this purpose. We find that the proposed criterion performs well. The thesis concludes in Chapter 7 with a summary and several discussions for further research.
49

Influential data cases when the C-p criterion is used for variable selection in multiple linear regression

Uys, Daniel Wilhelm January 2003 (has links)
Dissertation (PhD)--Stellenbosch University, 2003. / ENGLISH ABSTRACT: In this dissertation we study the influence of data cases when the Cp criterion of Mallows (1973) is used for variable selection in multiple linear regression. The influence is investigated in terms of the predictive power and the predictor variables included in the resulting model when variable selection is applied. In particular, we focus on the importance of identifying and dealing with these so called selection influential data cases before model selection and fitting are performed. For this purpose we develop two new selection influence measures, both based on the Cp criterion. The first measure is specifically developed to identify individual selection influential data cases, whereas the second identifies subsets of selection influential data cases. The success with which these influence measures identify selection influential data cases, is evaluated in example data sets and in simulation. All results are derived in the coordinate free context, with special application in multiple linear regression. / AFRIKAANSE OPSOMMING: Invloedryke waarnemings as die C-p kriterium vir veranderlike seleksie in meervoudigelineêre regressie gebruik word: In hierdie proefskrif ondersoek ons die invloed van waarnemings as die Cp kriterium van Mallows (1973) vir veranderlike seleksie in meervoudige lineêre regressie gebruik word. Die invloed van waarnemings op die voorspellingskrag en die onafhanklike veranderlikes wat ingesluit word in die finale geselekteerde model, word ondersoek. In besonder fokus ons op die belangrikheid van identifisering van en handeling met sogenaamde seleksie invloedryke waarnemings voordat model seleksie en passing gedoen word. Vir hierdie doel word twee nuwe invloedsmaatstawwe, albei gebaseer op die Cp kriterium, ontwikkel. Die eerste maatstaf is spesifiek ontwikkelom die invloed van individuele waarnemings te meet, terwyl die tweede die invloed van deelversamelings van waarnemings op die seleksie proses meet. Die sukses waarmee hierdie invloedsmaatstawwe seleksie invloedryke waarnemings identifiseer word beoordeel in voorbeeld datastelle en in simulasie. Alle resultate word afgelei binne die koërdinaatvrye konteks, met spesiale toepassing in meervoudige lineêre regressie.
50

Exact Markov chain Monte Carlo and Bayesian linear regression

Bentley, Jason Phillip January 2009 (has links)
In this work we investigate the use of perfect sampling methods within the context of Bayesian linear regression. We focus on inference problems related to the marginal posterior model probabilities. Model averaged inference for the response and Bayesian variable selection are considered. Perfect sampling is an alternate form of Markov chain Monte Carlo that generates exact sample points from the posterior of interest. This approach removes the need for burn-in assessment faced by traditional MCMC methods. For model averaged inference, we find the monotone Gibbs coupling from the past (CFTP) algorithm is the preferred choice. This requires the predictor matrix be orthogonal, preventing variable selection, but allowing model averaging for prediction of the response. Exploring choices of priors for the parameters in the Bayesian linear model, we investigate sufficiency for monotonicity assuming Gaussian errors. We discover that a number of other sufficient conditions exist, besides an orthogonal predictor matrix, for the construction of a monotone Gibbs Markov chain. Requiring an orthogonal predictor matrix, we investigate new methods of orthogonalizing the original predictor matrix. We find that a new method using the modified Gram-Schmidt orthogonalization procedure performs comparably with existing transformation methods, such as generalized principal components. Accounting for the effect of using an orthogonal predictor matrix, we discover that inference using model averaging for in-sample prediction of the response is comparable between the original and orthogonal predictor matrix. The Gibbs sampler is then investigated for sampling when using the original predictor matrix and the orthogonal predictor matrix. We find that a hybrid method, using a standard Gibbs sampler on the orthogonal space in conjunction with the monotone CFTP Gibbs sampler, provides the fastest computation and convergence to the posterior distribution. We conclude the hybrid approach should be used when the monotone Gibbs CFTP sampler becomes impractical, due to large backwards coupling times. We demonstrate large backwards coupling times occur when the sample size is close to the number of predictors, or when hyper-parameter choices increase model competition. The monotone Gibbs CFTP sampler should be taken advantage of when the backwards coupling time is small. For the problem of variable selection we turn to the exact version of the independent Metropolis-Hastings (IMH) algorithm. We reiterate the notion that the exact IMH sampler is redundant, being a needlessly complicated rejection sampler. We then determine a rejection sampler is feasible for variable selection when the sample size is close to the number of predictors and using Zellner’s prior with a small value for the hyper-parameter c. Finally, we use the example of simulating from the posterior of c conditional on a model to demonstrate how the use of an exact IMH view-point clarifies how the rejection sampler can be adapted to improve efficiency.

Page generated in 0.1326 seconds