• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 115
  • 61
  • 21
  • 20
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 263
  • 263
  • 68
  • 67
  • 59
  • 55
  • 51
  • 39
  • 34
  • 32
  • 31
  • 30
  • 30
  • 29
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
61

Canonical Variable Selection for Ecological Modeling of Fecal Indicators

Gilfillan, Dennis, Hall, Kimberlee, Joyner, Timothy Andrew, Scheuerman, Phillip R. 20 September 2018 (has links)
More than 270,000 km of rivers and streams are impaired due to fecal pathogens, creating an economic and public health burden. Fecal indicator organisms such as Escherichia coli are used to determine if surface waters are pathogen impaired, but they fail to identify human health risks, provide source information, or have unique fate and transport processes. Statistical and machine learning models can be used to overcome some of these weaknesses, including identifying ecological mechanisms influencing fecal pollution. In this study, canonical correlation analysis (CCorA) was performed to select parameters for the machine learning model, Maxent, to identify how chemical and microbial parameters can predict E. coli impairment and F+-somatic bacteriophage detections. Models were validated using a bootstrapping cross-validation. Three suites of models were developed; initial models using all parameters, models using parameters identified in CCorA, and optimized models after further sensitivity analysis. Canonical correlation analysis reduced the number of parameters needed to achieve the same degree of accuracy in the initial E. coli model (84.7%), and sensitivity analysis improved accuracy to 86.1%. Bacteriophage model accuracies were 79.2, 70.8, and 69.4% for the initial, CCorA, and optimized models, respectively; this suggests complex ecological interactions of bacteriophages are not captured by CCorA. Results indicate distinct ecological drivers of impairment depending on the fecal indicator organism used. Escherichia coli impairment is driven by increased hardness and microbial activity, whereas bacteriophage detection is inhibited by high levels of coliforms in sediment. Both indicators were influenced by organic pollution and phosphorus limitation.
62

Penalized methods and algorithms for high-dimensional regression in the presence of heterogeneity

Yi, Congrui 01 December 2016 (has links)
In fields such as statistics, economics and biology, heterogeneity is an important topic concerning validity of data inference and discovery of hidden patterns. This thesis focuses on penalized methods for regression analysis with the presence of heterogeneity in a potentially high-dimensional setting. Two possible strategies to deal with heterogeneity are: robust regression methods that provide heterogeneity-resistant coefficient estimation, and direct detection of heterogeneity while estimating coefficients accurately in the meantime. We consider the first strategy for two robust regression methods, Huber loss regression and quantile regression with Lasso or Elastic-Net penalties, which have been studied theoretically but lack efficient algorithms. We propose a new algorithm Semismooth Newton Coordinate Descent to solve them. The algorithm is a novel combination of Semismooth Newton Algorithm and Coordinate Descent that applies to penalized optimization problems with both nonsmooth loss and nonsmooth penalty. We prove its convergence properties, and show its computational efficiency through numerical studies. We also propose a nonconvex penalized regression method, Heterogeneity Discovery Regression (HDR) , as a realization of the second idea. We establish theoretical results that guarantees statistical precision for any local optimum of the objective function with high probability. We also compare the numerical performances of HDR with competitors including Huber loss regression, quantile regression and least squares through simulation studies and a real data example. In these experiments, HDR methods are able to detect heterogeneity accurately, and also largely outperform the competitors in terms of coefficient estimation and variable selection.
63

Passive detection of radionuclides from weak and poorly resolved gamma-ray energy spectra

Kump, Paul 01 July 2012 (has links)
Large passive detectors used in screening for special nuclear materials at ports of entry are characterized by poor spectral resolution, making identification of radionuclides a difficult task. Most identification routines, which fit empirical shapes and use derivatives, are impractical in these situations. Here I develop new, physics-based methods to determine the presence of spectral signatures of one or more of a set of isotopes. Gamma-ray counts are modeled as Poisson processes, where the average part is taken to be the model and the difference between the observed gamma-ray counts and the average is considered random noise. In the linear part, the unknown coefficients represent the intensites of the isotopes. Therefore, it is of great interest not to estimate each coefficient, but rather determine if the coefficient is non-zero, corresponding to the presence of the isotope. This thesis provides new selection algorithms, and, since detector data is undoubtedly finite, this unique work emphasizes selection when data is fixed and finite.
64

Grouped variable selection in high dimensional partially linear additive Cox model

Liu, Li 01 December 2010 (has links)
In the analysis of survival outcome supplemented with both clinical information and high-dimensional gene expression data, traditional Cox proportional hazard model fails to meet some emerging needs in biological research. First, the number of covariates is generally much larger the sample size. Secondly, predicting an outcome with individual gene expressions is inadequate because a gene's expression is regulated by multiple biological processes and functional units. There is a need to understand the impact of changes at a higher level such as molecular function, cellular component, biological process, or pathway. The change at a higher level is usually measured with a set of gene expressions related to the biological process. That is, we need to model the outcome with gene sets as variable groups and the gene sets could be partially overlapped also. In this thesis work, we investigate the impact of a penalized Cox regression procedure on regularization, parameter estimation, variable group selection, and nonparametric modeling of nonlinear eects with a time-to-event outcome. We formulate the problem as a partially linear additive Cox model with high-dimensional data. We group genes into gene sets and approximate the nonparametric components by truncated series expansions with B-spline bases. After grouping and approximation, the problem of variable selection becomes that of selecting groups of coecients in a gene set or in an approximation. We apply the group Lasso to obtain an initial solution path and reduce the dimension of the problem and then update the whole solution path with the adaptive group Lasso. We also propose a generalized group lasso method to provide more freedom in specifying the penalty and excluding covariates from being penalized. A modied Newton-Raphson method is designed for stable and rapid computation. The core programs are written in the C language. An user-friendly R interface is implemented to perform all the calculations by calling the core programs. We demonstrate the asymptotic properties of the proposed methods. Simulation studies are carried out to evaluate the finite sample performance of the proposed procedure using several tuning parameter selection methods for choosing the point on the solution path as the nal estimator. We also apply the proposed approach on two real data examples.
65

Variable selection and neural networks for high-dimensional data analysis: application in infrared spectroscopy and chemometrics

Benoudjit, Nabil 24 November 2003 (has links)
This thesis focuses particularly on the application of chemometrics in the field of analytical chemistry. Chemometrics (or multivariate analysis) consists in finding a relationship between two groups of variables, often called dependent and independent variables. In infrared spectroscopy for instance, chemometrics consists in the prediction of a quantitative variable (the obtention of which is delicate, requiring a chemical analysis and a qualified operator), such as the concentration of a component present in the studied product from spectral data measured on various wavelengths or wavenumbers (several hundreds, even several thousands). In this research we propose a methodology in the field of chemometrics to handle the chemical data (spectrophotometric data) which are often in high dimension. To handle these data, we first propose a new incremental method (step-by-step) for the selection of spectral data using linear and non-linear regression based on the combination of three principles: linear or non-linear regression, incremental procedure for the variable selection, and use of a validation set. This procedure allows on one hand to benefit from the advantages of non-linear methods to predict chemical data (there is often a non-linear relationship between dependent and independent variables), and on the other hand to avoid the overfitting phenomenon, one of the most crucial problems encountered with non-linear models. Secondly, we propose to improve the previous method by a judicious choice of the first selected variable, which has a very important influence on the final performances of the prediction. The idea is to use a measure of the mutual information between the independent and dependent variables to select the first one; then the previous incremental method (step-by-step) is used to select the next variables. The variable selected by mutual information can have a good interpretation from the spectrochemical point of view, and does not depend on the data distribution in the training and validation sets. On the contrary, the traditional chemometric linear methods such as PCR or PLSR produce new variables which do not have any interpretation from the spectrochemical point of view. Four real-life datasets (wine, orange juice, milk powder and apples) are presented in order to show the efficiency and advantages of both proposed procedures compared to the traditional chemometric linear methods often used, such as MLR, PCR and PLSR.
66

Stochastic Stepwise Ensembles for Variable Selection

Xin, Lu 30 April 2009 (has links)
Ensembles methods such as AdaBoost, Bagging and Random Forest have attracted much attention in the statistical learning community in the last 15 years. Zhu and Chipman (2006) proposed the idea of using ensembles for variable selection. Their implementation used a parallel genetic algorithm (PGA). In this thesis, I propose a stochastic stepwise ensemble for variable selection, which improves upon PGA. Traditional stepwise regression (Efroymson 1960) combines forward and backward selection. One step of forward selection is followed by one step of backward selection. In the forward step, each variable other than those already included is added to the current model, one at a time, and the one that can best improve the objective function is retained. In the backward step, each variable already included is deleted from the current model, one at a time, and the one that can best improve the objective function is discarded. The algorithm continues until no improvement can be made by either the forward or the backward step. Instead of adding or deleting one variable at a time, Stochastic Stepwise Algorithm (STST) adds or deletes a group of variables at a time, where the group size is randomly decided. In traditional stepwise, the group size is one and each candidate variable is assessed. When the group size is larger than one, as is often the case for STST, the total number of variable groups can be quite large. Instead of evaluating all possible groups, only a few randomly selected groups are assessed and the best one is chosen. From a methodological point of view, the improvement of STST ensemble over PGA is due to the use of a more structured way to construct the ensemble; this allows us to better control over the strength-diversity tradeoff established by Breiman (2001). In fact, there is no mechanism to control this fundamental tradeoff in PGA. Empirically, the improvement is most prominent when a true variable in the model has a relatively small coefficient (relative to other true variables). I show empirically that PGA has a much higher probability of missing that variable.
67

Parametric classification and variable selection by the minimum integrated squared error criterion

January 2012 (has links)
This thesis presents a robust solution to the classification and variable selection problem when the dimension of the data, or number of predictor variables, may greatly exceed the number of observations. When faced with the problem of classifying objects given many measured attributes of the objects, the goal is to build a model that makes the most accurate predictions using only the most meaningful subset of the available measurements. The introduction of [cursive l] 1 regularized model titling has inspired many approaches that simultaneously do model fitting and variable selection. If parametric models are employed, the standard approach is some form of regularized maximum likelihood estimation. While this is an asymptotically efficient procedure under very general conditions, it is not robust. Outliers can negatively impact both estimation and variable selection. Moreover, outliers can be very difficult to identify as the number of predictor variables becomes large. Minimizing the integrated squared error, or L 2 error, while less efficient, has been shown to generate parametric estimators that are robust to a fair amount of contamination in several contexts. In this thesis, we present a novel robust parametric regression model for the binary classification problem based on L 2 distance, the logistic L 2 estimator (L 2 E). To perform simultaneous model fitting and variable selection among correlated predictors in the high dimensional setting, an elastic net penalty is introduced. A fast computational algorithm for minimizing the elastic net penalized logistic L 2 E loss is derived and results on the algorithm's global convergence properties are given. Through simulations we demonstrate the utility of the penalized logistic L 2 E at robustly recovering sparse models from high dimensional data in the presence of outliers and inliers. Results on real genomic data are also presented.
68

Stochastic Stepwise Ensembles for Variable Selection

Xin, Lu 30 April 2009 (has links)
Ensembles methods such as AdaBoost, Bagging and Random Forest have attracted much attention in the statistical learning community in the last 15 years. Zhu and Chipman (2006) proposed the idea of using ensembles for variable selection. Their implementation used a parallel genetic algorithm (PGA). In this thesis, I propose a stochastic stepwise ensemble for variable selection, which improves upon PGA. Traditional stepwise regression (Efroymson 1960) combines forward and backward selection. One step of forward selection is followed by one step of backward selection. In the forward step, each variable other than those already included is added to the current model, one at a time, and the one that can best improve the objective function is retained. In the backward step, each variable already included is deleted from the current model, one at a time, and the one that can best improve the objective function is discarded. The algorithm continues until no improvement can be made by either the forward or the backward step. Instead of adding or deleting one variable at a time, Stochastic Stepwise Algorithm (STST) adds or deletes a group of variables at a time, where the group size is randomly decided. In traditional stepwise, the group size is one and each candidate variable is assessed. When the group size is larger than one, as is often the case for STST, the total number of variable groups can be quite large. Instead of evaluating all possible groups, only a few randomly selected groups are assessed and the best one is chosen. From a methodological point of view, the improvement of STST ensemble over PGA is due to the use of a more structured way to construct the ensemble; this allows us to better control over the strength-diversity tradeoff established by Breiman (2001). In fact, there is no mechanism to control this fundamental tradeoff in PGA. Empirically, the improvement is most prominent when a true variable in the model has a relatively small coefficient (relative to other true variables). I show empirically that PGA has a much higher probability of missing that variable.
69

A Multivariate Framework for Variable Selection and Identification of Biomarkers in High-Dimensional Omics Data

Zuber, Verena 17 December 2012 (has links) (PDF)
In this thesis, we address the identification of biomarkers in high-dimensional omics data. The identification of valid biomarkers is especially relevant for personalized medicine that depends on accurate prediction rules. Moreover, biomarkers elucidate the provenance of disease, or molecular changes related to disease. From a statistical point of view the identification of biomarkers is best cast as variable selection. In particular, we refer to variables as the molecular attributes under investigation, e.g. genes, genetic variation, or metabolites; and we refer to observations as the specific samples whose attributes we investigate, e.g. patients and controls. Variable selection in high-dimensional omics data is a complicated challenge due to the characteristic structure of omics data. For one, omics data is high-dimensional, comprising cellular information in unprecedented details. Moreover, there is an intricate correlation structure among the variables due to e.g internal cellular regulation, or external, latent factors. Variable selection for uncorrelated data is well established. In contrast, there is no consensus on how to approach variable selection under correlation. Here, we introduce a multivariate framework for variable selection that explicitly accounts for the correlation among markers. In particular, we present two novel quantities for variable importance: the correlation-adjusted t (CAT) score for classification, and the correlation-adjusted (marginal) correlation (CAR) score for regression. The CAT score is defined as the Mahalanobis-decorrelated t-score vector, and the CAR score as the Mahalanobis-decorrelated correlation between the predictor variables and the outcome. We derive the CAT and CAR score from a predictive point of view in linear discriminant analysis and regression; both quantities assess the weight of a decorrelated and standardized variable on the prediction rule. Furthermore, we discuss properties of both scores and relations to established quantities. Above all, the CAT score decomposes Hotelling’s T 2 and the CAR score the proportion of variance explained. Notably, the decomposition of total variance into explained and unexplained variance in the linear model can be rewritten in terms of CAR scores. To render our approach applicable on high-dimensional omics data we devise an efficient algorithm for shrinkage estimates of the CAT and CAR score. Subsequently, we conduct extensive simulation studies to investigate the performance of our novel approaches in ranking and prediction under correlation. Here, CAT and CAR scores consistently improve over marginal approaches in terms of more true positives selected and a lower model error. Finally, we illustrate the application of CAT and CAR score on real omics data. In particular, we analyze genomics, transcriptomics, and metabolomics data. We ascertain that CAT and CAR score are competitive or outperform state of the art techniques in terms of true positives detected and prediction error.
70

Bayesian Adjustment for Multiplicity

Scott, James Gordon January 2009 (has links)
<p>This thesis is about Bayesian approaches for handling multiplicity. It considers three main kinds of multiple-testing scenarios: tests of exchangeable experimental units, tests for variable inclusion in linear regresson models, and tests for conditional independence in jointly normal vectors. Multiplicity adjustment in these three areas will be seen to have many common structural features. Though the modeling approach throughout is Bayesian, frequentist reasoning regarding error rates will often be employed.</p><p>Chapter 1 frames the issues in the context of historical debates about Bayesian multiplicity adjustment. Chapter 2 confronts the problem of large-scale screening of functional data, where control over Type-I error rates is a crucial issue. Chapter 3 develops new theory for comparing Bayes and empirical-Bayes approaches for multiplicity correction in regression variable selection. Chapters 4 and 5 describe new theoretical and computational tools for Gaussian graphical-model selection, where multiplicity arises in performing many simultaneous tests of pairwise conditional independence. Chapter 6 introduces a new approach to sparse-signal modeling based upon local shrinkage rules. Here the focus is not on multiplicity per se, but rather on using ideas from Bayesian multiple-testing models to motivate a new class of multivariate scale-mixture priors. Finally, Chapter 7 describes some directions for future study, many of which are the subjects of my current research agenda.</p> / Dissertation

Page generated in 0.2176 seconds