• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 115
  • 61
  • 21
  • 20
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 263
  • 263
  • 68
  • 67
  • 59
  • 55
  • 51
  • 39
  • 34
  • 32
  • 31
  • 30
  • 30
  • 29
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

Statistical Methods for High Dimensional Data in Environmental Genomics

Sofer, Tamar January 2012 (has links)
In this dissertation, we propose methodology to analyze high dimensional genomics data, in which the observations have large number of outcome variables, in addition to exposure variables. In the Chapter 1, we investigate methods for genetic pathway analysis, where we have a small number of exposure variables. We propose two Canonical Correlation Analysis based methods, that select outcomes either sequentially or by screening, and show that the performance of the proposed methods depend on the correlation between the genes in the pathway. We also propose and investigate criterion for fixing the number of outcomes, and a powerful test for the exposure effect on the pathway. The methodology is applied to show that air pollution exposure affects gene methylation of a few genes from the asthma pathway. In Chapter 2, we study penalized multivariate regression as an efficient and flexible method to study the relationship between large number of covariates and multiple outcomes. We use penalized likelihood to shrink model parameters to zero and to select only the important effects. We use the Bayesian Information Criterion (BIC) to select tuning parameters for the employed penalty and show that it chooses the right tuning parameter with high probability. These are combined in the “two-stage procedure”, and asymptotic results show that it yields consistent, sparse and asymptotically normal estimator of the regression parameters. The method is illustrated on gene expression data in normal and diabetic patients. In Chapter 3 we propose a method for estimation of covariates-dependent principal components analysis (PCA) and covariance matrices. Covariates, such as smoking habits, can affect the variation in a set of gene methylation values. We develop a penalized regression method that incorporates covariates in the estimation of principal components. We show that the parameter estimates are consistent and sparse, and show that using the BIC to select the tuning parameter for the penalty functions yields good models. We also propose the scree plot residual variance criterion for selecting the number of principal components. The proposed procedure is implemented to show that the first three principal components of genes methylation in the asthma pathway are different in people who did not smoke, and people who did.
22

Robust Approaches to Marker Identification and Evaluation for Risk Assessment

Dai, Wei January 2013 (has links)
Assessment of risk has been a key element in efforts to identify factors associated with disease, to assess potential targets of therapy and enhance disease prevention and treatment. Considerable work has been done to develop methods to identify markers, construct risk prediction models and evaluate such models. This dissertation aims to develop robust approaches for these tasks. In Chapter 1, we present a robust, flexible yet powerful approach to identify genetic variants that are associated with disease risk in genome-wide association studies when some subjects are related. In Chapter 2, we focus on identifying important genes predictive of survival outcome when the number of covariates greatly exceeds the number of observations via a nonparametric transformation model. We propose a rank-based estimator that poses minimal assumptions and develop an efficient
23

Statistical Discovery of Biomarkers in Metagenomics

Abdul Wahab, Ahmad Hakeem January 2015 (has links)
Metagenomics holds unyielding potential in uncovering relationships within microbial communities that have yet to be discovered, particularly because the field circumvents the need to isolate and culture microbes from their natural environmental settings. A common research objective is to detect biomarkers, microbes are associated with changes in a status. For instance, determining such microbes across conditions such as healthy and diseased groups for instance allows researchers to identify pathogens and probiotics. This is often achieved via analysis of differential abundance of microbes. The problem is that differential abundance analysis looks at each microbe individually without considering the possible associations the microbes may have with each other. This is not favorable, since microbes rarely act individually but within intricate communities involving other microbes. An alternative would be variable selection techniques such as Lasso or Elastic Net which considers all the microbes simultaneously and conducts selection. However, Lasso often selects only a representative feature of a correlated cluster of features and the Elastic Net may incorrectly select unimportant features too frequently and erratically due to high levels of sparsity and variation in the data.\par In this research paper, the proposed method AdaLassop is an augmented variable selection technique that overcomes the misgivings of Lasso and Elastic Net. It provides researchers with a holistic model that takes into account the effects of selected biomarkers in presence of other important biomarkers. For AdaLassop, variable selection on sparse ultra-high dimensional data is implemented using the Adaptive Lasso with p-values extracted from Zero Inflated Negative Binomial Regressions as augmented weights. Comprehensive simulations involving varying correlation structures indicate that AdaLassop has optimal performance in the presence multicollinearity. This is especially apparent as sample size grows. Application of Adalassop on a Metagenome-wide study of diabetic patients reveals both pathogens and probiotics that have been researched in the medical field.
24

Comparisons of statistical modeling for constructing gene regulatory networks

Chen, Xiaohui 11 1900 (has links)
Genetic regulatory networks are of great importance in terms of scientific interests and practical medical importance. Since a number of high-throughput measurement devices are available, such as microarrays and sequencing techniques, regulatory networks have been intensively studied over the last decade. Based on these high-throughput data sets, statistical interpretations of these billions of bits are crucial for biologist to extract meaningful results. In this thesis, we compare a variety of existing regression models and apply them to construct regulatory networks which span trancription factors and microRNAs. We also propose an extended algorithm to address the local optimum issue in finding the Maximum A Posterjorj estimator. An E. coli mRNA expression microarray data set with known bona fide interactions is used to evaluate our models and we show that our regression networks with a properly chosen prior can perform comparably to the state-of-the-art regulatory network construction algorithm. Finally, we apply our models on a p53-related data set, NCI-60 data. By further incorporating available prior structural information from sequencing data, we identify several significantly enriched interactions with cell proliferation function. In both of the two data sets, we select specific examples to show that many regulatory interactions can be confirmed by previous studies or functional enrichment analysis. Through comparing statistical models, we conclude from the project that combining different models with over-representation analysis and prior structural information can improve the quality of prediction and facilitate biological interpretation. Keywords: regulatory network, variable selection, penalized maximum likelihood estimation, optimization, functional enrichment analysis.
25

Dimensionality Reduction in the Creation of Classifiers and the Effects of Correlation, Cluster Overlap, and Modelling Assumptions.

Petrcich, William 31 August 2011 (has links)
Discriminant analysis and random forests are used to create models for classification. The number of variables to be tested for inclusion in a model can be large. The goal of this work was to create an efficient and effective selection program. The first method used was based on the work of others. The resulting models were underperforming, so another approach was adopted. Models were built by adding the variable that maximized new-model accuracy. The two programs were used to generate discriminant-analysis and random forest models for three data sets. An existing software package was also used. The second program outperformed the alternatives. For the small number of runs produced in this study, it outperformed the method that inspired this work. The data sets were studied to identify determinants of performance. No definite conclusions were reached, but the results suggest topics for future study.
26

A Multivariate Process Analysis on a Paper Production Process

Löfroth, Jaime, Wiklund, Samuel January 2018 (has links)
A big challenge in managing large scale industry processes, like the ones in the paper and pulp industry, is to reduce the amount of downtime and reduce sources of product quality variability to a minimum, while staying cost effective. To accomplish this the key is to understand the complex nature of the processes variables, and to quantify the causal relationships between them and the product quality together with the amount of output. Paper and pulp industry processes consist mainly of chemical processes and the relatively low cost of sensors today enables collection of huge amounts of data, both variables and observations on frequent time intervals. These masses of data usually come with the intrinsic problem of multicollinearity which requires efficient multivari- ate statistical tools for the extraction of useful insights among the noise. One goal in this multivariate situation is to breakthrough the noise and find a relatively small subset of variables that are important, that is, variable selection. The purpose with this master thesis is to help SCA Obbola, a large paper manu- facturer that have had a variable production output, to come up with conclusions that can help them ensure a long term high production quantity and quality. We apply different variable selection approaches that have proven successful in the literature. The results that we get are of mixed success, but we manage to find both variables that SCA Obbola knows affect specific response variables, but also variables that they find interesting for further investigation. / En stor utmaning när det gäller att hantera storskaliga industriprocesser, som i pappers- och massaindustrin, är att minska tiden för driftstopp och reducera källor till varia- tioner i produktkvalitén till ett minimum, och samtidigt vara kostnadseffektiv. För att uppnå detta är det viktigt att förstå processvariablernas komplexa natur och att kvantifiera orsakssambanden mellan dem och produktkvaliteten tillsammans med pro- duktionsmängden. Pappers- och massasindustrin består huvudsakligen av kemiska pro- cesser och den relativt låga kostnaden för sensorer idag möjliggör insamling av stora mängder data, både variabler och observationer inom frekventa tidsintervall. Med des- sa datamängder får man ofta problem med multikollinearitet, vilket kräver effektiva multivariata statistiska verktyg för att extrahera användbara insikter bland bruset. Ett mål i denna multivariata situation är att bryta igenom bruset och hitta en relativt liten delmängd variabler som är viktiga, det vill säga variabel selektion. Syftet med denna masteruppsats är att hjälpa SCA Obbola, en stor pappersprodu- cent som har haft ett varierat produktionsutfall, att komma fram till slutsatser som kan hjälpa dem att säkerställa en långsiktig hög produktionskvantitet och kvalitet. Vi tillämpar olika metoder för variabel selektion, som har visat sig framgångsrika i lit- teraturen. Resultaten av arbetet är av blandad framgång, men vi lyckas hitta både variabler som SCA Obbola vet påverkar specifika responser, men även variabler som de tycker är intressanta för vidare utredning.
27

Bayesian Latent Class Analysis with Shrinkage Priors: An Application to the Hungarian Heart Disease Data

Grün, Bettina, Malsiner-Walli, Gertraud January 2018 (has links) (PDF)
Latent class analysis explains dependency structures in multivariate categorical data by assuming the presence of latent classes. We investigate the specification of suitable priors for the Bayesian latent class model to determine the number of classes and perform variable selection. Estimation is possible using standard tools implementing general purpose Markov chain Monte Carlo sampling techniques such as the software JAGS. However, class specific inference requires suitable post-processing in order to eliminate label switching. The proposed Bayesian specification and analysis method is applied to the Hungarian heart disease data set to determine the number of classes and identify relevant variables and results are compared to those obtained with the standard prior for the component specific parameters.
28

An Information Based Optimal Subdata Selection Algorithm for Big Data Linear Regression and a Suitable Variable Selection Algorithm

January 2017 (has links)
abstract: This article proposes a new information-based subdata selection (IBOSS) algorithm, Squared Scaled Distance Algorithm (SSDA). It is based on the invariance of the determinant of the information matrix under orthogonal transformations, especially rotations. Extensive simulation results show that the new IBOSS algorithm retains nice asymptotic properties of IBOSS and gives a larger determinant of the subdata information matrix. It has the same order of time complexity as the D-optimal IBOSS algorithm. However, it exploits the advantages of vectorized calculation avoiding for loops and is approximately 6 times as fast as the D-optimal IBOSS algorithm in R. The robustness of SSDA is studied from three aspects: nonorthogonality, including interaction terms and variable misspecification. A new accurate variable selection algorithm is proposed to help the implementation of IBOSS algorithms when a large number of variables are present with sparse important variables among them. Aggregating random subsample results, this variable selection algorithm is much more accurate than the LASSO method using full data. Since the time complexity is associated with the number of variables only, it is also very computationally efficient if the number of variables is fixed as n increases and not massively large. More importantly, using subsamples it solves the problem that full data cannot be stored in the memory when a data set is too large. / Dissertation/Thesis / Masters Thesis Statistics 2017
29

Comparisons of statistical modeling for constructing gene regulatory networks

Chen, Xiaohui 11 1900 (has links)
Genetic regulatory networks are of great importance in terms of scientific interests and practical medical importance. Since a number of high-throughput measurement devices are available, such as microarrays and sequencing techniques, regulatory networks have been intensively studied over the last decade. Based on these high-throughput data sets, statistical interpretations of these billions of bits are crucial for biologist to extract meaningful results. In this thesis, we compare a variety of existing regression models and apply them to construct regulatory networks which span trancription factors and microRNAs. We also propose an extended algorithm to address the local optimum issue in finding the Maximum A Posterjorj estimator. An E. coli mRNA expression microarray data set with known bona fide interactions is used to evaluate our models and we show that our regression networks with a properly chosen prior can perform comparably to the state-of-the-art regulatory network construction algorithm. Finally, we apply our models on a p53-related data set, NCI-60 data. By further incorporating available prior structural information from sequencing data, we identify several significantly enriched interactions with cell proliferation function. In both of the two data sets, we select specific examples to show that many regulatory interactions can be confirmed by previous studies or functional enrichment analysis. Through comparing statistical models, we conclude from the project that combining different models with over-representation analysis and prior structural information can improve the quality of prediction and facilitate biological interpretation. Keywords: regulatory network, variable selection, penalized maximum likelihood estimation, optimization, functional enrichment analysis. / Science, Faculty of / Graduate
30

Subgroup Identification in Clinical Trials

Li, Xiaochen 04 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Subgroup analyses assess the heterogeneity of treatment effects in groups of patients defined by patients’ baseline characteristics. Identifying subgroup of patients with differential treatment effect is crucial for tailored therapeutics and personalized medicine. Model-based variable selection methods are well developed and widely applied to select significant treatment-by-covariate interactions for subgroup analyses. Machine learning and data-driven based methods for subgroup identification have also been developed. In this dissertation, I consider two different types of subgroup identification methods: one is nonparametric machine learning based and the other is model based. In the first part, the problem of subgroup identification was transferred to an optimization problem and a stochastic search technique was implemented to partition the whole population into disjoint subgroups with differential treatment effect. In the second approach, an integrative three-step model-based variable selection method was proposed for subgroup analyses in longitudinal data. Using this three steps variable selection framework, informative features and their interaction with the treatment indicator can be identified for subgroup analysis in longitudinal data. This method can be extended to longitudinal binary or categorical data. Simulation studies and real data examples were used to demonstrate the performance of the proposed methods. / 2022-05-06

Page generated in 0.1379 seconds