• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 151
  • 45
  • 32
  • 15
  • 4
  • 4
  • 4
  • 3
  • 2
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 297
  • 297
  • 74
  • 52
  • 50
  • 47
  • 44
  • 42
  • 42
  • 41
  • 35
  • 34
  • 28
  • 27
  • 25
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Bayesian variable selection for linear mixed models when p is much larger than n with applications in genome wide association studies

Williams, Jacob Robert Michael 05 June 2023 (has links)
Genome-wide association studies (GWAS) seek to identify single nucleotide polymorphisms (SNP) causing phenotypic responses in individuals. Commonly, GWAS analyses are done by using single marker association testing (SMA) which investigates the effect of a single SNP at a time and selects a candidate set of SNPs using a strict multiple correction penalty. As SNPs are not independent but instead strongly correlated, SMA methods lead to such high false discovery rates (FDR) that the results are difficult to use by wet lab scientists. To address this, this dissertation proposes three different novel Bayesian methods: BICOSS, BGWAS, and IEB. From a Bayesian modeling point of view, SNP search can be seen as a variable selection problem in linear mixed models (LMMs) where $p$ is much larger than $n$. To deal with the $p>>n$ issue, our three proposed methods use novel Bayesian approaches based on two steps: a screening step and a model selection step. To control false discoveries, we link the screening and model selection steps through a common probability of a null SNP. To deal with model selection, we propose novel priors that are extensions for LMMs of nonlocal priors, Zellner-g prior, unit Information prior, and Zellner-Siow prior. For each method, extensive simulation studies and case studies show that these methods improve the recall of true causal SNPs and, more importantly, drastically decrease FDR. Because our Bayesian methods provide more focused and precise results, they may speed up discovery of important SNPs and significantly contribute to scientific progress in the areas of biology, agricultural productivity, and human health. / Doctor of Philosophy / Genome-wide association studies (GWAS) seek to identify locations in DNA known as single nucleotide polymorphisms (SNPs) that are the underlying cause of observable traits such as height or breast cancer. Commonly, GWAS analyses are performed by investigating each SNP individually and seeing which SNPs are highly correlated with the response. However, as the SNPs themselves are highly correlated, investigating each one individually leads to a high number of false positives. To address this, this dissertation proposes three different advanced statistical methods: BICOSS, BGWAS, and IEB. Through extensive simulations, our methods are shown to not only drastically reduce the number of falsely detected SNPs but also increase the detection rate of true causal SNPs. Because our novel methods provide more focused and precise results, they may speed up discovery of important SNPs and significantly contribute to scientific progress in the areas of biology, agricultural productivity, and human health.
42

Cognitive Diagnostic Model, a Simulated-Based Study: Understanding Compensatory Reparameterized Unified Model (CRUM)

Galeshi, Roofia 28 November 2012 (has links)
A recent trend in education has been toward formative assessments to enable teachers, parents, and administrators assist students succeed. Cognitive diagnostic modeling (CDM) has the potential to provide valuable information for stakeholders to assist students identify their skill deficiency in specific academic subjects. Cognitive diagnosis models are mainly viewed as a family of latent class confirmatory probabilistic models. These models allow the mapping of students' skill profiles/academic ability. Using a complex simulation studies, the methodological issues in one of the existing cognitive models, referred to as compensatory reparameterized unified model (CRUM) under the log-linear model family of CDM, was investigated. In order for practitioners to implement these models, their item parameter recovery and examinees' classifications need to be studied in detail. A series of complex simulated data were generated for investigation with the following designs: three attributes with seven items, three attributes with thirty five items, four attributes with fifteen items, and five attributes with thirty one items. Each dataset was generated with observations of: 50, 100, 500, 1,000, 5,000, and 10,000 examinees. The first manuscript is the report of the investigation of how accurately CRUM could recover item parameters and classify examinees under true QMattrix specification and various research designs. The results suggested that the test length with regards to number of attributes and sample size affects the item parameter recovery and examinees classification accuracy. The second manuscript is the report of the investigation of the sensitivity of relative fit indices in detecting misfit for over- and opposite-Q-Matrix misspecifications. The relative fit indices under investigation were Akaike information criterion (AIC), Bayesian information criterion (BIC), and sample size adjusted Bayesian information criterion (ssaBIC). The results suggested that the CRUM can be a robust model given the consideration to the observation number and item/attribute combinations. The findings of this dissertation fill some of the existing gaps in the methodological issues regarding cognitive models' applicability and generalizability. It helps practitioners design tests in CDM framework in order to attain reliable and valid results. / Ph. D.
43

Stacking Ensemble for auto_ml

Ngo, Khai Thoi 13 June 2018 (has links)
Machine learning has been a subject undergoing intense study across many different industries and academic research areas. Companies and researchers have taken full advantages of various machine learning approaches to solve their problems; however, vast understanding and study of the field is required for developers to fully harvest the potential of different machine learning models and to achieve efficient results. Therefore, this thesis begins by comparing auto ml with other hyper-parameter optimization techniques. auto ml is a fully autonomous framework that lessens the knowledge prerequisite to accomplish complicated machine learning tasks. The auto ml framework automatically selects the best features from a given data set and chooses the best model to fit and predict the data. Through multiple tests, auto ml outperforms MLP and other similar frameworks in various datasets using small amount of processing time. The thesis then proposes and implements a stacking ensemble technique in order to build protection against over-fitting for small datasets into the auto ml framework. Stacking is a technique used to combine a collection of Machine Learning models’ predictions to arrive at a final prediction. The stacked auto ml ensemble results are more stable and consistent than the original framework; across different training sizes of all analyzed small datasets. / Master of Science
44

Classification et inférence de réseaux pour les données RNA-seq / Clustering and network inference for RNA-seq data

Gallopin, Mélina 09 December 2015 (has links)
Cette thèse regroupe des contributions méthodologiques à l'analyse statistique des données issues des technologies de séquençage du transcriptome (RNA-seq). Les difficultés de modélisation des données de comptage RNA-seq sont liées à leur caractère discret et au faible nombre d'échantillons disponibles, limité par le coût financier du séquençage. Une première partie de travaux de cette thèse porte sur la classification à l'aide de modèle de mélange. L'objectif de la classification est la détection de modules de gènes co-exprimés. Un choix naturel de modélisation des données RNA-seq est un modèle de mélange de lois de Poisson. Mais des transformations simples des données permettent de se ramener à un modèle de mélange de lois gaussiennes. Nous proposons de comparer, pour chaque jeu de données RNA-seq, les différentes modélisations à l'aide d'un critère objectif permettant de sélectionner la modélisation la plus adaptée aux données. Par ailleurs, nous présentons un critère de sélection de modèle prenant en compte des informations biologiques externes sur les gènes. Ce critère facilite l'obtention de classes biologiquement interprétables. Il n'est pas spécifique aux données RNA-seq. Il est utile à toute analyse de co-expression à l'aide de modèles de mélange visant à enrichir les bases de données d'annotations fonctionnelles des gènes. Une seconde partie de travaux de cette thèse porte sur l'inférence de réseau à l'aide d'un modèle graphique. L'objectif de l'inférence de réseau est la détection des relations de dépendance entre les niveaux d'expression des gènes. Nous proposons un modèle d'inférence de réseau basé sur des lois de Poisson, prenant en compte le caractère discret et la grande variabilité inter-échantillons des données RNA-seq. Cependant, les méthodes d'inférence de réseau nécessitent un nombre d'échantillons élevé.Dans le cadre du modèle graphique gaussien, modèle concurrent au précédent, nous présentons une approche non-asymptotique pour sélectionner des sous-ensembles de gènes pertinents, en décomposant la matrice variance en blocs diagonaux. Cette méthode n'est pas spécifique aux données RNA-seq et permet de réduire la dimension de tout problème d'inférence de réseau basé sur le modèle graphique gaussien. / This thesis gathers methodologicals contributions to the statistical analysis of next-generation high-throughput transcriptome sequencing data (RNA-seq). RNA-seq data are discrete and the number of samples sequenced is usually small due to the cost of the technology. These two points are the main statistical challenges for modelling RNA-seq data.The first part of the thesis is dedicated to the co-expression analysis of RNA-seq data using model-based clustering. A natural model for discrete RNA-seq data is a Poisson mixture model. However, a Gaussian mixture model in conjunction with a simple transformation applied to the data is a reasonable alternative. We propose to compare the two alternatives using a data-driven criterion to select the model that best fits each dataset. In addition, we present a model selection criterion to take into account external gene annotations. This model selection criterion is not specific to RNA-seq data. It is useful in any co-expression analysis using model-based clustering designed to enrich functional annotation databases.The second part of the thesis is dedicated to network inference using graphical models. The aim of network inference is to detect relationships among genes based on their expression. We propose a network inference model based on a Poisson distribution taking into account the discrete nature and high inter sample variability of RNA-seq data. However, network inference methods require a large number of samples. For Gaussian graphical models, we propose a non-asymptotic approach to detect relevant subsets of genes based on a block-diagonale decomposition of the covariance matrix. This method is not specific to RNA-seq data and reduces the dimension of any network inference problem based on the Gaussian graphical model.
45

Robust estimation of the number of components for mixtures of linear regression

Meng, Li January 1900 (has links)
Master of Science / Department of Statistics / Weixin Yao / In this report, we investigate a robust estimation of the number of components in the mixture of regression models using trimmed information criterion. Compared to the traditional information criterion, the trimmed criterion is robust and not sensitive to outliers. The superiority of the trimmed methods in comparison with the traditional information criterion methods is illustrated through a simulation study. A real data application is also used to illustrate the effectiveness of the trimmed model selection methods.
46

Machine learning approaches for assessing moderate-to-severe diarrhea in children < 5 years of age, rural western Kenya 2008-2012

Ayers, Tracy L 13 May 2016 (has links)
Worldwide diarrheal disease is a leading cause of morbidity and mortality in children less than five years of age. Incidence and disease severity remain the highest in sub-Saharan Africa. Kenya has an estimated 400,000 severe diarrhea episodes and 9,500 diarrhea-related deaths per year in children. Current statistical methods for estimating etiological and exposure risk factors for moderate-to-severe diarrhea (MSD) in children are constrained by the inability to assess a large number of parameters due to limitations of sample size, complex relationships, correlated predictors, and model assumptions of linearity. This dissertation examines machine learning statistical methods to address weaknesses associated with using traditional logistic regression models. The studies presented here investigate data from a 4-year, prospective, matched case-control study of MSD among children less than five years of age in rural Kenya from the Global Enteric Multicenter Study. The three machine learning approaches were used to examine associations with MSD and include: least absolute shrinkage and selection operator, classification trees, and random forest. A principal finding in all three studies was that machine learning methodological approaches are useful and feasible to implement in epidemiological studies. All provided additional information and understanding of the data beyond using only logistic regression models. The results from all three machine learning approaches were supported by comparable logistic regression results indicating their usefulness as epidemiological tools. This dissertation offers an exploration of methodological alternatives that should be considered more frequently in diarrheal disease epidemiology, and in public health in general.
47

Fully Bayesian Analysis of Multivariate Latent Class Models with an Application to Metric Conjoint Analysis

Frühwirth-Schnatter, Sylvia, Otter, Thomas, Tüchler, Regina January 2000 (has links) (PDF)
In this paper we head for a fully Bayesian analysis of the latent class model with a priori unknown number of classes. Estimation is carried out by means of Markov Chain Monte Carlo (MCMC) methods. We deal explicitely with the consequences the unidentifiability of this type of model has on MCMC estimation. Joint Bayesian estimation of all latent variables, model parameters, and parameters determining the probability law of the latent process is carried out by a new MCMC method called permutation sampling. In a first run we use the random permutation sampler to sample from the unconstrained posterior. We will demonstrate that a lot of important information, such as e.g. estimates of the subject-specific regression coefficients, is available from such an unidentified model. The MCMC output of the random permutation sampler is explored in order to find suitable identifiability constraints. In a second run we use the permutation sampler to sample from the constrained posterior by imposing identifiablity constraints. The unknown number of classes is determined by formal Bayesian model comparison through exact model likelihoods. We apply a new method of computing model likelihoods for latent class models which is based on the method of bridge sampling. The approach is applied to simulated data and to data from a metric conjoint analysis in the Austrian mineral water market. (author's abstract) / Series: Forschungsberichte / Institut für Statistik
48

Narrowing the gap between network models and real complex systems

Viamontes Esquivel, Alcides January 2014 (has links)
Simple network models that focus only on graph topology or, at best, basic interactions are often insufficient to capture all the aspects of a dynamic complex system. In this thesis, I explore those limitations, and some concrete methods of resolving them. I argue that, in order to succeed at interpreting and influencing complex systems, we need to take into account  slightly more complex parts, interactions and information flows in our models.This thesis supports that affirmation with five actual examples of applied research. Each study case takes a closer look at the dynamic of the studied problem and complements the network model with techniques from information theory, machine learning, discrete maths and/or ergodic theory. By using these techniques to study the concrete dynamics of each system, we could obtain interesting new information. Concretely, we could get better models of network walks that are used on everyday applications like journal ranking. We could also uncover asymptotic characteristics of an agent-based information propagation model which we think is the basis for things like belief propaga-tion or technology adoption on society. And finally, we could spot associations between antibiotic resistance genes in bacterial populations, a problem which is becoming more serious every day.
49

Model selection strategies in genome-wide association studies

Keildson, Sarah January 2011 (has links)
Unravelling the genetic architecture of common diseases is a continuing challenge in human genetics. While genome-wide association studies (GWAS) have proven to be successful in identifying many new disease susceptibility loci, the extension of these studies beyond single-SNP methods of analysis has been limited. The incorporation of multi-locus methods of analysis may, however, increase the power of GWAS to detect genes of smaller effect size, as well as genes that interact with each other and the environment. This investigation carried out large-scale simulations of four multi-locus model selection techniques; namely forward and backward selection, Bayesian model averaging (BMA) and least angle regression with a lasso modification (lasso), in order to compare the type I error rates and power of each method. At a type I error rate of ~5%, lasso showed the highest power across varied effect sizes, disease frequencies and genetic models. Lasso penalized regression was then used to perform three different types of analysis on GWAS data. Firstly, lasso was applied to the Wellcome Trust Case Control Consortium (WTCCC) data and identified many of the WTCCC SNPs that had a moderate-strong association (p<10-5) type 2 diabetes (T2D), as well as some of the moderate WTCCC associations (p<10-4) that have since been replicated in a large-scale meta-analysis. Secondly, lasso was used to fine-map the 17q21 childhood asthma risk locus and identified putative secondary signals in the 17q21 region, that may further contribute to childhood asthma risk. Finally, lasso identified three potential interaction effects potentially contributing towards coronary artery disease (CAD) risk. While the validity of these findings hinges on their replication in follow-up studies, the results suggest that lasso may provide scientists with exciting new methods of dissecting, and ultimately understanding, the complex genetic framework underlying common human diseases.
50

Statistical Models and Analysis of Growth Processes in Biological Tissue

Xia, Jun 15 December 2016 (has links)
The mechanisms that control growth processes in biology tissues have attracted continuous research interest despite their complexity. With the emergence of big data experimental approaches there is an urgent need to develop statistical and computational models to fit the experimental data and that can be used to make predictions to guide future research. In this work we apply statistical methods on growth process of different biological tissues, focusing on development of neuron dendrites and tumor cells. We first examine the neuron cell growth process, which has implications in neural tissue regenerations, by using a computational model with uniform branching probability and a maximum overall length constraint. One crucial outcome is that we can relate the parameter fits from our model to real data from our experimental collaborators, in order to examine the usefulness of our model under different biological conditions. Our methods can now directly compare branching probabilities of different experimental conditions and provide confidence intervals for these population-level measures. In addition, we have obtained analytical results that show that the underlying probability distribution for this process follows a geometrical progression increase at nearby distances and an approximately geometrical series decrease for far away regions, which can be used to estimate the spatial location of the maximum of the probability distribution. This result is important, since we would expect maximum number of dendrites in this region; this estimate is related to the probability of success for finding a neural target at that distance during a blind search. We then examined tumor growth processes which have similar evolutional evolution in the sense that they have an initial rapid growth that eventually becomes limited by the resource constraint. For the tumor cells evolution, we found an exponential growth model best describes the experimental data, based on the accuracy and robustness of models. Furthermore, we incorporated this growth rate model into logistic regression models that predict the growth rate of each patient with biomarkers; this formulation can be very useful for clinical trials. Overall, this study aimed to assess the molecular and clinic pathological determinants of breast cancer (BC) growth rate in vivo.

Page generated in 0.0521 seconds