Global ETD Search

41	Development of Numerical Estimation: Data and Models Young, Christopher J. 21 October 2011 (has links) No description available. Cognitive Psychology Numerical Estimation Model Selection Bayesian Analysis
42	Bayesian variable selection for linear mixed models when p is much larger than n with applications in genome wide association studies Williams, Jacob Robert Michael 05 June 2023 (has links) Genome-wide association studies (GWAS) seek to identify single nucleotide polymorphisms (SNP) causing phenotypic responses in individuals. Commonly, GWAS analyses are done by using single marker association testing (SMA) which investigates the effect of a single SNP at a time and selects a candidate set of SNPs using a strict multiple correction penalty. As SNPs are not independent but instead strongly correlated, SMA methods lead to such high false discovery rates (FDR) that the results are difficult to use by wet lab scientists. To address this, this dissertation proposes three different novel Bayesian methods: BICOSS, BGWAS, and IEB. From a Bayesian modeling point of view, SNP search can be seen as a variable selection problem in linear mixed models (LMMs) where $p$ is much larger than $n$. To deal with the $p>>n$ issue, our three proposed methods use novel Bayesian approaches based on two steps: a screening step and a model selection step. To control false discoveries, we link the screening and model selection steps through a common probability of a null SNP. To deal with model selection, we propose novel priors that are extensions for LMMs of nonlocal priors, Zellner-g prior, unit Information prior, and Zellner-Siow prior. For each method, extensive simulation studies and case studies show that these methods improve the recall of true causal SNPs and, more importantly, drastically decrease FDR. Because our Bayesian methods provide more focused and precise results, they may speed up discovery of important SNPs and significantly contribute to scientific progress in the areas of biology, agricultural productivity, and human health. / Doctor of Philosophy / Genome-wide association studies (GWAS) seek to identify locations in DNA known as single nucleotide polymorphisms (SNPs) that are the underlying cause of observable traits such as height or breast cancer. Commonly, GWAS analyses are performed by investigating each SNP individually and seeing which SNPs are highly correlated with the response. However, as the SNPs themselves are highly correlated, investigating each one individually leads to a high number of false positives. To address this, this dissertation proposes three different advanced statistical methods: BICOSS, BGWAS, and IEB. Through extensive simulations, our methods are shown to not only drastically reduce the number of falsely detected SNPs but also increase the detection rate of true causal SNPs. Because our novel methods provide more focused and precise results, they may speed up discovery of important SNPs and significantly contribute to scientific progress in the areas of biology, agricultural productivity, and human health. Bayesian methods GWAS Linear Mixed Models Model Selection
43	Cognitive Diagnostic Model, a Simulated-Based Study: Understanding Compensatory Reparameterized Unified Model (CRUM) Galeshi, Roofia 28 November 2012 (has links) A recent trend in education has been toward formative assessments to enable teachers, parents, and administrators assist students succeed. Cognitive diagnostic modeling (CDM) has the potential to provide valuable information for stakeholders to assist students identify their skill deficiency in specific academic subjects. Cognitive diagnosis models are mainly viewed as a family of latent class confirmatory probabilistic models. These models allow the mapping of students' skill profiles/academic ability. Using a complex simulation studies, the methodological issues in one of the existing cognitive models, referred to as compensatory reparameterized unified model (CRUM) under the log-linear model family of CDM, was investigated. In order for practitioners to implement these models, their item parameter recovery and examinees' classifications need to be studied in detail. A series of complex simulated data were generated for investigation with the following designs: three attributes with seven items, three attributes with thirty five items, four attributes with fifteen items, and five attributes with thirty one items. Each dataset was generated with observations of: 50, 100, 500, 1,000, 5,000, and 10,000 examinees. The first manuscript is the report of the investigation of how accurately CRUM could recover item parameters and classify examinees under true QMattrix specification and various research designs. The results suggested that the test length with regards to number of attributes and sample size affects the item parameter recovery and examinees classification accuracy. The second manuscript is the report of the investigation of the sensitivity of relative fit indices in detecting misfit for over- and opposite-Q-Matrix misspecifications. The relative fit indices under investigation were Akaike information criterion (AIC), Bayesian information criterion (BIC), and sample size adjusted Bayesian information criterion (ssaBIC). The results suggested that the CRUM can be a robust model given the consideration to the observation number and item/attribute combinations. The findings of this dissertation fill some of the existing gaps in the methodological issues regarding cognitive models' applicability and generalizability. It helps practitioners design tests in CDM framework in order to attain reliable and valid results. / Ph. D. relative fit indices model selection CRUM cognitive diagnostic model
44	Stacking Ensemble for auto_ml Ngo, Khai Thoi 13 June 2018 (has links) Machine learning has been a subject undergoing intense study across many different industries and academic research areas. Companies and researchers have taken full advantages of various machine learning approaches to solve their problems; however, vast understanding and study of the field is required for developers to fully harvest the potential of different machine learning models and to achieve efficient results. Therefore, this thesis begins by comparing auto ml with other hyper-parameter optimization techniques. auto ml is a fully autonomous framework that lessens the knowledge prerequisite to accomplish complicated machine learning tasks. The auto ml framework automatically selects the best features from a given data set and chooses the best model to fit and predict the data. Through multiple tests, auto ml outperforms MLP and other similar frameworks in various datasets using small amount of processing time. The thesis then proposes and implements a stacking ensemble technique in order to build protection against over-fitting for small datasets into the auto ml framework. Stacking is a technique used to combine a collection of Machine Learning models’ predictions to arrive at a final prediction. The stacked auto ml ensemble results are more stable and consistent than the original framework; across different training sizes of all analyzed small datasets. / Master of Science / Machine learning is a concept of using known past data to predict unknown future data. Many different industries uses machine learning; hospitals use machine learning to find mutations in DNA, online retailers use machine learning to recommend items, and advertisers use machine learning to show interesting ads to viewers. With increasing adoption of machine learning in various fields, there are a significant number of developers who want to take advantages of this concept, but they are not deeply familiar with techniques used in machine learning. This thesis introduces auto_ml framework which reduces the required deep understanding of these techniques. auto_ml automatically selects the best technique to use for each individual process, which used to train and predict given datasets. In addition, the thesis also implements a stacking ensemble technique which helps to yield consistently good predictions on small datasets. As the result, auto_ml performs better than MLP and other frameworks. In addition, auto_ml with the stacking ensemble technique performs more consistently than auto_ml without the stacking ensemble technique. Machine learning Stacking Ensemble Model Selection Hyper-parameter optimization auto_ml
45	Classification et inférence de réseaux pour les données RNA-seq / Clustering and network inference for RNA-seq data Gallopin, Mélina 09 December 2015 (has links) Cette thèse regroupe des contributions méthodologiques à l'analyse statistique des données issues des technologies de séquençage du transcriptome (RNA-seq). Les difficultés de modélisation des données de comptage RNA-seq sont liées à leur caractère discret et au faible nombre d'échantillons disponibles, limité par le coût financier du séquençage. Une première partie de travaux de cette thèse porte sur la classification à l'aide de modèle de mélange. L'objectif de la classification est la détection de modules de gènes co-exprimés. Un choix naturel de modélisation des données RNA-seq est un modèle de mélange de lois de Poisson. Mais des transformations simples des données permettent de se ramener à un modèle de mélange de lois gaussiennes. Nous proposons de comparer, pour chaque jeu de données RNA-seq, les différentes modélisations à l'aide d'un critère objectif permettant de sélectionner la modélisation la plus adaptée aux données. Par ailleurs, nous présentons un critère de sélection de modèle prenant en compte des informations biologiques externes sur les gènes. Ce critère facilite l'obtention de classes biologiquement interprétables. Il n'est pas spécifique aux données RNA-seq. Il est utile à toute analyse de co-expression à l'aide de modèles de mélange visant à enrichir les bases de données d'annotations fonctionnelles des gènes. Une seconde partie de travaux de cette thèse porte sur l'inférence de réseau à l'aide d'un modèle graphique. L'objectif de l'inférence de réseau est la détection des relations de dépendance entre les niveaux d'expression des gènes. Nous proposons un modèle d'inférence de réseau basé sur des lois de Poisson, prenant en compte le caractère discret et la grande variabilité inter-échantillons des données RNA-seq. Cependant, les méthodes d'inférence de réseau nécessitent un nombre d'échantillons élevé.Dans le cadre du modèle graphique gaussien, modèle concurrent au précédent, nous présentons une approche non-asymptotique pour sélectionner des sous-ensembles de gènes pertinents, en décomposant la matrice variance en blocs diagonaux. Cette méthode n'est pas spécifique aux données RNA-seq et permet de réduire la dimension de tout problème d'inférence de réseau basé sur le modèle graphique gaussien. / This thesis gathers methodologicals contributions to the statistical analysis of next-generation high-throughput transcriptome sequencing data (RNA-seq). RNA-seq data are discrete and the number of samples sequenced is usually small due to the cost of the technology. These two points are the main statistical challenges for modelling RNA-seq data.The first part of the thesis is dedicated to the co-expression analysis of RNA-seq data using model-based clustering. A natural model for discrete RNA-seq data is a Poisson mixture model. However, a Gaussian mixture model in conjunction with a simple transformation applied to the data is a reasonable alternative. We propose to compare the two alternatives using a data-driven criterion to select the model that best fits each dataset. In addition, we present a model selection criterion to take into account external gene annotations. This model selection criterion is not specific to RNA-seq data. It is useful in any co-expression analysis using model-based clustering designed to enrich functional annotation databases.The second part of the thesis is dedicated to network inference using graphical models. The aim of network inference is to detect relationships among genes based on their expression. We propose a network inference model based on a Poisson distribution taking into account the discrete nature and high inter sample variability of RNA-seq data. However, network inference methods require a large number of samples. For Gaussian graphical models, we propose a non-asymptotic approach to detect relevant subsets of genes based on a block-diagonale decomposition of the covariance matrix. This method is not specific to RNA-seq data and reduces the dimension of any network inference problem based on the Gaussian graphical model. Modèle de mélange Modèle graphique RNA-Seq data Classification Inférence de réseau Sélection de modèle Mixture model Graphical model selection RNA-Seq data Clustering Network inference Model selection
46	Robust estimation of the number of components for mixtures of linear regression Meng, Li January 1900 (has links) Master of Science / Department of Statistics / Weixin Yao / In this report, we investigate a robust estimation of the number of components in the mixture of regression models using trimmed information criterion. Compared to the traditional information criterion, the trimmed criterion is robust and not sensitive to outliers. The superiority of the trimmed methods in comparison with the traditional information criterion methods is illustrated through a simulation study. A real data application is also used to illustrate the effectiveness of the trimmed model selection methods. Mixture of linear regression models Model selection Robustness Trimmed likelihood estimator Statistics (0463)
47	Machine learning approaches for assessing moderate-to-severe diarrhea in children < 5 years of age, rural western Kenya 2008-2012 Ayers, Tracy L 13 May 2016 (has links) Worldwide diarrheal disease is a leading cause of morbidity and mortality in children less than five years of age. Incidence and disease severity remain the highest in sub-Saharan Africa. Kenya has an estimated 400,000 severe diarrhea episodes and 9,500 diarrhea-related deaths per year in children. Current statistical methods for estimating etiological and exposure risk factors for moderate-to-severe diarrhea (MSD) in children are constrained by the inability to assess a large number of parameters due to limitations of sample size, complex relationships, correlated predictors, and model assumptions of linearity. This dissertation examines machine learning statistical methods to address weaknesses associated with using traditional logistic regression models. The studies presented here investigate data from a 4-year, prospective, matched case-control study of MSD among children less than five years of age in rural Kenya from the Global Enteric Multicenter Study. The three machine learning approaches were used to examine associations with MSD and include: least absolute shrinkage and selection operator, classification trees, and random forest. A principal finding in all three studies was that machine learning methodological approaches are useful and feasible to implement in epidemiological studies. All provided additional information and understanding of the data beyond using only logistic regression models. The results from all three machine learning approaches were supported by comparable logistic regression results indicating their usefulness as epidemiological tools. This dissertation offers an exploration of methodological alternatives that should be considered more frequently in diarrheal disease epidemiology, and in public health in general. machine learning model selection shrinkage model classification tree random forests diarrheal disease
48	Fully Bayesian Analysis of Multivariate Latent Class Models with an Application to Metric Conjoint Analysis Frühwirth-Schnatter, Sylvia, Otter, Thomas, Tüchler, Regina January 2000 (has links) (PDF) In this paper we head for a fully Bayesian analysis of the latent class model with a priori unknown number of classes. Estimation is carried out by means of Markov Chain Monte Carlo (MCMC) methods. We deal explicitely with the consequences the unidentifiability of this type of model has on MCMC estimation. Joint Bayesian estimation of all latent variables, model parameters, and parameters determining the probability law of the latent process is carried out by a new MCMC method called permutation sampling. In a first run we use the random permutation sampler to sample from the unconstrained posterior. We will demonstrate that a lot of important information, such as e.g. estimates of the subject-specific regression coefficients, is available from such an unidentified model. The MCMC output of the random permutation sampler is explored in order to find suitable identifiability constraints. In a second run we use the permutation sampler to sample from the constrained posterior by imposing identifiablity constraints. The unknown number of classes is determined by formal Bayesian model comparison through exact model likelihoods. We apply a new method of computing model likelihoods for latent class models which is based on the method of bridge sampling. The approach is applied to simulated data and to data from a metric conjoint analysis in the Austrian mineral water market. (author's abstract) / Series: Forschungsberichte / Institut für Statistik
49	Narrowing the gap between network models and real complex systems Viamontes Esquivel, Alcides January 2014 (has links) Simple network models that focus only on graph topology or, at best, basic interactions are often insufficient to capture all the aspects of a dynamic complex system. In this thesis, I explore those limitations, and some concrete methods of resolving them. I argue that, in order to succeed at interpreting and influencing complex systems, we need to take into account slightly more complex parts, interactions and information flows in our models.This thesis supports that affirmation with five actual examples of applied research. Each study case takes a closer look at the dynamic of the studied problem and complements the network model with techniques from information theory, machine learning, discrete maths and/or ergodic theory. By using these techniques to study the concrete dynamics of each system, we could obtain interesting new information. Concretely, we could get better models of network walks that are used on everyday applications like journal ranking. We could also uncover asymptotic characteristics of an agent-based information propagation model which we think is the basis for things like belief propaga-tion or technology adoption on society. And finally, we could spot associations between antibiotic resistance genes in bacterial populations, a problem which is becoming more serious every day. complex systems network science community detection model selection signficance analysis ergodicity
50	Model selection strategies in genome-wide association studies Keildson, Sarah January 2011 (has links) Unravelling the genetic architecture of common diseases is a continuing challenge in human genetics. While genome-wide association studies (GWAS) have proven to be successful in identifying many new disease susceptibility loci, the extension of these studies beyond single-SNP methods of analysis has been limited. The incorporation of multi-locus methods of analysis may, however, increase the power of GWAS to detect genes of smaller effect size, as well as genes that interact with each other and the environment. This investigation carried out large-scale simulations of four multi-locus model selection techniques; namely forward and backward selection, Bayesian model averaging (BMA) and least angle regression with a lasso modification (lasso), in order to compare the type I error rates and power of each method. At a type I error rate of ~5%, lasso showed the highest power across varied effect sizes, disease frequencies and genetic models. Lasso penalized regression was then used to perform three different types of analysis on GWAS data. Firstly, lasso was applied to the Wellcome Trust Case Control Consortium (WTCCC) data and identified many of the WTCCC SNPs that had a moderate-strong association (p<10-5) type 2 diabetes (T2D), as well as some of the moderate WTCCC associations (p<10-4) that have since been replicated in a large-scale meta-analysis. Secondly, lasso was used to fine-map the 17q21 childhood asthma risk locus and identified putative secondary signals in the 17q21 region, that may further contribute to childhood asthma risk. Finally, lasso identified three potential interaction effects potentially contributing towards coronary artery disease (CAD) risk. While the validity of these findings hinges on their replication in follow-up studies, the results suggest that lasso may provide scientists with exciting new methods of dissecting, and ultimately understanding, the complex genetic framework underlying common human diseases. 616.027

Search results