Spelling suggestions: "subject:"dirichlet process mixture"" "subject:"irichlet process mixture""
1 |
A Bayesian Analysis of a Multiple Choice TestLuo, Zhisui 24 April 2013 (has links)
In a multiple choice test, examinees gain points based on how many correct responses they got. However, in this traditional grading, it is assumed that questions in the test are replications of each other. We apply an item response theory model to estimate students' abilities characterized by item's feature in a midterm test. Our Bayesian logistic Item response theory model studies the relation between the probability of getting a correct response and the three parameters. One parameter measures the student's ability and the other two measure an item's difficulty and its discriminatory feature. In this model the ability and the discrimination parameters are not identifiable. To address this issue, we construct a hierarchical Bayesian model to nullify the effects of non-identifiability. A Gibbs sampler is used to make inference and to obtain posterior distributions of the three parameters. For a "nonparametric" approach, we implement the item response theory model using a Dirichlet process mixture model. This new approach enables us to grade and cluster students based on their "ability" automatically. Although Dirichlet process mixture model has very good clustering property, it suffers from expensive and complicated computations. A slice sampling algorithm has been proposed to accommodate this issue. We apply our methodology to a real dataset obtained on a multiple choice test from WPI’s Applied Statistics I (Spring 2012) that illustrates how a student's ability relates to the observed scores.
|
2 |
The Cauchy-Net Mixture Model for Clustering with Anomalous DataSlifko, Matthew D. 11 September 2019 (has links)
We live in the data explosion era. The unprecedented amount of data offers a potential wealth of knowledge but also brings about concerns regarding ethical collection and usage. Mistakes stemming from anomalous data have the potential for severe, real-world consequences, such as when building prediction models for housing prices. To combat anomalies, we develop the Cauchy-Net Mixture Model (CNMM). The CNMM is a flexible Bayesian nonparametric tool that employs a mixture between a Dirichlet Process Mixture Model (DPMM) and a Cauchy distributed component, which we call the Cauchy-Net (CN). Each portion of the model offers benefits, as the DPMM eliminates the limitation of requiring a fixed number of a components and the CN captures observations that do not belong to the well-defined components by leveraging its heavy tails. Through isolating the anomalous observations in a single component, we simultaneously identify the observations in the net as warranting further inspection and prevent them from interfering with the formation of the remaining components. The result is a framework that allows for simultaneously clustering observations and making predictions in the face of the anomalous data. We demonstrate the usefulness of the CNMM in a variety of experimental situations and apply the model for predicting housing prices in Fairfax County, Virginia. / Doctor of Philosophy / We live in the data explosion era. The unprecedented amount of data offers a potential wealth of knowledge but also brings about concerns regarding ethical collection and usage. Mistakes stemming from anomalous data have the potential for severe, real-world consequences, such as when building prediction models for housing prices. To combat anomalies, we develop the Cauchy-Net Mixture Model (CNMM). The CNMM is a flexible tool for identifying and isolating the anomalies, while simultaneously discovering cluster structure and making predictions among the nonanomalous observations. The result is a framework that allows for simultaneously clustering and predicting in the face of the anomalous data. We demonstrate the usefulness of the CNMM in a variety of experimental situations and apply the model for predicting housing prices in Fairfax County, Virginia.
|
3 |
Advanced Nonparametric Bayesian Functional ModelingGao, Wenyu 04 September 2020 (has links)
Functional analyses have gained more interest as we have easier access to massive data sets. However, such data sets often contain large heterogeneities, noise, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This dissertation considers noisy information reduction in functional analyses from two perspectives: functional variable selection to reduce the dimensionality and functional clustering to group similar observations and thus reduce the sample size. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model, or developed from a more generic one by changing the prior distributions. Hence, this dissertation focuses on the development of Bayesian approaches for functional analyses due to their flexibilities.
A nonparametric Bayesian approach, such as the Dirichlet process mixture (DPM) model, has a nonparametric distribution as the prior. This approach provides flexibility and reduces assumptions, especially for functional clustering, because the DPM model has an automatic clustering property, so the number of clusters does not need to be specified in advance. Furthermore, a weighted Dirichlet process mixture (WDPM) model allows for more heterogeneities from the data by assuming more than one unknown prior distribution. It also gathers more information from the data by introducing a weight function that assigns different candidate priors, such that the less similar observations are more separated. Thus, the WDPM model will improve the clustering and model estimation results.
In this dissertation, we used an advanced nonparametric Bayesian approach to study functional variable selection and functional clustering methods. We proposed 1) a stochastic search functional selection method with application to 1-M matched case-crossover studies for aseptic meningitis, to examine the time-varying unknown relationship and find out important covariates affecting disease contractions; 2) a functional clustering method via the WDPM model, with application to three pathways related to genetic diabetes data, to identify essential genes distinguishing between normal and disease groups; and 3) a combined functional clustering, with the WDPM model, and variable selection approach with application to high-frequency spectral data, to select wavelengths associated with breast cancer racial disparities. / Doctor of Philosophy / As we have easier access to massive data sets, functional analyses have gained more interest to analyze data providing information about curves, surfaces, or others varying over a continuum. However, such data sets often contain large heterogeneities and noise. When generalizing the analyses from vectors to functions, classical methods might not work directly. This dissertation considers noisy information reduction in functional analyses from two perspectives: functional variable selection to reduce the dimensionality and functional clustering to group similar observations and thus reduce the sample size. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model due to its flexibility. Hence, this dissertation focuses on the development of nonparametric Bayesian approaches for functional analyses. Our proposed methods can be applied in various applications: the epidemiological studies on aseptic meningitis with clustered binary data, the genetic diabetes data, and breast cancer racial disparities.
|
4 |
Bayesian variable selection in clustering via dirichlet process mixture modelsKim, Sinae 17 September 2007 (has links)
The increased collection of high-dimensional data in various fields has raised a strong
interest in clustering algorithms and variable selection procedures. In this disserta-
tion, I propose a model-based method that addresses the two problems simultane-
ously. I use Dirichlet process mixture models to define the cluster structure and to
introduce in the model a latent binary vector to identify discriminating variables. I
update the variable selection index using a Metropolis algorithm and obtain inference
on the cluster structure via a split-merge Markov chain Monte Carlo technique. I
evaluate the method on simulated data and illustrate an application with a DNA
microarray study. I also show that the methodology can be adapted to the problem
of clustering functional high-dimensional data. There I employ wavelet thresholding
methods in order to reduce the dimension of the data and to remove noise from the
observed curves. I then apply variable selection and sample clustering methods in the
wavelet domain. Thus my methodology is wavelet-based and aims at clustering the
curves while identifying wavelet coefficients describing discriminating local features.
I exemplify the method on high-dimensional and high-frequency tidal volume traces
measured under an induced panic attack model in normal humans.
|
5 |
A Non-parametric Bayesian Method for Hierarchical Clustering of Longitudinal DataRen, Yan 23 October 2012 (has links)
No description available.
|
6 |
Approaches to Find the Functionally Related Experiments Based on Enrichment Scores: Infinite Mixture Model Based Cluster Analysis for Gene Expression DataLi, Qian 18 October 2013 (has links)
No description available.
|
7 |
Bayesian Nonparametric Reliability Analysis Using Dirichlet Process Mixture ModelCheng, Nan 03 October 2011 (has links)
No description available.
|
8 |
Out-of-distribution Recognition and Classification of Time-Series Pulsed Radar Signals / Out-of-distribution Igenkänning och Klassificering av Pulserade Radar SignalerHedvall, Paul January 2022 (has links)
This thesis investigates out-of-distribution recognition for time-series data of pulsedradar signals. The classifier is a naive Bayesian classifier based on Gaussian mixturemodels and Dirichlet process mixture models. In the mixture models, we model thedistribution of three pulse features in the time series, namely radio-frequency in thepulse, duration of the pulse, and pulse repetition interval which is the time betweenpulses. We found that simple thresholds on the likelihood can effectively determine ifsamples are out-of-distribution or belong to one of the classes trained on. In addition,we present a simple method that can be used for deinterleaving/pulse classification andshow that it can robustly classify 100 interleaved signals and simultaneously determineif pulses are out-of-distribution. / Det här examensarbetet undersöker hur en maskininlärnings-modell kan anpassas för attkänna igen när pulserade radar-signaler inte tillhör samma fördelning som modellen är tränadmed men också känna igen om signalen tillhör en tidigare känd klass. Klassifieringsmodellensom används här är en naiv Bayesiansk klassifierare som använder sig av Gaussian mixturemodels och Dirichlet Process mixture models. Modellen skapar en fördelning av tidsseriedatan för pulserade radar-signaler och specifikt för frekvensen av varje puls, pulsens längd och tiden till nästa puls. Genom att sätta gränser i sannolikheten av varje puls eller sannolikhetenav en sekvens kan vi känna igen om datan är okänd eller tillhör en tidigare känd klass.Vi presenterar även en enkel metod för att klassifiera specifika pulser i sammanhang närflera signaler överlappar och att metoden kan användas för att robust avgöra om pulser ärokända.
|
9 |
Statistical methods for variant discovery and functional genomic analysis using next-generation sequencing dataTang, Man 03 January 2020 (has links)
The development of high-throughput next-generation sequencing (NGS) techniques produces massive amount of data, allowing the identification of biomarkers in early disease diagnosis and driving the transformation of most disciplines in biology and medicine. A greater concentration is needed in developing novel, powerful, and efficient tools for NGS data analysis. This dissertation focuses on modeling ``omics'' data in various NGS applications with a primary goal of developing novel statistical methods to identify sequence variants, find transcription factor (TF) binding patterns, and decode the relationship between TF and gene expression levels. Accurate and reliable identification of sequence variants, including single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (INDELs), plays a fundamental role in NGS applications. Existing methods for calling these variants often make simplified assumption of positional independence and fail to leverage the dependence of genotypes at nearby loci induced by linkage disequilibrium. We propose vi-HMM, a hidden Markov model (HMM)-based method for calling SNPs and INDELs in mapped short read data. Simulation experiments show that, under various sequencing depths, vi-HMM outperforms existing methods in terms of sensitivity and F1 score. When applied to the human whole genome sequencing data, vi-HMM demonstrates higher accuracy in calling SNPs and INDELs. One important NGS application is chromatin immunoprecipitation followed by sequencing (ChIP-seq), which characterizes protein-DNA relations through genome-wide mapping of TF binding sites. Multiple TFs, binding to DNA sequences, often show complex binding patterns, which indicate how TFs with similar functionalities work together to regulate the expression of target genes. To help uncover the transcriptional regulation mechanism, we propose a novel nonparametric Bayesian method to detect the clustering pattern of multiple-TF bindings from ChIP-seq datasets. Simulation study demonstrates that our method performs best with regard to precision, recall, and F1 score, in comparison to traditional methods. We also apply the method on real data and observe several TF clusters that have been recognized previously in mouse embryonic stem cells. Recent advances in ChIP-seq and RNA sequencing (RNA-Seq) technologies provides more reliable and accurate characterization of TF binding sites and gene expression measurements, which serves as a basis to study the regulatory functions of TFs on gene expression. We propose a log Gaussian cox process with wavelet-based functional model to quantify the relationship between TF binding site locations and gene expression levels. Through the simulation study, we demonstrate that our method performs well, especially with large sample size and small variance. It also shows a remarkable ability to distinguish real local feature in the function estimates. / Doctor of Philosophy / The development of high-throughput next-generation sequencing (NGS) techniques produces massive amount of data and bring out innovations in biology and medicine. A greater concentration is needed in developing novel, powerful, and efficient tools for NGS data analysis. In this dissertation, we mainly focus on three problems closely related to NGS and its applications: (1) how to improve variant calling accuracy, (2) how to model transcription factor (TF) binding patterns, and (3) how to quantify of the contribution of TF binding on gene expression. We develop novel statistical methods to identify sequence variants, find TF binding patterns, and explore the relationship between TF binding and gene expressions. We expect our findings will be helpful in promoting a better understanding of disease causality and facilitating the design of personalized treatments.
|
10 |
Recurrent-Event Models for Change-Points DetectionLi, Qing 23 December 2015 (has links)
The driving risk of novice teenagers is the highest during the initial period after licensure but decreases rapidly. This dissertation develops recurrent-event change-point models to detect the time when driving risk decreases significantly for novice teenager drivers. The dissertation consists of three major parts: the first part applies recurrent-event change-point models with identical change-points for all subjects; the second part proposes models to allow change-points to vary among drivers by a hierarchical Bayesian finite mixture model; the third part develops a non-parametric Bayesian model with a Dirichlet process prior. In the first part, two recurrent-event change-point models to detect the time of change in driving risks are developed. The models are based on a non-homogeneous Poisson process with piecewise constant intensity functions. It is shown that the change-points only occur at the event times and the maximum likelihood estimators are consistent. The proposed models are applied to the Naturalistic Teenage Driving Study, which continuously recorded textit{in situ} driving behaviour of 42 novice teenage drivers for the first 18 months after licensure using sophisticated in-vehicle instrumentation. The results indicate that crash and near-crash rate decreases significantly after 73 hours of independent driving after licensure. The models in part one assume identical change-points for all drivers. However, several studies showed that different patterns of risk change over time might exist among the teenagers, which implies that the change-points might not be identical among drivers. In the second part, change-points are allowed to vary among drivers by a hierarchical Bayesian finite mixture model, considering that clusters exist among the teenagers. The prior for mixture proportions is a Dirichlet distribution and a Markov chain Monte Carlo algorithm is developed to sample from the posterior distributions. DIC is used to determine the best number of clusters. Based on the simulation study, the model gives fine results under different scenarios. For the Naturalist Teenage Driving Study data, three clusters exist among the teenagers: the change-points are 52.30, 108.99 and 150.20 hours of driving after first licensure correspondingly for the three clusters; the intensity rates increase for the first cluster while decrease for other two clusters; the change-point of the first cluster is the earliest and the average intensity rate is the highest. In the second part, model selection is conducted to determine the number of clusters. An alternative is the Bayesian non-parametric approach. In the third part, a Dirichlet process Mixture Model is proposed, where the change-points are assigned a Dirichlet process prior. A Markov chain Monte Carlo algorithm is developed to sample from the posterior distributions. Automatic clustering is expected based on change-points without specifying the number of latent clusters. Based on the Dirichlet process mixture model, three clusters exist among the teenage drivers for the Naturalistic Teenage Driving Study. The change-points of the three clusters are 96.31, 163.83, and 279.19 hours. The results provide critical information for safety education, safety countermeasure development, and Graduated Driver Licensing policy making. / Ph. D.
|
Page generated in 0.0873 seconds