71 |
Analysis of Three-Way Data and Other Topics in Clustering and ClassificationGallaugher, Michael Patrick Brian January 2020 (has links)
Clustering and classification is the process of finding underlying group structure in heterogenous data. With the rise of the “big data” phenomenon, more complex data structures have made it so traditional clustering methods are oftentimes not advisable or feasible. This thesis presents methodology for analyzing three different examples of these more complex data types. The first is three-way (matrix variate) data, or data that come in the form of matrices. A large emphasis is placed on clustering skewed three-way data, and high dimensional three-way data. The second is click- stream data, which considers a user’s internet search patterns. Finally, co-clustering methodology is discussed for very high-dimensional two-way (multivariate) data. Parameter estimation for all these methods is based on the expectation maximization (EM) algorithm. Both simulated and real data are used for illustration. / Thesis / Doctor of Philosophy (PhD)
|
72 |
Outlier Detection in Gaussian Mixture ModelsClark, Katharine January 2020 (has links)
Unsupervised classification is a problem often plagued by outliers, yet there is a paucity of work on handling outliers in unsupervised classification. Mixtures of Gaussian distributions are a popular choice in model-based clustering. A single outlier can affect parameters estimation and, as such, must be accounted for. This issue is further complicated by the presence of multiple outliers. Predicting the proportion of outliers correctly is paramount as it minimizes misclassification error. It is proved that, for a finite Gaussian mixture model, the log-likelihoods of the subset models are distributed according to a mixture of beta-type distributions. This relationship is leveraged in two ways. First, an algorithm is proposed that predicts the proportion of outliers by measuring the adherence of a set of subset log-likelihoods to a beta-type mixture reference distribution. This algorithm removes the least likely points, which are deemed outliers, until model assumptions are met. Second, a hypothesis test is developed, which, at a chosen significance level, can test whether a dataset contains a single outlier. / Thesis / Master of Science (MSc)
|
73 |
The wild bootstrap resampling in regression imputation algorithm with a Gaussian Mixture ModelMat Jasin, A., Neagu, Daniel, Csenki, Attila 08 July 2018 (has links)
Yes / Unsupervised learning of finite Gaussian mixture model (FGMM) is used to learn the distribution of population data. This paper proposes the use of the wild bootstrapping to create the variability of the imputed data in single miss-ing data imputation. We compare the performance and accuracy of the proposed method in single imputation and multiple imputation from the R-package Amelia II using RMSE, R-squared, MAE and MAPE. The proposed method shows better performance when compared with the multiple imputation (MI) which is indeed known as the golden method of missing data imputation techniques.
|
74 |
Minimum virgin binder limits in recycled Superpave (SR) mixes in KansasTavakol, Masoumeh January 1900 (has links)
Master of Science / Civil Engineering / Mustaque A. Hossain / Use of recycled materials in asphalt pavement has become widespread recently due to rising costs of virgin binder and increased attention to sustainability. Historically, recycled asphalt pavement (RAP) has been the most commonly used recycled material for hot-mix asphalt (HMA). However, recycled asphalt shingle (RAS), another recycled material, has recently become popular. Although there are some guidelines regarding use of RAP and RAS in HMA, their effects on mixture performance, especially on mixtures containing RAS, are not thoroughly understood.
In this research, three recycled Superpave mixture designs from the Kansas Department of Transportation (KDOT) with 9.5 mm (SR-9.5A) and 19 mm (SR-19A) Nominal Maximum Aggregate Size (NMAS) were selected as control mixtures. Mixtures containing higher percentages of recycled materials (RAP and RAS) were developed using KDOT blending charts. A total of nine mixtures with varying virgin binder contents were designed and assessed for moisture susceptibility, rutting resistance, and fatigue cracking propensity using modified Lottman, Hamburg Wheel Tracking Device, flow number, Dynamic Modulus, and S-VECD direct tension fatigue tests.
Results confirmed the effect of NMAS and material source on mixture performance. For SR-9.5A, the mixtures showed increased susceptibility to moisture and rutting damage below virgin binder content of 75%. For SR-19A, mixtures with virgin binder content of 70% showed satisfactory performance properties. Mixtures with virgin binder contents lower than 60% definitely showed inferior performance.
|
75 |
Statistical methods for species richness estimation using count data from multiple sampling unitsArgyle, Angus Gordon 23 April 2012 (has links)
The planet is experiencing a dramatic loss of species. The majority of species are unknown to science, and it is usually infeasible to conduct a census of a region to acquire a complete inventory of all life forms. Therefore, it is important to estimate and conduct statistical inference on the total number of species in a region based on samples obtained from field observations. Such estimates may suggest the number of species new to science and at potential risk of extinction.
In this thesis, we develop novel methodology to conduct statistical inference, based on abundance-based data collected from multiple sampling locations, on the number of species within a taxonomic group residing in a region. The primary contribution of this work is the formulation of novel statistical methodology for analysis in this setting, where abundances of species are recorded at multiple sampling units across a region. This particular area has received relatively little attention in the literature.
In the first chapter, the problem of estimating the number of species is formulated in a broad context, one that occurs in several seemingly unrelated fields of study. Estimators are commonly developed from statistical sampling models. Depending on the organisms or objects under study, different sampling techniques are used, and consequently, a variety of statistical models have been developed for this problem. A review of existing estimation methods, categorized by the associated sampling model, is presented in the second chapter.
The third chapter develops a new negative binomial mixture model. The negative binomial model is employed to account for the common tendency of individuals of a particular species to occur in clusters. An exponential mixing distribution permits inference on the number of species that exist in the region, but were in fact absent from the sampling units. Adopting a classical approach for statistical inference, we develop the maximum likelihood estimator, and a corresponding profile-log-likelihood interval estimate of species richness. In addition, a Gaussian-based confidence interval based on large-sample theory is presented.
The fourth chapter further extends the hierarchical model developed in Chapter 3 into a Bayesian framework. The motivation for the Bayesian paradigm is explained, and a hierarchical model based on random effects and discrete latent variables is presented. Computing the posterior distribution in this case is not straight-forward. A data augmentation technique that indirectly places priors on species richness is employed to compute the model using a Metropolis-Hastings algorithm.
The fifth chapter examines the performance of our new methodology. Simulation studies are used to examine the mean-squared error of our proposed estimators. Comparisons to several commonly-used non-parametric estimators are made. Several conclusions emerge, and settings where our approaches can yield superior performance are clarified.
In the sixth chapter, we present a case study. The methodology is applied to a real data set of oribatid mites (a taxonomic order of micro-arthropods) collected from multiple sites in a tropical rainforest in Panama. We adjust our statistical sampling models to account for the varying masses of material sampled from the sites. The resulting estimates of species richness for the oribatid mites are useful, and contribute to a wider investigation, currently underway, examining the species richness of all arthropods in the rainforest.
Our approaches are the only existing methods that can make full use of the abundance-based data from multiple sampling units located in a single region. The seventh and final chapter concludes the thesis with a discussion of key considerations related to implementation and modeling assumptions, and describes potential avenues for further investigation. / Graduate
|
76 |
A Gamma-Poisson topic model for short textMazarura, Jocelyn Rangarirai January 2020 (has links)
Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in literature are admixture models, making the assumption that a document is generated from a mixture of topics.
In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.
The application of GPM was then extended to a further real-world task: that of distinguishing between semantically similar and dissimilar texts. The objective was to determine whether GPM could produce semantic representations that allow the user to determine the relevance of new, unseen documents to a corpus of interest. The challenge of addressing this problem in short text from small corpora was of key interest. Corpora of small size are not uncommon. For example, at the start of the Coronavirus pandemic limited research was available on the topic. Handling short text is not only challenging due to the sparsity of such text, but some corpora, such as chats between people, also tend to be noisy. The performance of GPM was compared to that of word2vec under these challenging conditions on labelled corpora. It was found that the GPM was able to produce better results based on accuracy, precision and recall in most cases. In addition, unlike word2vec, GPM was shown to be applicable on datasets that were unlabelled and a methodology for this was also presented. Finally, a relevance index metric was introduced. This relevance index translates the similarity distance between a corpus of interest and a test document to the probability of the test document to be semantically similar to the corpus of interest. / Thesis (PhD (Mathematical Statistics))--University of Pretoria, 2020. / Statistics / PhD (Mathematical Statistics) / Unrestricted
|
77 |
Approaches to Find the Functionally Related Experiments Based on Enrichment Scores: Infinite Mixture Model Based Cluster Analysis for Gene Expression DataLi, Qian 18 October 2013 (has links)
No description available.
|
78 |
Bayesian Nonparametric Reliability Analysis Using Dirichlet Process Mixture ModelCheng, Nan 03 October 2011 (has links)
No description available.
|
79 |
Extending Growth Mixture Models and Handling Missing Values via Mixtures of Non-Elliptical DistributionsWei, Yuhong January 2017 (has links)
Growth mixture models (GMMs) are used to model intra-individual change and inter-individual differences in change and to detect underlying group structure in longitudinal studies. Regularly, these models are fitted under the assumption of normality, an assumption that is frequently invalid. To this end, this thesis focuses on the development of novel non-elliptical growth mixture models to better fit real data. Two non-elliptical growth mixture models, via the multivariate skew-t distribution and the generalized hyperbolic distribution, are developed and applied to simulated and real data. Furthermore, these two non-elliptical growth mixture models are extended to accommodate missing values, which are near-ubiquitous in real data.
Recently, finite mixtures of non-elliptical distributions have flourished and facilitated the flexible clustering of the data featuring longer tails and asymmetry. However, in practice, real data often have missing values, and so work in this direction is also pursued. A novel approach, via mixtures of the generalized hyperbolic distribution and mixtures of the multivariate skew-t distributions, is presented to handle missing values in mixture model-based clustering context. To increase parsimony, families of mixture models have been developed by imposing constraints on the component scale matrices whenever missing data occur. Next, a mixture of generalized hyperbolic factor analyzers model is also proposed to cluster high-dimensional data with different patterns of missing values. Two missingness indicator matrices are also introduced to ease the computational burden. The algorithms used for parameter estimation are presented, and the performance of the methods is illustrated on simulated and real data. / Thesis / Doctor of Philosophy (PhD)
|
80 |
Hydrodynamic and gasification behavior of coal and biomass fluidized beds and their mixturesEstejab, Bahareh 29 March 2016 (has links)
In this study, efforts ensued to increase our knowledge of fluidization and gasification behavior of Geldart A particles using CFD. An extensive Eulerian-Eulerian numerical study was executed and simulations were compared and validated with experiments conducted at Utah State University. In order to improve numerical predictions using an Eulerian-Eulerian model, drag models were assessed to determine if they were suitable for fine particles classified as Geldart A. The results proved that if static regions of mass in fluidized beds are neglected, most drag models work well with Geldart A particles. The most reliable drag model for both single and binary mixtures was proved to be the Gidaspow-blend model. In order to capture the overshoot of pressure in homogeneous fluidization regions, a new modeling technique was proposed that modified the definition of the critical velocity in the Ergun correlation. The new modeling technique showed promising results for predicting fluidization behavior of fine particles. The fluidization behavior of three different mixtures of coal and poplar wood were studied. Although results indicated good mixing characteristics for all mixtures, there was a tendency for better mixing with higher percentages of poplar wood.
In this study, efforts continued to model co-gasification of coal and biomass. Comparing the coal gasification of large (Geldart B) and fine (Geldart A) particles showed that using finer particles had a pronounced effect on gas yields where CO mass fraction increased, although H2 and CH4 mass fraction slightly decreased. The gas yields of coal gasification with fine particles were also compared using three different gasification agents. Modeling the co-gasification of coal-switchgrass of both fine particles of Geldart A and larger particles of Geldart B showed that there is not a synergetic effect in terms of gas yields of H2 and CH4. The gas yields of CO, however, showed a significant increase during co-gasification. The effects of gasification temperature on gas yields were also investigated. / Ph. D.
|
Page generated in 0.0408 seconds