Global ETD Search

21	Mixture Model Averaging for Clustering Wei, Yuhong 30 April 2012 (has links) Model-based clustering is based on a finite mixture of distributions, where each mixture component corresponds to a different group, cluster, subpopulation, or part thereof. Gaussian mixture distributions are most often used. Criteria commonly used in choosing the number of components in a finite mixture model include the Akaike information criterion, Bayesian information criterion, and the integrated completed likelihood. The best model is taken to be the one with highest (or lowest) value of a given criterion. This approach is not reasonable because it is practically impossible to decide what to do when the difference between the best values of two models under such a criterion is ‘small’. Furthermore, it is not clear how such values should be calibrated in different situations with respect to sample size and random variables in the model, nor does it take into account the magnitude of the likelihood. It is, therefore, worthwhile considering a model-averaging approach. We consider an averaging of the top M mixture models and consider applications in clustering and classification. In the course of model averaging, the top M models often have different numbers of mixture components. Therefore, we propose a method of merging Gaussian mixture components in order to get the same number of clusters for the top M models. The idea is to list all the combinations of components for merging, and then choose the combination corresponding to the biggest adjusted Rand index (ARI) with the ‘reference model’. A weight is defined to quantify the importance of each model. The effectiveness of mixture model averaging for clustering is proved by simulated data and real data under the pgmm package, where the ARI from mixture model averaging for clustering are greater than the one of corresponding best model. The attractive feature of mixture model averaging is it’s computationally efficiency; it only uses the conditional membership probabilities. Herein, Gaussian mixture models are used but the approach could be applied effectively without modification to other mixture models. / Paul McNicholas mclust merging mixture component mixture model model averaging Model selection model-based clustering parameter estimation pgmm adjusted Rand index
22	If and How Many 'Races'? The Application of Mixture Modeling to World-Wide Human Craniometric Variation Algee-Hewitt, Bridget Frances Beatrice 01 December 2011 (has links) Studies in human cranial variation are extensive and widely discussed. While skeletal biologists continue to focus on questions of biological distance and population history, group-specific knowledge is being increasingly used for human identification in medico-legal contexts. The importance of this research has been often overshadowed by both philosophic and methodological concerns. Many analyses have been constrained in their scope by the limited availability of representative samples and readily criticized for adopting statistical techniques that require user-guidance and a priori information. A multi-part project is presented here that implements model-based clustering as an alternative approach for population studies using craniometric traits. This project also introduces the use of forced-directed graphing and mixture-based supervised classification methods as statistically robust and practically useful techniques. This project considers three well-documented craniometric sources, whose samples collectively permit large-scale analyses and tests of population structure at a variety of partitions and for different goals. The craniofacial measurements drawn from the world-wide data sets collected by Howells and Hanihara permit rigorous tests for group differences and cryptic population structure. The inclusion of modern American samples from the Forensic Anthropology Data Bank allows for investigations into the importance of biosocial race and biogeographic ancestry in forensic anthropology. Demographic information from the United States Census Bureau is used to contextualize these samples within the range of the racial diversity represented in the American population-at-large. This project's findings support the presence of population structure, the utility of finite mixture methods to questions of biological classification, and the validity of supervised discrimination methods as reliable tools. They also attest to the importance of context for producing the most useful information on identity and affinity. These results suggest that a meaningful relationship between statistically inferred clusters and predefined groups does exist and that population-informative differences in cranial morphology can be detected with measured degrees of statistical certainty, even when true memberships are unknown. They imply, in turn, that the estimation of biogeographic ancestry and the identification of biosocial race in forensic anthropology can provide useful information for modern American casework that can be evidenced by scientific methods. craniometrics human variation model-based clustering finite mixture analysis forensic anthropology race Biological and Physical Anthropology Statistical Methodology
23	Identifying mixtures of mixtures using Bayesian estimation Malsiner-Walli, Gertraud, Frühwirth-Schnatter, Sylvia, Grün, Bettina January 2017 (has links) (PDF) The use of a finite mixture of normal distributions in model-based clustering allows to capture non-Gaussian data clusters. However, identifying the clusters from the normal components is challenging and in general either achieved by imposing constraints on the model or by using post-processing procedures. Within the Bayesian framework we propose a different approach based on sparse finite mixtures to achieve identifiability. We specify a hierarchical prior where the hyperparameters are carefully selected such that they are reflective of the cluster structure aimed at. In addition, this prior allows to estimate the model using standard MCMC sampling methods. In combination with a post-processing approach which resolves the label switching issue and results in an identified model, our approach allows to simultaneously (1) determine the number of clusters, (2) flexibly approximate the cluster distributions in a semi-parametric way using finite mixtures of normals and (3) identify cluster-specific parameters and classify observations. The proposed approach is illustrated in two simulation studies and on benchmark data sets.
24	Model-Based Clustering of Covid-19 in US Counties Olayemi, Ismail Adigun 22 September 2021 (has links) No description available. Statistics Applied Mathematics Public Health COVID-19 Algebra Statistics Public Health Policy Making Model-Based Clustering US Counties
25	Digital Education Resource Mining for Decision Support AL Fanah, Muna M.S. January 2021 (has links) Nowadays education becomes a competitive and challenging domain, both nationally and internationally in terms of quality, visibility, experience of academic delivery affecting institutions, applicants, regulatory bodies. Currently data becomes more available for the general and public use, and plays also an increasingly significant role in decision support for education topics. For example, world university rankings (WUR) such as Quacquarelli Symonds (QS), Central World University Rankings (CWUR), Times Higher Education (Times) and national university rankings (e.g. the Guardian newspaper Best UK Universities and the Complete University Guide league tables) have published their data for many years now and are increasingly used in such decision making processes by institutions and general public. University rankings e-learners Classification Prediction Decision Trees Model-based clustering Markov Chains Education Resource mining Decision support
26	Crop decision planning under yield and price uncertainties Kantanantha, Nantachai 25 June 2007 (has links) This research focuses on developing a crop decision planning model to help farmers make decisions for an upcoming crop year. The decisions consist of which crops to plant, the amount of land to allocate to each crop, when to grow, when to harvest, and when to sell. The objective is to maximize the overall profit subject to available resources under yield and price uncertainties. To help achieve this objective, we develop yield and price forecasting models to estimate the probable outcomes of these uncertain factors. The output from both forecasting models are incorporated into the crop decision planning model which enables the farmers to investigate and analyze the possible scenarios and eventually determine the appropriate decisions for each situation. This dissertation has three major components, yield forecasting, price forecasting, and crop decision planning. For yield forecasting, we propose a crop-weather regression model under a semiparametric framework. We use temperature and rainfall information during the cropping season and a GDP macroeconomic indicator as predictors in the model. We apply a functional principal components analysis technique to reduce the dimensionality of the model and to extract meaningful information from the predictors. We compare the prediction results from our model with a series of other yield forecasting models. For price forecasting, we develop a futures-based model which predicts a cash price from futures price and commodity basis. We focus on forecasting the commodity basis rather than the cash price because of the availability of futures price information and the low uncertainty of the commodity basis. We adopt a model-based approach to estimate the density function of the commodity basis distribution, which is further used to estimate the confidence interval of the commodity basis and the cash price. Finally, for crop decision planning, we propose a stochastic linear programming model, which provides the optimal policy. We also develop three heuristic models that generate a feasible solution at a low computational cost. We investigate the robustness of the proposed models to the uncertainties and prior probabilities. A numerical study of the developed approaches is performed for a case of a representative farmer who grows corn and soybean in Illinois. Yield forecasting Functional principal component analysis Crop decision planning Price forecasting Heuristic Stochastic programming Cropping systems Decision support systems Crop yields Sales forecasting
27	Variational Approximations and Other Topics in Mixture Models Dang, Sanjeena 24 August 2012 (has links) Mixture model-based clustering has become an increasingly popular data analysis technique since its introduction almost fifty years ago. Families of mixture models are said to arise when the component parameters, usually the component covariance matrices, are decomposed and a number of constraints are imposed. Within the family setting, it is necessary to choose the member of the family --- i.e., the appropriate covariance structure --- in addition to the number of mixture components. To date, the Bayesian information criterion (BIC) has proved most effective for this model selection process, and the expectation-maximization (EM) algorithm has been predominantly used for parameter estimation. We deviate from the EM-BIC rubric, using variational Bayes approximations for parameter estimation and the deviance information criterion (DIC) for model selection. The variational Bayes approach alleviates some of the computational complexities associated with the EM algorithm. We use this approach on the most famous family of Gaussian mixture models known as Gaussian parsimonious clustering models (GPCM). These models have an eigen-decomposed covariance structure. Cluster-weighted modelling (CWM) is another flexible statistical framework for modelling local relationships in heterogeneous populations on the basis of weighted combinations of local models. In particular, we extend cluster-weighted models to include an underlying latent factor structure of the independent variable, resulting in a novel family of models known as parsimonious cluster-weighted factor analyzers. The EM-BIC rubric is utilized for parameter estimation and model selection. Some work on a mixture of multivariate t-distributions is also presented, with a linear model for the mean and a modified Cholesky-decomposed covariance structure leading to a novel family of mixture models. In addition to model-based clustering, these models are also used for model-based classification, i.e., semi-supervised clustering. Parameters are estimated using the EM algorithm and another approach to model selection other than the BIC is also considered. / NSERC PGS-D High-dimensional data Variational Bayes Approximations Mixture Models EM Algorithm Factor Analyzers Longitudinal Data Gene Expression Data Cluster-Weighted Models Classification Clustering Model-based clustering Family of Mixture Models Model-based Classification Cluster-Weighted Factor Analyzers
28	Analyse statistique de données fonctionnelles à structures complexes Adjogou, Adjobo Folly Dzigbodi 05 1900 (has links) No description available. Données longitudinales Partitionnement fonctionnel Classification non supervisée Modèles de mélange pour classification Analyse des données fonctionnelles Algorithme EM Statistique bayésienne Longitudinal data Functional clustering Model-based clustering Functional data analysis EM algorithm Bayesian framework Sparse longitudinal data Gene expression Mixture student PRRSV Lasso penalization
29	Classification de données multivariées multitypes basée sur des modèles de mélange : application à l'étude d'assemblages d'espèces en écologie / Model-based clustering for multivariate and mixed-mode data : application to multi-species spatial ecological data Georgescu, Vera 17 December 2010 (has links) En écologie des populations, les distributions spatiales d'espèces sont étudiées afin d'inférer l'existence de processus sous-jacents, tels que les interactions intra- et interspécifiques et les réponses des espèces à l'hétérogénéité de l'environnement. Nous proposons d'analyser les données spatiales multi-spécifiques sous l'angle des assemblages d'espèces, que nous considérons en termes d'abondances absolues et non de diversité des espèces. Les assemblages d'espèces sont une des signatures des interactions spatiales locales des espèces entre elles et avec leur environnement. L'étude des assemblages d'espèces peut permettre de détecter plusieurs types d'équilibres spatialisés et de les associer à l'effet de variables environnementales. Les assemblages d'espèces sont définis ici par classification non spatiale des observations multivariées d'abondances d'espèces. Les méthodes de classification basées sur les modèles de mélange ont été choisies afin d'avoir une mesure de l'incertitude de la classification et de modéliser un assemblage par une loi de probabilité multivariée. Dans ce cadre, nous proposons : 1. une méthode d'analyse exploratoire de données spatiales multivariées d'abondances d'espèces, qui permet de détecter des assemblages d'espèces par classification, de les cartographier et d'analyser leur structure spatiale. Des lois usuelles, telle que la Gaussienne multivariée, sont utilisées pour modéliser les assemblages, 2. un modèle hiérarchique pour les assemblages d'abondances lorsque les lois usuelles ne suffisent pas. Ce modèle peut facilement s'adapter à des données contenant des variables de types différents, qui sont fréquemment rencontrées en écologie, 3. une méthode de classification de données contenant des variables de types différents basée sur des mélanges de lois à structure hiérarchique (définies en 2.). Deux applications en écologie ont guidé et illustré ce travail : l'étude à petite échelle des assemblages de deux espèces de pucerons sur des feuilles de clémentinier et l'étude à large échelle des assemblages d'une plante hôte, le plantain lancéolé, et de son pathogène, l'oïdium, sur les îles Aland en Finlande / In population ecology, species spatial patterns are studied in order to infer the existence of underlying processes, such as interactions within and between species, and species response to environmental heterogeneity. We propose to analyze spatial multi-species data by defining species abundance assemblages. Species assemblages are one of the signatures of the local spatial interactions between species and with their environment. Species assemblages are defined here by a non spatial classification of the multivariate observations of species abundances. Model-based clustering procedures using mixture models were chosen in order to have an estimation of the classification uncertainty and to model an assemblage by a multivariate probability distribution. We propose : 1. An exploratory tool for the study of spatial multivariate observations of species abundances, which defines species assemblages by a model-based clustering procedure, and then maps and analyzes the spatial structure of the assemblages. Common distributions, such as the multivariate Gaussian, are used to model the assemblages. 2. A hierarchical model for abundance assemblages which cannot be modeled with common distributions. This model can be easily adapted to mixed mode data, which are frequent in ecology. 3. A clustering procedure for mixed-mode data based on mixtures of hierarchical models. Two ecological case-studies guided and illustrated this work: the small-scale study of the assemblages of two aphid species on leaves of Citrus trees, and the large-scale study of the assemblages of a host plant, Plantago lanceolata, and its pathogen, the powdery mildew, on the Aland islands in south-west Finland Assemblage d'espèces Coexistence Données mixtes Données multivariées spatiales Modèle gaussien latent Modèle hiérarchique Monte Carlo EM Species assemblages Finite mixture models Coexistence Mixed mode data Multivariate data Latent gaussian model Hierarchical model Model-based clustering Spatial data
30	Contributions à l'analyse de données fonctionnelles multivariées, application à l'étude de la locomotion du cheval de sport / Contributions to the analysis of multivariate functional data, application to the study of the sport horse's locomotion Schmutz, Amandine 15 November 2019 (has links) Avec l'essor des objets connectés pour fournir un suivi systématique, objectif et fiable aux sportifs et à leur entraineur, de plus en plus de paramètres sont collectés pour un même individu. Une alternative aux méthodes d'évaluation en laboratoire est l'utilisation de capteurs inertiels qui permettent de suivre la performance sans l'entraver, sans limite d'espace et sans procédure d'initialisation fastidieuse. Les données collectées par ces capteurs peuvent être vues comme des données fonctionnelles multivariées : se sont des entités quantitatives évoluant au cours du temps de façon simultanée pour un même individu statistique. Cette thèse a pour objectif de chercher des paramètres d'analyse de la locomotion du cheval athlète à l'aide d'un capteur positionné dans la selle. Cet objet connecté (centrale inertielle, IMU) pour le secteur équestre permet de collecter l'accélération et la vitesse angulaire au cours du temps, dans les trois directions de l'espace et selon une fréquence d'échantillonnage de 100 Hz. Une base de données a ainsi été constituée rassemblant 3221 foulées de galop, collectées en ligne droite et en courbe et issues de 58 chevaux de sauts d'obstacles de niveaux et d'âges variés. Nous avons restreint notre travail à la prédiction de trois paramètres : la vitesse par foulée, la longueur de foulée et la qualité de saut. Pour répondre aux deux premiers objectifs nous avons développé une méthode de clustering fonctionnelle multivariée permettant de diviser notre base de données en sous-groupes plus homogènes du point de vue des signaux collectés. Cette méthode permet de caractériser chaque groupe par son profil moyen, facilitant leur compréhension et leur interprétation. Mais, contre toute attente, ce modèle de clustering n'a pas permis d'améliorer les résultats de prédiction de vitesse, les SVM restant le modèle ayant le pourcentage d'erreur inférieur à 0.6 m/s le plus faible. Il en est de même pour la longueur de foulée où une précision de 20 cm est atteinte grâce aux Support Vector Machine (SVM). Ces résultats peuvent s'expliquer par le fait que notre base de données est composée uniquement de 58 chevaux, ce qui est un nombre d'individus très faible pour du clustering. Nous avons ensuite étendu cette méthode au co-clustering de courbes fonctionnelles multivariées afin de faciliter la fouille des données collectées pour un même cheval au cours du temps. Cette méthode pourrait permettre de détecter et prévenir d'éventuels troubles locomoteurs, principale source d'arrêt du cheval de saut d'obstacle. Pour finir, nous avons investigué les liens entre qualité du saut et les signaux collectés par l'IMU. Nos premiers résultats montrent que les signaux collectés par la selle seuls ne suffisent pas à différencier finement la qualité du saut d'obstacle. Un apport d'information supplémentaire sera nécessaire, à l'aide d'autres capteurs complémentaires par exemple ou encore en étoffant la base de données de façon à avoir un panel de chevaux et de profils de sauts plus variés / With the growth of smart devices market to provide athletes and trainers a systematic, objective and reliable follow-up, more and more parameters are monitored for a same individual. An alternative to laboratory evaluation methods is the use of inertial sensors which allow following the performance without hindering it, without space limits and without tedious initialization procedures. Data collected by those sensors can be classified as multivariate functional data: some quantitative entities evolving along time and collected simultaneously for a same individual. The aim of this thesis is to find parameters for analysing the athlete horse locomotion thanks to a sensor put in the saddle. This connected device (inertial sensor, IMU) for equestrian sports allows the collection of acceleration and angular velocity along time in the three space directions and with a sampling frequency of 100 Hz. The database used for model development is made of 3221 canter strides from 58 ridden jumping horses of different age and level of competition. Two different protocols are used to collect data: one for straight path and one for curved path. We restricted our work to the prediction of three parameters: the speed per stride, the stride length and the jump quality. To meet the first to objectives, we developed a multivariate functional clustering method that allow the division of the database into smaller more homogeneous sub-groups from the collected signals point of view. This method allows the characterization of each group by it average profile, which ease the data understanding and interpretation. But surprisingly, this clustering model did not improve the results of speed prediction, Support Vector Machine (SVM) is the model with the lowest percentage of error above 0.6 m/s. The same applied for the stride length where an accuracy of 20 cm is reached thanks to SVM model. Those results can be explained by the fact that our database is build from 58 horses only, which is a quite low number of individuals for a clustering method. Then we extend this method to the co-clustering of multivariate functional data in order to ease the datamining of horses’ follow-up databases. This method might allow the detection and prevention of locomotor disturbances, main source of interruption of jumping horses. Lastly, we looked for correlation between jumping quality and signals collected by the IMU. First results show that signals collected by the saddle alone are not sufficient to differentiate finely the jumping quality. Additional information will be needed, for example using complementary sensors or by expanding the database to have a more diverse range of horses and jump profiles Données fonctionnelles Clustering Co-clustering fonctionnel multivarié Modèle à blocs latents SEM-Gibbs Algorithme EM Functional data Model based clustering Latent block model SEM-Gibbs EM algorithm Multivariate functional co-clustering 510

Search results