Global ETD Search

171	Modèles de mélange de von Mises-Fisher / Von Mises-Fisher mixture models Parr Bouberima, Wafia 15 November 2013 (has links) Dans la vie actuelle, les données directionnelles sont présentes dans la majorité des domaines, sous plusieurs formes, différents aspects et de grandes tailles/dimensions, d'où le besoin de méthodes d'étude efficaces des problématiques posées dans ce domaine. Pour aborder le problème de la classification automatique, l'approche probabiliste est devenue une approche classique, reposant sur l'idée simple : étant donné que les g classes sont différentes entre elles, on suppose que chacune suit une loi de probabilité connue, dont les paramètres sont en général différents d'une classe à une autre; on parle alors de modèle de mélange de lois de probabilités. Sous cette hypothèse, les données initiales sont considérées comme un échantillon d'une variable aléatoire d-dimensionnelle dont la densité est un mélange de g distributions de probabilités spécifiques à chaque classe. Dans cette thèse nous nous sommes intéressés à la classification automatique de données directionnelles, en utilisant des méthodes de classification les mieux adaptées sous deux approches: géométrique et probabiliste. Dans la première, en explorant et comparant des algorithmes de type kmeans; dans la seconde, en s'attaquant directement à l'estimation des paramètres à partir desquels se déduit une partition à travers la maximisation de la log-vraisemblance, représentée par l'algorithme EM. Pour cette dernière approche, nous avons repris le modèle de mélange de distributions de von Mises-Fisher, nous avons proposé des variantes de l'algorithme EMvMF, soit CEMvMF, le SEMvMF et le SAEMvMF, dans le même contexte, nous avons traité le problème de recherche du nombre de composants et le choix du modèle de mélange, ceci en utilisant quelques critères d'information : Bic, Aic, Aic3, Aic4, Aicc, Aicu, Caic, Clc, Icl-Bic, Ll, Icl, Awe. Nous terminons notre étude par une comparaison du modèle vMF avec un modèle exponentiel plus simple ; à l'origine ce modèle part du principe que l'ensemble des données est distribué sur une hypersphère de rayon ρ prédéfini, supérieur ou égal à un. Nous proposons une amélioration du modèle exponentiel qui sera basé sur une étape estimation du rayon ρ au cours de l'algorithme NEM. Ceci nous a permis dans la plupart de nos applications de trouver de meilleurs résultats; en proposant de nouvelles variantes de l'algorithme NEM qui sont le NEMρ , NCEMρ et le NSEMρ. L'expérimentation des algorithmes proposés dans ce travail a été faite sur une variété de données textuelles, de données génétiques et de données simulées suivant le modèle de von Mises-Fisher (vMF). Ces applications nous ont permis une meilleure compréhension des différentes approches étudiées le long de cette thèse. / In contemporary life directional data are present in most areas, in several forms, aspects and large sizes / dimensions; hence the need for effective methods of studying the existing problems in these fields. To solve the problem of clustering, the probabilistic approach has become a classic approach, based on the simple idea: since the g classes are different from each other, it is assumed that each class follows a distribution of probability, whose parameters are generally different from one class to another. We are concerned here with mixture modelling. Under this assumption, the initial data are considered as a sample of a d-dimensional random variable whose density is a mixture of g distributions of probability where each one is specific to a class. In this thesis we are interested in the clustering of directional data that has been treated using known classification methods which are the most appropriate for this case. In which both approaches the geometric and the probabilistic one have been considered. In the first, some kmeans like algorithms have been explored and considered. In the second, by directly handling the estimation of parameters from which is deduced the partition maximizing the log-likelihood, this approach is represented by the EM algorithm. For the latter approach, model mixtures of distributions of von Mises-Fisher have been used, proposing variants of the EM algorithm: EMvMF, the CEMvMF, the SEMvMF and the SAEMvMF. In the same context, the problem of finding the number of the components in the mixture and the choice of the model, using some information criteria {Bic, Aic, Aic3, Aic4, AICC, AICU, CAIC, Clc, Icl-Bic, LI, Icl, Awe} have been discussed. The study concludes with a comparison of the used vMF model with a simpler exponential model. In the latter, it is assumed that all data are distributed on a hypersphere of a predetermined radius greater than one, instead of a unit hypersphere in the case of the vMF model. An improvement of this method based on the estimation step of the radius in the algorithm NEMρ has been proposed: this allowed us in most of our applications to find the best partitions; we have developed also the NCEMρ and NSEMρ algorithms. The algorithms proposed in this work were performed on a variety of textual data, genetic data and simulated data according to the vMF model; these applications gave us a better understanding of the different studied approaches throughout this thesis. Analyse des données Données directionnelles Modèle de mélange Distribution de von Mises Fisher Cluster analysis Directional data Mixture model Von Mises Fisher distribution 519.2
172	Classification et inférence de réseaux pour les données RNA-seq / Clustering and network inference for RNA-seq data Gallopin, Mélina 09 December 2015 (has links) Cette thèse regroupe des contributions méthodologiques à l'analyse statistique des données issues des technologies de séquençage du transcriptome (RNA-seq). Les difficultés de modélisation des données de comptage RNA-seq sont liées à leur caractère discret et au faible nombre d'échantillons disponibles, limité par le coût financier du séquençage. Une première partie de travaux de cette thèse porte sur la classification à l'aide de modèle de mélange. L'objectif de la classification est la détection de modules de gènes co-exprimés. Un choix naturel de modélisation des données RNA-seq est un modèle de mélange de lois de Poisson. Mais des transformations simples des données permettent de se ramener à un modèle de mélange de lois gaussiennes. Nous proposons de comparer, pour chaque jeu de données RNA-seq, les différentes modélisations à l'aide d'un critère objectif permettant de sélectionner la modélisation la plus adaptée aux données. Par ailleurs, nous présentons un critère de sélection de modèle prenant en compte des informations biologiques externes sur les gènes. Ce critère facilite l'obtention de classes biologiquement interprétables. Il n'est pas spécifique aux données RNA-seq. Il est utile à toute analyse de co-expression à l'aide de modèles de mélange visant à enrichir les bases de données d'annotations fonctionnelles des gènes. Une seconde partie de travaux de cette thèse porte sur l'inférence de réseau à l'aide d'un modèle graphique. L'objectif de l'inférence de réseau est la détection des relations de dépendance entre les niveaux d'expression des gènes. Nous proposons un modèle d'inférence de réseau basé sur des lois de Poisson, prenant en compte le caractère discret et la grande variabilité inter-échantillons des données RNA-seq. Cependant, les méthodes d'inférence de réseau nécessitent un nombre d'échantillons élevé.Dans le cadre du modèle graphique gaussien, modèle concurrent au précédent, nous présentons une approche non-asymptotique pour sélectionner des sous-ensembles de gènes pertinents, en décomposant la matrice variance en blocs diagonaux. Cette méthode n'est pas spécifique aux données RNA-seq et permet de réduire la dimension de tout problème d'inférence de réseau basé sur le modèle graphique gaussien. / This thesis gathers methodologicals contributions to the statistical analysis of next-generation high-throughput transcriptome sequencing data (RNA-seq). RNA-seq data are discrete and the number of samples sequenced is usually small due to the cost of the technology. These two points are the main statistical challenges for modelling RNA-seq data.The first part of the thesis is dedicated to the co-expression analysis of RNA-seq data using model-based clustering. A natural model for discrete RNA-seq data is a Poisson mixture model. However, a Gaussian mixture model in conjunction with a simple transformation applied to the data is a reasonable alternative. We propose to compare the two alternatives using a data-driven criterion to select the model that best fits each dataset. In addition, we present a model selection criterion to take into account external gene annotations. This model selection criterion is not specific to RNA-seq data. It is useful in any co-expression analysis using model-based clustering designed to enrich functional annotation databases.The second part of the thesis is dedicated to network inference using graphical models. The aim of network inference is to detect relationships among genes based on their expression. We propose a network inference model based on a Poisson distribution taking into account the discrete nature and high inter sample variability of RNA-seq data. However, network inference methods require a large number of samples. For Gaussian graphical models, we propose a non-asymptotic approach to detect relevant subsets of genes based on a block-diagonale decomposition of the covariance matrix. This method is not specific to RNA-seq data and reduces the dimension of any network inference problem based on the Gaussian graphical model. Modèle de mélange Modèle graphique RNA-Seq data Classification Inférence de réseau Sélection de modèle Mixture model Graphical model selection RNA-Seq data Clustering Network inference Model selection
173	Optimalizace modelování gaussovských směsí v podprostorech a jejich skórování v rozpoznávání mluvčího / Optimization of Gaussian Mixture Subspace Models and Related Scoring Algorithms in Speaker Verification Glembek, Ondřej January 2012 (has links) Tato práce pojednává o modelování v podprostoru parametrů směsí gaussovských rozložení pro rozpoznávání mluvčího. Práce se skládá ze tří částí. První část je věnována skórovacím metodám při použití sdružené faktorové analýzy k modelování mluvčího. Studované metody se liší převážně v tom, jak se vypořádávají s variabilitou kanálu testovacích nahrávek. Metody jsou prezentovány v souvislosti s obecnou formou funkce pravděpodobnosti pro sdruženou faktorovou analýzu a porovnány jak z hlediska přesnosti, tak i z hlediska rychlosti. Je zde prokázáno, že použití lineární aproximace pravděpodobnostní funkce dává výsledky srovnatelné se standardním vyhodnocením pravděpodobnosti při dramatickém zjednodušení matematického zápisu a tím i zvýšení rychlosti vyhodnocování. Druhá část pojednává o extrakci tzv. i-vektorů, tedy nízkodimenzionálních reprezentací nahrávek. Práce prezentuje dva přístupy ke zjednodušení extrakce. Motivací pro tuto část bylo jednak urychlení extrakce i-vektorů, jednak nasazení této úspěšné techniky na jednoduchá zařízení typu mobilní telefon, a také matematické zjednodušení umožněňující využití numerických optimalizačních metod pro diskriminativní trénování. Výsledky ukazují, že na dlouhých nahrávkách je zrychlení vykoupeno poklesem úspěšnosti rozpoznávání, avšak na krátkých nahrávkách, kde je úspěšnost rozpoznávání nízká, se rozdíly úspěšnosti stírají. Třetí část se zabývá diskriminativním trénováním v oblasti rozpoznávání mluvčího. Jsou zde shrnuty poznatky z předchozích prací zabývajících se touto problematikou. Kapitola navazuje na poznatky z předchozích dvou částí a pojednává o diskriminativním trénování parametrů extraktoru i-vektorů. Výsledky ukazují, že při klasickém trénování extraktoru a následném diskriminatviním přetrénování tyto metody zvyšují úspěšnost.
174	[en] IMPACT OF MOLECULAR DIFFUSION MODELS IN THE PREDICTION OF WAX DEPOSITION / [pt] IMPACTO DE MODELOS DE DIFUSÃO MOLECULAR NA PREVISÃO DE DEPOSIÇÃO DE PARAFINA PAULO GUSTAVO CANDIDO DE OLIVEIRA 21 November 2022 (has links) [pt] O petróleo é constituído por uma cadeia de hidrocarbonetos, os quais se precipitam na forma de partículas sólidas de parafina, quando a sua temperatura cai abaixo de um patamar conhecido como TIAC (Temperatura Inicial de Aparecimento de Cristais). Essas partículas podem se depositar nas paredes internas dos dutos obstruindo o escoamento, podendo gerar prejuízos da ordem de milhões de dólares. Por esse motivo, a habilidade de previsão e controle da deposição de parafina em eventos futuros é de fundamental importância tanto para projetistas como operadores de tubulações. Visando lidar com esse problema, grande esforço vem sendo feito pela comunidade científica com o intuito de aperfeiçoar as metodologias para previsão do depósito de parafina. Frequentemente, a modelagem da difusão das espécies é realizada utilizando a Lei de Fick, válida para misturas binárias, apesar dos hidrocarbonetos presentes no petróleo formarem uma mistura multicomponente. O presente trabalho propõe avaliar o fluxo difusivo de massa das espécies utilizando o modelo Stefan-Maxwell, compatível com sistemas multicomponentes. Para determinar a evolução axial e temporal da espessura do depósito de parafina, o escoamento foi modelado como uma mistura líquido/sólido e equações de conservação de energia, massa, quantidade de movimento linear e continuidade das espécies são resolvidas, acopladas com o modelo termodinâmico de múltiplas soluções sólidas, para determinação da precipitação da parafina. As equações de conservação foram resolvidas utilizando o software de código livre OpenFOAM (marca registrada). Uma comparação das previsões obtidas com a modelagem de Fick e de Stefan-Maxwell com dados experimentais, mostrou que no início do processo de deposição, o impacto do modelo difusivo é desprezível. Porém, observou-se que a medida que o tempo passa, o modelo de Stefan Maxwell prevê um maior incremento da concentração das espécies mais pesadas no interior do depósito de parafina quando comparado com a previsão da modelagem de Fick. / [en] Petroleum is formed by a chain of hydrocarbons, which precipitates in the form of solid particles of paraffin, when its temperature drops below a threshold known as Wax Appearance Temperature (WAT). These particles can be deposited on the inner walls of the pipelines, obstructing the flow, which can generate losses in the order of several millions of dollars. For this reason, the ability to predict and control wax deposition in future events is of fundamental importance for both designers and operators of pipelines. In an attempt to deal with this problem, a great effort has been made by the scientific community aiming to improve wax deposition prediction methodologies. Often, the modeling of species diffusion is performed using Fick s law, valid for binary mixtures, although the hydrocarbons present in the oil form a multicomponent solution. The present work proposes to evaluate the species mass diffusive flux employing the Stefan-Maxwell model, compatible with multicomponent systems. To determine the axial and temporal evolution of the wax deposition thickness, the flow was modelled as a liquid/solid mixture and the conservation equations of energy, mass, linear momentum and species continuity were solved coupled with the thermodynamic model of multiple solid solutions, to determine the paraffin precipitation. The conservation equations were solved using the open-source software OpenFOAM (trademark). A comparison of the predictions obtained with the Fick and Stefan-Maxwell models with experimental data showed that at the beginning of the deposition process, the impact of diffusive model is negligible. However, it was observed that as time passes, the Stefan-Maxwell model predicts a greater increase in the concentration of heaviest species inside the wax deposit when compared to the prediction of Fick s law [pt] DEPOSICAO DE PARAFINA [pt] LEI DE FICK [pt] STEFAN-MAXWELL [pt] DIFUSAO MULTICOMPONENTE [pt] MODELO DE MISTURA [en] WAX DEPOSITION [en] FICK S LAW [en] STEFAN-MAXWELL [en] MULTI-COMPONENT DIFFUSION [en] MIXTURE MODEL
175	Bayesian Solution to the Analysis of Data with Values below the Limit of Detection (LOD) Jin, Yan January 2008 (has links) No description available. Mathematics Statistics Bayesian Censored data Limit of detection LOD Censoring Repeated measure Nested random effects Mixture model Gibbs sampling Metropolis-Hastings algorithm DIC
176	Improved Methodologies for the Simultanoeus Study of Two Motor Systems: Reticulospinal and Corticospinal Cooperation and Competition for Motor Control Ortiz-Rosario, Alexis 31 October 2016 (has links) No description available. Biomedical Engineering Computer Science Neurosciences
177	Algorithm for comparing large scale protein-DNA interaction data Taslim, Cenny 28 July 2011 (has links) No description available. Biomedical Research Biostatistics Comparative Computer Engineering Computer Science machine learning nonlinear normalization model-based classification mixture model ChIP-seq differential identification
178	Regression Modeling of Time to Event Data Using the Ornstein-Uhlenbeck Process Erich, Roger Alan 16 August 2012 (has links) No description available. Biostatistics Statistics cancer clinical trial cure rate model first hitting time model Gaussian process mixture model random effects model
179	Unsupervised Anomaly Detection and Root Cause Analysis in HFC Networks : A Clustering Approach Forsare Källman, Povel January 2021 (has links) Following the significant transition from the traditional production industry to an informationbased economy, the telecommunications industry was faced with an explosion of innovation, resulting in a continuous change in user behaviour. The industry has made efforts to adapt to a more datadriven future, which has given rise to larger and more complex systems. Therefore, troubleshooting systems such as anomaly detection and root cause analysis are essential features for maintaining service quality and facilitating daily operations. This study aims to explore the possibilities, benefits, and drawbacks of implementing cluster analysis for anomaly detection in hybrid fibercoaxial networks. Based on the literature review on unsupervised anomaly detection and an assumption regarding the anomalous behaviour in hybrid fibercoaxial network data, the kmeans, SelfOrganizing Map, and Gaussian Mixture Model were implemented both with and without Principal Component Analysis. Analysis of the results demonstrated an increase in performance for all models when the Principal Component Analysis was applied, with kmeans outperforming both SelfOrganizing Map and Gaussian Mixture Model. On this basis, it is recommended to apply Principal Component Analysis for clusteringbased anomaly detection. Further research is necessary to identify whether cluster analysis is the most appropriate unsupervised anomaly detection approach. / Följt av övergången från den traditionella tillverkningsindustrin till en informationsbaserad ekonomi stod telekommunikationsbranschen inför en explosion av innovation. Detta skifte resulterade i en kontinuerlig förändring av användarbeteende och branschen tvingades genomgå stora ansträngningar för att lyckas anpassa sig till den mer datadrivna framtiden. Större och mer komplexa system utvecklades och således blev felsökningsfunktioner såsom anomalidetektering och rotfelsanalys centrala för att upprätthålla servicekvalitet samt underlätta för den dagliga driftverksamheten. Syftet med studien är att utforska de möjligheterna, för- samt nackdelar med att använda klusteranalys för anomalidetektering inom HFC- nätverk. Baserat på litteraturstudien för oövervakad anomalidetektering samt antaganden för anomalibeteenden inom HFC- data valdes algritmerna k- means, Self- Organizing Map och Gaussian Mixture Model att implementeras, både med och utan Principal Component Analysis. Analys av resultaten påvisade en uppenbar ökning av prestanda för samtliga modeller vid användning av PCA. Vidare överträffade k- means, både Self- Organizing Maps och Gaussian Mixture Model. Utifrån resultatanalysen rekommenderas det således att PCA bör tillämpas vid klusterings- baserad anomalidetektering. Vidare är ytterligare forskning nödvändig för att avgöra huruvida klusteranalys är den mest lämpliga metoden för oövervakad anomalidetektering. Anomaly Detection Root Cause Analysis Cluster Analysis k- means Self- Organizing Map Gaussian Mixture Model Dimensionality Reduction Principal Component Analysis Hybrid Fiber- Coaxial Network. Anomalidetektering Rotfelsanalys Klusteranalys k- means Self- Organizing Map Gaussian Mixture Model Dimensionsreducering Principal Component Analysis Hybrid Fiber Coax- nät. Computer and Information Sciences Data- och informationsvetenskap
180	A multi-wavelength study of a sample of galaxy clusters / Susan Wilson Wilson, Susan January 2012 (has links) In this dissertation we aim to perform a multi-wavelength analysis of galaxy clusters. We discuss various methods for clustering in order to determine physical parameters of galaxy clusters required for this type of study. A selection of galaxy clusters was chosen from 4 papers, (Popesso et al. 2007b, Yoon et al. 2008, Loubser et al. 2008, Brownstein & Mo at 2006) and restricted by redshift and galactic latitude to reveal a sample of 40 galaxy clusters with 0.0 < z < 0.15. Data mining using Virtual Observatory (VO) and a literature survey provided some background information about each of the galaxy clusters in our sample with respect to optical, radio and X-ray data. Using the Kayes Mixture Model (KMM) and the Gaussian Mixing Model (GMM), we determine the most likely cluster member candidates for each source in our sample. We compare the results obtained to SIMBADs method of hierarchy. We show that the GMM provides a very robust method to determine member candidates but in order to ensure that the right candidates are chosen we apply a select choice of outlier tests to our sources. We determine a method based on a combination of GMM, the QQ Plot and the Rosner test that provides a robust and consistent method for determining galaxy cluster members. Comparison between calculated physical parameters; velocity dispersion, radius, mass and temperature, and values obtained from literature show that for the majority of our galaxy clusters agree within 3 range. Inconsistencies are thought to be due to dynamically active clusters that have substructure or are undergoing mergers, making galaxy member identi cation di cult. Six correlations between di erent physical parameters in the optical and X-ray wavelength were consistent with published results. Comparing the velocity dispersion with the X-ray temperature, we found a relation of T0:43 as compared to T0:5 obtained from Bird et al. (1995). X-ray luminosity temperature and X-ray luminosity velocity dispersion relations gave the results LX T2:44 and LX 2:40 which lie within the uncertainty of results given by Rozgacheva & Kuvshinova (2010). These results all suggest that our method for determining galaxy cluster members is e cient and application to higher redshift sources can be considered. Further studies on galaxy clusters with substructure must be performed in order to improve this method. In future work, the physical parameters obtained here will be further compared to X-ray and radio properties in order to determine a link between bent radio sources and the galaxy cluster environment. / MSc (Space Physics), North-West University, Potchefstroom Campus, 2013 Galaxy kinematics and dynamics Galaxy Clusters Statistical analysis Clustering algorithms Abell clusters Mass determination Multi-wavelength view Kayes Mixing Model Gaussian Mixture Model Multi-modality Radio galaxies Data mining Velocity dispersion Kernel density estimation Outlier detection techniques

Search results