Global ETD Search

1	Algoritmo Wang-Landau e agrupamento de dados superparamagnético RAMEH, Leila Milfont 26 August 2010 (has links) Submitted by (ana.araujo@ufrpe.br) on 2016-08-02T14:20:48Z No. of bitstreams: 1 Leila Milfont Rameh.pdf: 1805419 bytes, checksum: 3c0a871188e0dc9ff8282000ec45fc1c (MD5) / Made available in DSpace on 2016-08-02T14:20:48Z (GMT). No. of bitstreams: 1 Leila Milfont Rameh.pdf: 1805419 bytes, checksum: 3c0a871188e0dc9ff8282000ec45fc1c (MD5) Previous issue date: 2010-08-26 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / The method of unsupervised data classification proposed by Domany and coworkers is based on mapping the problem onto an inhomogeneous granular magnetic system whose properties can be investigated through some Monte Carlo Method. The array containing the data consists of n numeric attributes corresponding to points in an n-dimensional Euclidean space. Each data item is associated with a Potts spin. The interaction between such spins decays exponentially with the distance. This favors the alignment of the spins associated with similar objects. The physical system corresponds to a disordered ferromagnet which, in turn, is described by a Hamiltonian of a q-states Potts model. It is expected that the magnetic system exhibits three temperature-dependent regimes. For very low temperatures the system is completely ordered. At the other extreme, high temperatures, the system shows no magnetic order. In an intermediate range of temperatures, the spins within certain regions remain tightly coupled, forming grains. However, a grain does not influence the behavior of another grain. That is, the grains are non-correlated and this intermediate state is named a superparamagnetic phase. The transition from one regime to another can be identified by peaks in the specific heat versus temperature curve. We apply the method to several artificial and real-life data sets, such as classification of flowers, summary medical data and identification of images. We measure the spin-spin correlation at several temperatures to classify the data. In disagreement with the Domany and coworkers claims we found that the best classification of the data occurred outside the superparagnetic phase. / O método de agrupamento de dados não supervisionado proposto por Domany e colaboradores baseia-se no mapeamento do problema em um sistema magnético granular não homogêneo, cujas propriedades são investigadas através de algum método de Monte Carlo. A matriz que contém os dados é composta por n atributos de valor numérico e corresponde a um ponto em um espaço euclidiano n-dimensional. A cada item de dado é associado um spin de Potts. A interação entre tais spins decai exponencialmente com o aumento da distância entre eles. Isto favorece o alinhamento dos spins associados a objetos similares. O sistema físico corresponde a um ferromagneto desordenado que, por sua vez, é descrito por um hamiltoniano de Potts de q estados. Espera-se que o sistema magnético exiba três regimes quando sua temperatura seja variada. Para temperaturas muito baixas o sistema está completamente ordenado. No outro extremo, em altas temperaturas, o sistema não apresenta qualquer ordem magnética. Numa faixa intermediária de temperaturas, spins dentro de certas regiões permanecem fortemente acoplados, formando grãos. Porém, um grão não influencie o comportamento de outro grão. Ou seja, os grãos estão não correlacionados. Este estado intermediário caracteriza um estado superparamagnético. A transição de um regime para outro pode ser identificada por picos na curva de calor específico versus temperatura. Aplicamos o método aos conjuntos de dados reais da planta íris e de dados médicos, conhecido por BUPA, aos dados sintéticos conhecidos por Ruspini e a um conjunto de dados, gerado por nós, que consiste de duas figuras tridimensionais sobrepostas, um esfera e um toro. Procedemos a classificação dos dados através da correlação spin-spin em diversas temperaturas. O principal resultado foi a verificação que nem sempre o agrupamento realizado na fase superparamagnética é o ideal. Agrupamento de dados superparamagnético Método de Monte Carlo Algoritmo de Wang-Landau Superparamagnetic clustering od data Monte Carlo method Wang-Landau Algorithm
2	Phase transitions in novel superfluids and systems with correlated disorder Meier, Hannes January 2015 (has links) Condensed matter systems undergoing phase transitions rarely allow exact solutions. The presence of disorder renders the situation even worse but collective Monte Carlo methods and parallel algorithms allow numerical descriptions. This thesis considers classical phase transitions in disordered spin systems in general and in effective models of superfluids with disorder and novel interactions in particular. Quantum phase transitions are considered via a quantum to classical mapping. Central questions are if the presence of defects changes universal properties and what qualitative implications follow for experiments. Common to the cases considered is that the disorder maps out correlated structures. All results are obtained using large-scale Monte Carlo simulations of effective models capturing the relevant degrees of freedom at the transition. Considering a model system for superflow aided by a defect network, we find that the onset properties are significantly altered compared to the $\lambda$-transition in $^{4}$He. This has qualitative implications on expected experimental signatures in a defect supersolid scenario. For the Bose glass to superfluid quantum phase transition in 2D we determine the quantum correlation time by an anisotropic finite size scaling approach. Without a priori assumptions on critical parameters, we find the critical exponent $z=1.8 \pm 0.05$ contradicting the long standing result $z=d$. Using a 3D effective model for multi-band type-1.5 superconductors we find that these systems possibly feature a strong first order vortex-driven phase transition. Despite its short-range nature details of the interaction are shown to play an important role. Phase transitions in disordered spin models exposed to correlated defect structures obtained via rapid quenches of critical loop and spin models are investigated. On long length scales the correlations are shown to decay algebraically. The decay exponents are expressed through known critical exponents of the disorder generating models. For cases where the disorder correlations imply the existence of a new long-range-disorder fixed point we determine the critical exponents of the disordered systems via finite size scaling methods of Monte Carlo data and find good agreement with theoretical expectations. / <p>QC 20150306</p> condensed matter physics phase transitions critical phenomena spin models quantum phase transitions quantum fluids superfluidity superconductivity disordered systems Bose glass dirty bosons vortex pinning statistical mechanics Monte Carlo simulation Wolff algorithm classical worm algorithm Wang-Landau algorithm
3	Modélisation des bi-grappes et sélection des variables pour des données de grande dimension : application aux données d’expression génétique Chekouo Tekougang, Thierry 08 1900 (has links) Les simulations ont été implémentées avec le programme Java. / Le regroupement des données est une méthode classique pour analyser les matrices d'expression génétiques. Lorsque le regroupement est appliqué sur les lignes (gènes), chaque colonne (conditions expérimentales) appartient à toutes les grappes obtenues. Cependant, il est souvent observé que des sous-groupes de gènes sont seulement co-régulés (i.e. avec les expressions similaires) sous un sous-groupe de conditions. Ainsi, les techniques de bi-regroupement ont été proposées pour révéler ces sous-matrices des gènes et conditions. Un bi-regroupement est donc un regroupement simultané des lignes et des colonnes d'une matrice de données. La plupart des algorithmes de bi-regroupement proposés dans la littérature n'ont pas de fondement statistique. Cependant, il est intéressant de porter une attention sur les modèles sous-jacents à ces algorithmes et de développer des modèles statistiques permettant d'obtenir des bi-grappes significatives. Dans cette thèse, nous faisons une revue de littérature sur les algorithmes qui semblent être les plus populaires. Nous groupons ces algorithmes en fonction du type d'homogénéité dans la bi-grappe et du type d'imbrication que l'on peut rencontrer. Nous mettons en lumière les modèles statistiques qui peuvent justifier ces algorithmes. Il s'avère que certaines techniques peuvent être justifiées dans un contexte bayésien. Nous développons une extension du modèle à carreaux (plaid) de bi-regroupement dans un cadre bayésien et nous proposons une mesure de la complexité du bi-regroupement. Le critère d'information de déviance (DIC) est utilisé pour choisir le nombre de bi-grappes. Les études sur les données d'expression génétiques et les données simulées ont produit des résultats satisfaisants. À notre connaissance, les algorithmes de bi-regroupement supposent que les gènes et les conditions expérimentales sont des entités indépendantes. Ces algorithmes n'incorporent pas de l'information biologique a priori que l'on peut avoir sur les gènes et les conditions. Nous introduisons un nouveau modèle bayésien à carreaux pour les données d'expression génétique qui intègre les connaissances biologiques et prend en compte l'interaction par paires entre les gènes et entre les conditions à travers un champ de Gibbs. La dépendance entre ces entités est faite à partir des graphes relationnels, l'un pour les gènes et l'autre pour les conditions. Le graphe des gènes et celui des conditions sont construits par les k-voisins les plus proches et permet de définir la distribution a priori des étiquettes comme des modèles auto-logistiques. Les similarités des gènes se calculent en utilisant l'ontologie des gènes (GO). L'estimation est faite par une procédure hybride qui mixe les MCMC avec une variante de l'algorithme de Wang-Landau. Les expériences sur les données simulées et réelles montrent la performance de notre approche. Il est à noter qu'il peut exister plusieurs variables de bruit dans les données à micro-puces, c'est-à-dire des variables qui ne sont pas capables de discriminer les groupes. Ces variables peuvent masquer la vraie structure du regroupement. Nous proposons un modèle inspiré de celui à carreaux qui, simultanément retrouve la vraie structure de regroupement et identifie les variables discriminantes. Ce problème est traité en utilisant un vecteur latent binaire, donc l'estimation est obtenue via l'algorithme EM de Monte Carlo. L'importance échantillonnale est utilisée pour réduire le coût computationnel de l'échantillonnage Monte Carlo à chaque étape de l'algorithme EM. Nous proposons un nouveau modèle pour résoudre le problème. Il suppose une superposition additive des grappes, c'est-à-dire qu'une observation peut être expliquée par plus d'une seule grappe. Les exemples numériques démontrent l'utilité de nos méthodes en terme de sélection de variables et de regroupement. / Clustering is a classical method to analyse gene expression data. When applied to the rows (e.g. genes), each column belongs to all clusters. However, it is often observed that the genes of a subset of genes are co-regulated and co-expressed in a subset of conditions, but behave almost independently under other conditions. For these reasons, biclustering techniques have been proposed to look for sub-matrices of a data matrix. Biclustering is a simultaneous clustering of rows and columns of a data matrix. Most of the biclustering algorithms proposed in the literature have no statistical foundation. It is interesting to pay attention to the underlying models of these algorithms and develop statistical models to obtain significant biclusters. In this thesis, we review some biclustering algorithms that seem to be most popular. We group these algorithms in accordance to the type of homogeneity in the bicluster and the type of overlapping that may be encountered. We shed light on statistical models that can justify these algorithms. It turns out that some techniques can be justified in a Bayesian framework. We develop an extension of the biclustering plaid model in a Bayesian framework and we propose a measure of complexity for biclustering. The deviance information criterion (DIC) is used to select the number of biclusters. Studies on gene expression data and simulated data give satisfactory results. To our knowledge, the biclustering algorithms assume that genes and experimental conditions are independent entities. These algorithms do not incorporate prior biological information that could be available on genes and conditions. We introduce a new Bayesian plaid model for gene expression data which integrates biological knowledge and takes into account the pairwise interactions between genes and between conditions via a Gibbs field. Dependence between these entities is made from relational graphs, one for genes and another for conditions. The graph of the genes and conditions is constructed by the k-nearest neighbors and allows to define a priori distribution of labels as auto-logistic models. The similarities of genes are calculated using gene ontology (GO). To estimate the parameters, we adopt a hybrid procedure that mixes MCMC with a variant of the Wang-Landau algorithm. Experiments on simulated and real data show the performance of our approach. It should be noted that there may be several variables of noise in microarray data. These variables may mask the true structure of the clustering. Inspired by the plaid model, we propose a model that simultaneously finds the true clustering structure and identifies discriminating variables. We propose a new model to solve the problem. It assumes that an observation can be explained by more than one cluster. This problem is addressed by using a binary latent vector, so the estimation is obtained via the Monte Carlo EM algorithm. Importance Sampling is used to reduce the computational cost of the Monte Carlo sampling at each step of the EM algorithm. Numerical examples demonstrate the usefulness of these methods in terms of variable selection and clustering. Groupement Clustering Ontologie des gènes Gene Ontology Expression génétique gene expression Critère d’information de déviance Deviance information criterion Algorithme de Wang-Landau Wang-Landau algorithm modèle auto-logistique auto-logistic models Sélection des variables Variable selection modèle à carreaux plaid model Algorithme EM de Monte Carlo Monte Carlo EM algorithm Importance échantillonnale Importance Sampling
4	Modélisation des bi-grappes et sélection des variables pour des données de grande dimension : application aux données d’expression génétique Chekouo Tekougang, Thierry 08 1900 (has links) Le regroupement des données est une méthode classique pour analyser les matrices d'expression génétiques. Lorsque le regroupement est appliqué sur les lignes (gènes), chaque colonne (conditions expérimentales) appartient à toutes les grappes obtenues. Cependant, il est souvent observé que des sous-groupes de gènes sont seulement co-régulés (i.e. avec les expressions similaires) sous un sous-groupe de conditions. Ainsi, les techniques de bi-regroupement ont été proposées pour révéler ces sous-matrices des gènes et conditions. Un bi-regroupement est donc un regroupement simultané des lignes et des colonnes d'une matrice de données. La plupart des algorithmes de bi-regroupement proposés dans la littérature n'ont pas de fondement statistique. Cependant, il est intéressant de porter une attention sur les modèles sous-jacents à ces algorithmes et de développer des modèles statistiques permettant d'obtenir des bi-grappes significatives. Dans cette thèse, nous faisons une revue de littérature sur les algorithmes qui semblent être les plus populaires. Nous groupons ces algorithmes en fonction du type d'homogénéité dans la bi-grappe et du type d'imbrication que l'on peut rencontrer. Nous mettons en lumière les modèles statistiques qui peuvent justifier ces algorithmes. Il s'avère que certaines techniques peuvent être justifiées dans un contexte bayésien. Nous développons une extension du modèle à carreaux (plaid) de bi-regroupement dans un cadre bayésien et nous proposons une mesure de la complexité du bi-regroupement. Le critère d'information de déviance (DIC) est utilisé pour choisir le nombre de bi-grappes. Les études sur les données d'expression génétiques et les données simulées ont produit des résultats satisfaisants. À notre connaissance, les algorithmes de bi-regroupement supposent que les gènes et les conditions expérimentales sont des entités indépendantes. Ces algorithmes n'incorporent pas de l'information biologique a priori que l'on peut avoir sur les gènes et les conditions. Nous introduisons un nouveau modèle bayésien à carreaux pour les données d'expression génétique qui intègre les connaissances biologiques et prend en compte l'interaction par paires entre les gènes et entre les conditions à travers un champ de Gibbs. La dépendance entre ces entités est faite à partir des graphes relationnels, l'un pour les gènes et l'autre pour les conditions. Le graphe des gènes et celui des conditions sont construits par les k-voisins les plus proches et permet de définir la distribution a priori des étiquettes comme des modèles auto-logistiques. Les similarités des gènes se calculent en utilisant l'ontologie des gènes (GO). L'estimation est faite par une procédure hybride qui mixe les MCMC avec une variante de l'algorithme de Wang-Landau. Les expériences sur les données simulées et réelles montrent la performance de notre approche. Il est à noter qu'il peut exister plusieurs variables de bruit dans les données à micro-puces, c'est-à-dire des variables qui ne sont pas capables de discriminer les groupes. Ces variables peuvent masquer la vraie structure du regroupement. Nous proposons un modèle inspiré de celui à carreaux qui, simultanément retrouve la vraie structure de regroupement et identifie les variables discriminantes. Ce problème est traité en utilisant un vecteur latent binaire, donc l'estimation est obtenue via l'algorithme EM de Monte Carlo. L'importance échantillonnale est utilisée pour réduire le coût computationnel de l'échantillonnage Monte Carlo à chaque étape de l'algorithme EM. Nous proposons un nouveau modèle pour résoudre le problème. Il suppose une superposition additive des grappes, c'est-à-dire qu'une observation peut être expliquée par plus d'une seule grappe. Les exemples numériques démontrent l'utilité de nos méthodes en terme de sélection de variables et de regroupement. / Clustering is a classical method to analyse gene expression data. When applied to the rows (e.g. genes), each column belongs to all clusters. However, it is often observed that the genes of a subset of genes are co-regulated and co-expressed in a subset of conditions, but behave almost independently under other conditions. For these reasons, biclustering techniques have been proposed to look for sub-matrices of a data matrix. Biclustering is a simultaneous clustering of rows and columns of a data matrix. Most of the biclustering algorithms proposed in the literature have no statistical foundation. It is interesting to pay attention to the underlying models of these algorithms and develop statistical models to obtain significant biclusters. In this thesis, we review some biclustering algorithms that seem to be most popular. We group these algorithms in accordance to the type of homogeneity in the bicluster and the type of overlapping that may be encountered. We shed light on statistical models that can justify these algorithms. It turns out that some techniques can be justified in a Bayesian framework. We develop an extension of the biclustering plaid model in a Bayesian framework and we propose a measure of complexity for biclustering. The deviance information criterion (DIC) is used to select the number of biclusters. Studies on gene expression data and simulated data give satisfactory results. To our knowledge, the biclustering algorithms assume that genes and experimental conditions are independent entities. These algorithms do not incorporate prior biological information that could be available on genes and conditions. We introduce a new Bayesian plaid model for gene expression data which integrates biological knowledge and takes into account the pairwise interactions between genes and between conditions via a Gibbs field. Dependence between these entities is made from relational graphs, one for genes and another for conditions. The graph of the genes and conditions is constructed by the k-nearest neighbors and allows to define a priori distribution of labels as auto-logistic models. The similarities of genes are calculated using gene ontology (GO). To estimate the parameters, we adopt a hybrid procedure that mixes MCMC with a variant of the Wang-Landau algorithm. Experiments on simulated and real data show the performance of our approach. It should be noted that there may be several variables of noise in microarray data. These variables may mask the true structure of the clustering. Inspired by the plaid model, we propose a model that simultaneously finds the true clustering structure and identifies discriminating variables. We propose a new model to solve the problem. It assumes that an observation can be explained by more than one cluster. This problem is addressed by using a binary latent vector, so the estimation is obtained via the Monte Carlo EM algorithm. Importance Sampling is used to reduce the computational cost of the Monte Carlo sampling at each step of the EM algorithm. Numerical examples demonstrate the usefulness of these methods in terms of variable selection and clustering. / Les simulations ont été implémentées avec le programme Java. Groupement Clustering Ontologie des gènes Gene Ontology Expression génétique gene expression Critère d’information de déviance Deviance information criterion Algorithme de Wang-Landau Wang-Landau algorithm modèle auto-logistique auto-logistic models Sélection des variables Variable selection modèle à carreaux plaid model Algorithme EM de Monte Carlo Monte Carlo EM algorithm Importance échantillonnale Importance Sampling
5	Non-convex Bayesian Learning via Stochastic Gradient Markov Chain Monte Carlo Wei Deng (11804435) 18 December 2021 (has links) <div>The rise of artificial intelligence (AI) hinges on the efficient training of modern deep neural networks (DNNs) for non-convex optimization and uncertainty quantification, which boils down to a non-convex Bayesian learning problem. A standard tool to handle the problem is Langevin Monte Carlo, which proposes to approximate the posterior distribution with theoretical guarantees. However, non-convex Bayesian learning in real big data applications can be arbitrarily slow and often fails to capture the uncertainty or informative modes given a limited time. As a result, advanced techniques are still required.</div><div><br></div><div>In this thesis, we start with the replica exchange Langevin Monte Carlo (also known as parallel tempering), which is a Markov jump process that proposes appropriate swaps between exploration and exploitation to achieve accelerations. However, the na\"ive extension of swaps to big data problems leads to a large bias, and the bias-corrected swaps are required. Such a mechanism leads to few effective swaps and insignificant accelerations. To alleviate this issue, we first propose a control variates method to reduce the variance of noisy energy estimators and show a potential to accelerate the exponential convergence. We also present the population-chain replica exchange and propose a generalized deterministic even-odd scheme to track the non-reversibility and obtain an optimal round trip rate. Further approximations are conducted based on stochastic gradient descents, which yield a user-friendly nature for large-scale uncertainty approximation tasks without much tuning costs. </div><div><br></div><div>In the second part of the thesis, we study scalable dynamic importance sampling algorithms based on stochastic approximation. Traditional dynamic importance sampling algorithms have achieved successes in bioinformatics and statistical physics, however, the lack of scalability has greatly limited their extensions to big data applications. To handle this scalability issue, we resolve the vanishing gradient problem and propose two dynamic importance sampling algorithms based on stochastic gradient Langevin dynamics. Theoretically, we establish the stability condition for the underlying ordinary differential equation (ODE) system and guarantee the asymptotic convergence of the latent variable to the desired fixed point. Interestingly, such a result still holds given non-convex energy landscapes. In addition, we also propose a pleasingly parallel version of such algorithms with interacting latent variables. We show that the interacting algorithm can be theoretically more efficient than the single-chain alternative with an equivalent computational budget.</div> Statistics Stochastic Analysis and Modelling Monte Carlo Algorithm Artificial intelligence Importance sampling Computer vision Langevin Dynamics Variance reduction techniques Wang-Landau algorithm Interacting particles Hamiltonian Monte Carlo Log-Sobolev inequality Metropolis Hasting Deep neural network Stochastic variance-reduced gradient Wasserstein distance Convolutional neural network Deterministic even odd scheme Non-reversibility Stochastic approximation Monte Carlo Stochastic differential equation Stochastic gradient descent Parallel tempering Stochastic approximation Replica exchange Stochastic gradient Langevin dynamics Markov Chain Monte Carlo

1

Page generated in 0.0372 seconds