• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 75
  • 16
  • 12
  • 11
  • 8
  • 1
  • 1
  • Tagged with
  • 154
  • 154
  • 54
  • 47
  • 39
  • 36
  • 34
  • 30
  • 28
  • 24
  • 22
  • 20
  • 19
  • 19
  • 17
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
71

Modeling and Characterization of Dynamic Changes in Biological Systems from Multi-platform Genomic Data

Zhang, Bai 30 September 2011 (has links)
Biological systems constantly evolve and adapt in response to changed environment and external stimuli at the molecular and genomic levels. Building statistical models that characterize such dynamic changes in biological systems is one of the key objectives in bioinformatics and computational biology. Recent advances in high-throughput genomic and molecular profiling technologies such as gene expression and and copy number microarrays provide ample opportunities to study cellular activities at the individual gene and network levels. The aim of this dissertation is to formulate mathematically dynamic changes in biological networks and DNA copy numbers, to develop machine learning algorithms to learn these statistical models from high-throughput biological data, and to demonstrate their applications in systems biological studies. The first part (Chapters 2-4) of the dissertation focuses on the dynamic changes taking placing at the biological network level. Biological networks are context-specific and dynamic in nature. Under different conditions, different regulatory components and mechanisms are activated and the topology of the underlying gene regulatory network changes. We report a differential dependency network (DDN) analysis to detect statistically significant topological changes in the transcriptional networks between two biological conditions. Further, we formalize and extend the DDN approach to an effective learning strategy to extract structural changes in graphical models using l1-regularization based convex optimization. We discuss the key properties of this formulation and introduce an efficient implementation by the block coordinate descent algorithm. Another type of dynamic changes in biological networks is the observation that a group of genes involved in certain biological functions or processes coordinate to response to outside stimuli, producing distinct time course patterns. We apply the echo stat network, a new architecture of recurrent neural networks, to model temporal gene expression patterns and analyze the theoretical properties of echo state networks with random matrix theory. The second part (Chapter 5) of the dissertation focuses on the changes at the DNA copy number level, especially in cancer cells. Somatic DNA copy number alterations (CNAs) are key genetic events in the development and progression of human cancers, and frequently contribute to tumorigenesis. We propose a statistically-principled in silico approach, Bayesian Analysis of COpy number Mixtures (BACOM), to accurately detect genomic deletion type, estimate normal tissue contamination, and accordingly recover the true copy number profile in cancer cells. / Ph. D.
72

Multiple Uses of Frequent Episodes in Temporal Process Modeling

Patnaik, Debprakash 19 August 2011 (has links)
This dissertation investigates algorithmic techniques for temporal process discovery in many domains. Many different formalisms have been proposed for modeling temporal processes such as motifs, dynamic Bayesian networks and partial orders, but the direct inference of such models from data has been computationally intensive or even intractable. In this work, we propose the mining of frequent episodes as a bridge to inferring more formal models of temporal processes. This enables us to combine the advantages of frequent episode mining, which conducts level wise search over constrained spaces, with the formal basis of process representations, such as probabilistic graphical models and partial orders. We also investigate the mining of frequent episodes in infinite data streams which further expands their applicability into many modern data mining contexts. To demonstrate the usefulness of our methods, we apply them in different problem contexts such as: sensor networks in data centers, multi-neuronal spike train analysis in neuroscience, and electronic medical records in medical informatics. / Ph. D.
73

Uncovering Structure in High-Dimensions: Networks and Multi-task Learning Problems

Kolar, Mladen 01 July 2013 (has links)
Extracting knowledge and providing insights into complex mechanisms underlying noisy high-dimensional data sets is of utmost importance in many scientific domains. Statistical modeling has become ubiquitous in the analysis of high dimensional functional data in search of better understanding of cognition mechanisms, in the exploration of large-scale gene regulatory networks in hope of developing drugs for lethal diseases, and in prediction of volatility in stock market in hope of beating the market. Statistical analysis in these high-dimensional data sets is possible only if an estimation procedure exploits hidden structures underlying data. This thesis develops flexible estimation procedures with provable theoretical guarantees for uncovering unknown hidden structures underlying data generating process. Of particular interest are procedures that can be used on high dimensional data sets where the number of samples n is much smaller than the ambient dimension p. Learning in high-dimensions is difficult due to the curse of dimensionality, however, the special problem structure makes inference possible. Due to its importance for scientific discovery, we put emphasis on consistent structure recovery throughout the thesis. Particular focus is given to two important problems, semi-parametric estimation of networks and feature selection in multi-task learning.
74

Measuring Interestingness in Outliers with Explanation Facility using Belief Networks

Masood, Adnan 01 January 2014 (has links)
This research explores the potential of improving the explainability of outliers using Bayesian Belief Networks as background knowledge. Outliers are deviations from the usual trends of data. Mining outliers may help discover potential anomalies and fraudulent activities. Meaningful outliers can be retrieved and analyzed by using domain knowledge. Domain knowledge (or background knowledge) is represented using probabilistic graphical models such as Bayesian belief networks. Bayesian networks are graph-based representation used to model and encode mutual relationships between entities. Due to their probabilistic graphical nature, Belief Networks are an ideal way to capture the sensitivity, causal inference, uncertainty and background knowledge in real world data sets. Bayesian Networks effectively present the causal relationships between different entities (nodes) using conditional probability. This probabilistic relationship shows the degree of belief between entities. A quantitative measure which computes changes in this degree of belief acts as a sensitivity measure . The first contribution of this research is enhancing the performance for measurement of sensitivity based on earlier research work, the Interestingness Filtering Engine Miner algorithm. The algorithm developed (IBOX - Interestingness based Bayesian outlier eXplainer) provides progressive improvement in the performance and sensitivity scoring of earlier works. Earlier approaches compute sensitivity by measuring divergence among conditional probability of training and test data, while using only couple of probabilistic interestingness measures such as Mutual information and Support to calculate belief sensitivity. With ingrained support from the literature as well as quantitative evidence, IBOX provides a framework to use multiple interestingness measures resulting in better performance and improved sensitivity analysis. The results provide improved performance, and therefore explainability of rare class entities. This research quantitatively validated probabilistic interestingness measures as an effective sensitivity analysis technique in rare class mining. This results in a novel, original, and progressive research contribution to the areas of probabilistic graphical models and outlier analysis.
75

Modèles graphiques discriminants pour l'étiquetage de séquences : application à la reconnaissance d'entités nommées radiophiniques / Discriminative graphical models for sequence labelling : application to named entity recognition in audio broadcast news

Zidouni, Azeddine 08 December 2010 (has links)
Le traitement automatique des données complexes et variées est un processus fondamental dans les applications d'extraction d'information. L'explosion combinatoire dans la composition des textes journalistiques et l'évolution du vocabulaire rend la tâche d'extraction d'indicateurs sémantiques, tel que les entités nommées, plus complexe par les approches symboliques. Les modèles stochastiques structurels tel que les champs conditionnels aléatoires (CRF) permettent d'optimiser des systèmes d'extraction d'information avec une importante capacité de généralisation. La première contribution de cette thèse est consacrée à la définition du contexte optimal pour l'extraction des régularités entre les mots et les annotations dans la tâche de reconnaissance d'entités nommées. Nous allons intégrer diverses informations dans le but d'enrichir les observations et améliorer la qualité de prédiction du système. Dans la deuxième partie nous allons proposer une nouvelle approche d'adaptation d'annotations entre deux protocoles différents. Le principe de cette dernière est basé sur l'enrichissement d'observations par des données générées par d'autres systèmes. Ces travaux seront expérimentés et validés sur les données de la campagne ESTER. D'autre part, nous allons proposer une approche de couplage entre le niveau signal représenté par un indice de la qualité de voisement et le niveau sémantique. L'objectif de cette étude est de trouver le lien entre le degré d'articulation du locuteur et l'importance de son discours / Recent researches in Information Extraction are designed to extract fixed types of information from data. Sequence annotation systems are developed to associate structured annotations to input data presented in sequential form. The named entity recognition (NER) task consists of identifying and classifying every word in a document into some predefined categories such as person name, locations, organizations, and dates. The complexity of the NER is largely related to the definition of the task and to the complexity of the relationships between words and the semantic associated. Our first contribution is devoted to solving the NER problem using discriminative graphical models. The proposed approach investigates the use of various contexts of the words to improve recognition. NER systems are fixed in accordance with a specific annotation protocol. Thus, new applications are developed for new protocols. The challenge is how we can adapt an annotation system which is performed for a specific application to other target application? We will propose in this work an adaptation approach of sequence labelling task based on annotation enrichment using conditional random fields (CRF). Experimental results show that the proposed approach outperform rules-based approach in NER task. Finally, we propose a multimodal approach of NER by integrating low level features as contextual information in radio broadcast news data. The objective of this study is to measure the correlation between the speaker voicing quality and the importance of his speech
76

Desenvolvimento de modelos de causalidade com informações de QTLs para estudo do relacionamento de caracteres fenotípicos relativos à absorção de fósforo em milho / Development of causal models with QTL information to the study of relationship among traits associated with phosphorus uptake in maize

Gianotto, Adriana Cheavegatti 26 March 2015 (has links)
Metodologias de mapeamento de QTLs modernas empregam abordagem multivariada e se beneficiam da matriz de covariâncias fenotípicas para melhorar as estimativas de localização e efeitos de QTLs. No entanto, a correlação fenotípica pode ser em parte atribuída às relações de causalidade entre os fenótipos e mesmo as abordagens de mapeamento de QTLs multivariadas atuais têm desconsiderado tais relacionamentos. Dentre as metodologias científicas desenvolvidas para o estudo da causalidade em dados observacionais, destacam-se os modelos de equações estruturais e os modelos gráficos. Neste trabalho, foi estudado um conjunto de caracteres fenotípicos relacionados à morfologia de raízes, absorção de fósforo e acúmulo de biomassa em uma população composta de 145 linhagens endogâmicas recombinantes (RILs) do programa de melhoramento de milho da EMBRAPA Milho e Sorgo. O mapeamento de QTLs para os caracteres fenotípicos foi realizado utilizando mapeamento de múltiplos intervalos univariado (MIM) e multivariado (MT-MIM). A análise MIM revelou QTLs afetando diâmetro de raízes, área de superfície de raízes finas, peso seco da parte aérea e concentração de fósforo na parte aérea e nas raízes. A análise MT-MIM revelou 12 QTLs, com diferentes padrões de pleiotropia, com efeitos marginais para as sete variáveis analisadas. Um modelo de relacionamento causal entre os caracteres fenotípicos foi desenvolvido utilizando conhecimento prévio e modelagem de equações estruturais. O modelo de equações estruturais apresentou fluxo unidirecional de causalidade entre as variáveis, com as variáveis de morfologia de raízes exercendo efeito sobre as variáveis de acúmulo de biomassa, que por sua vez, têm efeito sobre as variáveis de absorção de fósforo. A aplicação do algoritmo PC para a descoberta de causalidade automatizada baseada nos padrões de independências condicionais não foi capaz de orientar todas as relações de causalidade descobertas, porém revelou um relacionamento mais complexo que o modelo de equações estruturais, com potenciais ciclos de retroalimentação causais. O emprego de algoritmos de descoberta de causalidade baseados em informações de QTLs, chamados QDG e QPSO, permitiu a orientação de todos os relacionamentos de causalidade encontrados pelo algoritmo PC e confirmou a existência de dois ciclos vizinhos de relacionamento causais entre as variáveis estudadas. Como regra geral, os QTLs pleiotrópicos detectados pela metodologia MT-MIM apresentaram efeitos sobre caracteres fenotípicos alinhados causalmente nos modelos propostos pelos algoritmos PC e QDG, sugerindo que alguns dos QTLs detectados são na realidade efeitos indiretos de QTLs situados em posição mais elevada no modelo causal. O emprego da abordagem MT-MIM aliada à análise de causalidade permitiu melhor compreensão da arquitetura genética dos caracteres de morfologia de raiz, acumulação de biomassa e aquisição de fósforo em milho. / Modern QTL mapping approaches are multivariate and take advantage of the phenotypic covariance matrix to improve estimates of QTL positions and effects. However, phenotypic correlation can also be assigned to the causal relationship among phenotypes, and even modern multivariate QTL analysis does not take these relationships into account. Structural equation models and graphical models are the main methodologies to study causality from observational data. We studied a set of phenotypes related to root morphology, biomass accumulation and phosphorus acquisition in maize. These phenotypes were measured in a maize population from the EMBRAPA breeding program composed of 145 recombinant inbred lines (RILs) derived from the crossing of two divergent lines for phosphorus acquisition efficiency. QTL mapping for the traits was performed using univariate (MIM) and multivariate (MT-MIM) multiple interval mapping. MIM analysis revealed QTL affecting root diameter, fine root surface area, shoot dry weight and root dry weight. MT-MIM analysis revealed 12 QTL with different pleiotropy patterns and QTL with marginal effects affecting all seven studied characters. A causal model for phenotype characters was developed using a priori knowledge and structural equation model techniques. The structural equation model presented an unidirectional causal flow among the variables, with root morphological traits exerting causal effects over biomass traits, which in turn cause phosphorus acquisition traits. Using PC algorithm for an automatic search of causal models based on conditional independence was not able to orient all discovered causal relationships among traits but revealed a more intricated relationship than the structural equation model, with potential causal feedback loops among the traits. Employing causal search algorithms based on QTL information (named QDG and QPSO) allowed the orientation of all causal relationships detected by PC algorithm and it has also confirmed the presence of two neighbor causal cycles among the studied traits. As a general rule, pleiotropic QTL detected by MT-MIM approach exerted effects over traits according to the causal model discovered by PC and QDG algorithms, suggesting that some of the QTL detected effects were indirect effects of QTL located upstream at the proposed causal model. Employing MT-MIM approach and causal analysis has allowed a better comprehension of genetic architecture underlying root morphology, biomass accumulation and phosphorus acquisition traits in maize.
77

Learning probabilistic relational models: a novel approach. / Aprendendo modelos probabilísticos relacionais: uma nova abordagem.

Mormille, Luiz Henrique Barbosa 17 August 2018 (has links)
While most statistical learning methods are designed to work with data stored in a single table, many large datasets are stored in relational database systems. Probabilistic Relational Models (PRM) extend Bayesian networks by introducing relations and individuals, thus making it possible to represent information in a relational database. However, learning a PRM from relational data is a more complex task than learning a Bayesian Network from \"flat\" data. The main difficulties that arise while learning a PRM are establishing what are the legal dependency structures, searching for possible structures, and scoring them. This thesis focuses on the development of a novel approach to learn the structure of a PRM, describes a package in the R language to support the learning framework, and applies it to a real, large scale scenario of a city named Atibaia, in the state of São Paulo, Brazil. The research is based on a database combining three different tables, each representing one class in the domain of study. The first table contains 27 attributes from 110,816 citizens of Atibaia. The second table contains 9 attributes from 20,162 companies located in the city. And finally, the third table has 8 attributes from 327 census sectors (small territorial units that comprise the city of Atibaia). The proposed framework is applied to learn a PRM structure and parameters from the database. The model is used to verify if the Social Class of a person can be explained by the location where they live, their neighbors, and the companies nearby. Preliminary experiments have been conducted and a paper published in the 2017 Symposium on Knowledge Discovery, Mining and Learning (KDMiLe). The algorithm performance was further evaluated by extensive experimentation, and a broader study using Serasa Experian data was conducted. Finally, the package in the R language that supports our method was refined along with proper documentation and a tutorial. / Embora a maioria dos métodos de aprendizado estatístico tenha sido desenvolvida para se trabalhar com dados armazenados em uma única tabela, muitas bases de dados estão armazenadas em bancos de dados relacionais. Modelos Probabilísticos Relacionai (PRM) estendem Redes Bayesianas introduzindo relações e indivíduos, tornando possível a representação de informação em uma base de dados relacional. Entretanto, aprender um PRM através de dados relacionais é uma tarefa mais complexa que aprender uma Rede Bayesiana de uma única tabela. As maiores dificuldades que se impõe enquanto se aprende um PRM são estabelecer quais são as estruturas de dependência legais, procurar por possíveis estruturas, e avalia-las. Esta tese foca em desenvolver um novo método de aprendizado de estruturas de PRM, descrever um pacote na linguagem R que suporte este método e aplica-lo a um cenário real e de grande escala, a cidade de Atibaia, no estado de São Paulo, Brasil. Esta pesquisa está baseada em uma base de dados combinando três tabelas distintas, cada uma representando uma classe no domínio de estudo. A primeira tabela contém 27 atributos de 110.816 habitantes de Atibaia, e a segunda tabela contém 9 atributos de 20.162 empresas da cidade. Por fim, a terceira tabela possui 8 atributos para 327 setores censitários (pequenas unidades territoriais que formam a cidade de Atibaia). A proposta é aplicada para aprender-se a estrutura de um PRM e seus parâmetros através desta base de dados. O modelo foi utilizado para verificar se a classe social de uma pessoa pode ser explicada pelo local onde ela vive, seus vizinhos e as companhias próximas. Experimentos preliminares foram conduzidos e um artigo foi publicado no Symposium on Knowledge Discovery, Mining and Learning (KDMiLe). O desempenho do algoritmo foi reavaliada através de extensiva experimentação, e um estudo mais amplo foi conduzido com os dados da Serasa Experian. Por fim, o pacote em R que suporta o método proposto foi refinado, e documentação e tutorial apropriado foram descritos.
78

Apprentissage de structures musicales en contexte d'improvisation / Learning of musical structures in the context of improvisation

Déguernel, Ken 06 March 2018 (has links)
Les systèmes actuels d’improvisation musicales sont capables de générer des séquences musicales unidimensionnelles par recombinaison du matériel musical. Cependant, la prise en compte de plusieurs dimensions (mélodie, harmonie...) et la modélisation de plusieurs niveaux temporels sont des problèmes difficiles. Dans cette thèse, nous proposons de combiner des approches probabilistes et des méthodes issues de la théorie des langages formels afin de mieux apprécier la complexité du discours musical à la fois d’un point de vue multidimensionnel et multi-niveaux dans le cadre de l’improvisation où la quantité de données est limitée. Dans un premier temps, nous présentons un système capable de suivre la logique contextuelle d’une improvisation représentée par un oracle des facteurs tout en enrichissant son discours musical à l’aide de connaissances multidimensionnelles représentées par des modèles probabilistes interpolés. Ensuite, ces travaux sont étendus pour modéliser l’interaction entre plusieurs musiciens ou entre plusieurs dimensions par un algorithme de propagation de croyance afin de générer des improvisations multidimensionnelles. Enfin, nous proposons un système capable d’improviser sur un scénario temporel avec des informations multi-niveaux représenté par une grammaire hiérarchique. Nous proposons également une méthode d’apprentissage pour l’analyse automatique de structures temporelles hiérarchiques. Tous les systèmes sont évalués par des musiciens et improvisateurs experts lors de sessions d’écoute / Current musical improvisation systems are able to generate unidimensional musical sequences by recombining their musical contents. However, considering several dimensions (melody, harmony...) and several temporal levels are difficult issues. In this thesis, we propose to combine probabilistic approaches with formal language theory in order to better assess the complexity of a musical discourse, both from a multidimensional and multi-level point of view in the context of improvisation where the amount of data is limited. First, we present a system able to follow the contextual logic of an improvisation modelled by a factor oracle whilst enriching its musical discourse with multidimensional knowledge represented by interpolated probabilistic models. Then, this work is extended to create another system using a belief propagation algorithm representing the interaction between several musicians, or between several dimensions, in order to generate multidimensional improvisations. Finally, we propose a system able to improvise on a temporal scenario with multi-level information modelled with a hierarchical grammar. We also propose a learning method for the automatic analysis of hierarchical temporal structures. Every system is evaluated by professional musicians and improvisers during listening sessions
79

Multi-label classification based on sum-product networks / Classificação multi-rótulo baseada em redes soma-produto

Llerena, Julissa Giuliana Villanueva 06 September 2017 (has links)
Multi-label classification consists of learning a function that is capable of mapping an object to a set of relevant labels. It has applications such as the association of genes with biological functions, semantic classification of scenes and text categorization. Traditional classification (i.e., single-label) is therefore a particular case of multi-label classification in which each object is associated with exactly one label. A successful approach to constructing classifiers is to obtain a probabilistic model of the relation between object attributes and labels. This model can then be used to classify objects, finding the most likely prediction by computing the marginal probability or the most probable explanation (MPE) of the labels given the attributes. Depending on the probabilistic models family chosen, such inferences may be intractable when the number of labels is large. Sum-Product Networks (SPN) are deep probabilistic models, that allow tractable marginal inference. Nevertheless, as with many other probabilistic models, performing MPE inference is NP- hard. Although, SPNs have already been used successfully for traditional classification tasks (i.e. single-label), there is no in-depth investigation on the use of SPNs for Multi-Label classification. In this work we investigate the use of SPNs for Multi-Label classification. We compare several algorithms for learning SPNs combined with different proposed approaches for classification. We show that SPN-based multi-label classifiers are competitive against state-of-the-art classifiers, such as Random k-Labelsets with Support Vector Machine and MPE inference on CutNets, in a collection of benchmark datasets. / A classificação Multi-Rótulo consiste em aprender uma função que seja capaz de mapear um objeto para um conjunto de rótulos relevantes. Ela possui aplicações como associação de genes com funções biológicas, classificação semântica de cenas e categorização de texto. A classificação tradicional, de rótulo único é, portanto, um caso particular da Classificação Multi-Rótulo, onde cada objeto está associado com exatamente um rótulo. Uma abordagem bem sucedida para classificação é obter um modelo probabilístico da relação entre atributos do objeto e rótulos. Esse modelo pode então ser usado para classificar objetos, encon- trando a predição mais provável por meio da probabilidade marginal ou a explicação mais provavél dos rótulos dados os atributos. Dependendo da família de modelos probabilísticos escolhidos, tais inferências podem ser intratáveis quando o número de rótulos é grande. As redes Soma-Produto (SPN, do inglês Sum Product Network) são modelos probabilísticos profundos, que permitem inferência marginal tratável. No entanto, como em muitos outros modelos probabilísticos, a inferência da explicação mais provavél é NP-difícil. Embora SPNs já tenham sido usadas com sucesso para tarefas de classificação tradicionais, não existe investigação aprofundada no uso de SPNs para classificação Multi-Rótulo. Neste trabalho, investigamos o uso de SPNs para classificação Multi-Rótulo. Comparamos vários algoritmos de aprendizado de SPNs combinados com diferentes abordagens propostos para classi- ficação. Mostramos que os classificadores Multi-Rótulos baseados em SPN são competitivos contra classificadores estado-da-arte, como Random k-Labelsets usando Máquinas de Suporte Vetorial e inferência exata da explicação mais provavél em CutNets, em uma coleção de conjuntos de dados de referência.
80

Métodos Bayesianos aplicados em taxonomia molecular / Bayesian methods applied in molecular taxonomy

Edwin Rafael Villanueva Talavera 31 August 2007 (has links)
Neste trabalho são apresentados dois métodos de agrupamento de dados visados para aplicações em taxonomia molecular. Estes métodos estão baseados em modelos probabilísticos, o que permite superar alguns problemas apresentados nos métodos não probabilísticos existentes, como a dificuldade na escolha da métrica de distância e a falta de tratamento e aproveitamento do conhecimento a priori disponível. Os métodos apresentados combinam por meio do teorema de Bayes a informação extraída dos dados com o conhecimento a priori que se dispõe, razão pela qual são denominados métodos Bayesianos. O primeiro método, método de agrupamento hierárquico Bayesiano, está baseado no algoritmo HBC (Hierarchical Bayesian Clustering). Este método constrói uma hierarquia de partições (dendrograma) baseado no critério da máxima probabilidade a posteriori de cada partição. O segundo método é baseado em um tipo de modelo gráfico probabilístico conhecido como redes Gaussianas condicionais, o qual foi adaptado para problemas de agrupamento. Ambos métodos foram avaliados em três bancos de dados donde se conhece a rótulo da classe. Os métodos foram usados também em um problema de aplicação real: a taxonomia de uma coleção brasileira de estirpes de bactérias do gênero Bradyrhizobium (conhecidas por sua capacidade de fixar o \'N IND.2\' do ar no solo). Este banco de dados é composto por dados genotípicos resultantes da análise do RNA ribossômico. Os resultados mostraram que o método hierárquico Bayesiano gera dendrogramas de boa qualidade, em alguns casos superior que o melhor dos algoritmos hierárquicos analisados. O método baseado em redes gaussianas condicionais também apresentou resultados aceitáveis, mostrando um adequado aproveitamento do conhecimento a priori sobre as classes tanto na determinação do número ótimo de grupos, quanto no melhoramento da qualidade dos agrupamentos. / In this work are presented two clustering methods thought to be applied in molecular taxonomy. These methods are based in probabilistic models which overcome some problems observed in traditional clustering methods such as the difficulty to know which distance metric must be used or the lack of treatment of available prior information. The proposed methods use the Bayes theorem to combine the information of the data with the available prior information, reason why they are called Bayesian methods. The first method implemented in this work was the hierarchical Bayesian clustering, which is an agglomerative hierarchical method that constructs a hierarchy of partitions (dendogram) guided by the criterion of maximum Bayesian posterior probability of the partition. The second method is based in a type of probabilistic graphical model knows as conditional Gaussian network, which was adapted for data clustering. Both methods were validated in 3 datasets where the labels are known. The methods were used too in a real problem: the clustering of a brazilian collection of bacterial strains belonging to the genus Bradyrhizobium, known by their capacity to transform the nitrogen (\'N IND.2\') of the atmosphere into nitrogen compounds useful for the host plants. This dataset is formed by genetic data resulting of the analysis of the ribosomal RNA. The results shown that the hierarchical Bayesian clustering method built dendrograms with good quality, in some cases, better than the other hierarchical methods. In the method based in conditional Gaussian network was observed acceptable results, showing an adequate utilization of the prior information (about the clusters) to determine the optimal number of clusters and to improve the quality of the groups.

Page generated in 0.0721 seconds