1 |
Algorithms for Gene Clustering Analysis on GenomesYi, Gang Man 2011 May 1900 (has links)
The increased availability of data in biological databases provides many opportunities for understanding biological processes through these data. As recent attention has shifted from sequence analysis to higher-level analysis of genes across multiple genomes, there is a need to develop efficient algorithms for these large-scale applications that can help us understand the functions of genes.
The overall objective of my research was to develop improved methods which can automatically assign groups of functionally related genes in large-scale data sets by applying new gene clustering algorithms. Proposed gene clustering algorithms that can help us understand gene function and genome evolution include new algorithms
for protein family classification, a window-based strategy for gene clustering on chromosomes, and an exhaustive strategy that allows all clusters of small size to be enumerated. I investigate the problems of gene clustering in multiple genomes, and define gene clustering problems using mathematical methodology and solve the problems by developing efficient and effective algorithms.
For protein family classification, I developed two supervised classification algorithms that can assign proteins to existing protein families in public databases and, by taking into account similarities between the unclassified proteins, allows for progressive construction of new families from proteins that cannot be assigned. This approach is useful for rapid assignment of protein sequences from genome sequencing projects to protein families. A comparative analysis of the method to other previously developed methods shows that the algorithm has a higher accuracy rate and lower mis-classification rate when compared to algorithms that are based on the use of multiple sequence alignments and hidden Markov models. The proposed algorithm performs well even on families with very few proteins and on families with low sequence similarity.
Apart from the analysis of individual sequences, identifying genomic regions that descended from a common ancestor helps us study gene function and genome evolution. In distantly related genomes, clusters of homologous gene pairs serve as evidence used in function prediction, operon detection, etc. Thus, reliable identification of gene clusters is critical to functional annotation and analysis of genes. I developed an efficient gene clustering algorithm that can be applied on hundreds of genomes at the same time. This approach allows for large-scale study of evolutionary relationships
of gene clusters and study of operon formation and destruction. By placing a stricter limit on the maximum cluster size, I developed another algorithm that uses a different formulation based on constraining the overall size of a cluster and statistical estimates that allow direct comparisons of clusters of different size. A comparative analysis of proposed algorithms shows that more biological insight can be obtained by analyzing gene clusters across hundreds of genomes, which can help us understand operon occurrences, gene orientations and gene rearrangements.
|
2 |
An algorithm for identifying clusters of functionally related genes in genomesYi, Gang Man 15 May 2009 (has links)
An increasing body of literature shows that genomes of eukaryotes can contain
clusters of functionally related genes. Most approaches to identify gene clusters utilize
microarray data or metabolic pathway databases to find groups of genes on chromo-
somes that are linked by common attributes. A generalized method that can find
gene clusters, regardless of the mechanism of origin, would provide researchers with
an unbiased method for finding clusters and studying the evolutionary forces that
give rise to them.
I present a basis of algorithm to identify gene clusters in eukaryotic genomes
that utilizes functional categories defined in graph-based vocabularies such as the
Gene Ontology (GO). Clusters identified in this manner need only have a common
function and are not constrained by gene expression or other properties. I tested the
algorithm by analyzing genomes of a representative set of species. I identified species
specific variation in percentage of clustered genes as well as in properties of gene
clusters, including size distribution and functional annotation. These properties may
be diagnostic of the evolutionary forces that lead to the formation of gene clusters.
The approach finds all gene clusters in the data set and ranks them by their likelihood
of occurrence by chance. The method successfully identified clusters.
|
3 |
Genetic Algorithm Application to Queuing Network and Gene-Clustering ProblemsHourani, Mouin 25 February 2004 (has links)
No description available.
|
4 |
LARGE-SCALE MICROARRAY DATA ANALYSIS USING GPU- ACCELERATED LINEAR ALGEBRA LIBRARIESZhang, Yun 01 August 2012 (has links)
The biological datasets produced as a result of high-throughput genomic research such as specifically microarrays, contain vast amounts of knowledge for entire genome and their expression affiliations. Gene clustering from such data is a challenging task due to the huge data size and high complexity of the algorithms as well as the visualization needs. Most of the existing analysis methods for genome-wide gene expression profiles are sequential programs using greedy algorithms and require subjective human decision. Recently, Zhu et al. proposed a parallel Random matrix theory (RMT) based approach for generating transcriptional networks, which is much more resistant to high level of noise in the data [9] without human intervention. Nowadays GPUs are designed to be used more efficiently for general purpose computing [1] and are vastly superior to CPUs [6] in terms of threading performance. Our kernel functions running on GPU utilizes the functions from both the libraries of Compute Unified Basic Linear Algebra Subroutines (CUBLAS) and Compute Unified Linear Algebra (CULA) which implements the Linear Algebra Package (LAPACK). Our experiment results show that GPU program can achieve an average speed-up of 2~3 times for some simulated datasets.
|
5 |
Alternative Approach to Dose-Response Modeling of Toxicogenomic Data with an Application in Risk Assessment of Engineered NanomaterialsDavidson, Sarah E. 04 October 2021 (has links)
No description available.
|
6 |
Building a History of Horizontal Gene Transfer in E. ColiWilber, Matthew 01 January 2016 (has links)
Bacteria's ability to pass entire genes between one another, a process called Horizontal Gene Transfer (HGT), has a major impact on bacterial evolution. In an ongoing project at Harvey Mudd, computational methods have been used to catalogue the HGT events that have impacted a group of closely related bacteria.
This thesis builds on that project, by improving our ability to identify gene families --- groups of genes in different strains that are related. Previously, similarity was measured only by comparing two genes' DNA sequences, ignoring their positions on the organism's DNA. Here, we leverage genes' relative position to make a better measurement of gene similarity. These improved similarity measurements will improve the existing pipeline's ability to identify HGT events.
|
7 |
Dynamic machine learning for supervised and unsupervised classification / Apprentissage automatique dynamique pour la classification supervisée et non superviséeSîrbu, Adela-Maria 06 June 2016 (has links)
La direction de recherche que nous abordons dans la thèse est l'application des modèles dynamiques d'apprentissage automatique pour résoudre les problèmes de classification supervisée et non supervisée. Les problèmes particuliers que nous avons décidé d'aborder dans la thèse sont la reconnaissance des piétons (un problème de classification supervisée) et le groupement des données d'expression génétique (un problème de classification non supervisée). Les problèmes abordés sont représentatifs pour les deux principaux types de classification et sont très difficiles, ayant une grande importance dans la vie réelle. La première direction de recherche que nous abordons dans le domaine de la classification non supervisée dynamique est le problème de la classification dynamique des données d'expression génétique. L'expression génétique représente le processus par lequel l'information d'un gène est convertie en produits de gènes fonctionnels : des protéines ou des ARN ayant différents rôles dans la vie d'une cellule. La technologie des micro-réseaux moderne est aujourd'hui utilisée pour détecter expérimentalement les niveaux d'expression de milliers de gènes, dans des conditions différentes et au fil du temps. Une fois que les données d'expression génétique ont été recueillies, l'étape suivante consiste à analyser et à extraire des informations biologiques utiles. L'un des algorithmes les plus populaires traitant de l'analyse des données d'expression génétique est le groupement, qui consiste à diviser un certain ensemble en groupes, où les composants de chaque groupe sont semblables les uns aux autres données. Dans le cas des ensembles de données d'expression génique, chaque gène est représenté par ses valeurs d'expression (caractéristiques), à des points distincts dans le temps, dans les conditions contrôlées. Le processus de regroupement des gènes est à la base des études génomiques qui visent à analyser les fonctions des gènes car il est supposé que les gènes qui sont similaires dans leurs niveaux d'expression sont également relativement similaires en termes de fonction biologique. Le problème que nous abordons dans le sens de la recherche de classification non supervisée dynamique est le regroupement dynamique des données d'expression génique. Dans notre cas, la dynamique à long terme indique que l'ensemble de données ne sont pas statiques, mais elle est sujette à changement. Pourtant, par opposition aux approches progressives de la littérature, où l'ensemble de données est enrichie avec de nouveaux gènes (instances) au cours du processus de regroupement, nos approches abordent les cas lorsque de nouvelles fonctionnalités (niveaux d'expression pour de nouveaux points dans le temps) sont ajoutés à la gènes déjà existants dans l'ensemble de données. À notre connaissance, il n'y a pas d'approches dans la littérature qui traitent le problème de la classification dynamique des données d'expression génétique, définis comme ci-dessus. Dans ce contexte, nous avons introduit trois algorithmes de groupement dynamiques que sont capables de gérer de nouveaux niveaux d'expression génique collectés, en partant d'une partition obtenue précédente, sans la nécessité de ré-exécuter l'algorithme à partir de zéro. L'évaluation expérimentale montre que notre méthode est plus rapide et plus précis que l'application de l'algorithme de classification à partir de zéro sur la fonctionnalité étendue ensemble de données... / The research direction we are focusing on in the thesis is applying dynamic machine learning models to salve supervised and unsupervised classification problems. We are living in a dynamic environment, where data is continuously changing and the need to obtain a fast and accurate solution to our problems has become a real necessity. The particular problems that we have decided te approach in the thesis are pedestrian recognition (a supervised classification problem) and clustering of gene expression data (an unsupervised classification. problem). The approached problems are representative for the two main types of classification and are very challenging, having a great importance in real life.The first research direction that we approach in the field of dynamic unsupervised classification is the problem of dynamic clustering of gene expression data. Gene expression represents the process by which the information from a gene is converted into functional gene products: proteins or RNA having different roles in the life of a cell. Modern microarray technology is nowadays used to experimentally detect the levels of expressions of thousand of genes, across different conditions and over time. Once the gene expression data has been gathered, the next step is to analyze it and extract useful biological information. One of the most popular algorithms dealing with the analysis of gene expression data is clustering, which involves partitioning a certain data set in groups, where the components of each group are similar to each other. In the case of gene expression data sets, each gene is represented by its expression values (features), at distinct points in time, under the monitored conditions. The process of gene clustering is at the foundation of genomic studies that aim to analyze the functions of genes because it is assumed that genes that are similar in their expression levels are also relatively similar in terms of biological function.The problem that we address within the dynamic unsupervised classification research direction is the dynamic clustering of gene expression data. In our case, the term dynamic indicates that the data set is not static, but it is subject to change. Still, as opposed to the incremental approaches from the literature, where the data set is enriched with new genes (instances) during the clustering process, our approaches tackle the cases when new features (expression levels for new points in time) are added to the genes already existing in the data set. To our best knowledge, there are no approaches in the literature that deal with the problem of dynamic clustering of gene expression data, defined as above. In this context we introduced three dynamic clustering algorithms which are able to handle new collected gene expression levels, by starting from a previous obtained partition, without the need to re-run the algorithm from scratch. Experimental evaluation shows that our method is faster and more accurate than applying the clustering algorithm from scratch on the feature extended data set...
|
8 |
Estudo de coeficientes de correlação para medidas de proximidade em dados de expressão gênica / A study of correlation coefficients as proximity measures for gene expression dataJaskowiak, Pablo Andretta 02 March 2011 (has links)
O desenvolvimento da tecnologia de microarray tornou possível a mediçao dos níveis de expressão de centenas ou até mesmo milhares de genes simultaneamente para diversas condições experimentais. A grande quantidade de dados disponível gerou a demanda por métodos computacionais que permitam sua análise de forma eficiente e automatizada. Em muitos dos métodos computacionais empregados durante a análise de dados de expressão gênica é necessária a escolha de uma medida de proximidade apropriada entre genes ou amostras. Dentre as medidas de proximidade disponíveis, coeficientes de correlação têm sido amplamente empregados, em virtude da sua capacidade em capturar similaridades entre tendências das sequências numéricas comparadas (genes ou amostras). O presente trabalho possui como objetivo comparar diferentes medidas de correlação para as três principais tarefas envolvidas na análise de dados de expressão gênica: agrupamento, seleção de atributos e classificação. Dessa forma, é apresentada nesta dissertação uma visão geral da análise de dados de expressão gênica e das diferentes medidas de correlação consideradas para tal comparação. São apresentados também resultados empíricos obtidos a partir da comparação dos coeficientes de correlação para agrupamento de genes, agrupamento de amostras, seleção de genes para o problema de classificação de amostras e classificação de amostras / The development of microarray technology made possible the expression level measurement of hundreds or even thousands of genes simultaneously for various experimental conditions. The huge amount of available data generated the need for computational methods that allow its analysis in an effcient and automated way. In many of the computational methods employed during gene expression data analysis the choice of a proximity measure is necessary. Among the proximity measures available, correlation coefficients have been widely employed because of their ability to capture similarity trends among the compared numeric sequences (genes or samples). The present work has as objective to compare different correlation measures for the three major tasks involved in the analysis of gene expression data: clustering, feature selection and classification. To this extent, in this dissertation an overview of gene expression data analysis and the different correlation measures considered for this comparison are presented. In the present work are also presented empirical results obtained from the comparison of correlation coefficients for gene clustering, sample clustering, gene selection for sample classification and sample classification
|
9 |
Estudo de coeficientes de correlação para medidas de proximidade em dados de expressão gênica / A study of correlation coefficients as proximity measures for gene expression dataPablo Andretta Jaskowiak 02 March 2011 (has links)
O desenvolvimento da tecnologia de microarray tornou possível a mediçao dos níveis de expressão de centenas ou até mesmo milhares de genes simultaneamente para diversas condições experimentais. A grande quantidade de dados disponível gerou a demanda por métodos computacionais que permitam sua análise de forma eficiente e automatizada. Em muitos dos métodos computacionais empregados durante a análise de dados de expressão gênica é necessária a escolha de uma medida de proximidade apropriada entre genes ou amostras. Dentre as medidas de proximidade disponíveis, coeficientes de correlação têm sido amplamente empregados, em virtude da sua capacidade em capturar similaridades entre tendências das sequências numéricas comparadas (genes ou amostras). O presente trabalho possui como objetivo comparar diferentes medidas de correlação para as três principais tarefas envolvidas na análise de dados de expressão gênica: agrupamento, seleção de atributos e classificação. Dessa forma, é apresentada nesta dissertação uma visão geral da análise de dados de expressão gênica e das diferentes medidas de correlação consideradas para tal comparação. São apresentados também resultados empíricos obtidos a partir da comparação dos coeficientes de correlação para agrupamento de genes, agrupamento de amostras, seleção de genes para o problema de classificação de amostras e classificação de amostras / The development of microarray technology made possible the expression level measurement of hundreds or even thousands of genes simultaneously for various experimental conditions. The huge amount of available data generated the need for computational methods that allow its analysis in an effcient and automated way. In many of the computational methods employed during gene expression data analysis the choice of a proximity measure is necessary. Among the proximity measures available, correlation coefficients have been widely employed because of their ability to capture similarity trends among the compared numeric sequences (genes or samples). The present work has as objective to compare different correlation measures for the three major tasks involved in the analysis of gene expression data: clustering, feature selection and classification. To this extent, in this dissertation an overview of gene expression data analysis and the different correlation measures considered for this comparison are presented. In the present work are also presented empirical results obtained from the comparison of correlation coefficients for gene clustering, sample clustering, gene selection for sample classification and sample classification
|
10 |
ON APPLICATIONS OF STATISTICAL LEARNING TO BIOPHYSICSCAO, BAOQIANG 03 April 2007 (has links)
No description available.
|
Page generated in 0.1092 seconds