11 |
Discovering Protein Sequence-Structure Motifs and Two Applications to Structural PredictionTang, Thomas Cheuk Kai January 2004 (has links)
This thesis investigates the correlations between short protein peptide sequences and local tertiary structures. In particular, it introduces a novel algorithm for partitioning short protein segments into clusters of local sequence-structure motifs, and demonstrates that these motif clusters contain useful structural information via two applications to structural prediction. The first application utilizes motif clusters to predict local protein tertiary structures. A novel dynamic programming algorithm that performs comparably with some of the best existing algorithms is described. The second application exploits the capability of motif clusters in recognizing regular secondary structures to improve the performance of secondary structure prediction based on Support Vector Machines. Empirical results show significant improvement in overall prediction accuracy with no performance degradation in any specific aspect being measured. The encouraging results obtained illustrate the great potential of using local sequence-structure motifs to tackle protein structure predictions and possibly other important problems in computational biology.
|
12 |
Protein-DNA Binding: Discovering Motifs and Distinguishing Direct from Indirect InteractionsGordan, Raluca Mihaela January 2009 (has links)
<p>The initiation of two major processes in the eukaryotic cell, gene transcription and DNA replication, is regulated largely through interactions between proteins or protein complexes and DNA. Although a lot is known about the interacting proteins and their role in regulating transcription and replication, the specific DNA binding motifs of many regulatory proteins and complexes are still to be determined. For this purpose, many computational tools for DNA motif discovery have been developed in the last two decades. These tools employ a variety of strategies, from exhaustive search to sampling techniques, with the hope of finding over-represented motifs in sets of co-regulated or co-bound sequences. Despite the variety of computational tools aimed at solving the problem of motif discovery, their ability to correctly detect known DNA motifs is still limited. The motifs are usually short and many times degenerate, which makes them difficult to distinguish from genomic background. We believe the most efficient strategy for improving the performance of motif discovery is not to use increasingly complex computational and statistical methods and models, but to incorporate more of the biology into the computational techniques, in a principled manner. To this end, we propose a novel motif discovery algorithm: PRIORITY. Based on a general Gibbs sampling framework, PRIORITY has a major advantage over other motif discovery tools: it can incorporate different types of biological information (e.g., nucleosome positioning information) to guide the search for DNA binding sites toward regions where these sites are more likely to occur (e.g., nucleosome-free regions). </p><p>We use transcription factor (TF) binding data from yeast chromatin immunoprecipitation (ChIP-chip) experiments to test the performance of our motif discovery algorithm when incorporating three types of biological information: information about nucleosome positioning, information about DNA double-helical stability, and evolutionary conservation information. In each case, incorporating additional biological information has proven very useful in increasing the accuracy of motif finding, with the number of correctly identified motifs increasing with up to 52%. PRIORITY is not restricted to TF binding data. In this work, we also analyze origin recognition complex (ORC) binding data and show that PRIORITY can utilize DNA structural information to predict the binding specificity of the yeast ORC. </p><p>Despite the improvement obtained using additional biological information, the success of motif discovery algorithms in identifying known motifs is still limited, especially when applied to sequences bound in vivo (such as those of ChIP-chip) because the observed protein-DNA interactions are not necessarily direct. Some TFs associate with DNA only indirectly via protein partners, while others exhibit both direct and indirect binding. We propose a novel method to distinguish between direct and indirect TF-DNA interactions, integrating in vivo TF binding data, in vivo nucleosome occupancy data, and in vitro motifs from protein binding microarrays. When applied to yeast ChIP-chip data, our method reveals that only 48% of the ChIP-chip data sets can be readily explained by direct binding of the profiled TF, while 16% can be explained by indirect DNA binding. In the remaining 36%, we found that none of the motifs used in our analysis was able to explain the ChIP-chip data, either because the data was too noisy or because the set of motifs was incomplete. As more in vitro motifs become available, our method can be used to build a complete catalog of direct and indirect TF-DNA interactions.</p> / Dissertation
|
13 |
Motif representation and discoveryCarvalho, A.M. 01 July 2011 (has links) (PDF)
An important part of gene regulation is mediated by specific proteins, called transcription factors, which influence the transcription of a particular gene by binding to specific sites on DNA sequences, called transcription factor binding sites (TFBS) or, simply, motifs. Such binding sites are relatively short segments of DNA, normally 5 to 25 nucleotides long, over- represented in a set of co-regulated DNA sequences. There are two different problems in this setup: motif representation, accounting for the model that describes the TFBS's; and motif discovery, focusing in unravelling TFBS's from a set of co-regulated DNA sequences. This thesis proposes a discriminative scoring criterion that culminates in a discriminative mixture of Bayesian networks to distinguish TFBS's from the background DNA. This new probabilistic model supports further evidence in non-additivity among binding site positions, providing a superior discriminative power in TFBS's detection. On the other hand, extra knowledge carefully selected from the literature was incorporated in TFBS discovery in order to capture a variety of characteristics of the TFBS's patterns. This extra knowledge was combined during the process of motif discovery leading to results that are considerably more accurate than those achieved by methods that rely in the DNA sequence alone.
|
14 |
Αλγόριθμοι διαχείρισης και ανάλυσης ακολουθιών βιολογικών δεδομένων με εφαρμογή σε προβλήματα βιοπληροφορικής / Algorithms for the analysis of biological sequences with application on bioinformatics problemsΠερδικούρη, Αικατερίνη 26 February 2009 (has links)
Αντικείμενο της παρούσας διδακτορικής διατριβής είναι η μελέτη και η σχεδίαση αποδοτικών αλγορίθμων για τη διαχείριση και ανάλυση ακολουθιών βιολογικών δεδομένων. Οι αλγόριθμοι που θα περιγράψουμε εφαρμόζονται σε προβλήματα Βιοπληροφορικής, όπως η αναγνώριση γνωστών ή άγνωστων μοτίβων του DNA και RNA, που εμπλέκονται σε ποικίλες βιολογικές διεργασίες καθώς και η ανακάλυψη περιοδικοτήτων.
Ειδικότερα οι αλγόριθμοι που θα παρουσιάσουμε χρησιμοποιούνται για την ανάλυση Βιολογικών Ακολουθιών με “αδιάφορους χαρακτήρες” και Βιολογικών Ακολουθιών με Βάρη. Οι Βιολογικές Ακολουθίες με “αδιάφορους χαρακτήρες” αναπαριστούν συνήθως οικογένειες πρωτεϊνών ενώ οι Βιολογικές Ακολουθίες με βάρη αναπαριστούν συναρμολογούμένες ακολουθίες γονιδιωμάτων που έχουν πρόσφατα αλληλουχηθεί.
Στις Βιολογικές Ακολουθίες με αδιάφορους χαρακτήρες παρουσιάζουμε δυο αποδοτικούς αλγορίθμους γραμμικού χρόνου για τον υπολογισμό της περιόδου και τον υπολογισμό του καλύμματος. Ο δεύτερος αλγόριθμος εφαρμόζεται και σε κυκλικά (circular DNAs).
Στις Βιολογικές Ακολουθίες με βάρη παρουσιάζουμε δυο αλγορίθμους για τον υπολογισμό των βασικών περιοδικοτήτων: της περιόδου και του καλύμματος ενώ επιλύουμε και το πρόβλημα της εύρεσης προτύπου. Η ανάγκη για αποδοτική διαχείριση βιολογικών ακολουθιών με βάρη μας ώθησε να εισάγουμε μια νέα αποδοτική δομή η οποία επιλύει αποδοτικά τα 2 προηγούμενα προβλήματα. Η δομή αυτή ονομάζεται Δέντρο Επιθεμάτων με Βάρη. Χρησιμοποιώντας το Δέντρο Επιθεμάτων με Βάρη επιλύουμε διάφορες παραλλαγές του προβλήματος εξαγωγής μοτίβων από Βιολογικές Ακολουθίες με Βάρη.
Τέλος αποφασίσαμε να μελετήσουμε τη χρήση των Γενετικών Αλγορίθμων και του Εξελικτικού Προγραμματισμού στην ανάλυση ακολουθιών βιολογικών δεδομένων. Αποτέλεσμα αυτής της μελέτης είναι η περιγραφή ενός γενετικού αλγορίθμου που υπολογίζει τις επαναλήψεις σε μια βιολογική ακολουθία. / The object of this doctoral thesis is the study and the design of efficient algorithms for the analysis of sequences of biological data. The algorithms that we describe have application on Bioinformatics problems, such as the recognition of known or unknown patterns in DNA and RNA that are involved in various biological activities, as well as the discovery of periodicities.
More specifically the algorithms that we present are used for the analysis of Biological Sequences with “don't care characters”' and Weighted Biological Sequences. Biological Sequences with “don't care characters”, usually represent protein families while Weighted Biological Sequences represent assembled sequences of genomes that they have been recently sequenced.
In Biological Sequences with “don't care characters”' we present two efficient algorithms of linear time for the computation of the period and the cover. The second algorithm is also applied in circular DNAs .
In Weighted Biological Sequences we present two algorithms for the computation of basic periodicities: the period and the cover, while we also solve the problem of pattern matching. The need for efficient management of biological sequences with weights prompted us to introduce a new efficient data structure which solves efficiently the two precedents problems. This structure is named Weighted Suffix Tree. Using the Weighted Suffix Tree we solve various instances of the motif discovery problem in Biological Weighted Sequences.
Finally we decided to study the use of Genetic Algorithms and Evolutionary Programming in the analysis of biological sequences. The result of this study is the description of a genetic algorithm that computes the repetitions in a biological sequence.
|
15 |
Scalable and explainable self-supervised motif discovery in temporal dataBakhtiari Ramezani, Somayeh 08 December 2023 (has links) (PDF)
The availability of a scalable and explainable rule extraction technique via motif discovery is crucial for identifying the health states of a system. Such a technique can enable the creation of a repository of normal and abnormal states of the system and identify the system’s state as we receive data. In complex systems such as ECG, each activity session can consist of a long sequence of motifs that form different global structures. As a result, applying machine learning algorithms without first identifying the local patterns is not feasible and would result in low performance. Thus, extracting unique local motifs and establishing a database of prototypes or signatures is a crucial first step in analyzing long temporal data that reduces the computational cost and overcomes imbalanced data. The present research aims to streamline the extraction of motifs and add explainability to their analysis by identifying their differences. We have developed a novel framework for unsupervised motif extraction. We also offer a robust algorithm to identify unique motifs and their signatures, coupled with a proper distance metric to compare the signatures of partially similar motifs. Defining such distance metrics allows us to assign a degree of semblance between two motifs that may have different lengths or contain noise. We have tested our framework against five different datasets and observed excellent results, including extraction of motifs from 100 million samples in 8.02 seconds, 99.90% accuracy in self-supervised ECG data classification, and an average error of 16.66% in RUL prediction of bearing failure.
|
16 |
Soluções aproximadas para algoritmos escaláveis de mineração de dados em domínios de dados complexos usando GPGPU / On approximate solutions to scalable data mining algorithms for complex data problems using GPGPUMamani, Alexander Victor Ocsa 22 September 2011 (has links)
A crescente disponibilidade de dados em diferentes domínios tem motivado o desenvolvimento de técnicas para descoberta de conhecimento em grandes volumes de dados complexos. Trabalhos recentes mostram que a busca em dados complexos é um campo de pesquisa importante, já que muitas tarefas de mineração de dados, como classificação, detecção de agrupamentos e descoberta de motifs, dependem de algoritmos de busca ao vizinho mais próximo. Para resolver o problema da busca dos vizinhos mais próximos em domínios complexos muitas abordagens determinísticas têm sido propostas com o objetivo de reduzir os efeitos da maldição da alta dimensionalidade. Por outro lado, algoritmos probabilísticos têm sido pouco explorados. Técnicas recentes relaxam a precisão dos resultados a fim de reduzir o custo computacional da busca. Além disso, em problemas de grande escala, uma solução aproximada com uma análise teórica sólida mostra-se mais adequada que uma solução exata com um modelo teórico fraco. Por outro lado, apesar de muitas soluções exatas e aproximadas de busca e mineração terem sido propostas, o modelo de programação em CPU impõe restrições de desempenho para esses tipos de solução. Uma abordagem para melhorar o tempo de execução de técnicas de recuperação e mineração de dados em várias ordens de magnitude é empregar arquiteturas emergentes de programação paralela, como a arquitetura CUDA. Neste contexto, este trabalho apresenta uma proposta para buscas kNN de alto desempenho baseada numa técnica de hashing e implementações paralelas em CUDA. A técnica proposta é baseada no esquema LSH, ou seja, usa-se projeções em subespac¸os. O LSH é uma solução aproximada e tem a vantagem de permitir consultas de custo sublinear para dados em altas dimensões. Usando implementações massivamente paralelas melhora-se tarefas de mineração de dados. Especificamente, foram desenvolvidos soluções de alto desempenho para algoritmos de descoberta de motifs baseados em implementações paralelas de consultas kNN. As implementações massivamente paralelas em CUDA permitem executar estudos experimentais sobre grandes conjuntos de dados reais e sintéticos. A avaliação de desempenho realizada neste trabalho usando GeForce GTX470 GPU resultou em um aumento de desempenho de até 7 vezes, em média sobre o estado da arte em buscas por similaridade e descoberta de motifs / The increasing availability of data in diverse domains has created a necessity to develop techniques and methods to discover knowledge from huge volumes of complex data, motivating many research works in databases, data mining and information retrieval communities. Recent studies have suggested that searching in complex data is an interesting research field because many data mining tasks such as classification, clustering and motif discovery depend on nearest neighbor search algorithms. Thus, many deterministic approaches have been proposed to solve the nearest neighbor search problem in complex domains, aiming to reduce the effects of the well-known curse of dimensionality. On the other hand, probabilistic algorithms have been slightly explored. Recently, new techniques aim to reduce the computational cost relaxing the quality of the query results. Moreover, in large-scale problems, an approximate solution with a solid theoretical analysis seems to be more appropriate than an exact solution with a weak theoretical model. On the other hand, even though several exact and approximate solutions have been proposed, single CPU architectures impose limits on performance to deliver these kinds of solution. An approach to improve the runtime of data mining and information retrieval techniques by an order-of-magnitude is to employ emerging many-core architectures such as CUDA-enabled GPUs. In this work we present a massively parallel kNN query algorithm based on hashing and CUDA implementation. Our method, based on the LSH scheme, is an approximate method which queries high-dimensional datasets with sub-linear computational time. By using the massively parallel implementation we improve data mining tasks, specifically we create solutions for (soft) realtime time series motif discovery. Experimental studies on large real and synthetic datasets were carried out thanks to the highly CUDA parallel implementation. Our performance evaluation on GeForce GTX 470 GPU resulted in average runtime speedups of up to 7x on the state-of-art of similarity search and motif discovery solutions
|
17 |
Soluções aproximadas para algoritmos escaláveis de mineração de dados em domínios de dados complexos usando GPGPU / On approximate solutions to scalable data mining algorithms for complex data problems using GPGPUAlexander Victor Ocsa Mamani 22 September 2011 (has links)
A crescente disponibilidade de dados em diferentes domínios tem motivado o desenvolvimento de técnicas para descoberta de conhecimento em grandes volumes de dados complexos. Trabalhos recentes mostram que a busca em dados complexos é um campo de pesquisa importante, já que muitas tarefas de mineração de dados, como classificação, detecção de agrupamentos e descoberta de motifs, dependem de algoritmos de busca ao vizinho mais próximo. Para resolver o problema da busca dos vizinhos mais próximos em domínios complexos muitas abordagens determinísticas têm sido propostas com o objetivo de reduzir os efeitos da maldição da alta dimensionalidade. Por outro lado, algoritmos probabilísticos têm sido pouco explorados. Técnicas recentes relaxam a precisão dos resultados a fim de reduzir o custo computacional da busca. Além disso, em problemas de grande escala, uma solução aproximada com uma análise teórica sólida mostra-se mais adequada que uma solução exata com um modelo teórico fraco. Por outro lado, apesar de muitas soluções exatas e aproximadas de busca e mineração terem sido propostas, o modelo de programação em CPU impõe restrições de desempenho para esses tipos de solução. Uma abordagem para melhorar o tempo de execução de técnicas de recuperação e mineração de dados em várias ordens de magnitude é empregar arquiteturas emergentes de programação paralela, como a arquitetura CUDA. Neste contexto, este trabalho apresenta uma proposta para buscas kNN de alto desempenho baseada numa técnica de hashing e implementações paralelas em CUDA. A técnica proposta é baseada no esquema LSH, ou seja, usa-se projeções em subespac¸os. O LSH é uma solução aproximada e tem a vantagem de permitir consultas de custo sublinear para dados em altas dimensões. Usando implementações massivamente paralelas melhora-se tarefas de mineração de dados. Especificamente, foram desenvolvidos soluções de alto desempenho para algoritmos de descoberta de motifs baseados em implementações paralelas de consultas kNN. As implementações massivamente paralelas em CUDA permitem executar estudos experimentais sobre grandes conjuntos de dados reais e sintéticos. A avaliação de desempenho realizada neste trabalho usando GeForce GTX470 GPU resultou em um aumento de desempenho de até 7 vezes, em média sobre o estado da arte em buscas por similaridade e descoberta de motifs / The increasing availability of data in diverse domains has created a necessity to develop techniques and methods to discover knowledge from huge volumes of complex data, motivating many research works in databases, data mining and information retrieval communities. Recent studies have suggested that searching in complex data is an interesting research field because many data mining tasks such as classification, clustering and motif discovery depend on nearest neighbor search algorithms. Thus, many deterministic approaches have been proposed to solve the nearest neighbor search problem in complex domains, aiming to reduce the effects of the well-known curse of dimensionality. On the other hand, probabilistic algorithms have been slightly explored. Recently, new techniques aim to reduce the computational cost relaxing the quality of the query results. Moreover, in large-scale problems, an approximate solution with a solid theoretical analysis seems to be more appropriate than an exact solution with a weak theoretical model. On the other hand, even though several exact and approximate solutions have been proposed, single CPU architectures impose limits on performance to deliver these kinds of solution. An approach to improve the runtime of data mining and information retrieval techniques by an order-of-magnitude is to employ emerging many-core architectures such as CUDA-enabled GPUs. In this work we present a massively parallel kNN query algorithm based on hashing and CUDA implementation. Our method, based on the LSH scheme, is an approximate method which queries high-dimensional datasets with sub-linear computational time. By using the massively parallel implementation we improve data mining tasks, specifically we create solutions for (soft) realtime time series motif discovery. Experimental studies on large real and synthetic datasets were carried out thanks to the highly CUDA parallel implementation. Our performance evaluation on GeForce GTX 470 GPU resulted in average runtime speedups of up to 7x on the state-of-art of similarity search and motif discovery solutions
|
18 |
Algorithmic Methods for Multi-Omics Biomarker DiscoveryLi, Yichao January 2018 (has links)
No description available.
|
19 |
Optimizing Biomarkers From an Ensemble Learning PipelineKuntala, Prashant Kumar January 2017 (has links)
No description available.
|
Page generated in 0.0963 seconds