1 |
Finding Similar Protein Structures Efficiently and EffectivelyCui, Xuefeng 23 April 2014 (has links)
To assess the similarities and the differences among protein structures, a
variety of structure alignment algorithms and programs have been designed and
implemented. We introduce a low-resolution approach and a high-resolution
approach to evaluate the similarities among protein structures. Our results
show that both the low-resolution approach and the high-resolution approach
outperform state-of-the-art methods.
For the low-resolution approach, we eliminate false positives through the
comparison of both local similarity and remote similarity with little
compromise in speed. Two kinds of contact libraries (ContactLib) are introduced
to fingerprint protein structures effectively and efficiently. Each contact
group from the contact library consists of one local or two remote fragments
and is represented by a concise vector. These vectors are then indexed and used
to calculate a new combined hit-rate score to identify similar protein
structures effectively and efficiently.
We tested our ContactLibs on the high-quality protein structure subset of
SCOP30, which contains 3,297 protein structures. For each protein structure
of the subset, we retrieved its neighbor protein structures from the rest of
the subset. The best area under the ROC curve, archived by a ContactLib, is as
high as 0.960. This is a significant improvement over 0.747, the best
result achieved by the state-of-the-art method, FragBag.
For the high-resolution approach, our PROtein STructure Alignment method
(PROSTA) relies on and verifies the fact that the optimal protein structure
alignment always contains a small subset of aligned residue pairs, called a
seed, such that the rotation and translation (ROTRAN), which minimizes the RMSD
of the seed, yields both the optimal ROTRAN and the optimal alignment score.
Thus, ROTRANs minimizing the RMSDs of small subsets of residues are sampled,
and global alignments are calculated directly from the sampled ROTRANs.
Moreover, our method incorporates remote information and filters similar
ROTRANs (or alignments) by clustering, rather than by an exhaustive method, to
overcome the computational inefficiency.
Our high-resolution protein structure alignment method, when applied to
optimizing the TM-score and the GDT-TS score, produces a significantly better
result than state-of-the-art protein structure alignment methods.
Specifically, if the highest TM-score found by TM-align is lower than 0.6 and
the highest TM-score found by one of the tested methods is higher than 0.5,
our alignment method tends to discover better protein structure alignments with
(up to 0.21) higher TM-scores. In such cases, TM-align fails to find TM-scores
higher than 0.5 with a probability of 42%; however, our alignment method
fails the same task with a probability of only 2%.
In addition, existing protein structure alignment scoring functions focus on
atom coordinate similarity alone and simply ignore other important
similarities, such as sequence similarity. Our scoring function has the
capacity for incorporating multiple similarities into the scoring function. Our
result shows that sequence similarity aids in finding high quality protein
structure alignments that are more consistent with HOMSTRAD alignments, which
are protein structure alignments examined by human experts. When atom
coordinate similarity itself fails to find alignments with any consistency to
HOMSTRAD alignments, our scoring function remains capable of finding alignments
highly similar to, or even identical to, HOMSTRAD alignments.
|
2 |
Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome Sequence ComparisonZhao, Zhiyu 07 August 2008 (has links)
Sequence analysis and structure analysis are two of the fundamental areas of bioinformatics research. This dissertation discusses, specifically, protein structure related problems including protein structure alignment and query, and genome sequence related problems including haplotype reconstruction and genome rearrangement. It first presents an algorithm for pairwise protein structure alignment that is tested with structures from the Protein Data Bank (PDB). In many cases it outperforms two other well-known algorithms, DaliLite and CE. The preliminary algorithm is a graph-theory based approach, which uses the concept of \stars" to reduce the complexity of clique-finding algorithms. The algorithm is then improved by introducing \double-center stars" in the graph and applying a self-learning strategy. The updated algorithm is tested with a much larger set of protein structures and shown to be an improvement in accuracy, especially in cases of weak similarity. A protein structure query algorithm is designed to search for similar structures in the PDB, using the improved alignment algorithm. It is compared with SSM and shows better performance with lower maximum and average Q-score for missing proteins. An interesting problem dealing with the calculation of the diameter of a 3-D sequence of points arose and its connection to the sublinear time computation is discussed. The diameter calculation of a 3-D sequence is approximated by a series of sublinear time deterministic, zero-error and bounded-error randomized algorithms and we have obtained a series of separations about the power of sublinear time computations. This dissertation also discusses two genome sequence related problems. A probabilistic model is proposed for reconstructing haplotypes from SNP matrices with incomplete and inconsistent errors. The experiments with simulated data show both high accuracy and speed, conforming to the theoretically provable e ciency and accuracy of the algorithm. Finally, a genome rearrangement problem is studied. The concept of non-breaking similarity is introduced. Approximating the exemplar non-breaking similarity to factor n1..f is proven to be NP-hard. Interestingly, for several practical cases, several polynomial time algorithms are presented.
|
3 |
Waveform Mapping and Time-Frequency Processing of Biological Sequences and StructuresJanuary 2011 (has links)
abstract: Genomic and proteomic sequences, which are in the form of deoxyribonucleic acid (DNA) and amino acids respectively, play a vital role in the structure, function and diversity of every living cell. As a result, various genomic and proteomic sequence processing methods have been proposed from diverse disciplines, including biology, chemistry, physics, computer science and electrical engineering. In particular, signal processing techniques were applied to the problems of sequence querying and alignment, that compare and classify regions of similarity in the sequences based on their composition. However, although current approaches obtain results that can be attributed to key biological properties, they require pre-processing and lack robustness to sequence repetitions. In addition, these approaches do not provide much support for efficiently querying sub-sequences, a process that is essential for tracking localized database matches. In this work, a query-based alignment method for biological sequences that maps sequences to time-domain waveforms before processing the waveforms for alignment in the time-frequency plane is first proposed. The mapping uses waveforms, such as time-domain Gaussian functions, with unique sequence representations in the time-frequency plane. The proposed alignment method employs a robust querying algorithm that utilizes a time-frequency signal expansion whose basis function is matched to the basic waveform in the mapped sequences. The resulting WAVEQuery approach is demonstrated for both DNA and protein sequences using the matching pursuit decomposition as the signal basis expansion. The alignment localization of WAVEQuery is specifically evaluated over repetitive database segments, and operable in real-time without pre-processing. It is demonstrated that WAVEQuery significantly outperforms the biological sequence alignment method BLAST for queries with repetitive segments for DNA sequences. A generalized version of the WAVEQuery approach with the metaplectic transform is also described for protein sequence structure prediction. For protein alignment, it is often necessary to not only compare the one-dimensional (1-D) primary sequence structure but also the secondary and tertiary three-dimensional (3-D) space structures. This is done after considering the conformations in the 3-D space due to the degrees of freedom of these structures. As a result, a novel directionality based 3-D waveform mapping for the 3-D protein structures is also proposed and it is used to compare protein structures using a matched filter approach. By incorporating a 3-D time axis, a highly-localized Gaussian-windowed chirp waveform is defined, and the amino acid information is mapped to the chirp parameters that are then directly used to obtain directionality in the 3-D space. This mapping is unique in that additional characteristic protein information such as hydrophobicity, that relates the sequence with the structure, can be added as another representation parameter. The additional parameter helps tracking similarities over local segments of the structure, this enabling classification of distantly related proteins which have partial structural similarities. This approach is successfully tested for pairwise alignments over full length structures, alignments over multiple structures to form a phylogenetic trees, and also alignments over local segments. Also, basic classification over protein structural classes using directional descriptors for the protein structure is performed. / Dissertation/Thesis / Ph.D. Electrical Engineering 2011
|
4 |
Large-scale Comparative Study of Hi-C-based Chromatin 3D Structure Modeling MethodsWang, Cheng 17 May 2018 (has links)
Chromatin is a complex polymer molecule in eukaryotic cells, primarily consisting of DNA and histones. Many works have shown that the 3D folding of chromatin structure plays an important role in DNA expression. The recently proposed Chro- mosome Conformation Capture technologies, especially the Hi-C assays, provide us an opportunity to study how the 3D structures of the chromatin are organized. Based on the data from Hi-C experiments, many chromatin 3D structure modeling methods have been proposed. However, there is limited ground truth to validate these methods and no robust chromatin structure alignment algorithms to evaluate the performance of these methods.
In our work, we first made a thorough literature review of 25 publicly available population Hi-C-based chromatin 3D structure modeling methods. Furthermore, to evaluate and to compare the performance of these methods, we proposed a novel data simulation method, which combined the population Hi-C data and single-cell Hi-C data without ad hoc parameters. Also, we designed a global and a local alignment algorithms to measure the similarity between the templates and the chromatin struc- tures predicted by different modeling methods. Finally, the results from large-scale comparative tests indicated that our alignment algorithms significantly outperform the algorithms in literature.
|
5 |
Variations on RNA folding and alignment: lessons from BenasqueBompfünewerer, Athanasius F., Backofen, Rolf, Bernhart, Stephan H., Hertel, Jana, Hofacker, Ivo L., Stadler, Peter F., Will, Sebastian 09 November 2018 (has links)
Dynamic Programming Algorithms solve many standard problems of RNA bioinformatics in polynomial time. In this contribution we discuss a series of variations on these standard methods that implement refined biophysical models, such as a restriction of RNA folding to canonical structures, and an extension of structural alignments to an explicit scoring of stacking propensities. Furthermore, we demonstrate that a local structural alignment can be employed for ncRNA gene finding. In this context we discuss scanning variants for folding and alignment algorithms.
|
6 |
PROTEIN STRUCTURE ALIGNMENT USING A GENERALIZED ALIGNMENT MODELSUBRAMANIAN, SUCHITHA January 2007 (has links)
No description available.
|
7 |
Structural Information and Hidden Markov Models for Biological Sequence AnalysisTångrot, Jeanette January 2008 (has links)
Bioinformatics is a fast-developing field, which makes use of computational methods to analyse and structure biological data. An important branch of bioinformatics is structure and function prediction of proteins, which is often based on finding relationships to already characterized proteins. It is known that two proteins with very similar sequences also share the same 3D structure. However, there are many proteins with similar structures that have no clear sequence similarity, which make it difficult to find these relationships. In this thesis, two methods for annotating protein domains are presented, one aiming at assigning the correct domain family or families to a protein sequence, and the other aiming at fold recognition. Both methods use hidden Markov models (HMMs) to find related proteins, and they both exploit the fact that structure is more conserved than sequence, but in two different ways. Most of the research presented in the thesis focuses on the structure-anchored HMMs, saHMMs. For each domain family, an saHMM is constructed from a multiple structure alignment of carefully selected representative domains, the saHMM-members. These saHMM-members are collected in the so called "midnight ASTRAL set", and are chosen so that all saHMM-members within the same family have mutual sequence identities below a threshold of about 20%. In order to construct the midnight ASTRAL set and the saHMMs, a pipe-line of software tools are developed. The saHMMs are shown to be able to detect the correct family relationships at very high accuracy, and perform better than the standard tool Pfam in assigning the correct domain families to new domain sequences. We also introduce the FI-score, which is used to measure the performance of the saHMMs, in order to select the optimal model for each domain family. The saHMMs are made available for searching through the FISH server, and can be used for assigning family relationships to protein sequences. The other approach presented in the thesis is secondary structure HMMs (ssHMMs). These HMMs are designed to use both the sequence and the predicted secondary structure of a query protein when scoring it against the model. A rigorous benchmark is used, which shows that HMMs made from multiple sequences result in better fold recognition than those based on single sequences. Adding secondary structure information to the HMMs improves the ability of fold recognition further, both when using true and predicted secondary structures for the query sequence. / Bioinformatik är ett område där datavetenskapliga och statistiska metoder används för att analysera och strukturera biologiska data. Ett viktigt område inom bioinformatiken försöker förutsäga vilken tredimensionell struktur och funktion ett protein har, utifrån dess aminosyrasekvens och/eller likheter med andra, redan karaktäriserade, proteiner. Det är känt att två proteiner med likande aminosyrasekvenser också har liknande tredimensionella strukturer. Att två proteiner har liknande strukturer behöver dock inte betyda att deras sekvenser är lika, vilket kan göra det svårt att hitta strukturella likheter utifrån ett proteins aminosyrasekvens. Den här avhandlingen beskriver två metoder för att hitta likheter mellan proteiner, den ena med fokus på att bestämma vilken familj av proteindomäner, med känd 3D-struktur, en given sekvens tillhör, medan den andra försöker förutsäga ett proteins veckning, d.v.s. ge en grov bild av proteinets struktur. Båda metoderna använder s.k. dolda Markov modeller (hidden Markov models, HMMer), en statistisk metod som bland annat kan användas för att beskriva proteinfamiljer. Med hjälp en HMM kan man förutsäga om en viss proteinsekvens tillhör den familj modellen representerar. Båda metoderna använder också strukturinformation för att öka modellernas förmåga att känna igen besläktade sekvenser, men på olika sätt. Det mesta av arbetet i avhandlingen handlar om strukturellt förankrade HMMer (structure-anchored HMMs, saHMMer). För att bygga saHMMerna används strukturbaserade sekvensöverlagringar, vilka genereras utifrån hur proteindomänerna kan läggas på varandra i rymden, snarare än utifrån vilka aminosyror som ingår i deras sekvenser. I varje proteinfamilj används bara ett särskilt, representativt urval av domäner. Dessa är valda så att då sekvenserna jämförs parvis, finns det inget par inom familjen med högre sekvensidentitet än ca 20%. Detta urval görs för att få så stor spridning som möjligt på sekvenserna inom familjen. En programvaruserie har utvecklats för att välja ut representanter för varje familj och sedan bygga saHMMer baserade på dessa. Det visar sig att saHMMerna kan hitta rätt familj till en hög andel av de testade sekvenserna, med nästan inga fel. De är också bättre än den ofta använda metoden Pfam på att hitta rätt familj till helt nya proteinsekvenser. saHMMerna finns tillgängliga genom FISH-servern, vilken alla kan använda via Internet för att hitta vilken familj ett intressant protein kan tillhöra. Den andra metoden som presenteras i avhandlingen är sekundärstruktur-HMMer, ssHMMer, vilka är byggda från vanliga multipla sekvensöverlagringar, men också från information om vilka sekundärstrukturer proteinsekvenserna i familjen har. När en proteinsekvens jämförs med ssHMMen används en förutsägelse om sekundärstrukturen, och den beräknade sannolikheten att sekvensen tillhör familjen kommer att baseras både på sekvensen av aminosyror och på sekundärstrukturen. Vid en jämförelse visar det sig att HMMer baserade på flera sekvenser är bättre än sådana baserade på endast en sekvens, när det gäller att hitta rätt veckning för en proteinsekvens. HMMerna blir ännu bättre om man också tar hänsyn till sekundärstrukturen, både då den riktiga sekundärstrukturen används och då man använder en teoretiskt förutsagd. / Jeanette Hargbo.
|
8 |
Training of Template-Specific Weighted Energy Function for Sequence-to-Structure AlignmentLee, En-Shiun Annie January 2008 (has links)
Threading is a protein structure prediction method that uses a library of template protein structures in the following steps: first the target sequence is matched to the template library and the best template structure is selected, secondly the predicted target structure of the target sequence is modeled by this selected template structure. The deceleration of new folds which are added to the protein data bank promises completion of the template structure library. This thesis uses a new set of template-specific weights to improve the energy function for sequence-to-structure alignment in the template selection step of the threading process. The weights are estimated using least squares methods with the quality of the modelling step in the threading process as the label. These new weights show an average 12.74% improvement in estimating the label. Further family analysis show a correlation between the performance of the new weights to the number of seeds in pFam.
|
9 |
Training of Template-Specific Weighted Energy Function for Sequence-to-Structure AlignmentLee, En-Shiun Annie January 2008 (has links)
Threading is a protein structure prediction method that uses a library of template protein structures in the following steps: first the target sequence is matched to the template library and the best template structure is selected, secondly the predicted target structure of the target sequence is modeled by this selected template structure. The deceleration of new folds which are added to the protein data bank promises completion of the template structure library. This thesis uses a new set of template-specific weights to improve the energy function for sequence-to-structure alignment in the template selection step of the threading process. The weights are estimated using least squares methods with the quality of the modelling step in the threading process as the label. These new weights show an average 12.74% improvement in estimating the label. Further family analysis show a correlation between the performance of the new weights to the number of seeds in pFam.
|
10 |
Computational Protein Structure Analysis : Kernel And Spectral MethodsBhattacharya, Sourangshu 08 1900 (has links)
The focus of this thesis is to develop computational techniques for analysis of protein structures. We model protein structures as points in 3-dimensional space which in turn are modeled as weighted graphs. The problem of protein structure comparison is posed as a weighted graph matching problem and an algorithm motivated from the spectral graph matching techniques is developed. The thesis also proposes novel similarity measures by deriving kernel functions. These kernel functions allow the data to be mapped to a suitably defined Reproducing kernel Hilbert Space(RKHS), paving the way for efficient algorithms for protein structure classification.
Protein structure comparison (structure alignment)is a classical method of determining overall similarity between two protein structures. This problem can be posed as the approximate weighted subgraph matching problem, which is a well known NP-Hard problem. Spectral graph matching techniques provide efficient heuristic solution for the weighted graph matching problem using eigenvectors of adjacency matrices of the graphs. We propose a novel and efficient algorithm for protein structure comparison using the notion of neighborhood preserving projections (NPP) motivated from spectral graph matching. Empirically, we demonstrate that comparing the NPPs of two protein structures gives the correct equivalences when the sizes of proteins being compared are roughly similar. Also, the resulting algorithm is 3 -20 times faster than the existing state of the art techniques. This algorithm was used for retrieval of protein structures from standard databases with accuracies comparable to the state of the art.
A limitation of the above method is that it gives wrong results when the number of unmatched residues, also called insertions and deletions (indels), are very high. This problem was tackled by matching neighborhoods, rather than entire structures. For each pair of neighborhoods, we grow the neighborhood alignments to get alignments for entire structures. This results in a robust method that has outperformed the existing state of the art methods on standard benchmark datasets. This method was also implemented using MPI on a cluster for database search.
Another important problem in computational biology is classification of protein structures into classes exhibiting high structural similarity. Many manual and semi-automatic structural classification databases exist. Kernel methods along with support vector machines (SVM) have proved to be a robust and principled tool for classification. We have proposed novel positive semidefinite kernel functions on protein structures based on spatial neighborhoods. The kernels were derived using a general technique called convolution kernel, and showed to be related to the spectral alignment score in a limiting case. These kernels have outperformed the existing tools when validated on a well known manual classification scheme called SCOP. The kernels were designed keeping the general problem of capturing structural similarity in mind, and have been successfully applied to problems in other domains, e.g. computer vision.
|
Page generated in 0.0816 seconds