Global ETD Search

Return to search

Finding Similar Protein Structures Efficiently and Effectively

To assess the similarities and the differences among protein structures, a
variety of structure alignment algorithms and programs have been designed and
implemented. We introduce a low-resolution approach and a high-resolution
approach to evaluate the similarities among protein structures. Our results
show that both the low-resolution approach and the high-resolution approach
outperform state-of-the-art methods.

For the low-resolution approach, we eliminate false positives through the
comparison of both local similarity and remote similarity with little
compromise in speed. Two kinds of contact libraries (ContactLib) are introduced
to fingerprint protein structures effectively and efficiently. Each contact
group from the contact library consists of one local or two remote fragments
and is represented by a concise vector. These vectors are then indexed and used
to calculate a new combined hit-rate score to identify similar protein
structures effectively and efficiently.

We tested our ContactLibs on the high-quality protein structure subset of
SCOP30, which contains 3,297 protein structures. For each protein structure
of the subset, we retrieved its neighbor protein structures from the rest of
the subset. The best area under the ROC curve, archived by a ContactLib, is as
high as 0.960. This is a significant improvement over 0.747, the best
result achieved by the state-of-the-art method, FragBag.

For the high-resolution approach, our PROtein STructure Alignment method
(PROSTA) relies on and verifies the fact that the optimal protein structure
alignment always contains a small subset of aligned residue pairs, called a
seed, such that the rotation and translation (ROTRAN), which minimizes the RMSD
of the seed, yields both the optimal ROTRAN and the optimal alignment score.
Thus, ROTRANs minimizing the RMSDs of small subsets of residues are sampled,
and global alignments are calculated directly from the sampled ROTRANs.
Moreover, our method incorporates remote information and filters similar
ROTRANs (or alignments) by clustering, rather than by an exhaustive method, to
overcome the computational inefficiency.

Our high-resolution protein structure alignment method, when applied to
optimizing the TM-score and the GDT-TS score, produces a significantly better
result than state-of-the-art protein structure alignment methods.
Specifically, if the highest TM-score found by TM-align is lower than 0.6 and
the highest TM-score found by one of the tested methods is higher than 0.5,
our alignment method tends to discover better protein structure alignments with
(up to 0.21) higher TM-scores. In such cases, TM-align fails to find TM-scores
higher than 0.5 with a probability of 42%; however, our alignment method
fails the same task with a probability of only 2%.

In addition, existing protein structure alignment scoring functions focus on
atom coordinate similarity alone and simply ignore other important
similarities, such as sequence similarity. Our scoring function has the
capacity for incorporating multiple similarities into the scoring function. Our
result shows that sequence similarity aids in finding high quality protein
structure alignments that are more consistent with HOMSTRAD alignments, which
are protein structure alignments examined by human experts. When atom
coordinate similarity itself fails to find alignments with any consistency to
HOMSTRAD alignments, our scoring function remains capable of finding alignments
highly similar to, or even identical to, HOMSTRAD alignments.

http://hdl.handle.net/10012/8349

Bioinformatics

Protein Structure Retrieval

Protein Structure Alignment

Identifer	oai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OWTU.10012/8349
Date	23 April 2014
Creators	Cui, Xuefeng
Source Sets	Library and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada
Language	English
Detected Language	English
Type	Thesis or Dissertation

Page generated in 0.0039 seconds

Finding Similar Protein Structures Efficiently and Effectively

Description

Links & Downloads

Tags

Additional Fields