Return to search

Integrating Sequence and Structure for Annotating Proteins in the Twilight Zone: A Machine Learning Approach

Determining protein structure and function experimentally is both costly and time consuming. Transferring function-related protein annotations based on homology-based methods is relatively straightforward for proteins that have sequence identity of more than 40%. However, there are many proteins in the "twilight zone" where sequence similarity with any other protein is very weak, while being structurally similar to several. Such cases require methods that are capable of using and exploiting both sequence and structural similarity. To understand ways of how such methods can and should be designed is the focus of this study. In this thesis, models that use both sequence and structure features are applied on two protein prediction problems that are particularly challenging when relying on sequence alone. Enzyme classification benefits from both kinds of features because on one hand, enzymes can have identical function with limited sequence similarity while on the other hand, proteins with similar fold may have disparate enzyme class annotation. This thesis shows that the full integration of protein sequence and structure-related features (via the use of kernels) automatically places proteins with similar biological properties closer together, leading to superior classification accuracy using Support Vector Machines. Disulfide-bonds link residues in a protein structure, but may appear distant in sequence. Sequence similarity reflecting such structural properties is thus very hard to detect. It is sufficient for the structure to be similar for accurate prediction of disulfide-bonds, but such information is very scarce and predictors that rely on protein structure are not nearly as useful as those operating on sequence alone. This thesis proposes a novel approach based on Kernel Canonical Correlation Analysis that uses structural features during training only. It does so by finding sequence representations that correlate with structural features that are essential for a disulfide bond. The resulting representations enable high prediction accuracy for a range of disulfide-bond problems. The proposed model thus taps the advantage of structural features without requiring protein structure to be available in the prediction process. The merits of this approach should apply to a number of open protein structure prediction problems.

Identiferoai:union.ndltd.org:ADTP/279310
CreatorsIsye Arieshanti
Source SetsAustraliasian Digital Theses Program
Detected LanguageEnglish

Page generated in 0.002 seconds