Global ETD Search

Return to search

Machine Learning Applications in Proteins: Interaction Prediction and Structure Prediction

This thesis focuses on the two research projects which have applied machine learning techniques to the protein-related topics. The first project is to use protein sequences and the interaction graph to address the protein-protein interaction prediction problem. The second project is to leverage the sequences of protein loops within and beyond homologs to predict the protein loop structures. In the protein-protein interaction prediction project, we applied the pretrained language models, which were trained on large sets of protein sequences, as one of the protein feature extraction methods. Another feature extraction method is the graph learning on the protein interaction graph. The graph learning embeddings and the language model embeddings were fed into classification models to predict if two proteins are interacting or not. We trained and tested our methods on the S. cerevisiae dataset and the human dataset. Our results are comparable to or better than other state-of-art methods, with the advantages that our method is faster at the sample preparation step and has a larger application scope for requiring only protein sequences. We also did experiments with datasets from different similarity cutoffs between the train and test set of the human dataset, and our method has shown an effective prediction ability even with a strict similarity cutoff.

In the protein loop prediction project, we utilized the attention-based encoder-decoder language models to predict the protein loop inter-residue distances from the protein loop sequences. We fed the model with the loop sequences and received arrays of numbers representing the distances between each C_α pair in the loops. We utilized two different strategies to reconstruct the loops from the predicted distances. One was firstly to calculate the C_α coordinates from the predicted distances, and then apply a fast full-atom reconstruction method starting from C_α coordinates to build the local loop structures. Our local loop structure prediction results of this method are very competitive with low local RMSDs, especially with the lowest standard deviations. The second method was to integrate the predicted inter-residue distances as constraints to the de novo loop prediction method PLOP (Jacobson et al. 2004). We tested the loop reconstruction process on the 8-res and 12-res loop benchmark sets. This method has the best performance compared to other state-of-art methods, and the incorporation of such machine learning step decreased the computing time of the standalone PLOP program.

https://doi.org/10.7916/d8-cbq5-fm68

Chemistry

Machine learning

Amino acid sequence

Protein-protein interactions

Identifer	oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/d8-cbq5-fm68
Date	January 2021
Creators	Sun, Mengzhen
Source Sets	Columbia University
Language	English
Detected Language	English
Type	Theses

Page generated in 0.0024 seconds

Machine Learning Applications in Proteins: Interaction Prediction and Structure Prediction

Description

Links & Downloads

Tags

Additional Fields