This thesis focuses on the two research projects which have applied machine learning techniques to the protein-related topics. The first project is to use protein sequences and the interaction graph to address the protein-protein interaction prediction problem. The second project is to leverage the sequences of protein loops within and beyond homologs to predict the protein loop structures. In the protein-protein interaction prediction project, we applied the pretrained language models, which were trained on large sets of protein sequences, as one of the protein feature extraction methods. Another feature extraction method is the graph learning on the protein interaction graph. The graph learning embeddings and the language model embeddings were fed into classification models to predict if two proteins are interacting or not. We trained and tested our methods on the S. cerevisiae dataset and the human dataset. Our results are comparable to or better than other state-of-art methods, with the advantages that our method is faster at the sample preparation step and has a larger application scope for requiring only protein sequences. We also did experiments with datasets from different similarity cutoffs between the train and test set of the human dataset, and our method has shown an effective prediction ability even with a strict similarity cutoff.
In the protein loop prediction project, we utilized the attention-based encoder-decoder language models to predict the protein loop inter-residue distances from the protein loop sequences. We fed the model with the loop sequences and received arrays of numbers representing the distances between each C_α pair in the loops. We utilized two different strategies to reconstruct the loops from the predicted distances. One was firstly to calculate the C_α coordinates from the predicted distances, and then apply a fast full-atom reconstruction method starting from C_α coordinates to build the local loop structures. Our local loop structure prediction results of this method are very competitive with low local RMSDs, especially with the lowest standard deviations. The second method was to integrate the predicted inter-residue distances as constraints to the de novo loop prediction method PLOP (Jacobson et al. 2004). We tested the loop reconstruction process on the 8-res and 12-res loop benchmark sets. This method has the best performance compared to other state-of-art methods, and the incorporation of such machine learning step decreased the computing time of the standalone PLOP program.
Identifer | oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/d8-cbq5-fm68 |
Date | January 2021 |
Creators | Sun, Mengzhen |
Source Sets | Columbia University |
Language | English |
Detected Language | English |
Type | Theses |
Page generated in 0.0019 seconds