Return to search

Deep Learning Based Proteomic Language Modelling for in-silico Protein Generation

A protein is a biopolymer of amino acids that encodes a particular function. Given that there are 20 amino acids possible at each site, even a short protein of 100 amino acids has $20^{100}$ possible variants, making it unrealistic to evaluate all possible sequences in sequence level space. This search space could be reduced by considering the fact that billions of years of evolution exerting a constant pressure has left us with only a small subset of protein sequences that carry out particular cellular functions. The portion of amino acid space occupied by actual proteins found in nature is therefore much smaller than that which is possible cite{kauffman1993origins}. By examining related proteins that share a conserved function and common evolutionary history (heretofore referred to as protein families), it is possible to identify common motifs that are shared. Examination of these motifs allows us to characterize protein families in greater depth and even generate new ``in silico" proteins that are not found in nature, but exhibit properties of a particular protein family. Using novel deep learning approaches and leveraging the large volume of genomic data that is now available due to high-throughput DNA sequencing, it is now possible to examine protein families in a scale and resolution that has never before been possible. By using this abundance of data to learn high dimensional representations of amino acids sequences, in this work, we show that it is possible to generate novel sequences from a particular protein family. Such a deep sequential model-based approach has great value for bioinformatics and biotechnological applications due to its rapid sampling abilities. / Master of Science / Proteins are one of the most important functional biological elements. These are composed of amino acids which link together to form different shapes which might encode a particular function. These proteins may act independently or might form ``complexes" to have a particular function. Therefore, understanding them is of utmost importance. Due to the fact that there are 20 amino acids even a protein sequence fragment of length 5 can have more than 3 million different combinations. Given, that proteins are generally 1000 amino acids long, looking at all the possibilities is next to impossible. In this work, by leveraging the ``deep learning" paradigm and the vast amount of data available, we try to model these proteins and generate new proteins belonging to a specific ``protein family." This approach has great value for bioinformatics and biotechnological applications due to its rapid sampling abilities.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/109435
Date29 September 2020
CreatorsKesavan Nair, Nitin
ContributorsElectrical and Computer Engineering, Xuan, Jianhua, Aylward, Frank O., Abbott, A. Lynn
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
Detected LanguageEnglish
TypeThesis
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0025 seconds