Large scale genomic sequencing efforts have resulted in a massive inflow of raw sequence data. This raw data, when appropriately processed and analyzed, can provide insight to a trained biologist and aid in hypothesis-driven research. Given the time and resource requirements necessary for biological experiments, computational predictions of gene functions can aid in reducing a large list of candidate genes to a few promising targets. Various computational solutions have been proposed and developed for gene function prediction. These solutions utilize various forms of data, such as DNA/RNA/protein sequences, protein structures, interaction networks, literature mining, and a combination of these data sources. However, these methods do not always produce precise results as the underlying data sets used for training or modeling are quite sparse. We developed and used a massive sequence similarity network build over 108 million known protein sequences to aid in protein function prediction. Predictions are made through the alignment of query sequences to representative sequences for a given cluster derived from the massive sequence similarity network. Derived clusters aggregate information (particularly that from the Gene Ontology) from respective members, which we then consolidate through a novel weighted path method. We evaluate our method on four holdout datasets using CAFA evaluation metrics. Our results suggest that clustering significantly reduces the time and memory requirements, with a marginal impact on predictive power. At lower sequence similarity thresholds, our method outperforms other gold standard methods. / Master of Science / We often think of a protein as a nutritional requirement. However, proteins are far more than just food, they play countless and unappreciated roles in facilitating life. From transporting nutrients in the body, synthesis of hormones, functioning as enzymes to expediting chemical reactions, serving as the scaffold for cells and tissues, to protecting the body against foreign pathogens. On a molecular level, each protein is made up of chains of 20 different amino acids, just like a chain of beads, that are then folded to create a 3-dimensional structure. The variations in the ordering of amino acids result in different types of proteins. There are millions of genes across known life, and they perform different functions when translated into proteins. Nature has given us many proteins with interesting properties, and the low cost of sequencing their precursors (DNA) has resulted in large amounts of sequence data that is not yet associated with a function. Biological experiments to determine the function of a protein can be time consuming and expensive. We built a massive network encompassing 108 million protein sequences based on sequence similarity. This ensures that we make use of as much data as possible to make better predictions. Specifically, our work focuses on utilizing this information of similar proteins to aid in predicting the functions of a protein given its sequences. It is based on the idea of guilt by association, such that if two proteins are similar in sequences, they perform similar functions. We show that using computationally efficient methods and large datasets, one can achieve fast and highly precise predictions.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/106704 |
Date | 29 May 2020 |
Creators | Vora, Parth Harish |
Contributors | Computer Science, Kale, Shiv D., Murali, T. M., Heath, Lenwood S. |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0018 seconds