Submitted to the faculty of the School of Informatics in partial fulfillment of the requirements for the degree Master of Science
in Bioinformatics in the School of
Informatics, Indiana University August, 2004 / Prediction of the secondary structure of a protein from its amino acid sequence remains an important task. Not only did the growth of database holding only protein sequences outpace that of solved protein structures, but successful predictions can provide a starting point for direct tertiary structure modeling ,, and they can also significantly improve sequence analysis and sequence-structure threading , for aiding in structure and function determination. Previous works on predicting secondary structures of proteins have yielded the best percent accuracy ranging from 63% to 71% . These numbers, however, should be taken with caution since performance of a method based on a training set may vary when trained on a different training set. In order to improve predictions of secondary structure, there are three challenges. The first challenge is establishing an appropriate database. The next challenge is to represent the protein sequence appropriately. The third challenge is finding an appropriate method of classification. So, two of three challenges are related to an appropriate database and characteristic features. Here, we report the development of a database of non-identical segments of secondary structure elements and fragments with missing electron densities (disordered fragments) extracted from Protein Data Bank and categorized into groups of equal lengths, from 6 to 40. The number of residues corresponding to the above-mentioned categories is: 219,788 for α-helices, 82,070 for β-sheets, 179,388 for coils, and 74,724 for disorder. The total number of fragments in the database is 49,544; 17,794 of which are α-helices, 10,216 β-sheets, 16,318 coils, and 5,216 disordered regions. Across the whole range of lengths, α-helices were found to be enriched in L, A, E, I, and R, β-sheets were enriched in V, I, F, Y, and L, coils were enriched in P, G, N, D, and S, while disordered regions were enriched in S, G, P, H, and D. In addition to the amino acid sequence, for each fragment of every structural type, we calculated the distance between the residues immediately flanking its termini. The observed distances have ranges between 3 and 30Å. We found that for the three secondary structure types the average distance between the bookending residues linearly increases with sequence length, while distances were more constant for disorder. For each length between 6 and 40, we compared amino acid compositions of all four structural types and found a strong compositional dependence on length only for the β-sheet fragments, while the other three types showed virtually no change with length. Using the Kullback-Leibler (KL) distance between amino acid compositions, we quantified the differences between the four categories. We found that the closest pair in terms of the KL-distance were coil and disorder (dKL = 0.06 bits), then α-helix and β-sheet (dKL = 0.14 bits), while all other pairs we almost equidistant from one another (dKL ≈ 0.25 bits). With the increasing segment length we found a decreasing KL-distance between sheet and coil, sheet and disorder, and disorder and helix. Analyzing hierarchical clustering of length from 6 to 18 for sheet, coil, disorder, and helix, we found that the group coil had the closet proximity among lengths from 6 to 18. The next closest were helix and disorder. The sheet has the most difference among its length from 6 to 18. In group sheet and coil, fragments of length 17 had the longest distance while fragments of length 6 had the longest distance in group disorder and helix.
|08 August 2005
|Indiana University-Purdue University Indianapolis
|336382 bytes, application/pdf
Page generated in 0.0025 seconds