Global ETD Search

11	Discovery and Extraction of Protein Sequence Motif Information that Transcends Protein Family Boundaries Chen, Bernard 17 July 2009 (has links) Protein sequence motifs are gathering more and more attention in the field of sequence analysis. The recurring patterns have the potential to determine the conformation, function and activities of the proteins. In our work, we obtained protein sequence motifs which are universally conserved across protein family boundaries. Therefore, unlike most popular motif discovering algorithms, our input dataset is extremely large. As a result, an efficient technique is essential. We use two granular computing models, Fuzzy Improved K-means (FIK) and Fuzzy Greedy K-means (FGK), in order to efficiently generate protein motif information. After that, we develop an efficient Super Granular SVM Feature Elimination model to further extract the motif information. During the motifs searching process, setting up a fixed window size in advance may simplify the computational complexity and increase the efficiency. However, due to the fixed size, our model may deliver a number of similar motifs simply shifted by some bases or including mismatches. We develop a new strategy named Positional Association Super-Rule to confront the problem of motifs generated from a fixed window size. It is a combination approach of the super-rule analysis and a novel Positional Association Rule algorithm. We use the super-rule concept to construct a Super-Rule-Tree (SRT) by a modified HHK clustering, which requires no parameter setup to identify the similarities and dissimilarities between the motifs. The positional association rule is created and applied to search similar motifs that are shifted some residues. By analyzing the motifs results generated by our approaches, we realize that these motifs are not only significant in sequence area, but also in secondary structure similarity and biochemical properties. Positional Association Rule Super-Rule protein sequence motif FIK model FGK model Super GSVM-FE HHK clustering algorithm Computer Sciences
12	Similarity Search And Analysis Of Protein Sequences And Structures: A Residue Contacts Based Approach Sacan, Ahmet 01 August 2008 (has links) (PDF) The advent of high-throughput sequencing and structure determination techniques has had a tremendous impact on our quest in cracking the language of life. The genomic and protein data is now being accumulated at a phenomenal rate, with the motivation of deriving insights into the function, mechanism, and evolution of the biomolecules, through analysis of their similarities, differences, and interactions. The rapid increase in the size of the biomolecular databases, however, calls for development of new computational methods for sensitive and efficient management and analysis of this information. In this thesis, we propose and implement several approaches for accurate and highly efficient comparison and retrieval of protein sequences and structures. The observation that corresponding residues in related proteins share similar inter-residue contacts is exploited in derivation of a new set of biologically sensitive metric amino acid substitution matrices, yielding accurate alignment and comparison of proteins. The metricity of these matrices has allowed efficient indexing and retrieval of both protein sequences and structures. A landmark-guided embedding of protein sequences is developed to represent subsequences in a vector space for approximate, but extremely fast spatial indexing and similarity search. Whereas protein structure comparison and search tasks were hitherto handled separately, we propose an integrated approach that serves both of these tasks and performs comparable to or better than other available methods. Our approach hinges on identification of similar residue contacts using distance-based indexing and provides the best of the both worlds: the accuracy of detailed structure alignment algorithms, at a speed comparable to that of the structure retrieval algorithms. We expect that the methods and tools developed in this study will find use in a wide range of application areas including annotation of new proteins, discovery of functional motifs, discerning evolutionary relationships among genes and species, and drug design and targeting. QH Methods of Research, Technique 324
13	Towards a complete sequence homology concept: Limitations and applications Wong, Wing-Cheong 11 August 2011 (has links) Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Since the matching of SPs/TMs creates the illusion of matching hydrophobic cores, the inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, this work shows explicit examples that the scores of clearly false-positive hits, even in globalmode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, this study finds that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. A workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users is provided. While E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences, it can also complicate the annotation problem. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We demonstrated that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g., 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value<0.1 when the EVD provides an E-value>0.1. Examples of false annotations are provided and the appropriateness of a logistic function as alternative to the EVD is critically discussed. This work shows that misguided E-value computation coupled with non-globular regions embedded in domain model library not only causes annotation errors in public databases but also limits the extrapolation power of protein function prediction tasks. So far, the preceding work has demonstrated that sequence homology considerations widely used to transfer functional annotation to uncharacterized protein sequences require special precautions in the case of non-globular sequence segments including membrane-spanning stretches from non-polar residues. We found that there are two types of transmembrane helices (TMs) in membrane-associated proteins. On the one hand, there are so-called simple TMs with elevated hydrophobicity, low sequence complexity and extraordinary enrichment in long aliphatic residues. They merely serve as membrane-anchoring device. In contrast, so-called complex TMs have lower hydrophobicity, higher sequence complexity and some functional residues. These TMs have additional roles besides membrane anchoring such as intramembrane complex formation, ligand binding or a catalytic role. Simple and complex TMs can occur both in single- and multi-membrane-spanning proteins essentially in any type of topology. Whereas simple TMs have the potential to confuse searches for sequence homologues and to generate unrelated hits with seemingly convincing statistical significance, complex TMs contain essential evolutionary information. For extending the homologyconcept onto membrane proteins, we provide a necessary quantitative criterion to distinguish simple TMs in query sequences prior to their usage in homology searches based on assessment of hydrophobicity and sequence complexity of the TM sequence segments. Theoretical insights from this work were applied to problems of function prediction for specific uncharacterized gene/protein sequences (for example, APMAP and ARXES) and for the functional classification of TM-containing proteins. info:eu-repo/classification/ddc/000 ddc:000
14	Restricted Boltzmann machines : from compositional representations to protein sequence analysis / Machines de Boltzmann restreintes : des représentations compositionnelles à l'analyse des séquences de protéines Tubiana, Jérôme 29 November 2018 (has links) Les Machines de Boltzmann restreintes (RBM) sont des modèles graphiques capables d’apprendre simultanément une distribution de probabilité et une représentation des données. Malgré leur architecture relativement simple, les RBM peuvent reproduire très fidèlement des données complexes telles que la base de données de chiffres écrits à la main MNIST. Il a par ailleurs été montré empiriquement qu’elles peuvent produire des représentations compositionnelles des données, i.e. qui décomposent les configurations en leurs différentes parties constitutives. Cependant, toutes les variantes de ce modèle ne sont pas aussi performantes les unes que les autres, et il n’y a pas d’explication théorique justifiant ces observations empiriques. Dans la première partie de ma thèse, nous avons cherché à comprendre comment un modèle si simple peut produire des distributions de probabilité si complexes. Pour cela, nous avons analysé un modèle simplifié de RBM à poids aléatoires à l’aide de la méthode des répliques. Nous avons pu caractériser théoriquement un régime compositionnel pour les RBM, et montré sous quelles conditions (statistique des poids, choix de la fonction de transfert) ce régime peut ou ne peut pas émerger. Les prédictions qualitatives et quantitatives de cette analyse théorique sont en accord avec les observations réalisées sur des RBM entraînées sur des données réelles. Nous avons ensuite appliqué les RBM à l’analyse et à la conception de séquences de protéines. De part leur grande taille, il est en effet très difficile de simuler physiquement les protéines, et donc de prédire leur structure et leur fonction. Il est cependant possible d’obtenir des informations sur la structure d’une protéine en étudiant la façon dont sa séquence varie selon les organismes. Par exemple, deux sites présentant des corrélations de mutations importantes sont souvent physiquement proches sur la structure. A l’aide de modèles graphiques tels que les Machine de Boltzmann, on peut exploiter ces signaux pour prédire la proximité spatiale des acides-aminés d’une séquence. Dans le même esprit, nous avons montré sur plusieurs familles de protéines que les RBM peuvent aller au-delà de la structure, et extraire des motifs étendus d’acides aminés en coévolution qui reflètent les contraintes phylogénétiques, structurelles et fonctionnelles des protéines. De plus, on peut utiliser les RBM pour concevoir de nouvelles séquences avec des propriétés fonctionnelles putatives par recombinaison de ces motifs. Enfin, nous avons développé de nouveaux algorithmes d’entraînement et des nouvelles formes paramétriques qui améliorent significativement la performance générative des RBM. Ces améliorations les rendent compétitives avec l’état de l’art des modèles génératifs tels que les réseaux génératifs adversariaux ou les auto-encodeurs variationnels pour des données de taille intermédiaires. / Restricted Boltzmann machines (RBM) are graphical models that learn jointly a probability distribution and a representation of data. Despite their simple architecture, they can learn very well complex data distributions such the handwritten digits data base MNIST. Moreover, they are empirically known to learn compositional representations of data, i.e. representations that effectively decompose configurations into their constitutive parts. However, not all variants of RBM perform equally well, and little theoretical arguments exist for these empirical observations. In the first part of this thesis, we ask how come such a simple model can learn such complex probability distributions and representations. By analyzing an ensemble of RBM with random weights using the replica method, we have characterised a compositional regime for RBM, and shown under which conditions (statistics of weights, choice of transfer function) it can and cannot arise. Both qualitative and quantitative predictions obtained with our theoretical analysis are in agreement with observations from RBM trained on real data. In a second part, we present an application of RBM to protein sequence analysis and design. Owe to their large size, it is very difficult to run physical simulations of proteins, and to predict their structure and function. It is however possible to infer information about a protein structure from the way its sequence varies across organisms. For instance, Boltzmann Machines can leverage correlations of mutations to predict spatial proximity of the sequence amino-acids. Here, we have shown on several synthetic and real protein families that provided a compositional regime is enforced, RBM can go beyond structure and extract extended motifs of coevolving amino-acids that reflect phylogenic, structural and functional constraints within proteins. Moreover, RBM can be used to design new protein sequences with putative functional properties by recombining these motifs at will. Lastly, we have designed new training algorithms and model parametrizations that significantly improve RBM generative performance, to the point where it can compete with state-of-the-art generative models such as Generative Adversarial Networks or Variational Autoencoders on medium-scale data. Physique statistique Apprentissage automatique Analyse des séquences de protéines Systèmes désordonnés Modèles génératifs Coévolution Statistical physics Machine learning Protein sequence analysis Disordered systems Generative models Coevolution 530
15	Fragments structuraux : comparaison, prédictibilité à partir de la séquence et application à l'identification de protéines de virus / Structural fragments : comparison, predictability from the sequence and application to the identification of viral structural proteins Galiez, Clovis 08 December 2015 (has links) Cette thèse propose de nouveaux outils pour la caractérisation locale de familles de protéines au niveau de la séquence et de la structure. Nous introduisons les fragments en contact (CF) comme des portions de structure conciliant localité spatiale et voisinage séquentiel. Nous montrons qu'ils bénéficient d'une meilleure prédictibilité de structure depuis la séquence que des fragments contigus ou encore que des paires de fragments qui ne seraient pas en contact en structure. Pour comparer structuralement ces CF, nous introduisons l'ASD, une nouvelle mesure de similarité ne nécessitant pas d'alignement préalable, respectant l'inégalité triangulaire tout en étant tolérante aux décalages de séquences et aux indels. Nous montrons notamment que l'ASD offre des meilleures performances que les scores classiques de comparaison de fragments sur des tâches concrètes de classification non-supervisée et de fouille structurale. Enfin, grâce à des techniques d'apprentissage automatique, nous mettrons en œuvre la détection de CF à partir de la séquence pour l'identification de protéines de virus avec l'outil VIRALpro développé au cours de cette thèse. / This thesis investigates the local characterization of protein families at both structural and sequential level. We introduce contact fragments (CF) as parts of protein structure that conciliate spatial locality together with sequential neighborhood. We show that the predictability of CF from the sequence is better than that of contiguous fragments and of structurally distant pairs of fragments. In order to structurally compare CF, we introduce ASD, a novel alignment-free dissimilarity measure that respects triangular inequality while being tolerant to sequence shifts and indels. We show that ASD outperforms classical scores for fragment comparison on practical experiments such that unsupervised classification and structural mining. Ultimately, by integrating the identification of CF from the sequence into a statistical machine learning framework, we developed VIRALpro, a tool that enables the detection of sequences of viral structural proteins. Biologie structurale Bioinformatique Fragments structuraux Séquence protéique Virus Capside Transformée de Fourier Tructural biology Bioinformatics Computational biology Protein fragments Protein sequence Virus Capsid Fourier transform
16	Development And Applications Of Computational Methods To Aid Recognition Of Protein Functions And Interactions Krishnadev, O 03 1900 (has links) (PDF) Protein homology detection has played a central role in the understanding of evolution of protein structures, functions and interactions. Many of the developments in protein bioinformatics can be traced back to an initial step of homology detection. It is not surprising then, that extension of remote homology detection has gained a lot of attention in the recent past. The explosive growth of genome sequences and the slow pace of experimental techniques have thrust computational analyses into the limelight. It is not surprising to see that many of the traditional experimental areas such as gene expression analysis, recognition of function and recognition of 3-D structure have been attempted effectively by computational approaches. The idea behind homology-based bioinformatics work is the fact that the hereditary mechanisms ensure that the parent generation gives rise to a very similar offspring generation. Since biological functions of proteins of an organism are product of expression of its genetic material, it follows that the genes of an organism should show conservation from one generation to another (with very few mutations if parent and offspring generation have to be nearly identical) Thus, if it can be established that two proteins have descended from a common ancestor, then it can be inferred that the biological functions of the two proteins could be very similar. Thus, homology-based information transfer from one protein to another has become a commonly used procedure in protein bioinformatics. The ability to recognize homologs of a protein solely from amino acid sequences has seen a steady increase in the last two decades. However, currently, still there are a large number of proteins of known amino acid sequence and yet unknown function . Thus, a major goal of current computational work is to extend the limits of remote homology detection to enable the functional characterization of proteins of unknown function. Since proteins do not work in isolation in a cell, it has become essential to understand the in vivo context of the function of a protein. For this purpose, it is essential to have an understanding of all the molecules that interact with a particular protein. Thus, another major area of bioinformatics has been to integrate biological information with protein-protein interactions to enable a better understanding of the molecular processes. Such attempts have been made successfully for the interaction network of proteins within an organism. The extension of the interaction network analysis to a host-pathogen scenario can lead to useful insights into pathophysiology of diseases. The work done as part of the thesis explores both the ideas mentioned above, namely, the extension of limits of remote homology detection and prediction of protein-protein interactions between a pathogen and its host. Since the work can logically be divided into two different areas though there is a connection, the thesis is organized as two parts. The first part of the thesis (comprising Chapters 2, 3, 4 and 5) describes the development and application of remote homology detection tools for function/structure annotation. The second part of the thesis (comprising of Chapters 6, 7, 8 and 9) describes the development and application of a homology-based procedure for detection of host-pathogen protein-protein interactions. Chapter 1 provides a background and literature survey in the areas of homology detection and prediction of protein-protein interactions. It is argued that homology-based information transfer is currently an important tool in the prediction and recognition of protein structures, functions and interactions. The development of remote homology detection methods and its effect on function recognition has been highlighted. Recent work in the area of prediction of protein-protein interactions using homology to known interaction templates is described and it is implied to be a successful approach for prediction of protein-protein interactions on a genome scale. The importance of further improvements in remote homology detection (as done in the first part of the thesis), is emphasized for annotation of proteins in newly sequenced genomes. The importance of application of homology detection methods in predicting protein-protein interactions across host-pathogen organisms is also explored. Chapter 2 analyzes the performance of the PSI-BLAST, one of the well-known and very effective approaches for recognition of related proteins, for remote homology detection. The chapter describes in detail the working of the PSI-BLAST algorithm and focuses on three parameters that determine the time required for searching in a large database, and also provide a ceiling for the sensitivity of the search procedure. The parameters that have been analyzed are the window size for two-hit method, the threshold for extension of an initial hit to dynamic programming and the extent of dependence on the query as encompassed in the profile generation step. The procedure followed for the analysis is to consider a large database of known evolutionary relationships (SCOP database was chosen for the analysis), and use the PSI-BLAST program at different values of three parameters to find out the effect on sensitivity (defined as the normalized number of correct SCOP superfamily relationships found in a search), and the time required for completion of the search. For the demonstration of the effect on the query dependence, a multiple sequence alignment (MSA) of a SCOP family (generated from all family sequences using ClustalW), was used with multiple queries to derive profiles in PSI-BLAST runs. The increase in sensitivity and the increase in time required for completion of each search were then monitored. The effect of changing the two PSI-BLAST internal parameters of score threshold for extension of word hits and the window size for the two-hit method do not result in a significant increase in sensitivity. Since PSI-BLAST uses the amino acid residues present in the query sequence to derive the Position Specific Scoring Matrix (PSSM) parameters, there is a strong query dependence on the sensitivity of each PSSM. Using multiple PSSMs derived from a single MSA can thus help overcome the query dependence and increase the sensitivity. In this Chapter such an approach, named as MulPSSM, has been demonstrated to have higher sensitivity than single profiles approach, (by up to two times more) in a benchmark dataset of 100 randomly chosen SCOP folds. Strategies to optimize sensitivity and the time required in searching MulPSSM have been explored and it is found that use of a non-redundant set of queries to generate MulPSSM can reduce the time required for each search while not affecting the sensitivity by a large degree. The application of the MulPSSM approach in function annotation of proteins in completely sequenced genomes was explored by searching genomic sequences in a MulPSSM database of Pfam families. The association of function to proteins has been assessed when both single profile per family database and MulPSSM database of families were used. It is found that in a comprehensive list of 291 genomes of Prokaryotes, 44 genomes of Eukaryotes and 40 genomes of Archea, that on an average MulPSSM is able to identify evolutionary relationships for 10% more proteins in a genome than single profiles-based approach. Such an enhancement in the recognition of evolutionary relationships, which has an implication in obtaining clues to functions, can help in more efficient exploration of newly sequenced genomes. Identification of evolutionary relationships involving some of the proteins of M. tuberculosis and M. leprae has been possible due to the use of multiple profiles search approach which is discussed in this chapter. The examples of annotations provided in the chapter include enzymes that are involved in glyco lipids synthesis which are vital for the survival of the pathogens inside the host and such annotations can help in expanding our knowledge of these processes. Chapter 3 describes the development and assessment of a sensitive remote homology detection method. The sensitivity of remote homology detection methods has been steadily increasing in the past decade and profile analysis has become a mainstay of such efforts. The profile is a probabilistic model of substitutions allowed at each position in a sequence family, and hence captures the essential features of a family. Alignment of two such profiles is thus considered to provide a more sensitive and accurate method than the alignment of two sequences. The performance of HMMs (Hidden Markov Models) has been shown to be higher than PSSMs (Position Specific Scoring Matrix). Thus, a profile-profile alignment using HMMs can in principle give the best possible sensitivity in remote homology detection. Many investigators have incorporated residue conservation and secondary structure information to align two HMMs, and such additional information has been demonstrated to provide better sensitivity in remote homology detection (for instance in the HHSearch program). The work presented in Chapter 3, extends the idea of incorporating additional information such as explicit hydrophobicity information, along with conservation and predicted secondary structure over a window of Multiple Sequence Alignment (MSA) columns in aligning HMMs. The new algorithm is named AlignHUSH (Alignment of HMMs Using Secondary structure and Hydrophobicity). The HMMs used in the work are derived from structural alignments using HMMER program and are taken from the publicly available superfamily database which provides HMMs for all the SCOP families. The HMMs are modified into two-state HMMs by collapsing the ‘insert’ and ‘delete’ states into a ‘non-match’ state in the AlignHUSH algorithm. The two state HMMs enables the use of dynamic programming methods and keeps intact the position-specific gap penalties. The two state HMMs can be more readily extended to alignment of PSSMs. The incorporation of secondary structure information is made using secondary structure predictions made using PSIPRED program. The hydrophobicity information is calculated using the Kyte Doolittle hydrophobicity values. The alignment is generated by scoring each position using the values present in a window of residues. The assessment of alignment accuracy is done by comparison to manually curated alignments present in the BaliBASE database. A detailed description of the optimization steps followed for obtaining the values for each score contribution (conservation, secondary structure and hydrophobicity) is provided. The assessment revealed that a high weightage to conservation score (18.0) and low weightage to the secondary structure score (1.5) and hydrophobicity (1.0) is optimal. The use of residue windows in alignment has been shown to dramatically increase the sensitivity (around 30% on a small dataset comprising 10% of total SCOP domains). The sensitivity of AlignHUSH algorithm in comparison to other HMM-HMM alignment methods HHSearch and PRC in an all-against-all comparison of SCOP 1.69 database demonstrates that AlignHUSH has better sensitivity than both HHSearch and PRC (approximately by 10% and 5% respectively). The alignment accuracy calculated as the ratio of correctly aligned residues and all alignment positions in BaliBASE alignments reveals that AlignHUSH algorithm provides an accuracy comparable or marginally higher than both HHSearch and PRC (25% for AlignHUSH and roughly 17% for both HHSearch and PRC). A few examples of structural relationships between SCOP families belonging to different folds and/or classes are presented in the chapter to illustrate the strength of AlignHUSH in detecting very remote relationships. Chapter 4 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important for obtaining better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Much effort has been taken by various investigators in bringing many proteins in the sequence databases within homology modeling distance with a protein of known structure. Structural genomics initiatives spend considerable effort in achieving this goal. The results from such experiments suggest that in many cases after the structure has been solved using X-ray crystallography or NMR methods, the protein is seen to have structural similarity to a protein of already known structure. Thus, an inability to detect such remote relationships severely impairs the efficiency of structural genomics initiatives. The development of the SUPFAM method was made earlier in the group to enable detection of distant relationships between Pfam families. In SUPFAM approach, relationships are detected by mapping the Pfam families to SCOP families. Further, using the implicit or explicit evolutionary relationship information present in the SCOP database relationships between Pfam families are detected. The work presented in this chapter is an improvement of previous development using the significantly more sensitive AlignHUSH method to uncover more relationships. The new database follows a procedure slightly different than the older SUPFAM database and hence is called SUPFAM+. The relative improvement brought by SUPFAM+ has been discussed in detail in the chapter. The methodology followed for the analysis is to first generate SUPFAM database by recognition of relationships between Pfam families and SCOP families using PSI BLAST / RPS BLAST. For the generation of SUPFAM+ database, recognition of relationships between Pfam families and SCOP families is done using AlignHUSH. The criteria are kept stringent at this stage to minimize the rate of false positives. In cases of a Pfam family mapping to two or more SCOP superfamilies, a semi-automated decision tree is used to assign the Pfam family to a single SCOP superfamily. Some of the Pfam families which remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families which are already mapped to a SCOP family. In the final step, the Pfam families still without a SCOP family mapping are mapped onto one another to form ‘Potential New Superfamilies’ (PNSF), which are excellent targets for structural genomics since none of the proteins in such PNSFs have a recognizable homologue of known structure. The clustering of Pfam families into Superfamilies belonging to SCOP 1.69 version, were then queried to check if a structure has been solved for these Pfam families subsequent to the release of the SCOP 1.69 database. The latest SCOP database reveals that for close to 87 Pfam families a structure was solved which is at best related at a SCOP superfamily level with a family present in SCOP 1.69. An analysis of the mappings provided by SUPFAM+ database reveals that the mappings are correct in 85% of the cases at the SCOP superfamily level. An in-depth analysis revealed that among the rest of the cases, only one can be adjudged as an incorrect mapping. Many of the inconsistent mappings were found to be due to the absence of the SCOP fold in the SCOP 1.69 release, although interestingly the mapping provided by SUPFAM+ database shows structural similarity to the actual fold for the Pfam family found subsequently. A straightforward comparison with a similar database (Pfam Clans database) reveals that the SUPFAM+ database could suggest four times more pairwise relationships between Pfam families than the Pfam Clans database. Thus, since the structural mappings provided in the SUPFAM+ database are very accurate the relationships found in the database could help in function annotation of uncharacterized protein families (explored in Chapter 5). The accuracy of mapping would be similar for the PNSFs, and hence these clusters can be excellent targets for structural genomics initiatives. The classiﬁcation of families based on sequence/structural similarities can also be useful for function annotation of families of uncharacterized proteins, and such an idea is explored in the next chapter. Chapter 5 describes the attempts made to obtain clues to the structure and/or function of the DUF (Domain of Unknown Function) families present in the Pfam database. Currently, the DUF families populate around 21% of the Pfam database (2260 out of 10340). Thus, although homologues for each of the proteins in these families can be recognized in sequence databases, the homology does not provide obvious insight into the function of these proteins. The annotation of such difficult targets is a major goal of computational biologists in the post-genomic era. The development of a sensitive profile-profile alignment method as part of this thesis, gives an excellent opportunity to increase the number of annotations for proteins, especially in the DUF families, since a profile for these families exists in the Pfam database. The method followed for the analysis is similar to the SUPFAM+ development, and involved generation of Pfam profiles compatible with the AlignHUSH method. For the analysis presented in the chapter, relationships found between DUF families and SCOP families were analyzed. In benchmarks using the AlignHUSH method, it was found that a Z score of 5.0 gives a 10% error rate, and a Z score of 7.5 gives an error rate of 1%, and hence a minimum Z score cutoff of 7.5 was used in the analysis. A very high Z score in AlignHUSH is usually seen in cases, when sequence identity is also high, so a maximum Z score cutoﬀ of 12.0 was used to find DUF families which are difficult to annotate using other profile based methods (such as PSI-BLAST). For some of the DUF families, subsequent structure determination of one of the proteins had been reported in literature, and these cases were used to assess the accuracy of structural annotation using AlignHUSH. In other cases, fold recognition was done using the PHYRE method to ensure that the structure mappings are corroborated by fold recognition. In all cases studied, the alignment of the DUF family with the SCOP family was generated and queried for conservation of active site residues reported for each homologous SCOP family in the CSA (Catalytic Site Atlas) database. The assessment on 8 DUF families for which structure was solved subsequent to the SCOP release used in the analysis, reveals that in all cases, the correct structure was identified using the AlignHUSH procedure. In the eight cases of validated structure annotation, the conservation of active site residues was seen pointing to the effectiveness of AlignHUSH and its use in function annotation. The 27 cases in which a structure for any one of the proteins in the DUF family is not known, the fold recognition attempts suggest that in all cases, the results from fold recognition corroborate the suggestion made by AlignHUSH. The alignments of each of the DUF families with the suggested homologous SCOP family reveals that in many cases the active site residues are not conserved or are substituted by different residues. An in-depth analysis of some cases reveals that the non-conservation of residues occurs between two SCOP families in the same SCOP superfamily. Thus, although structure annotation can be reliably provided for all the DUF families studied, the exact biochemical function could be detected only for those cases in which active site conservation is seen even among distantly related families (such as two SCOP families in the same SCOP superfamily). The development and application of methods for remote homology detection has been made successfully and it has been demonstrated in the first part of the thesis that there is scope for extending the limits of remote homology detection. The use of sequence derived information in aligning profiles makes the procedure generally applicable and has been applied successfully for the case of structure/function recognition in the DUF families. In the next part of the thesis, a method for prediction of protein-protein interactions between a host and pathogen organism and its application to three groups of pathogens is presented. Chapter 6 describes the development of a procedure for prediction of protein-protein interactions (PPI) between a pathogen and its host organism. In the past, prediction of PPI has been attempted for proteins of a given organism. This was often approached by identifying proteins of the organism of interest that are homologous to two interacting proteins of another organism. A study of conservation of interactions as a function of sequence identity has been made in the past by various groups, which reveal that homologues sharing a sequence identity greater than about 30% interact in similar way. This fact can be used, along with a high quality database of protein-protein interactions to predict interactions between proteins of same organism. The work done in this thesis is one of the first attempts at extending the idea to the prediction of interactions between two different organisms. Homology of proteins from a pathogen and its host to proteins which are known to interact with each other would suggest that the proteins from pathogen and host can interact. The feasibility of such an interaction to occur under in vivo conditions need to be addressed for biologically meaningful predictions. These issues have been dealt with in this part of the thesis. One of the main steps in the procedure for the prediction of PPI is identification of homologues of pathogen and host proteins to interacting proteins listed in PPI databases. Two template PPI databases have been used in this work. One of the databases is the DIP database which provides a list of interactions based on genome-scale yeast-two-hybrid data or small scale experiments. The other database used is the iPfam database which provides interaction templates (Pfam families) based on protein complexes of known structure present in Protein Data Bank (PDB). Thus, the two databases are both comprehensive and are of high quality. The search for homologues in the DIP database was made using PSI-BLAST with stringent cutoffs for various parameters to minimize false positives. The search in iPfam database is done using RPS-BLAST and MulPSSM using stringent cutoffs. The cutoffs for the searches were fixed based on an assessment of conservation of putative interacting residues in the host and pathogen proteins as compared to the protein complexes of known structure. The predictions made are analyzed manually to assess the importance to the pathogenesis of the disease under consideration. In this chapter, in order to obtain an idea about robustness of this approach, PPI prediction was made for the phage-bacteria system and the herpes virus – human system which have been experimentally studied extensively and hence opportunities exist to compare the “predictions” with experimental results. The prediction of phage – bacteria interactions suggests that the gross biological features of the pathogenesis have been captured in the predictions. The GO (Gene Ontology) based annotations for the bacterial proteins predicted to interact suggests that the predictions involve proteins participating in DNA replication and protein synthesis. Many of the known interactions such as between the lambda phage repressor and RecA protein of bacteria were also ‘predicted’ in the analysis. A few novel interactions were predicted. For example interaction between a tail component protein and a protein of unknown function, YeeJ in E.coli has been predicted. The prediction of interactions between Herpes Virus 8 and human host and its comparison to a set of experimentally veriﬁed interactions reported in literature suggested that close to 50% of the known interactions were ‘predicted’ by the procedure followed. A few novel cases of interaction between the viral proteins and the p53 protein have also been made which might help in understanding the tumorigenesis of the viral disease. A comparison between the procedure followed in this thesis and the results from another genome-scale method (proposed by Andrej Sali and coworkers) suggests that although the proteins involved in predicted interactions from two methods may diﬀer, the functions of the proteins concerned suggested by GO annotations are highly correlated (greater than 98%). In the next few chapters, the prediction of interactions for diﬀerent host-pathogen systems is described. In the Chapter 7, the prediction of PPI between a Eukaryotic malarial pathogen, P.falciparum and its human host is described. The malarial parasite was chosen because of the extensive work reported in the literature on this pathogen in the recent years. Also, the gene expression patterns in the pathogen are highly correlated to the human tissue types with each stage of the pathogen occurring in a distinct tissue type. Thus, the biological context of the PPI can be explicitly assessed, which makes this example a well suited case for the procedure described in the Chapter 6 of this thesis. The pathogen is important from a medical perspective since there has been a recent emergence of P.falciparum induced malaria which is unresponsive to conventional drugs. Thus, studies of this parasite have gained an importance in the post genomic era. The difficulty in identifying homologues of many of the P.falciparum proteins makes this a challenging case study. Prediction of PPI between the malarial parasite and the human proteins has been approached in the same way as described in Chapter 6, with the cutoffs in homology searches kept stringent. However, in this case effective use of available additional biological data has been possible. The tissue specific expression information for human proteins has been obtained from the Atlas of Human transcriptome, and the NCBI GEO database. The pathogen stage-specific expression data has been obtained from multiple genome-scale experiments reported in the literature. The subcellular localization of both human and pathogen proteins has been predicted and hence this information is given low weightage in subsequent analysis. The prediction of PPI between malarial parasite and human, resulted in a total of more than 30,000 interactions which were compatible in an in vivo condition according to the expression data. Further reduction in the set of predicted interactions was made by incorporating the subcellular localization predictions (reduced to around 2000 interactions). Manual analysis of each of these interactions taking aid from literature on malarial parasites reveals that many of the known PPI are also ‘predicted’ in the analysis such as the interaction between SSP2 protein of P.falciparum and human ICAMs. For many proteins known to be important for pathogenesis, such as the RESA antigen, novel interactions were predicted that could help in better understanding of the pathogen. For some of the novel predicted interactions, such as that between the parasite Plasmepsin and human Spectrin, there exists circumstantial experimental evidence of interaction. Among many other novel interactions, the procedure used could predict interactions for 441 ‘hypothetical proteins’ of unknown function coded in the genome of the pathogen. The comprehensive list of predictions made using the procedure and an exploration of its biological significance can lead to novel hypothesis regarding the parthenogenesis of malaria and hence the work presented in this chapter can be helpful for further experimental exploration of the pathogen. The success of the procedure in predicting known interactions as well as novel interactions in a Eukaryotic pathogen suggests that the procedure developed is generally applicable. However it must be pointed out that in many cases of host-pathogen systems, such extensive expression and localization data may not be available, which makes the analysis difficult due to the large number of interactions predicted. One of such difficult cases is the interactions between Mycobacterial species and human host which is described in the next chapter. Chapter 8 describes the prediction of PPI between human and M.tuberculosis as well as three pathogens closely related to M.tuberculosis. Each of the pathogens has seen to re-emerge due to drug resistance and other causes. M.tuberculosis is becoming a global problem due to the limited number of drugs available to treat TB, which is susceptible to resistance. M.leprae has also shown signs of emergence of drug resistance, whereas C.diptheriae another pathogen studied in this chapter is seen as an emerging pathogen in Eastern Europe and in Indian subcontinent. Nocardial infections have also seen a rise due to the prevalence of AIDS which leads to susceptibility to the Nocardia infections. Thus, there is a need to understand further the pathogens in this important family, in order to better direct drug development. An important area for such endeavors is the mapping of the PPI between the pathogens and the human host. The procedure developed as part of the thesis can be used to predict such interactions. The procedure for prediction of interactions is the same as followed in Chapter 6 and involves identifications of homologues for the pathogen and host proteins among the proteins listed in the two template datasets DIP and iPfam using PSI-BLAST and RPS-BLAST (MulPSSM). In addition to the homology to the proteins involved in PPI, information / prediction on subcellular localization is used to assess biological significance of the interaction. An experimentally derived dataset of exported proteins in the M.tuberculosis was used to supplement the predictions from PSORTb database that provides subcellular localization for bacterial proteins. In order to minimize the number of predictions explored manually and to maximize the biological relevance of predicted interactions,, the predictions were made only for proteins present on the membrane of the pathogen or which are exported into the host. Prediction of interactions between human proteins and the proteins of four pathogens studied revealed that, some of the interactions which were known from earlier experiments were “predicted” by the present procedure. For example, the M.leprae exported Serine protease is known to interact with Ras-like proteins in the human host, and this interaction was ‘predicted’. Among other predicted interactions, several novel interactions have been suggested for proteins important for pathogenesis such as the MPT70 protein of M.tuberculosis which has been predicted to interact with TGFβ associated proteins which could play an important role in the pathogenesis of the disease. Some of the human proteins are known to play important role in pathogenesis, especially the toll-like receptors. A C.diphtheriae protein Mycosin, has been predicted to interact with the toll-like receptors raising the possibility that the Mycosins may play an important role in pathogenesis. Several hypothetical proteins of unknown function in the pathogens have been predicted to interact with human proteins. A few of such cases from M.tuberculosis have been described in the thesis and these proteins are predicted to interact with proteins involved in post-transnational modification in the human host. The prediction of novel interactions along with known interactions in four bacterial species thus points to the fact that the procedure can be used for almost any host-pathogen pair. In the next chapter, the application of the method to three other bacterial species belonging to the Enterobacteriaciae family is presented. Chapter 9 describes the analysis performed on the predicted interactions between human and three pathogens in the Enterobact Protein Functions Bioactive Proteins Computational Biochemistry Protein-Protein Interactions Protein Homology Detection - Algorithms Protein Bioinformatics Protein Sequence Homology (Biology) Remote Homology Detection Biochemistry
17	Topology-based Sequence Design For Proteins Structures And Statistical Potentials Sensitive To Local Environments Jha, Anupam Nath 11 1900 (has links) (PDF) Proteins, which regulate most of the biological activities, perform their functions through their unique three-dimensional structures. The folding process of this three dimensional structure from one dimensional sequence is not well understood. The available facts infer that the protein structures are mostly conserved while sequences are more tolerant to mutations i.e. a number of sequences can adopt the same fold. These arch of optimal sequences for a chosen conformation is known as inverse protein folding and this thesis takes this approach to solve the enigmatic problem. This thesis presents a protein sequence design method based on the native state topology of protein structure. The structural importance of the amino acid positions has been converted into the topological parameter of the protein conformation. This scheme of extraction of topology of structures has been successfully applied on three dimensional lattice structures and in turn sequences with minimum energy for a given structure are obtained. This technique along with the reduced amino cid alphabet(A reduced amino acid alphabet is any clustering of twenty amino acids based on some measure of the irrelative similarity) has been applied on the protein structures and hence designed optimal amino acid sequences for a given structure. These designed sequences are energetically much better than the native amino acid sequence. The utility of this method is further confirmed by showing the similarity between naturally occurring and the designed sequences. In summary, a computationally efficient method of designing optimal sequences for a given structure is given. The physical interaction energy between the amino acids is an important part of study of protein-protein interaction, structure prediction, modeling and docking etc. The local environment of amino acids makes a difference between the same amino acid pairs in the protein structure and so the pair-wise interaction energy of amino acid residues should depend on the irrespective environment. A local environment depended knowledge based potential energy function is developed in this thesis. Two different environments, one of these is the local degree (number of contacts) and the other is the secondary structural element of amino acids, have been considered. The investigations have shown that the environment-based interaction preferences for amino acids is able to provide good potential energy functions which perform exceedingly well in discriminating the native structure from the structures with random interactions. Further, the membrane proteins are located in a completely different physico-chemical environment with different amino acid composition than the water soluble proteins. This work provides reliable potential energy functions which take care of different environment for the investigation(model/predict) of the structure of helical membrane proteins. Three different environments, parallel and perpendicular to the lipid bilayer and number of amino acid contacts, are explored to analyze the environmental effects on the potential functions. These environment dependent scoring functions perform exceedingly well indiscriminating the native sequence from a set of random sequences. Hydrophobicity of amino acids is a measure of buriedness or exposure to the aqueous environment. The lack of uniformity within the protein environment gives rise to the different values of hydrophobicity for the same amino acids, which completely depends on its location inside the protein.The contact based environment dependent hydrophobicity values of all amino acids, separately for globular and membrane proteins, have also been evaluated in this thesis. Apart from developing scoring functions, the packing of helices in membrane proteins is investigated by an approach based on the local backbone geometry and side chain atom-atom contacts of amino acids. A parameter defined in this study is able to capture the essential features of inter-helical packing, which may prove to be useful in modeling of helical membrane proteins. In conclusion, this thesis has described a novel technique to design the energetically minimized amino acid sequences which can fold in to a given conformation. Also the environment dependent interaction preference of amino acids in globular proteins is captured an efficient manner. Specially, the environment dependent scoring function for helical membrane proteins is a first successful attempt in this direction. Protein Structure Membrane Proteins Protein Folding Protein Design Amino Acid Sequence Membrane Proteins - Helix Packing Protein Sequences Globular Proteins Protein Sequence Design Biochemistry
18	Predikce aktivních míst v proteinech / Protein hot spots prediction Kašpárek, Jan January 2013 (has links) Knowledge of protein hot spots and the ability to successfully predict them while using only primary protein structure has been a worldwide scientific goal for several decades. This thesis describes the importance of hot spots and sums up advances achieved in this field of study so far. Besides that we introduce hot spot prediction algorithm using only a primary protein structure, based primarily on signal processing techniques. To convert protein sequence to numerical signal we use the EIIP attribute, while further processing is carried out via means of S-transform. The algorithm achieves sensitivity of more than 60 %, positive predictive value exceeds 50 % and the main advantage over competitive algorithms is its simplicity and low computational requirements.
19	Analyses of All Possible Point Mutations within a Protein Reveals Relationships between Function and Experimental Fitness: A Dissertation Roscoe, Benjamin P. 25 March 2014 (has links) The primary amino acid sequence of a protein governs its specific cellular functions. Since the cracking of the genetic code in the late 1950’s, it has been possible to predict the amino acid sequence of a given protein from the DNA sequence of a gene. Nevertheless, the ability to predict a protein’s function from its primary sequence remains a great challenge in biology. In order to address this problem, we combined recent advances in next generation sequencing technologies with systematic mutagenesis strategies to assess the function of thousands of protein variants in a single experiment. Using this strategy, my dissertation describes the effects of most possible single point mutants in the multifunctional Ubiquitin protein in yeast. The effects of these mutants on the essential activation of ubiquitin by the ubiquitin activating protein (E1, Uba1p) as well as their effects on overall yeast growth were measured. Ubiquitin mutants defective for E1 activation were found to correlate with growth defects, although in a non-linear fashion. Further examination of select point mutants indicated that E1 activation deficiencies predict downstream defects in Ubiquitin function, resulting in the observed growth phenotypes. These results indicate that there may be selective pressure for the activity of the E1enzyme to selectively activate ubiquitin protein variants that do not result in functional downstream defects. Additionally, I will describe the use of similar techniques to discover drug resistant mutants of the oncogenic protein BRAFV600E in human melanoma cell lines as an example of the widespread applicability of our strategy for addressing the relationship between protein function and biological fitness. Amino Acid Sequence Mutagenesis Point Mutation Protein Sequence Analysis Ubiquitin Dissertations, UMMS Sequence Analysis, Protein Biochemistry Cellular and Molecular Physiology Molecular Biology Molecular Genetics
20	Novel statistical methods for evaluation of metabolic biomarkers applied to human cancer cell lines Wang, Bo 05 May 2014 (has links) No description available. Chemistry Biochemistry Biostatistics Bioanalytical methods Bioassays Biological samples Chemometrics Statistics NMR Metabolomics Metabonomics Principal components analysis Protein sequence analysis Cancer biology Metabolic regulation Biomarker validation

Search results